The intent of this book is to bring readers a straight-forward, crisp, and practical approach to designing world-class high-availability systems from the ground up, i.e., systems in which high availability is a critical design element and differentiator, as well as a customer requirement. Such systems include but are not limited to Telecom, Automotive, Medical, Manufacturing, Aerospace, Financial, and other Information Systems, which typically consist of high reliability hardware, embedded and off-the-shelf software, multi-site, multi-threaded distributed processing environments, complex real time applications, and demanding performance requirements. Though high availability and reliability are typically "must-haves" and taken for granted, designing such systems is usually complex and difficult to implement for a variety of reasons and can take many iterations involving significant cost and effort. This book attempts to bring together different practical techniques used in the industry to successfully design, predict, and deploy high availability systems while reducing costs.

Preface xiii

List of Abbreviations xvii

**1. Introduction 1**

**2. Initial Considerations for Reliability Design 3**

2.1 The Challenge 3

2.2 Initial Data Collection 3

2.3 Where Do We Get MTBF Information? 5

2.4 MTTR and Identifying Failures 6

2.5 Summary 7

**3. A Game of Dice: An Introduction to Probability 8**

3.1 Introduction 8

3.2 A Game of Dice 10

3.3 Mutually Exclusive and Independent Events 10

3.4 Dice Paradox Problem and Conditional Probability 15

3.5 Flip a Coin 21

3.6 Dice Paradox Revisited 23

3.7 Probabilities for Multiple Dice Throws 24

3.8 Conditional Probability Revisited 27

3.9 Summary 29

**4. Discrete Random Variables 30**

4.1 Introduction 30

4.2 Random Variables 31

4.3 Discrete Probability Distributions 33

4.4 Bernoulli Distribution 34

4.5 Geometric Distribution 35

4.6 Binomial Coeffi cients 38

4.7 Binomial Distribution 40

4.8 Poisson Distribution 43

4.9 Negative Binomial Random Variable 48

4.10 Summary 50

**5. Continuous Random Variables 51**

5.1 Introduction 51

5.2 Uniform Random Variables 52

5.3 Exponential Random Variables 53

5.4 Weibull Random Variables 54

5.5 Gamma Random Variables 55

5.6 Chi-Square Random Variables 59

5.7 Normal Random Variables 59

5.8 Relationship between Random Variables 60

5.9 Summary 61

**6. Random Processes 62**

6.1 Introduction 62

6.2 Markov Process 63

6.3 Poisson Process 63

6.4 Deriving the Poisson Distribution 64

6.5 Poisson Interarrival Times 69

6.6 Summary 71

**7. Modeling and Reliability Basics 72**

7.1 Introduction 72

7.2 Modeling 75

7.3 Failure Probability and Failure Density 77

7.4 Unreliability, F(t) 78

7.5 Reliability, R(t) 79

7.6 MTTF 79

7.7 MTBF 79

7.8 Repairable System 80

7.9 Nonrepairable System 80

7.10 MTTR 80

7.11 Failure Rate 81

7.12 Maintainability 81

7.13 Operability 81

7.14 Availability 82

7.15 Unavailability 84

7.16 Five 9s Availability 85

7.17 Downtime 85

7.18 Constant Failure Rate Model 85

7.19 Conditional Failure Rate 88

7.20 Bayes’s Theorem 94

7.21 Reliability Block Diagrams 98

7.22 Summary 107

**8. Discrete-Time Markov Analysis 110**

8.1 Introduction 110

8.2 Markov Process Defined 112

8.3 Dynamic Modeling 116

8.4 Discrete Time Markov Chains 116

8.5 Absorbing Markov Chains 123

8.6 Nonrepairable Reliability Models 129

8.7 Summary 140

**9. Continuous-Time Markov Systems 141**

9.1 Introduction 141

9.2 Continuous-Time Markov Processes 141

9.3 Two-State Derivation 143

9.4 Steps to Create a Markov Reliability Model 147

9.5 Asymptotic Behavior (Steady-State Behavior) 148

9.6 Limitations of Markov Modeling 154

9.7 Markov Reward Models 154

9.8 Summary 155

**10. Markov Analysis: Nonrepairable Systems 156**

10.1 Introduction 156

10.2 One Component, No Repair 156

10.3 Nonrepairable Systems: Parallel System with No Repair 165

10.4 Series System with No Repair: Two Identical Components 172

10.5 Parallel System with Partial Repair: Identical Components 176

10.6 Parallel System with No Repair: Nonidentical Components 183

10.7 Summary 192

**11. Markov Analysis: Repairable Systems 193**

11.1 Repairable Systems 193

11.2 One Component with Repair 194

11.3 Parallel System with Repair: Identical Component Failure and Repair Rates 204

11.4 Parallel System with Repair: Different Failure and Repair Rates 217

11.5 Summary 239

**12. Analyzing Confidence Levels 240**

12.1 Introduction 240

12.2 pdf of a Squared Normal Random Variable 240

12.3 pdf of the Sum of Two Random Variables 243

12.4 pdf of the Sum of Two Gamma Random Variables 245

12.5 pdf of the Sum of n Gamma Random Variables 246

12.6 Goodness-of-Fit Test Using Chi-Square 249

12.7 Confidence Levels 257

12.8 Summary 264

**13. Estimating Reliability Parameters 266**

13.1 Introduction 266

13.2 Bayes’ Estimation 268

13.3 Example of Estimating Hardware MTBF 273

13.4 Estimating Software MTBF 273

13.5 Revising Initial MTBF Estimates and Tradeoffs 274

13.6 Summary 277

**14. Six Sigma Tools for Predictive Engineering 278**

14.1 Introduction 278

14.2 Gathering Voice of Customer (VOC) 279

14.3 Processing Voice of Customer 281

14.4 Kano Analysis 282

14.5 Analysis of Technical Risks 284

14.6 Quality Function Deployment (QFD) or House of Quality 284

14.7 Program Level Transparency of Critical Parameters 287

14.8 Mapping DFSS Techniques to Critical Parameters 287

14.9 Critical Parameter Management (CPM) 287

14.10 First Principles Modeling 289

14.11 Design of Experiments (DOE) 289

14.12 Design Failure Modes and Effects Analysis (DFMEA) 289

14.13 Fault Tree Analysis 290

14.14 Pugh Matrix 290

14.15 Monte Carlo Simulation 291

14.16 Commercial DFSS Tools 291

14.17 Mathematical Prediction of System Capability instead of “Gut Feel” 293

14.18 Visualizing System Behavior Early in the Life Cycle 297

14.19 Critical Parameter Scorecard 297

14.20 Applying DFSS in Third-Party Intensive Programs 298

14.21 Summary 300

**15. Design Failure Modes and Effects Analysis 302**

15.1 Introduction 302

15.2 What Is Design Failure Modes and Effects Analysis (DFMEA)? 302

15.3 Definitions 303

15.4 Business Case for DFMEA 303

15.5 Why Conduct DFMEA? 305

15.6 When to Perform DFMEA 305

15.7 Applicability of DFMEA 306

15.8 DFMEA Template 306

15.9 DFMEA Life Cycle 312

15.10 The DFMEA Team 324

15.11 DFMEA Advantages and Disadvantages 327

15.12 Limitations of DFMEA 328

15.13 DFMEAs, FTAs, and Reliability Analysis 328

15.14 Summary 330

**16. Fault Tree Analysis 331**

16.1 What Is Fault Tree Analysis? 331

16.2 Events 332

16.3 Logic Gates 333

16.4 Creating a Fault Tree 335

16.5 Fault Tree Limitations 339

16.6 Summary 339

**17. Monte Carlo Simulation Models 340**

17.1 Introduction 340

17.2 System Behavior over Mission Time 344

17.3 Reliability Parameter Analysis 344

17.4 A Worked Example 348

17.5 Component and System Failure Times Using Monte Carlo Simulations 359

17.6 Limitations of Using Nontime-Based Monte Carlo Simulations 361

17.7 Summary 365

**18. Updating Reliability Estimates: Case Study 367**

18.1 Introduction 367

18.2 Overview of the Base Station Controller—Data Only (BSC-DO) System 367

18.3 Downtime Calculation 368

18.4 Calculating Availability from Field Data Only 371

18.5 Assumptions Behind Using the Chi-Square Methodology 372

18.6 Fault Tree Updates from Field Data 372

18.7 Summary 376

**19. Fault Management Architectures 377**

19.1 Introduction 377

19.2 Faults, Errors, and Failures 378

19.3 Fault Management Design 381

19.4 Repair versus Recovery 382

19.5 Design Considerations for Reliability Modeling 383

19.6 Architecture Techniques to Improve Availability 383

19.7 Redundancy Schemes 384

19.8 Summary 395

**20 Application of DFMEA to Real-Life Example 397**

20.1 Introduction 397

20.2 Cage Failover Architecture Description 397

20.3 Cage Failover DFMEA Example 399

20.4 DFMEA Scorecard 401

20.5 Lessons Learned 402

20.6 Summary 403

**21. Application of FTA to Real-Life Example 404**

21.1 Introduction 404

21.2 Calculating Availability Using Fault Tree Analysis 404

21.3 Building the Basic Events 405

21.4 Building the Fault Tree 406

21.5 Steps for Creating and Estimating the Availability Using FTA 408

21.6 Summary 416

**22. Complex High Availability System Analysis 420**

22.1 Introduction 420

22.2 Markov Analysis of the Hardware Components 420

22.3 Building a Fault Tree from the Hardware Markov Model 427

22.4 Markov Analysis of the Software Components 427

22.5 Markov Analysis of the Combined Hardware and Software Components 433

22.6 Techniques for Simplifying Markov Analysis 437

22.7 Summary 446

References 447

Index 450