9780471356011

Foreword

Eric Schmidt

Preface

xvii

Introduction

(8)

Why an Availability Book?

(1)

Our Approach to the Problem Set

(1)

What's Not Here

(1)

Our Mission

(1)

The Availability Index

(1)

Summary

(1)

Organization of the Book

(1)

Key Points

(1)

What is Resiliency?

(22)

Measuring Availability

(6)

Defining Downtime

(1)

Causes of Downtime

(2)

What is Availability?

(1)

`M' is for Mean

(2)

Failure Modes

(6)

Hardware

(1)

Environmental and Physical Failures

(1)

Network Failures

(1)

Database System Failures

(2)

Web Server Failures

(1)

File and Print Server Failures

(1)

Cost/Risk Tradeoffs

(7)

The Costs of Downtime

(2)

Explaining the Problems to Management

(1)

Levels of Availability (The Availability Continuum)

(1)

Regular Availability: Do Nothing Special

(1)

Increased Availability: Protect the Data

(1)

High Availability: Protect the System

(1)

Disaster Recovery: Protect the Organization

(1)

Fault-Tolerant Systems

(1)

Balancing Risk and Rewards

(1)

Don't Overspend

(1)

Key Points

(2)

Twenty Key System Design Principles

(16)

Spend Money ... but Not Blindly

(1)

Assume Nothing

(1)

Remove Single Points of Failure

(1)

Maintain Tight Security

(1)

Consolidate Your Servers

(1)

Automate Common Tasks

(1)

Document Everything

(2)

Establish Service Level Agreements

(1)

Plan Ahead

(1)

Test Everything

(1)

Maintain Separate Environments

(1)

Invest in Failure Isolation

(1)

Examine the History of the System

(1)

Build for Growth

(1)

Choose Mature Software

(1)

Select Reliable and Serviceable Hardware

(1)

Reuse Configurations

(1)

Exploit External Resources

(1)

One Problem, One Solution

(1)

KISS: Keep It Simple

(3)

Highly Available Data Management

(28)

Fundamental Truths

(3)

Disk Hardware and Connectivity Terminology

(8)

SCSI (Small Computer Systems Interface)

(2)

Fibrechannel

(1)

Multihosting

(1)

Multipathing

(1)

Disk Array

(1)

JBOD (Just a Bunch of Disks)

(1)

Hot-Pluggable Disks

(1)

Warm-Pluggable Disks

(1)

Hot Spares

(1)

Write Cache

(1)

Storage Area Network (SAN)

(3)

SCSI versus Fibrechannel

(1)

RAID Technology

(12)

RAID Levels

(1)

Striping

(1)

Mirroring

(1)

Combining RAID-0 and RAID-1

(2)

Hamming Encoding

(1)

Parity RAID

(2)

Hardware RAID

(3)

Disk Arrays

(1)

Software RAID

(1)

Logical Volume Management

(1)

The Right Answer

(1)

Disk Space and FileSystems

(4)

What Happens When a LUN Fills up?

(1)

Managing Disk and Volume Availability

(1)

File System Recovery

(1)

Key Points

(1)

Redundant Server Design

(22)

Server Failures and Failover

(2)

Logical, Application-Centric Thinking

(2)

Failover Requirements

(1)

Servers

(3)

Failing Over between Incompatible Servers

(2)

Networks

(10)

Heartbeat Networks

(2)

When the Heartbeat Stops

(1)

Running Heartbeat Networks

(1)

Public Networks

(1)

Redundant Network Connectivity

(1)

Moving Network Identities

(2)

IP Addresses and Names

(1)

Selecting Logical Hostnames

(1)

Administrative Networks

(1)

Disks

(3)

Private Disks

(1)

Shared Disks

(1)

Placing Critical Applications on Disks

(1)

Key Points

(1)

Failover Management

(10)

Component Monitoring

(3)

When Component Tests Fail

(1)

Time to Manual Failover

100

(2)

Homegrown Failover Software versus Commercial Software

102

(1)

Commercial Failover Management Software

103

(2)

Key Points

105

(2)

Failover Configurations and Issues

107

(30)

Two-Node Failover Configurations

107

(9)

Asymmetric 1-to-1 Configuration

108

(1)

How Can I Use the Standby Server?

109

(3)

Symmetric 1-to-1 Failover

112

(2)

Symmetric or Asymmetric?

114

(1)

Service Level Failover

115

(1)

More Complex Failover Configurations

116

(4)

N-to-1 Asymmetric

117

(1)

N Host, Networked

118

(2)

Offbeat Failover Configurations

120

(4)

N-to-1 Symmetric

121

(1)

1-to-N (Spray) Asymmetric

121

(1)

Round-Robin Symmetric

122

(2)

When Good Failovers Go Bad

124

(5)

Split-Brain Syndrome

124

(1)

Causes and Remedies of Split-Brain Syndrome

125

(3)

Undesirable Failovers

128

(1)

Verification and Testing

129

(3)

State Transition Diagrams

129

(2)

Testing the Works

131

(1)

Managing Failovers

132

(3)

System Monitoring

132

(1)

Consoles

133

(1)

Utilities

134

(1)

Time Matters

135

(1)

Key Points

135

(2)

Redundant Network Services

137

(30)

Network Failure Taxonomy

138

(9)

Network Reliability Challenges

138

(2)

Network Failure Modes

140

(1)

Physical Device Failures

141

(1)

IP Level Failures

142

(1)

IP Address Configuration

142

(1)

Routing Information

143

(1)

Congestion-Induced Failures

144

(1)

Network Traffic Congestion

144

(2)

Design and Operations Guidelines

146

(1)

Building Redundant Networks

147

(12)

Virtual IP Addresses

148

(1)

Redundant Network Connections

149

(1)

Redundant Network Attach

150

(1)

Multiple Network Attach

150

(2)

Interface Trunking

152

(1)

Configuring Multiple Networks

153

(3)

IP Routing Redundancy

156

(3)

Choosing the Failover Mechanism

159

(1)

Network Service Reliability

159

(7)

Network Service Dependencies

160

(4)

Hardening Core Services

164

(1)

Denial-of-Service Attacks

165

(1)

Key Points

166

(1)

Data Service Reliability

167

(22)

Network FileSystem Services

168

(7)

Detecting RPC Failures

168

(2)

NFS Server Constraints

170

(1)

Inside an NFS Failover

170

(1)

Optimizing NFS Recovery

171

(1)

File Locking

172

(2)

Stale File Handles

174

(1)

Database Servers

175

(8)

Managing Recovery Time

176

(1)

Database Probes

176

(1)

Database Restarts

177

(2)

Client Reconnection

179

(1)

Surviving Corruption

180

(1)

Unsafe at Any (High) Speed

180

(1)

Transaction Size and Checkpointing

181

(1)

Parallel Databases

181

(2)

Web Servers

183

(5)

Availability Constraints

183

(1)

Web Server Farms

184

(1)

High-Availability Pairs

184

(1)

Round-Robin DNS

185

(1)

IP Redirection

186

(1)

Deep or Wide?

187

(1)

Key Points

188

(1)

Replication Techniques

189

(24)

What is Replication?

190

(3)

Replication Applications

190

(2)

Overview of Replication Techniques

192

(1)

Filesytem Replication

193

(6)

Archive Distribution

194

(2)

Distribution Utilities

196

(1)

File Replication with Finesse

197

(1)

Software Distribution

198

(1)

Database Replication

199

(8)

Log Replay

200

(1)

Database Replication Managers

201

(1)

To Block Copy or Not?

202

(1)

Transaction Processing Monitors

203

(1)

Queuing Systems

204

(3)

Process Replication

207

(4)

Redundant Service Processes

207

(2)

Process State Multicast

209

(1)

Checkpointing

210

(1)

Key Points

211

(2)

Application Recovery

213

(24)

Application Recovery Overview

214

(4)

Application Failure Modes

214

(1)

Application Recovery Techniques

215

(2)

Kinder, Gentler Failures

217

(1)

Tolerating Data Service Failures

218

(5)

File Server Client Recovery

218

(1)

NFS Soft Mounts

219

(1)

Automounter Tricks

220

(1)

Database Application Recovery

221

(1)

Web Client Recovery

222

(1)

Application Recovery from System Failures

223

(5)

Virtual Memory Exhaustion

224

(1)

I/O Errors

225

(1)

Network Connectivity

226

(1)

Restarting Network Services

227

(1)

Internal Application Failures

228

(2)

Memory Access Faults

228

(1)

Memory Corruption and Recovery

229

(1)

Hanging Processes

230

(1)

Developer Hygiene

230

(5)

Return Value Checks

231

(1)

Boundary Condition Checks

232

(1)

Value-Based Security

233

(1)

Logging Support

234

(1)

Assume Nothing, Manage Everything

235

(1)

Key Points

236

(1)

Backups and Restores

237

(32)

The Basic Rules for Backups

237

(2)

Backup Software

239

(4)

Commercial or Homegrown?

239

(1)

Examples of Commercial Backup Software

240

(1)

Commercial Backup Software Features

241

(2)

Backup Performance

243

(9)

Improving Backup Performance: Find the Bottleneck

243

(5)

Solving for Performance

248

(4)

Backup Styles

252

(3)

Incremental Backups of Databases

254

(1)

Backup Windows

255

(7)

Hot Backups

255

(2)

Have Less Data, Save More Time (and Space)

257

(1)

Hierarchical Storage Management

257

(1)

Amazon no longer offers textbook rentals. We do!

Amazon no longer offers textbook rentals. We do!

We're the #1 textbook rental company. Let us show you why.

Blueprints for High Availability: Designing Resilient Distributed Systems

0471356018

Summary

Author Biography

Table of Contents

Supplemental Materials

Rewards Program