Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives when interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the key ideas and algorithms of reinforcement learning. Their discussion ranges from the history of the field's intellectual foundations to the most recent developments and applications. The only necessary mathematical background is familiarity with elementary concepts of probability. The book is divided into three parts. Part I defines the reinforcement learning problem in terms of Markov decision processes. Part II provides basic solution methods: dynamic programming, Monte Carlo methods, and temporal-difference learning. Part III presents a unified view of the solution methods and incorporates artificial neural networks, eligibility traces, and planning; the two final chapters present case studies and consider the future of reinforcement learning.

Series Foreword

xiii

(2)

Preface

xv

I The Problem

1

(86)

1 Introduction

3

(22)

1.1 Reinforcement Learning

3

(3)

1.2 Examples

6

(1)

1.3 Elements of Reinforcement Learning

7

(3)

1.4 An Extended Example: Tic-Tac-Toe

10

(5)

1.5 Summary

15

(1)

1.6 History of Reinforcement Learning

16

(7)

1.7 Bibliographical Remarks

23

(2)

2 Evaluative Feedback

25

(26)

2.1 An n-Armed Bandit Problem

26

(1)

2.2 Action-Value Methods

27

(3)

2.3 Softmax Action Selection

30

(1)

2.4 Evaluation Versus Instruction

31

(5)

2.5 Incremental Implementation

36

(2)

2.6 Tracking a Nonstationary Problem

38

(1)

2.7 Optimistic Initial Values

39

(2)

2.8 Reinforcement Comparison

41

(2)

2.9 Pursuit Methods

43

(2)

2.10 Associative Search

45

(1)

2.11 Conclusions

46

(2)

2.12 Bibliographical and Historical Remarks

48

(3)

3 The Reinforcement Learning Problem

51

(36)

3.1 The Agent-Environment Interface

51

(5)

3.2 Goals and Rewards

56

(1)

3.3 Returns

57

(3)

3.4 Unified Notation for Episodic and Continuing Tasks

60

(1)

3.5 The Markov Property

61

(5)

3.6 Markov Decision Processes

66

(2)

3.7 Value Functions

68

(7)

3.8 Optimal Value Functions

75

(5)

3.9 Optimality and Approximation

80

(1)

3.10 Summary

81

(2)

3.11 Bibliographical and Historical Remarks

83

(4)

II Elementary Solution Methods

87

(74)

4 Dynamic Programming

89

(22)

4.1 Policy Evaluation

90

(3)

4.2 Policy Improvement

93

(4)

4.3 Policy Iteration

97

(3)

4.4 Value Iteration

100

(3)

4.5 Asynchronous Dynamic Programming

103

(2)

4.6 Generalized Policy Iteration

105

(2)

4.7 Efficiency of Dynamic Programming

107

(1)

4.8 Summary

108

(1)

4.9 Bibliographical and Historical Remarks

109

(2)

5 Monte Carlo Methods

111

(22)

5.1 Monte Carlo Policy Evaluation

112

(4)

5.2 Monte Carlo Estimation of Action Values

116

(2)

5.3 Monte Carlo Control

118

(4)

5.4 On-Policy Monte Carlo Control

122

(2)

5.5 Evaluating One Policy While Following Another

124

(2)

5.6 Off-Policy Monte Carlo Control

126

(2)

5.7 Incremental Implementation

128

(1)

5.8 Summary

129

(2)

5.9 Bibliographical and Historical Remarks

131

(2)

6 Temporal-Difference Learning

133

(28)

6.1 TD Prediction

133

(5)

6.2 Advantages of TD Prediction Methods

138

(3)

6.3 Optimality of TD(0)

141

(4)

6.4 Sarsa: On-Policy TD Control

145

(3)

6.5 Q-Learning: Off-Policy TD Control

148

(3)

6.6 Actor-Critic Methods

151

(2)

6.7 R-Learning for Undiscounted Continuing Tasks

153

(3)

6.8 Games, Afterstates, and Other Special Cases

156

(1)

6.9 Summary

157

(1)

6.10 Bibliographical and Historical Remarks

158

(3)

III A Unified View

161

(130)

7 Eligibility Traces

163

(30)

7.1 n-Step TD Prediction

164

(5)

7.2 The Forward View of TD(Frequency)

169

(4)

7.3 The Backward View of TD(Frequency)

173

(3)

7.4 Equivalence of Forward and Backward Views

176

(3)

7.5 Sarsa(Frequency)

179

(3)

7.6 Q(Frequency)

182

(3)

7.7 Eligibility Traces for Actor-Critic Methods

185

(1)

7.8 Replacing Traces

186

(3)

7.9 Implementation Issues

189

(1)

7.10 Variable Frequency

189

(1)

7.11 Conclusions

190

(1)

7.12 Bibliographical and Historical Remarks

191

(2)

8 Generalization and Function Approximation

193

(34)

8.1 Value Prediction with Function Approximation

194

(3)

8.2 Gradient-Descent Methods

197

(3)

8.3 Linear Methods

200

(10)

8.4 Control with Function Approximation

210

(6)

8.5 Off-Policy Bootstrapping

216

(4)

8.6 Should We Bootstrap?

220

(2)

8.7 Summary

222

(1)

8.8 Bibliographical and Historical Remarks

223

(4)

9 Planning and Learning

227

(28)

9.1 Models and Planning

227

(3)

9.2 Integrating Planning, Acting, and Learning

230

(5)

9.3 When the Model Is Wrong

235

(3)

9.4 Prioritized Sweeping

238

(4)

9.5 Full vs. Sample Backups

242

(4)

9.6 Trajectory Sampling

246

(4)

9.7 Heuristic Search

250

(2)

9.8 Summary

252

(2)

9.9 Bibliographical and Historical Remarks

254

(1)

10 Dimensions of Reinforcement Learning

255

(6)

10.1 The Unified View

255

(3)

10.2 Other Frontier Dimensions

258

(3)

11 Case Studies

261

(30)

11.1 TD-Gammon

261

(6)

11.2 Samuel's Checkers Player

267

(3)

11.3 The Acrobot

270

(4)

11.4 Elevator Dispatching

274

(5)

11.5 Dynamic Channel Allocation

279

(4)

11.6 Job-Shop Scheduling

283

(8)

References

291

(22)

Summary of Notation

313

(2)

Index

315

What is included with this book?

The New copy of this book will include any supplemental materials advertised. Please check the title of the book to determine if it should include any access cards, study guides, lab manuals, CDs, etc.

The Used, Rental and eBook copies of this book are not guaranteed to include any supplemental materials. Typically, only the book itself is included. This is true even if the title states it includes any access cards, study guides, lab manuals, CDs, etc.