The nearly twenty years since the publication of the first edition of this book have seen tremendous progress in artificial intelligence, propelled in large part by advances in machine learning, including advances in reinforcement learning. Although the impressive computational power that became avaContentsPreface to the first editionPreface to the second editionXIIISummary of NotationXVII1 The Reinforcement Learning Problem1.1 Rcinforccmcnt Learning1.2 Examples1461.3 Flements of Reinforcement Learning1.4 Limitations and Scope1.5 An Extended Example: Tic-Tac-Toc1.6 Summary,.,.,,,,151. 7 History of reinforcement learning15I Tabular solution methods252 Multi-arm bandits272.1 A k-Armed bandit problem282.2 Action-Value Methods2.3 The 10-armed Testbed.,.302.4 Incremental Implementation332.5 Tracking a nonstationary Problem342.6 Optimistic Initial Values2.7 Upper-Confidence-Bound Action Selection372.8 Gradient Bandit AlgorithIns392.9 Associative Search(Contextual Bandits)422.10 SummaryCONTENTS3 Finite markov Decision processes493.1 The agent-Environment Interface3.2 Goals and Rewards533.3 Returns,.....,,,,543.4 Unified Notation for Episodic and Continuing Tasks573.5 * The Markov Property583.6 Markov Decision processes...623.7 Value functions3.8 Optimal value Functions3.9 Optimality and approximation3.10 Summary4 Dynamic Programming814.1 Policy Evaluation824.2 Policy Improvement4.3 Policy Iteration4.4 Value iteration.914.5 Asynchronous Dynamic Programming.934.6 Generalized Policy iteration4.7 Elliciency of Dynamic Programining964.8 Summary975 Monte carlo methods1015.1 Monte carlo prediction5.2 Monte Carlo Estimation of Action values5.3 Monte carlo control1075.4 Monte Carlo Control without Exploring Startsl105.5 Off-policy Prediction via Importance Sampling1135.6 Incremental Implementation1195.7 Off-Policy Monte Carlo Control.1205.8 *Discounting-A ware Importance Sampling1225.9 *Per-Reward Importance Sampling1245.10 Off-policy returns1255.11 Summary256 Temporal-Difference Learning1296.1 TD Prediction1296.2 Advantages of TD Prediction Methods133CONTENTS6.3 Optimality of TD(O1366. 1 Sarsa: On-Policy TD Control1396.5 Q-lcarning: Off-Policy TD Control1426.6 Expected sarsa1446. 7 Maximization Bias and Double Learning.1456.8 Games, Afterstates, and Other Special Cases1476.9 Summary1497 Multi-step Bootstrapping7. 1 n-step TD Prediction1537.2 n-Step sarsa.1587.3 n-step Off-policy Learning by Importance Sampling1607. 4 *Per-reward Off-policy Methods7.5 Off-policy Learning Without lmportance SamplingThe n-step Tree Backup algorithm1637.6 *A Unifying Algorithm: n-step Q(a)1667. 7 Summary.,.1708 Planning and Learning with Tabular Methods1738.1 Models and Planning.1738.2 Dyna: Integrating Planning, Acting, and Learning8.3 When the Model Is Wrong8.4 Prioritized Sweeping.1838.5 Full vs Sample Backups1878.6 Trajectory Sampling1908.7 Real- Time Dynamic Programming938.8 Planning at Decision Time.1978.9 Heuristic Search.1988.10 Rollout Algorithms8.11 Monte carlo Trcc Scarch,,,,,,2028. 12 Summary205II Approximate Solution Methods2089 On-policy Prediction with Approximation2119.1 Value-function Approximation2119. 2 The Prediction Objective(MSVE2129.3 Stochastic-gradient and Semi-gradient Methods214CONTENTS4 Linear method2189.5 Feature Construction for Linear methods2245.1 Polynomials2249.5.2 Fourier basis2259.5.3 Coarse Coding2289.5.4 Tile Coding2319.5.5 Radial Basis Functions2359.6 Nonlinear Function Approximation: Artificial Neural Networks2369.7 Least-Squares tD2419.8 Memory-based Function Approximation.2439.9 Kernel-Based Function Approximation2459.10 Looking Deeper at On-policy Learning: Interest and Emphasis.2169.11 Summary.24710 On-policy Control with Approximation25510.1 Episodic Semi-gradient Control25510.2 n-step Semi-gradient Sarsa.25910.3 Average Reward: A New Problem Setting for Continuing Tasks.... 26110.4 Deprecating the Discounted Setting26410.5 T-Step Dierential SeIni-gradient Sarsa26610. 6 Summary....2671l Off-policy Methods with Approximation26911.1 Semi-gradient Meth27011.2 Examples of off-policy Divergence27211. 3 The deadly tria27611. 4 Linear value-Function Geometry27811.5 Stochastic Gradient Descent in the bellman error28211.6 Learnability of the Bellman error28711.7 Gradient-TD Methods.29111.8 Fmphatic-TD Methods29511.9 Reducing variance29611.10 Summary29812 Eligibility Traces30112.1The入 return.302122TD(入)30612.3 n-step ' Truncated A-return Methods310CONTENTS12.4 Redoing Updates: The Online A-return Algorithm31112.5 True Online tD(入)31312.6 Dutch Traces in Monte Carlo learning12.7 Sarsa(入.317128 Variable入andy32212.9 Off-policy Eligibility Traces.32312.10 Watkins’sQ( to Trcc-Backup(入)..32712. 11 Stable Off-policy Methods with Traces.32912.12 Implementation Issues33012.13 Conclusions...33113 Policy Gradient Methods33513.1 Policy Approximation and its Advantages33613. 2 The Policy gradient Theorem.33813. 3 REINFORCE: Monte Carlo Policy Gradient34013.4 REINFORCE with baseline.34213.5 Actor-Critic Methods.34313.6 Policy gradient for Continuing Problems34513.7 Policy Parameterization for Continuous Actions31813.8 Summary349III Looking Deeper35214 Psychology35314.1 Prediction and Control35414.2 Classical Conditioning.35514.2. 1 Blocking and Higher-Order Conditioning35714.2.2 The Rescorla-Wagner Model.35911.2.3 The TD Model.36114.2.4 TD Model Simulations36314.3 Instrumental Conditionin.37214.4 Delayed Reinforcement37614.5 Cognitive Maps.37814.6 Habitual and Goal-Directed behavior.37914.7 Summar.38415 Neuroscience393l11CONTENTS15.1 Neuroscience basics39415.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors 39615.3 The Reward Prediction Error Hypothesis39815.4 Dopamine39915.5 Experimental Support for the Reward Prediction Error Hypothesis.. 40415.6 TD Error/Dopamine Correspondence40615.7 Neural Actor-critic41215.8 Actor and Critic Learning Rules11515.9 Hedonistic ncurons42015.10 Collective Reinforcement learning42215.11 Model-Based methods in the brain42515.12 Addiction42715.13 Summary42816 Applications and Case Studies43916.1 TD-Gammon416.2 Samuels Checkers player4416.3 The acrobot44716.4 Watson's Daily-Double Wagering45116.5 Optimizing memory control15116.6 Human-Level Video game Play.45816.7 Mastering the Game of Go46416.8 Personalized Web Services46916.9 Thermal Soaring47317上 rentiers477References479Preface to the first editionWe first came to focus on what is now known as reinforceNent learning in lale 1979We were both at the University of Massachusetts, working on one of the earliestprojects to revive the idea that networks of neuronlike adaptive elements might proveto be a promising approach to artificial adaptive intelligence. The project exploredthe"heterostatic theory of adaptive systems' developed by A. Harry Klopf. Harry'swork was a rich source of ideas, and we were perinitled to explore then criticalland compare them with the long history of prior work in adaptive systems. Ourtask became one of teasing the ideas apart and understanding their relationshipsand relative importance. This continues today, but in 1979 we came to realize thatperhaps the simplest of the ideas, which had long been taken for granted, had receivedsurprisingly litlle attention froll a coinpulalional perspective. This was sinply theidea of a learning system that wants something that adapts its behavior in order tomaximize a special signal from its environment. This was the idea of a "hedonisticlearning system, or, as we would say now, the idea of reinforcement learningLike others, we had a sense that reinforcement learning had been thoroughly ex-plored in the early days of cy bernetics and artificial intelligence. On closer inspectionthough, we found that it had been explored only slightly. While reinforcement learn-ing had clearly motivated some of the earliest computational studies of learninmost of these researchers had gone on to other things, such as pattern classifica-Lion, supervised learning, anld adaptive control, or they had abandoned the study oflearning altogether. As a result, the special issues involved in learning how to getsomething from the environment received relatively little attention. In retrospecfocusing on this idea was the critical step that set this branch of research in motionLittle progress could be made in the computational study of reinforcement learninguntil it was recognized thal such a fundamental idea had not yet been thoroughlyexploredThe field has come a long way since then, evolving and maturing in several directions. Reinforcement learning has gradually become one of the most active researchareas in machine learning, artificial intelligence, and neural network research. Thefield has developed strong mathematical foundations and impressive applicationsThe computationa.l study of rein forcement learning is now a large field, with hundreds of active researchers around the world in diverse disciplines such as psychologcontrol theory, artificial intelligence, and neuroscience. Particularly important havebeen Che contributions establishing and developing the relationships lo the theoryPreface to the first editionof optimal control and dynamic programming. The overall problem of learning frominteraction to achieve goals is still far from being solved, but our understanding ofit has improved significantly. We can now place component ideas, such as temporaldifference learning, dynamic programming, and function approximation, within acoherent pcrspcctivc with respect to the ovcrall problcmOur goal in writing this book was to provide a clear and simple account of thekey ideas and algorithms of reinforcement learning We wanted our treatment to beaccessible to readers in all of the related disciplines but we could not cover all ofthesc perspectives in dctail. For the most part, our trcatment takes the point of vicwof artificial intelligence and engineering. Coverage of connections to other fields weleave to others or to another time. We also chose not to produce a rigorous formaltreatment of reinforcement learning. We did not reach for the highest possible levelof mathematical abstraction and did not rely on a theorem-proof format. We triedto choose a lcvcl of mathcmatical dctail that points the mathematically inclined inthe right directions without distracting from the simplicity and potential generalityof the underlying ideasThe book is largely self-contained. The only mathematical background assumed isfamiliarity with clcmcntary concepts of probability, such as cxpcctations of randomvariables. Chapter 9 is substantially easier to digest if the reader has some knowledgeof artificial neural networks or some other kind of supervised learning method, but itcan be read without prior background. We strongly recommend working the exercisesprovided throughout the book. Solution manuals are available to instructors. Thisand other related and timely material is available via the InternetAt the end of most chapters is a section entitled"Bibliographical and Histori-cal Remarks, wherein we credit the sources of the ideas presented in that chapterprovide pointers to further reading and ongoing research, and describe relevant historical background. Despite our attempts to make these sections authoritative andcomplete, we have undoubtedly left out some important prior work. For that we apol-ogize, and welcome corrections and extensions for incorporation into a subsequenteditionIn some sense we have been working toward this book for thirty years, and wehave lots of people to thank. First, we thank those who have personally helped usdevelop the overall view presented in this book: Harry Klopf, for helping us recognizethat reinforcement learning needed to be revived Chris Watkins. Dimitri bertsekasJohn Tsilsiklis, and Paul Werbos, for helping us see the value of the relationshipsto dynamic programming; John Moore and Jim Kehoe, for insights and inspirationsfrom animal learning theory; Oliver Selfridge, for emphasizing the breadth and im-portance of adaptation; and, more generally, our colleagues and students who havecontributed in countless ways: Ron Williams, Charles Anderson, Satinder SinghSridhar Mahadevan, Steve Bradcke, Bob Criles, Peter Dayal, and Leemon BairdOur vicw of rcinforccmcnt learning has bccn significantly enriched by discussionswith Paul Cohen, Paul utgoff, Martha Steenstrup. Gerry Tesauro, Mike JordanLeslie Kaelbling, Andrew Moore, Chris Atkeson, Tom Mitchell, Nils Nilsson, StuartRussell, Tom Dietterich, Tom Dean, and Bob Narendra. We thank Michael Littman,