Algorithm for reinforcement learningRewardystemActionControllerFigure 1: The basic reinforcement learning scenariodescribe the core ideas together with a large number of state of the art algorithInsfollowed by the discussion of their theoretical properties and limitationsKeywords: reinforcement learning; Markov Decision Processes; temporal difference learning; stochastic appro imation; twio-timescale stochastic approximation, Monte-Carlo methods; simulation optimization; function appro.mimation; stochastic gradient methods; least-squares methods; overfitting; bias-variance tradeoff; online learning; active learning; plan-ning; simulation PAC-learning;Q-lcarning; actor-critic mcthods; policy gradient; naturalgradient1 OverviewReinforcement learning(RL) refers to both a learning problem and a subfield of machinelearning. As a learning problem, it refers to learning to control a system so as to maxi-mize some numerical value which represents a long-term objective. A typical setting wherereinforcement learning operates is shown in Figure If A controller receives the controlledsystem's state and a reward associated with the last state transition. It then calculates anaction which is sent back to the system. In response, the system makes a transition to anew state and the cycle is repeated. The problen is to learn a way of controlling the systernso as to maximize the total reward. The learning problems differ in the details of how thedata is collected and how performance is measuredIn this book, we assume that the system that we wish to control is stochastic. Furtherwe assume that the measurements available on the system's state are detailed enough sothat the the controller can avoid reasoning about how to collect information about thestate. Problems with these characteristics are best described in the framework of markovianDecision Processes (MDPs). The standard approach to 'solve MDPs is to use dynamicprogramming, which transforms the problem of finding a good controller into the problemof finding a good value function. However, apart from the simplest cases when the mdp hasvery few states and actions, dynamic programming is infeasible. The rl algorithms thatwe discuss here can be thought of as a way of turning the infeasible dynamic programmingmethods into practical algorithms so that they can be applied to large-scale problemsThere are two key ideas that allow Rl algorithms to achieve this goal. The first idea is touse samples to compactly represent the dynamics of the control problem. This is importantfor two reasons: First, it allows one to deal with learning scenarios when the dynamics isunknown. Second, even if the dynamics is available, exact reasoning that uses it mightbe intractable on its own. The second key idea behind Rl algorithms is to use powerfulfunction approximation methods to compactly represent value functions. The significanceof this is that it allows dealing with large, high-dimensional state- and action-spaces. Whatis more, the two ideas fit nicely together: Samples may be focused on a small subset of thspaces they belong to, which clever function approximation techniques might exploit. Itthe understanding of the interplay between dynamic programming, samples and functionapproximation that is at the heart of designing analyzing and applying rl algorithmsThe purpose of this book is to allow the reader to have a chance to peek into this beautifulfield. However, certainly we are not the first to set out to accomplish this goal. In 1996Kaelbling etahave written a nice, compact survey about the approaches and algorithmsavailable at the time(Kaelbling et al. 1996). This was followed by the publication of the bookby Bertsekas and Tsitsiklis(1996), which detailed the theoretical foundations. A few yearslater Sutton and Barto the fathers' of RL, published their book, where they presented theiideas on RL in a very clear and accessible manner(Sutton and Barto, 1998). A more recentand comprehensive overview of the tools and techniques of dynamic programming/optimalcontrol is given in the two-volume book by Bertsekas(2007a b) which devotes one chapterto RL methods At times, when a field is rapidly developing, books can get out of datepretty quickly. In fact, to keep up with the growing body of new results, Bertsekas maintainsan online version of his Chapter 6 of volume It of his book, which, at, the time of writingthis survey counted as much as 160 pages(Bertsekas, 2010). Other recent books on thesubject include the book of Gosavi(2003) who devotes 60 pages to reinforcement learningalgorithms in Chapter 9, concentrating on average cost problems, or that of Cao(2007) whofocuses on policy gradient methods. Powell (2007) presents the algorithms and ideas from anoperations research perspective and emphasizes methods that are capable of handling largeIIn this book, RL is called neuro-dynamic programming or approximate dynamic programming. Theterm neuro-dynamic programming stems from the fact that, in many cases, RL algorithms are used withartificial neural networks4control spaces, Chang et al2008) focuses on adaptive sampling (i.e, simulation-basedperformance optimization), while the center of the recent book by Busoniu et al. (2010) isfunction approximationThus, by no means do Rl researchers lack a good body of literature. However, what seemsto be missing is a self-contained and yet relatively short summary that can help newcomersto the field to develop a good sense of the state of the art as well as existing researchers tobroaden their overview of the field, an article, similar to that of Kaelbling et al. (1996),butwith an updated contents. To fill this gap is the very purpose of this short bookHaving the goal of keeping the text short, we had to make a few, hopefully, not too troupling compromises. The first compromise we made was to present results only for the totalexpected discounted reward criterion. This choice is motivated by that this is the criterionthat is both widely used and the easiest to deal with mathematically. The next compro-mise is that the background on MDPs and dynamic programming is kept ultra-compact(although an appendix is added that, explains these basic results). Apart from these, thbook aims to cover a bit of all aspects of Rl, up to the level that the reader should becble to understand the whats and hows, as well as to implement the algorithms presentedNaturally, we still had to be selective in what we present. Here. the decision was to focuson the basic algorithms, ideas, as well as the available theory. Special attention was paid todescribing the choices of the user. as well as the tradeoffs that come with these. We triedto be impartial as much as possible, but some personal bias, as usual, surely remained. Thepseudocode of almost twenty algorithms was included hoping that this will make it easierfor the practically inclined reader to implement the algorithms describedThe target audience is advanced undergaduate and graduate students, as well as researchersand practitioners who want to get a good overview of the state of the art in Rl quicklyResearchers who are already working on RL might also enjoy reading about parts of the rlliterature that they are not so familiar with, thus broadening their perspective on RL. Thereader is assumed to be familiar with the basics of linear algebra, calculus, and probabilitytheory. In particular, we assume that the reader is familiar with the concepts of randomvariables, conditional expectations, and markov chains. It is helpful, but not necessaryfor the reader to be familiar with statistical learning theory, as the essential concepts willbe explained as needed. In some parts of the book, knowledge of regression techniques ofmachine learning will be usefulThis book has three parts. In the first part, in Section 2 we provide the necessary back-ground. It is here where the notation is introduced, followed by a short overview of thetheory of Markov Decision Processes and the description of the basic dynamic programmingalgorithms. Readers familiar with MDPs and dynamic programming should skim throughthis part to familiarize themselves with the notation used. Readers, who are less familiarwith MDPs, must spend enough time here before moving on because the rest of the bookbuilds heavily on the results and ideas presented hereThe remaining two parts are devoted to the two basic RL problems(cf. Figure I), one partdevoted to each. In Section 3 the problem of learning to predict values associated withstates is studied. We start by explaining the basic ideas for the So-called tabular case whenthe MDp is small enough so that one can store one value per state in an array allocated ina computer's main memory. The first algorithm explained is TD(), which can be viewedas the learning analogue to value iteration from dynamic programming. After this, weconsider the more challenging situation when there are more states than what fits into acomputer's memory. Clearly, in this case, one must compress the table representing thevalues. Abstractly, this can be done by relying on an appropriate function approxinationmethod. First, we describe how TD()can be used in this situation. This is followed by thedescription of some new gradient based methods(GTD2 and TDC), which can be viewedas improved versions of TD() in that they avoid some of the convergence difficulties thatTD() faces. We then discuss least-squares methods (in particular, LSTD() and A-LSPE)and compare them to the incremental methods described earlier. Finally, we describe choicesavailable for implementing function approximation and the tradeoffs that these choices comeThe second part( Section 4) is devoted to algorithms that are developed for control learningFirst, we describe methods whose goal is optimizing online performance. In particular,we describe the "optimism in the face of uncertainty"" principle and methods that exploretheir environment based on this principle. State of the art algorithms are given both forbandit problems and MDPs. The message here is that clever exploration methods makea large difference, but more work is needed to scale up the available methods to largeproblems. The rest of this section is devoted to methods that aim at developing methodsthat can be used in large-scale applications. As learning in large-scale MdPs is significantlymore difficult than learning when the Mdp is small, the goal of learning is relaxed tolearning a good enough policy in the limit. First, direct methods are discussed which aim atstimating the optimal action-values directly. These can be viewed as the learning analogueof value iteration of dynamic programming This is followed by the description of actor-critic methods, which can be thought of as the counterpart of the policy iteration algorithmof dynamic programming. Both methods based on direct policy improvement and policygradient (i. e, which use parametric policy classes) are presentedThe book is concluded in Sectionwhich lists some topics for further explorationPredictionValue iterationPolicy iterationPolicy searchFigure 2: Types of reinforcement problems and approaches2 Markov decision processesThe purpose of this section is to introduce the notation that will be used in the subsequentparts and the most essential facts that we will need from the theory of Markov DecisionProcesses (dps) in the rest of the book. Readers familiar with MDPs should skim throughthis section to familiarize themselves with the notation. Readers unfamiliar with mdps aresuggested to spend enough time with this section to understand the details. Proofs of mostof the results(with sone simplifications)are included in Appendix Al The reader who isinterested in learning more about MDPs is suggested to consult one of the many excellentbooks on the subject, such as the books of Bertsekas and Shreve(1978), Puterman(199.or the two-volume book by Bertsekas(2007a b)2.1 PreliminariesWe use n to denote the set of natural numbers: n=0, 1, 2, .., while R denotes the setof reals. By a vector v(unless it is transposed, v), we mean a column vector. The innerproduct of two finite-dimensional vectors. u, vE Rd is (u,v)=2ir u; vj. The resulting 2norm is u2=(u, u). The maximun norIn for vectors is defined by halloo=maxi-L. uilwhile for a function f:→R,‖· loo is defined by‖川=supn∈x|f(x). A mappingT between the metric spaces(Mi,d1),(M2, d2) is called Lipschitz with modulus L E R iffor any a, bE M1, d2(T(a), T(b))sldi(a, b). If T is Lipschitz with a modulus Lofor any ( c, a)ExxA, IR(c,a)0)+ CAt. Thus, there is a ficed entry cost K of ordering nonzero items andeach item must be purchased at a ficed price c. Here K,c>0. In addition, there is a cost oholding an inventory of size x>0. In the simplest case, this cost is proportional to the sizeMathematically, a behavior is an infinite sequence of probability kernels To, T1Tt maps histories of length t to a probability distribution over the action space 4.", where丌t(·{xo,ao,70,…of the inventory with proportionality factor h>0. Finally, upon selling x units the manageris paid the monetary amount of p z, where p>0. In, order to make the problern interestingwe must have p>h, otherwise there is no incentive to order new itemsThis problem can be represented as an MDP as follows: Let the state Xt on day>o be thesize of the inventory in the evening of that dag. Thus, &=0, 1,., M, where ME n isthe macimum inventory size. The action At gives the number of items ordered in the eveningof day t. Thus, we can choose A=10, 1,..., M since there is no need to consider orderslarger than the inventory size. Given Xt and At, the size of the nect inventory is given b(X+A)AM-D2:1)where an b is a shorthand notation for the minimum of the numbers a, b,(a)=avomax(a, 0)is the positive part of a, and Dt+1 E n is the demand on the (t+ 1)th dayBy assumption,(Dt; t>0 is a sequence of independent and identically distributed (i.i. dinteger-valued random variables. The revenue made on day t+lR++1=-KIAt0-c((Xt+ At)AM- Xthx+p(X+4)∧M-Xt+1)Equations(2)-(3) can be written in the compact form(x1+1,R+1)=f(X2A+,D+1),with an appropriately chosen function f. Then, Po is given byPo(U|x,a)=P((a,D)∈U)=∑I(ma=pD()Here pp(is the probability mass function of the random demands and d pp(. Thisfinishes the definition of the MDP underlying the inventory optimization problemnventory control is just one of the many operations research problems that give rise toan MDP. Other problems include optimizing transportation systems, optimizing schedulesor production. MDPs arise naturally in many engineering optimal control problems. toosuch as the optimal control of chemical, electronic or mechanical systems(the latter classincludes the problem of controlling robots). Quite a few information theory problems canalso be represented as MDPs(.g, optimal coding optimizing channel allocation, or sensornetworks). Another important class of problems comes from finance. These include, amongstothers, optimal portfolio management and option pricing