Feedback Systems and Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Feedback Systems and Reinforcement Learning Sean Meyn August 26, 2020 2 pre-draft manuscript for Simons RL bootcamp August 26, 2020 Welcome to the Simons Institute RL Bootcamp! This draft manuscript was conceived in March 2020. During the spring semester I was giving my stochastic control course, for which the final weeks always focus on topics in RL. Throughout the semester I was thinking ahead to the Simons RL crash course, and a similar one I was supposed to present in Berlin before a pandemic altered my travel plans. A publisher living far away contacted me in May to see if I had any plans for a book: \...someone mentioned that you were scheduled to deliver lectures on reinforcement learning..." The idea of a book came up, and I decided this must be destiny. You can learn more about my motivation and goals in the introduction. This draft manuscript is a product of handouts I've prepared over more than a decade, bits and pieces of papers and book chapters prepared over an even longer period, and the longest time ever confined to home without travel. There were two themes in particular that I felt a need to spell out: (i) There is an apparent paradox in Q-learning, considering the long history of convex analytic approaches to dynamic programming, as we will discuss in September lectures. In contrast, there is nothing convex about the Q-learning algorithms used today. Some would object that the least-squares techniques, and DQN are based on convex optimization (subject to linear function approximation). However, the root finding problems they seek to solve are highly non-linear and not well understood. The paradox was addressed in [142], through a convex re-formulation of Q-learning. This summer I set to work to make this more accessible, and show how the ideas are related to standard algorithms in RL. (ii) Since I was a graduate student, I have enjoyed the idea of ODE methods for algorithm design|probably first inspired by the ODE method (a phrased coined by Ljung [134, 171]) in applications to adaptive control. We now know that ODE methods have amazing power in the design of optimization algorithms [193, 224, 181], and they offer enormous insight for algorithm design in general. The book chapter [70] and thesis [67] make this case in the context of RL, with much of the theory built upon the Newton-Raphson flow. This exotic ODE is universally stable, and provides a new tool for algorithm design. Much of this theory seems to be highly technical. In particular, [70, 67] assume significant background in the theory of stochastic processes, mainly as a vehicle to explain the theory of stochastic approximation: the engine behind the ODE method. I wrote this book to lift the veil: many aspects of RL can be understood by a student with only a good grasp of calculus and matrix theory. In particular, there is nothing inherently stochastic about stochastic approximation, provided you are willing to work with sinusoids or other deterministic probing signals instead of stochastic processes. This is my third book, and like the others the writing came with discoveries. While working on theme (i), my colleague Prashant Mehta and I found that Convex Q-Learning could be made much more practical by borrowing batch RL concepts that are currently popular. This led to the new work [143]|you will find text and equations from this paper scattered throughout Chapters3 and5. Chapter4 on ODE methods and quasi-stochastic approximation was to be built primarily on [24, 25]. Over the summer all of this material was generalized to create a complete theory of convergence and convergence rates for these algorithms, along with a better understanding Feedback Systems and Reinforcement Learning Draft copy August 26, 2020 3 of their application to both gradient free optimization and policy gradient techniques for RL. The chapter is currently incomplete because a manuscript is in preparation|to be uploaded to arXiv in September. A quote from [66] came to mind while thinking about this preface: \...little more than a grab-bag of techniques which have been successfully applied to special situations and are therefore worth trying in sufficiently closely related settings". The authors were referring to the theory of Large Deviations: a highly applicable, but highly abstract branch of mathematics. The same statement applies to the field of reinforcement learning, and control theory, and perhaps any successful engineering sub-discipline. We need a big \grab-bag" because one technique cannot possible solve every control problem in fields ranging from healthcare to supply chains to autonomous vehicles. We need theory for understanding, but we also need reassurance that tricks from the grab-bag have been successfully applied in at least a few special situations. For this reason, any book on reinforcement learning will have its own biases and eccentricities, reflecting on those of the discipline and the author. I hope this book will be a bit less eccentric when it is finished in 2021! Sean Meyn | August 26, 2020 Prerequisites for the control-RL crash course on September 3rd: Please watch at least hour-one of Richard Murray's crash course on control: Feedback Control Theory: Architectures and Tools for Real-Time Decision Making https://simons.berkeley.edu/talks/murray-control-1 Contents Preface 1 1 Introduction 9 1.1 What You Can Find in Here.............................9 1.2 What's Missing?.................................... 12 1.3 Words of Thanks.................................... 13 1.4 Resources........................................ 13 I Fundamentals Without Noise 15 2 Control Crash Course 17 2.1 You Have a Control Problem............................. 17 2.2 What To Do About It?................................ 18 2.3 State Space Models................................... 19 2.4 Stability and Performance............................... 24 2.5 A Glance Ahead: From Control Theory to RL.................... 36 2.6 How Can We Ignore Noise?.............................. 39 2.7 Examples........................................ 39 2.8 Exercises........................................ 50 2.9 Notes.......................................... 53 3 Optimal Control 55 3.1 Value Function for Total Cost............................. 55 3.2 Bellman Equation................................... 56 3.3 Variations........................................ 63 3.4 Inverse Dynamic Programming............................ 66 3.5 Bellman Equation is a Linear Program........................ 68 3.6 Linear Quadratic Regulator.............................. 70 3.7 A Second Glance Ahead................................ 72 3.8 Examples........................................ 73 3.9 Exercises........................................ 74 3.10 Notes.......................................... 75 5 6 pre-draft manuscript for Simons RL bootcamp August 26, 2020 4 ODE Methods for Algorithm Design 77 4.1 Ordinary Differential Equations............................ 77 4.2 A Brief Return to Reality............................... 79 4.3 Newton-Raphson Flow................................. 80 4.4 Optimization...................................... 82 4.5 Quasi-Stochastic Approximation........................... 86 4.6 Gradient-Free Optimization.............................. 99 4.7 Quasi Policy Gradient Algorithms.......................... 103 4.8 Stability of ODEs*................................... 107 4.9 Convergence theory for QSA*............................. 113 4.10 Exercises........................................ 120 4.11 Notes.......................................... 120 5 Value Function Approximations 123 5.1 Function Approximation Architectures........................ 124 5.2 Exploration and ODE Approximations........................ 132 5.3 TD-learning and SARSA................................ 133 5.4 Projected Dynamic Programming and TD Algorithms............... 137 5.5 Convex Q-learning................................... 144 5.6 Safety Constraints................................... 153 5.7 Examples........................................ 153 5.8 Exercises........................................ 154 5.9 Notes.......................................... 154 II Stochastic State Space Models 155 6 Markov Chains 157 6.1 Markov Models are State Space Models....................... 158 6.2 Simple Examples.................................... 159 6.3 Spectra and Ergodicity................................. 162 6.4 Poisson's Equation................................... 165 6.5 Lyapunov functions................................... 165 6.6 Simulation: Confidence Bounds & Control Variates................. 168 6.7 Exercises........................................ 169 7 Markov Decision Processes 177 7.1 Total Cost and Every Other Criterion........................ 178 7.2 Computational Aspects of MDPs........................... 180 7.3 Relative DP Equatons................................. 187 7.4 Inverse Dynamic Programming............................ 188 7.5 Exercises........................................ 189 7.6 Notes.......................................... 192 Feedback Systems and Reinforcement Learning Draft copy August 26, 2020 7 8 Examples 193 8.1 Fluid Models for Policy Approximation....................... 193 8.2 LQG........................................... 194 8.3 Queues.......................................... 194 8.4 Speed scaling...................................... 194 8.5 Contention for resources and instability....................... 198 8.6 A Queueing Game................................... 200 8.7 Controlling Rover