Feb 16, 2021
Introduction
Cathy Wu 6.246: Reinforcement Learning: Foundations and Methods Sources
• Adapted from Alessandro Lazaric (FAIR/INRIA) Why: Important Problems 3
Autonomous robo cs
Elder care Connected and automated vehicles Exploration of dangerous environments Robotics for entertainment
Wu Why: Important Problems 4
Robotics Finance
Trading execution algorithms Portfolio management Option pricing
Wu Why: Important Problems 5
Robotics Finance
Resource management Inventory management, pricing, revenue management Energy grid integration, market regulation, production management Vehicle routing, ridesharing, trafc management Network resource allocation, routing, scheduling
Wu Why: Important Problems 6
Robotics Finance Resource management Recommender systems
Web advertising Product / news recommendation Mobile health – adaptive treatments MOOCs Wu Why: Important Problems 7
Robotics Finance Resource management Recommender systems Games Board games Computer games
Playing Atari with Deep Reinforcement Learning, Mnih et al., 2016; Mastering the game of Go with Deep Neural Networks & Tree Search David Silver, Aja Huang, et al. Nature 2016 Wu What: Dynamic Programming 8 “[We] consider systems where decisions are made in stages. The outcome of each decision is not fully predictable but can be an cipated to some extent before Stochas c the next decision is made. system transi on Each decision results in some immediate cost but also probabilities afects the context in which future decisions are to be state s decision/control at t made and therefore affects the cost incurred in future Feedback stages. We are interested in decision making policies that control policy minimize the total cost over a number of stages. ( maximize the total reward ) Such problems are challenging primarily because of the tradeof between immediate and future costs. Explora on vs exploita on Dynamic programming (DP) provides a dilemma mathema cal formaliza on of this tradeoff.” — Bertsekas and Tsitsiklis (1996) Wu What: Reinforcement Learning 9
“Reinforcement learning is learning how to Also known as approximate dynamic map states to ac ons so as to maximize a programming (ADP). We will use these numerical reward signal in an unknown and terms more-or-less interchangeably. uncertain environment. In the most interes ng and challenging cases, actions affect not only the immediate reward but also the next situation and all subsequent environment rewards (delayed reward).
action at reward rt state st The agent is not told which ac ons to take but it must discover which ac ons yield RL agent the most reward by trying them (trial-and- error).”
— Sutton and Barto (1998)
Wu
Super-Human Performance Playing ATARI Games Image Encoder is a Mul layer Neural Network Zt
Image Image Encoder Encoder
Adapted from Pulkit Agrawal Playing ATARI Games
Ac on Predictor Z t Ac on Predictor is a Image Encoder Mul layer Neural Network
Adapted from Pulkit Agrawal Playing ATARI Games
Ac on Predictor
Zt
Image Encoder
Adapted from Pulkit Agrawal Similarity to Image Classification Cat? Dog?
Ac on Predictor Class Predictor
Zt Zt
Image Image Encoder Encoder
Adapted from Pulkit Agrawal Cat? CAT Dog?
Ac on Predictor Class Predictor
Zt Zt
Image Image Encoder Encoder
Adapted from Pulkit Agrawal Cat? CAT ? Dog? Ac on Predictor Class Predictor
Zt Zt
Image Image Encoder Encoder
Adapted from Pulkit Agrawal How to get the ? “ac on” label? Ac on Predictor
Zt
Image Reinforcement learning methods Encoder
Adapted from Pulkit Agrawal How to get the ? “ac on” label? Ac on Predictor
Zt
Image Encoder
Behavior Cloning
Adapted from Pulkit Agrawal What: the Highlights of the Course 10
How to model DP & RL problems What: problem space, determinis c vs Markov decision process, imperfect informa on Tools: probability, processes, Markov chain
Wu What: the Highlights of the Course 10
How to model DP & RL problems
How to solve exactly DP & RL problems What: Bellman equa ons, dynamic programming algorithms Tools: op mality principle, fixed point operators
Wu What: the Highlights of the Course 10
How to model DP & RL problems
How to solve exactly DP & RL problems
How to solve incrementally DP & RL problems What: Monte Carlo, temporal difference (TD), Q-learning Tools: stochas c approxima on, max norm contrac on analysis
Wu What: the Highlights of the Course 10
How to model DP & RL problems
How to solve exactly DP & RL problems
How to solve incrementally DP & RL problems
How to solve approximately DP & RL problems What: approximate RL (TD-based methods, policy space methods, deep RL) Tools: func on approxima on, Lyapunov func on analysis, deep learning
Wu What: the Highlights of the Course 10
How to model DP & RL problems
How to solve exactly DP & RL problems
How to solve incrementally DP & RL problems How to
solve approximately DP & RL problems
With (simple!) examples from resource optimization, control systems, computer games.
Wu How: Textbooks and readings 11
Textbook and readings (a) Required : Dynamic Programming and Op mal Control (2007), Vol. I, 4th Edi on, ISBN-13: 978-1-886529-43-4 by Dimitri P. Bertsekas. [DPOC] (b) The second volume of the text is a useful and comprehensive reference. It is recommended, but not required. (c) Required : Neuro Dynamic Programming (1996) by Dimitri P. Bertsekas and John N. Tsitsiklis. [NDP]
A note on notation. We will be using contemporary nota on (e.g. s, a, V), which differs from nota on from these texts (e.g. x, u, J). We will be maximizing instead of minimizing, etc.
Wu How: Pre-requisites 12
(a) Solid knowledge of undergraduate probability (6.041A & 6.041B) (b) Mathema cal maturity and the ability to write down precise and rigorous arguments (c) Exposure to convex op miza on / analysis and linear algebra (18.06) (d) Python programming (e) Analysis of algorithms (6.006, 6.046)
We will issue a HW0 (not graded) to help you gauge your level of familiarity with the pre- requisite material. If you are not comfortable with HW0, consider taking this course another me.
Wu When/What/Where 13
Lecture: TR 1-2:30pm (Zoom) Instructor • Cathy Wu
Course pointers web.mit.edu/6.246/www/ (materials & info) h ps://canvas.mit.edu/courses/7560 (zoom links/recordings) piazza.com/mit/spring2021/6246 (announcements, solu ons, collab) gradescope.com/courses/246411 (sign up with @mit.edu email) psetpartners.mit.edu (to find pset partners) Wu When/What/Where 14
5 homework assignments (25%) – roughly 1 hw per 2-3 lectures; 1.5 weeks/hw
Lecture presenta on + lecture note (5%) – 2nd half of semester (details forthcoming)
1 in-class quiz (25%) – coverage: first 12 lectures
Class project + final project presenta on (35%)
Class par cipa on (10%) – asychronous discussion (on Piazza); par cipa on during lecture; answering ques ons on Piazza; a ending office hours and recita on
Lecture scribing (extra credit, up to 5%)
Homeworks Late homework will be penalized 10% every 24 hours. Solu ons for homework will be released shortly a er the deadline (late submi ers must abide by honor code)
Wu Caution & Opportunity 15 Caution We are planning under uncertainty. Course details are subject to change. Your coopera on is appreciated. Opportunity The world has seen some changes in the last few decades, including: recent advances in machine learning and reinforcement learning newly available data sources & compute open source everything (github, arXiv) new poten al impact areas for DP & RL What is an appropriate foundational course to support and advance research and prac ce in sequen al decision making? Wu 17
Why: A multi-disciplinary field 6.231 dynamic programming & stochas c control 6.246 core lectures A.I. Statistics Clustering 6.246 special topics Statistical Learning Learning Theory Cognitives Sciences Neural Networks Applied Math
Approximation Theory
Reinforcement Learning Neuroscience Dynamic Programming
Optimal Control
Categorization Automatic Control
Active Learning
Psychology
Figure: Note: circles may not be to scale. Credit: Alessandro Lazaric
Wu Sampling of special topics (from Spring 2020)
• Finite Sample Analysis • RL for Combinatorial Op miza on • Model-based Planning and Policy Learning • Explora on vs Exploita on in Bandits • Explora on vs Exploita on in Deep RL • Mul -agent RL / Game Theory • Mul -agent Deep RL • Off-policy Learning • Transfer Learning / Curriculum Learning • Variance Reduc on Techniques • Ar ficial Life / Evolu onary Strategy Why: Parable of the blind men and the elephant 16
Wu Bibliography 18
Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. Bradford Book, MIT Press, Cambridge, MA.
Wu