Introduction

Home , Aja Huang

Feb 16, 2021

Introduction

Cathy Wu 6.246: Reinforcement Learning: Foundations and Methods Sources

• Adapted from Alessandro Lazaric (FAIR/INRIA) Why: Important Problems 3

Autonomous robocs

Elder care Connected and automated vehicles Exploration of dangerous environments Robotics for entertainment

Wu Why: Important Problems 4

Robotics Finance

Trading execution algorithms Portfolio management Option pricing

Wu Why: Important Problems 5

Robotics Finance

Resource management Inventory management, pricing, revenue management Energy grid integration, market regulation, production management Vehicle routing, ridesharing, trafc management Network resource allocation, routing, scheduling

Wu Why: Important Problems 6

Robotics Finance Resource management Recommender systems

Web advertising Product / news recommendation Mobile health – adaptive treatments MOOCs Wu Why: Important Problems 7

Robotics Finance Resource management Recommender systems Games Board games Computer games

Playing Atari with Deep Reinforcement Learning, Mnih et al., 2016; Mastering the game of Go with Deep Neural Networks & Tree Search David Silver, Aja Huang, et al. Nature 2016 Wu What: Dynamic Programming 8 “[We] consider systems where decisions are made in stages. The outcome of each decision is not fully predictable but can be ancipated to some extent before Stochasc the next decision is made. system transion Each decision results in some immediate cost but also probabilities afects the context in which future decisions are to be state s decision/control at t made and therefore aﬀects the cost incurred in future Feedback stages. We are interested in decision making policies that control policy minimize the total cost over a number of stages. ( maximize the total reward ) Such problems are challenging primarily because of the tradeof between immediate and future costs. Exploraon vs exploitaon Dynamic programming (DP) provides a dilemma mathemacal formalizaon of this tradeoﬀ.” — Bertsekas and Tsitsiklis (1996) Wu What: Reinforcement Learning 9

“Reinforcement learning is learning how to Also known as approximate dynamic map states to acons so as to maximize a programming (ADP). We will use these numerical reward signal in an unknown and terms more-or-less interchangeably. uncertain environment. In the most interesng and challenging cases, actions aﬀect not only the immediate reward but also the next situation and all subsequent environment rewards (delayed reward).

action at reward rt state st The agent is not told which acons to take but it must discover which acons yield RL agent the most reward by trying them (trial-and- error).”

— Sutton and Barto (1998)

Super-Human Performance Playing ATARI Games Image Encoder is a Mul layer Neural Network Zt

Image Image Encoder Encoder

Adapted from Pulkit Agrawal Playing ATARI Games

Acon Predictor Z t Acon Predictor is a Image Encoder Mul layer Neural Network

Adapted from Pulkit Agrawal Playing ATARI Games

Acon Predictor

Image Encoder

Adapted from Pulkit Agrawal Similarity to Image Classification Cat? Dog?

Acon Predictor Class Predictor

Zt Zt

Image Image Encoder Encoder

Adapted from Pulkit Agrawal Cat? CAT Dog?

Acon Predictor Class Predictor

Zt Zt

Image Image Encoder Encoder

Adapted from Pulkit Agrawal Cat? CAT ? Dog? Acon Predictor Class Predictor

Zt Zt

Image Image Encoder Encoder

Adapted from Pulkit Agrawal How to get the ? “acon” label? Acon Predictor

Image Reinforcement learning methods Encoder

Adapted from Pulkit Agrawal How to get the ? “acon” label? Acon Predictor

Image Encoder

Behavior Cloning

Adapted from Pulkit Agrawal What: the Highlights of the Course 10

How to model DP & RL problems What: problem space, determinisc vs Markov decision process, imperfect informaon Tools: probability, processes, Markov chain

Wu What: the Highlights of the Course 10

How to model DP & RL problems

How to solve exactly DP & RL problems What: Bellman equaons, dynamic programming algorithms Tools: opmality principle, ﬁxed point operators

Wu What: the Highlights of the Course 10

How to model DP & RL problems

How to solve exactly DP & RL problems

How to solve incrementally DP & RL problems What: Monte Carlo, temporal diﬀerence (TD), Q-learning Tools: stochasc approximaon, max norm contracon analysis

Wu What: the Highlights of the Course 10

How to model DP & RL problems

How to solve exactly DP & RL problems

How to solve incrementally DP & RL problems

How to solve approximately DP & RL problems What: approximate RL (TD-based methods, policy space methods, deep RL) Tools: funcon approximaon, Lyapunov funcon analysis, deep learning

Wu What: the Highlights of the Course 10

How to model DP & RL problems

How to solve exactly DP & RL problems

How to solve incrementally DP & RL problems How to

solve approximately DP & RL problems

With (simple!) examples from resource optimization, control systems, computer games.

Wu How: Textbooks and readings 11

Textbook and readings (a) Required : Dynamic Programming and Opmal Control (2007), Vol. I, 4th Edion, ISBN-13: 978-1-886529-43-4 by Dimitri P. Bertsekas. [DPOC] (b) The second volume of the text is a useful and comprehensive reference. It is recommended, but not required. (c) Required : Neuro Dynamic Programming (1996) by Dimitri P. Bertsekas and John N. Tsitsiklis. [NDP]

A note on notation. We will be using contemporary notaon (e.g. s, a, V), which diﬀers from notaon from these texts (e.g. x, u, J). We will be maximizing instead of minimizing, etc.

Wu How: Pre-requisites 12

(a) Solid knowledge of undergraduate probability (6.041A & 6.041B) (b) Mathemacal maturity and the ability to write down precise and rigorous arguments (c) Exposure to convex opmizaon / analysis and linear algebra (18.06) (d) Python programming (e) Analysis of algorithms (6.006, 6.046)

We will issue a HW0 (not graded) to help you gauge your level of familiarity with the pre- requisite material. If you are not comfortable with HW0, consider taking this course another me.

Wu When/What/Where 13

Lecture: TR 1-2:30pm (Zoom) Instructor • Cathy Wu • Office Hours: TR 2:30-3pm (Zoom) Teaching assistants • Sirui Li • Tiancheng Yu • Office hours: TBD (check website) Recitaons: F9-10am, F3-4pm (Zoom) Staff list: <6-246-staff@mit.edu> Please include “[6.246]" in your email subject line

Course pointers web.mit.edu/6.246/www/ (materials & info) hps://canvas.mit.edu/courses/7560 (zoom links/recordings) piazza.com/mit/spring2021/6246 (announcements, soluons, collab) gradescope.com/courses/246411 (sign up with @mit.edu email) psetpartners.mit.edu (to ﬁnd pset partners) Wu When/What/Where 14

5 homework assignments (25%) – roughly 1 hw per 2-3 lectures; 1.5 weeks/hw

Lecture presentaon + lecture note (5%) – 2nd half of semester (details forthcoming)

1 in-class quiz (25%) – coverage: ﬁrst 12 lectures

Class project + ﬁnal project presentaon (35%)

Class parcipaon (10%) – asychronous discussion (on Piazza); parcipaon during lecture; answering quesons on Piazza; aending oﬃce hours and recitaon

Lecture scribing (extra credit, up to 5%)

Homeworks Late homework will be penalized 10% every 24 hours. Soluons for homework will be released shortly aer the deadline (late submiers must abide by honor code)

Wu Caution & Opportunity 15 Caution We are planning under uncertainty. Course details are subject to change. Your cooperaon is appreciated. Opportunity The world has seen some changes in the last few decades, including: recent advances in machine learning and reinforcement learning newly available data sources & compute open source everything (github, arXiv) new potenal impact areas for DP & RL What is an appropriate foundational course to support and advance research and pracce in sequenal decision making? Wu 17

Why: A multi-disciplinary field 6.231 dynamic programming & stochasc control 6.246 core lectures A.I. Statistics Clustering 6.246 special topics Statistical Learning Learning Theory Cognitives Sciences Neural Networks Applied Math

Approximation Theory

Reinforcement Learning Neuroscience Dynamic Programming

Optimal Control

Categorization Automatic Control

Active Learning

Psychology

Figure: Note: circles may not be to scale. Credit: Alessandro Lazaric

Wu Sampling of special topics (from Spring 2020)

• Finite Sample Analysis • RL for Combinatorial Opmizaon • Model-based Planning and Policy Learning • Exploraon vs Exploitaon in Bandits • Exploraon vs Exploitaon in Deep RL • Mul-agent RL / Game Theory • Mul-agent Deep RL • Oﬀ-policy Learning • Transfer Learning / Curriculum Learning • Variance Reducon Techniques • Arﬁcial Life / Evoluonary Strategy Why: Parable of the blind men and the elephant 16

Wu Bibliography 18

Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming. Athena Scientific, Belmont, MA. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. Bradford Book, MIT Press, Cambridge, MA.