Rltheorybook AJKS.Pdf

Reinforcement Learning: Theory and Algorithms Alekh Agarwal Nan Jiang Sham M. Kakade Wen Sun September 30, 2021 WORKING DRAFT: We will be frequently updating the book this fall, 2021. Please email [email protected] with any typos or errors you find. We appreciate it! ii Contents 1 Fundamentals 3 1 Markov Decision Processes 5 1.1 Discounted (Infinite-Horizon) Markov Decision Processes . .5 1.1.1 The objective, policies, and values . .5 1.1.2 Bellman Consistency Equations for Stationary Policies . .7 1.1.3 Bellman Optimality Equations . .8 1.2 Finite-Horizon Markov Decision Processes . 11 1.3 Computational Complexity . 12 1.3.1 Value Iteration . 12 1.3.2 Policy Iteration . 14 1.3.3 Value Iteration for Finite Horizon MDPs . 16 1.3.4 The Linear Programming Approach . 16 1.4 Sample Complexity and Sampling Models . 18 1.5 Bonus: Advantages and The Performance Difference Lemma . 18 1.6 Bibliographic Remarks and Further Reading . 20 2 Sample Complexity with a Generative Model 21 2.1 Warmup: a naive model-based approach . 21 2.2 Sublinear Sample Complexity . 23 2.3 Minmax Optimal Sample Complexity (and the Model Based Approach) . 24 2.3.1 The Discounted Case . 24 2.3.2 Finite Horizon Setting . 25 2.4 Analysis . 26 2.4.1 Variance Lemmas . 26 iii 2.4.2 Completing the proof . 28 2.5 Scalings and Effective Horizon Dependencies . 29 2.6 Bibliographic Remarks and Further Readings . 29 3 Linear Bellman Completeness 31 3.1 The Linear Bellman Completeness Condition . 31 3.2 The LSVI Algorithm . 32 3.3 LSVI with D-Optimal Design . 32 3.3.1 D-Optimal Design . 32 3.3.2 Performance Guarantees . 33 3.3.3 Analysis . 34 3.4 How Strong is Bellman Completion as a Modeling? . 35 3.5 Offline Reinforcement Learning . 36 3.5.1 Offline Learning . 36 3.5.2 Offline Policy Evaluation . 36 3.6 Bibliographic Remarks and Further Readings . 37 4 Fitted Dynamic Programming Methods 39 4.1 Fitted Q-Iteration (FQI) and Offline RL . 39 4.1.1 The FQI Algorithm . 40 4.1.2 Performance Guarantees of FQI . 40 4.2 Fitted Policy-Iteration (FPI) . 42 4.3 Failure Cases Without Assumption 4.1 . 43 4.4 FQI for Policy Evaluation . 43 4.5 Bibliographic Remarks and Further Readings . 43 5 Statistical Limits of Generalization 45 5.1 Agnostic Learning . 46 5.1.1 Review: Binary Classification . 46 5.1.2 Importance Sampling and a Reduction to Supervised Learning . 47 5.2 Linear Realizability . 49 5.2.1 Offline Policy Evaluation with Linearly Realizable Values . 49 5.2.2 Linearly Realizable Q? ...................................... 53 iv 5.2.3 Linearly Realizable π? ...................................... 58 5.3 Discussion: Studying Generalization in RL . 59 5.4 Bibliographic Remarks and Further Readings . 59 2 Strategic Exploration 61 6 Multi-Armed & Linear Bandits 63 6.1 The K-Armed Bandit Problem . 63 6.1.1 The Upper Confidence Bound (UCB) Algorithm . 63 6.2 Linear Bandits: Handling Large Action Spaces . 65 6.2.1 The LinUCB algorithm . 66 6.2.2 Upper and Lower Bounds . 67 6.3 LinUCB Analysis . 68 6.3.1 Regret Analysis . 69 6.3.2 Confidence Analysis . 71 6.4 Bibliographic Remarks and Further Readings . 71 7 Strategic Exploration in Tabular MDPs 73 7.1 On The Need for Strategic Exploration . 73 7.2 The UCB-VI algorithm . 74 7.3 Analysis . 75 7.3.1 Proof of Lemma 7.2 . 78 7.4 An Improved Regret Bound . 80 7.5 Phased Q-learning . 83 7.6 Bibliographic Remarks and Further Readings . 84 8 Linearly Parameterized MDPs 85 8.1 Setting . 85 8.1.1 Low-Rank MDPs and Linear MDPs . 85 8.2 Planning in Linear MDPs . 86 8.3 Learning Transition using Ridge Linear Regression . 87 8.4 Uniform Convergence via Covering . 89 8.5 Algorithm . 92 v 8.6 Analysis of UCBVI for Linear MDPs . 93 8.6.1 Proving Optimism . 93 8.6.2 Regret Decomposition . 94 8.6.3 Concluding the Final Regret Bound . 95 8.7 Bibliographic Remarks and Further Readings . 96 9 Parametric Models with Bounded Bellman Rank 97 9.1 Problem setting . 97 9.2 Value-function approximation . 98 9.3 Bellman Rank . 98 9.3.1 Examples . 99 9.3.2 Examples that do not have low Bellman Rank . 100 9.4 Algorithm . 101 9.5 Extension to Model-based Setting . 102 9.6 Bibliographic Remarks and Further Readings . 103 10 Deterministic MDPs with Linearly Parameterized Q? 105 3 Policy Optimization 107 11 Policy Gradient Methods and Non-Convex Optimization 109 11.1 Policy Gradient Expressions and the Likelihood Ratio Method . ..

Load more