Rltheorybook AJKS.Pdf

Reinforcement Learning: Theory and Algorithms Alekh Agarwal Nan Jiang Sham M. Kakade Wen Sun September 30, 2021 WORKING DRAFT: We will be frequently updating the book this fall, 2021. Please email [email protected] with any typos or errors you find. We appreciate it! ii Contents 1 Fundamentals 3 1 Markov Decision Processes 5 1.1 Discounted (Infinite-Horizon) Markov Decision Processes . .5 1.1.1 The objective, policies, and values . .5 1.1.2 Bellman Consistency Equations for Stationary Policies . .7 1.1.3 Bellman Optimality Equations . .8 1.2 Finite-Horizon Markov Decision Processes . 11 1.3 Computational Complexity . 12 1.3.1 Value Iteration . 12 1.3.2 Policy Iteration . 14 1.3.3 Value Iteration for Finite Horizon MDPs . 16 1.3.4 The Linear Programming Approach . 16 1.4 Sample Complexity and Sampling Models . 18 1.5 Bonus: Advantages and The Performance Difference Lemma . 18 1.6 Bibliographic Remarks and Further Reading . 20 2 Sample Complexity with a Generative Model 21 2.1 Warmup: a naive model-based approach . 21 2.2 Sublinear Sample Complexity . 23 2.3 Minmax Optimal Sample Complexity (and the Model Based Approach) . 24 2.3.1 The Discounted Case . 24 2.3.2 Finite Horizon Setting . 25 2.4 Analysis . 26 2.4.1 Variance Lemmas . 26 iii 2.4.2 Completing the proof . 28 2.5 Scalings and Effective Horizon Dependencies . 29 2.6 Bibliographic Remarks and Further Readings . 29 3 Linear Bellman Completeness 31 3.1 The Linear Bellman Completeness Condition . 31 3.2 The LSVI Algorithm . 32 3.3 LSVI with D-Optimal Design . 32 3.3.1 D-Optimal Design . 32 3.3.2 Performance Guarantees . 33 3.3.3 Analysis . 34 3.4 How Strong is Bellman Completion as a Modeling? . 35 3.5 Offline Reinforcement Learning . 36 3.5.1 Offline Learning . 36 3.5.2 Offline Policy Evaluation . 36 3.6 Bibliographic Remarks and Further Readings . 37 4 Fitted Dynamic Programming Methods 39 4.1 Fitted Q-Iteration (FQI) and Offline RL . 39 4.1.1 The FQI Algorithm . 40 4.1.2 Performance Guarantees of FQI . 40 4.2 Fitted Policy-Iteration (FPI) . 42 4.3 Failure Cases Without Assumption 4.1 . 43 4.4 FQI for Policy Evaluation . 43 4.5 Bibliographic Remarks and Further Readings . 43 5 Statistical Limits of Generalization 45 5.1 Agnostic Learning . 46 5.1.1 Review: Binary Classification . 46 5.1.2 Importance Sampling and a Reduction to Supervised Learning . 47 5.2 Linear Realizability . 49 5.2.1 Offline Policy Evaluation with Linearly Realizable Values . 49 5.2.2 Linearly Realizable Q? ...................................... 53 iv 5.2.3 Linearly Realizable π? ...................................... 58 5.3 Discussion: Studying Generalization in RL . 59 5.4 Bibliographic Remarks and Further Readings . 59 2 Strategic Exploration 61 6 Multi-Armed & Linear Bandits 63 6.1 The K-Armed Bandit Problem . 63 6.1.1 The Upper Confidence Bound (UCB) Algorithm . 63 6.2 Linear Bandits: Handling Large Action Spaces . 65 6.2.1 The LinUCB algorithm . 66 6.2.2 Upper and Lower Bounds . 67 6.3 LinUCB Analysis . 68 6.3.1 Regret Analysis . 69 6.3.2 Confidence Analysis . 71 6.4 Bibliographic Remarks and Further Readings . 71 7 Strategic Exploration in Tabular MDPs 73 7.1 On The Need for Strategic Exploration . 73 7.2 The UCB-VI algorithm . 74 7.3 Analysis . 75 7.3.1 Proof of Lemma 7.2 . 78 7.4 An Improved Regret Bound . 80 7.5 Phased Q-learning . 83 7.6 Bibliographic Remarks and Further Readings . 84 8 Linearly Parameterized MDPs 85 8.1 Setting . 85 8.1.1 Low-Rank MDPs and Linear MDPs . 85 8.2 Planning in Linear MDPs . 86 8.3 Learning Transition using Ridge Linear Regression . 87 8.4 Uniform Convergence via Covering . 89 8.5 Algorithm . 92 v 8.6 Analysis of UCBVI for Linear MDPs . 93 8.6.1 Proving Optimism . 93 8.6.2 Regret Decomposition . 94 8.6.3 Concluding the Final Regret Bound . 95 8.7 Bibliographic Remarks and Further Readings . 96 9 Parametric Models with Bounded Bellman Rank 97 9.1 Problem setting . 97 9.2 Value-function approximation . 98 9.3 Bellman Rank . 98 9.3.1 Examples . 99 9.3.2 Examples that do not have low Bellman Rank . 100 9.4 Algorithm . 101 9.5 Extension to Model-based Setting . 102 9.6 Bibliographic Remarks and Further Readings . 103 10 Deterministic MDPs with Linearly Parameterized Q? 105 3 Policy Optimization 107 11 Policy Gradient Methods and Non-Convex Optimization 109 11.1 Policy Gradient Expressions and the Likelihood Ratio Method . ..

Rltheorybook AJKS.Pdf

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support