Spinning Up Documentation Release Joshua Achiam Feb 07, 2020 User Documentation 1 Introduction 3 1.1 What This Is...............................................3 1.2 Why We Built This............................................4 1.3 How This Serves Our Mission......................................4 1.4 Code Design Philosophy.........................................5 1.5 Long-Term Support and Support History................................5 2 Installation 7 2.1 Installing Python.............................................8 2.2 Installing OpenMPI...........................................8 2.3 Installing Spinning Up..........................................8 2.4 Check Your Install............................................9 2.5 Installing MuJoCo (Optional)......................................9 3 Algorithms 11 3.1 What’s Included............................................. 11 3.2 Why These Algorithms?......................................... 12 3.3 Code Format............................................... 12 4 Running Experiments 15 4.1 Launching from the Command Line................................... 16 4.2 Launching from Scripts......................................... 20 5 Experiment Outputs 23 5.1 Algorithm Outputs............................................ 24 5.2 Save Directory Location......................................... 26 5.3 Loading and Running Trained Policies................................. 26 6 Plotting Results 29 7 Part 1: Key Concepts in RL 31 7.1 What Can RL Do?............................................ 31 7.2 Key Concepts and Terminology..................................... 32 7.3 (Optional) Formalism.......................................... 39 8 Part 2: Kinds of RL Algorithms 41 8.1 A Taxonomy of RL Algorithms..................................... 41 i 8.2 Links to Algorithms in Taxonomy.................................... 44 9 Part 3: Intro to Policy Optimization 45 9.1 Deriving the Simplest Policy Gradient.................................. 46 9.2 Implementing the Simplest Policy Gradient............................... 47 9.3 Expected Grad-Log-Prob Lemma.................................... 50 9.4 Don’t Let the Past Distract You..................................... 51 9.5 Implementing Reward-to-Go Policy Gradient.............................. 52 9.6 Baselines in Policy Gradients...................................... 52 9.7 Other Forms of the Policy Gradient................................... 53 9.8 Recap................................................... 54 10 Spinning Up as a Deep RL Researcher 55 10.1 The Right Background.......................................... 55 10.2 Learn by Doing.............................................. 56 10.3 Developing a Research Project...................................... 57 10.4 Doing Rigorous Research in RL..................................... 58 10.5 Closing Thoughts............................................. 59 10.6 PS: Other Resources........................................... 59 10.7 References................................................ 59 11 Key Papers in Deep RL 61 11.1 1. Model-Free RL............................................ 63 11.2 2. Exploration.............................................. 63 11.3 3. Transfer and Multitask RL...................................... 63 11.4 4. Hierarchy............................................... 63 11.5 5. Memory................................................ 63 11.6 6. Model-Based RL........................................... 63 11.7 7. Meta-RL................................................ 63 11.8 8. Scaling RL............................................... 63 11.9 9. RL in the Real World......................................... 63 11.10 10. Safety................................................. 63 11.11 11. Imitation Learning and Inverse Reinforcement Learning...................... 63 11.12 12. Reproducibility, Analysis, and Critique............................... 63 11.13 13. Bonus: Classic Papers in RL Theory or Review........................... 63 12 Exercises 65 12.1 Problem Set 1: Basics of Implementation................................ 65 12.2 Problem Set 2: Algorithm Failure Modes................................ 67 12.3 Challenges................................................ 68 13 Benchmarks for Spinning Up Implementations 69 13.1 Performance in Each Environment.................................... 70 13.2 Experiment Details............................................ 70 13.3 PyTorch vs Tensorflow.......................................... 71 14 Vanilla Policy Gradient 73 14.1 Background................................................ 73 14.2 Documentation.............................................. 75 14.3 References................................................ 79 15 Trust Region Policy Optimization 81 15.1 Background................................................ 81 15.2 Documentation.............................................. 84 15.3 References................................................ 88 ii 16 Proximal Policy Optimization 89 16.1 Background................................................ 89 16.2 Documentation.............................................. 92 16.3 References................................................ 96 17 Deep Deterministic Policy Gradient 99 17.1 Background................................................ 100 17.2 Documentation.............................................. 103 17.3 References................................................ 107 18 Twin Delayed DDPG 109 18.1 Background................................................ 109 18.2 Documentation.............................................. 112 18.3 References................................................ 117 19 Soft Actor-Critic 119 19.1 Background................................................ 120 19.2 Documentation.............................................. 124 19.3 References................................................ 130 20 Logger 131 20.1 Using a Logger.............................................. 131 20.2 Logger Classes.............................................. 134 20.3 Loading Saved Models (PyTorch Only)................................. 136 20.4 Loading Saved Graphs (Tensorflow Only)................................ 136 21 Plotter 139 22 MPI Tools 141 22.1 Core MPI Utilities............................................ 141 22.2 MPI + PyTorch Utilities......................................... 142 22.3 MPI + Tensorflow Utilities........................................ 142 23 Run Utils 145 23.1 ExperimentGrid............................................. 145 23.2 Calling Experiments........................................... 146 24 Acknowledgements 149 25 About the Author 151 26 Indices and tables 153 Python Module Index 155 iii iv Spinning Up Documentation, Release User Documentation 1 Spinning Up Documentation, Release 2 User Documentation CHAPTER 1 Introduction Table of Contents • Introduction – What This Is – Why We Built This – How This Serves Our Mission – Code Design Philosophy – Long-Term Support and Support History 1.1 What This Is Welcome to Spinning Up in Deep RL! This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including: • a short introduction to RL terminology, kinds of algorithms, and basic theory, • an essay about how to grow into an RL research role, •a curated list of important papers organized by topic, • a well-documented code repo of short, standalone implementations of key algorithms, • and a few exercises to serve as warm-ups. 3 Spinning Up Documentation, Release 1.2 Why We Built This One of the single most common questions that we hear is If I want to contribute to AI safety, how do I get started? At OpenAI, we believe that deep learning generally—and deep reinforcement learning specifically—will play central roles in the development of powerful AI technology. To ensure that AI is safe, we have to come up with safety strategies and algorithms that are compatible with this paradigm. As a result, we encourage everyone who asks this question to study these fields. However, while there are many resources to help people quickly ramp up on deep learning, deep reinforcement learning is more challenging to break into. To begin with, a student of deep RL needs to have some background in math, coding, and regular deep learning. Beyond that, they need both a high-level view of the field—an awareness of what topics are studied in it, why they matter, and what’s been done already—and careful instruction on how to connect algorithm theory to algorithm code. The high-level view is hard to come by because of how new the field is. There is not yet a standard deep RL textbook, so most of the knowledge is locked up in either papers or lecture series, which can take a long time to parse and digest. And learning to implement deep RL algorithms is typically painful, because either • the paper that publishes an algorithm omits or inadvertently obscures key design details, • or widely-public implementations of an algorithm are hard to read, hiding how the code lines up with the algorithm. While fantastic repos like garage, Baselines, and rllib make it easier for researchers who are already in the field to make progress, they build algorithms into frameworks
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages164 Page
-
File Size-