Reinforcement Learning (1) Key Concepts & Algorithms

Winter 2020 CSC 594 Topics in AI: Advanced Deep Learning 5. Deep Reinforcement Learning (1) Key Concepts & Algorithms (Most content adapted from OpenAI ‘Spinning Up’) 1 Noriko Tomuro Reinforcement Learning (RL) • Reinforcement Learning (RL) is a type of Machine Learning where an agent learns to achieve a goal by interacting with the environment -- trial and error. • RL is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. • The purpose of RL is to learn an optimal policy that maximizes the return for the sequences of agent’s actions (i.e., optimal policy). https://en.wikipedia.org/wiki/Reinforcement_learning 2 • RL have recently enjoyed a wide variety of successes, e.g. – Robot controlling in simulation as well as in the real world Strategy games such as • AlphaGO (by Google DeepMind) • Atari games https://en.wikipedia.org/wiki/Reinforcement_learning 3 Deep Reinforcement Learning (DRL) • A policy is essentially a function, which maps the agent’s each action to the expected return or reward. Deep Reinforcement Learning (DRL) uses deep neural networks for the function (and other components). https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 4 5 Some Key Concepts and Terminology 1. States and Observations – A state s is a complete description of the state of the world. For now, we can think of states belonging in the environment. – An observation o is a partial description of a state, which may omit information. – A state could be fully or partially observable to the agent. If partial, the agent forms an internal state (or state estimation). https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 6 • States and Observations (cont.) – In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. – For example, a state in the CartPole environment is represented by four features: • Cart position • Cart velocity • Pole angle • Pole velocity https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 7 2. Action Spaces – Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. – Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. – Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors. https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 8 2. Policies – A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by µ: = or it may be stochastic , usually denoted by π: ~ [Note: ~ represents ‘sampling’ from the stochastic process.] � – In deep RL, we deal with parameterized policies: policies whose outputs are depend on a set of parameters θ (e.g. the weights and biases of a neural network): = ~ ( | ) � https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 9 3. Trajectories – A trajectory τ is a sequence of states and actions in the world: = , , , , … – State transitions (what happens to the world between the τ 1 1 state at time t (st)and the state at t+1 (st+1) are governed by the natural laws of the environment, and depend on only the most recent action at. They can be either deterministic, = , +1 or stochastic, ~ | , – Actions come from an agent according to its policy. +1 � – Trajectories are also frequently called episodes or rollouts. https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 10 4. Reward and Return – The reward function R is critically important in RL. – It takes the current state st, the action at taken at the state and the next state st+1 where the environment placed the agent after taking the action, and returns a reward rt (for just that one step; the value is usually a real number). = ( , , ) But sometimes simplified notations +1 are used: = , or = ( ) – Then we can define return: the cumulative reward for a whole trajectory τ – the total reward the agent receives from (or at the end of) the trajectory: = = ( , , ) −1 −1 +1 �=0 �=0 https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 11 – We often apply discount factor γ (gamma; between 0 and 1) to rewards obtained later in the trajectory (especially the trajectories are infinite). = = + + + ∞ 0 1 2 � 0 � 1 � 2 ⋯ – But the ultimate�=0 goal� of the agent is to find an optimal policy that maximizes the expected return (over all trajectories). To wit, assuming stochastic policy and environment transitions, we first define the probability distributions. – The probability of a T-step trajectory is = ( ) −1 ( | , ) ( | ) 0 0 +1 Initial state probability � � =0 Action probability of agent State transition probability of environment https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 12 – Then the expected return, denoted J(π), is defined as: = = E ~ � � Return of the trajectory Probability of trajectory w.r.t the given policy – The central optimization problem in RL can then be expressed by = argmax ( ) ∗ where being the optimal policy . ∗ https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 13 5. Value functions – Value functions are used in almost every RL algorithm. – A value function returns the expected return if you start from a given state (V(s)) or for a state-action pair (Q(s,a)). There are four main functions to note: 1. On-policy Value Function – returns the expected return if you start from the given state s and always act according to the given/current policy π: ( ) = E ~ | = 2. On-policy Action-Value Function0 – returns the expected return if you start from the given state s, take an arbitrary action a, then forever after act according to policy π: ( , ) = E ~ | = , = 0 0 https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 14 3. Optimal Value Function – returns the expected return if you start from the given state s and always act according to the optimal policy in the environment π*: ( ) = max E ~ | = ∗ 0 4. Optimal Action-Value Function –returns the expected return if you start from the given state s, take an arbitrary action a, then forever after act according to the optimal policy π*: ( , ) = max E ~ | = , = ∗ 0 0 -------------------------- * Note on the connection between V and Q functions. – = E ~ , – = max , ∗ ∗ https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 15 6. Bellman Equations – All four value functions obey special self-consistency equations called Bellman equations. – Its basic idea is that the value of the starting state is “the reward you expect to get from being there, plus the value of wherever you land next.” • Bellman equations for the on-policy value functions are: ( ) = E ~ , ~ , + ( ) ( , ) = E ′, + E ( ,′ ) ~ ~ � ′ ′ ′ ′ • Bellman equations for the optimal � value functions are: ( ) = max E ~ , + ( ) ∗ ′ ∗ ′ ( , ) = E ~ , + max ( , ) � ∗ ′ ∗ ′ ′ ′ � https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 16 7. Advantage Functions – Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. – The advantage function ( , ) corresponding to a policy π describes how much better it is to take a specific action a in state s, over randomly selecting an action according to ( | ). , = , � − https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 17 Kinds of RL Algorithms • There are many algorithms in RL. Here is an example taxonomy. https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 18 Model-free vs. Model-based • Model-based RL algorithms are applied when the agent has access the model of the environment – a function that predicts state transitions and rewards. • In that case, the agent can plan by thinking ahead to obtain (or learn) an optimal policy without actually acting out the plan. • A particularly well-known example of model-based RL is AlphaZero, a Google DeepMind project applied to chess, shogi and go (where the predecessor AlphaGo Zero only played go). https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 19 [Note] Training in AlphaGo Zero (from Wikipedia) “AlphaGo Zero's neural network was trained using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. Only four TPUs were used for inference. The neural network initially knew nothing about Go beyond the rules. Unlike earlier versions of AlphaGo, Zero only perceived the board's stones, rather than having some rare human- programmed edge cases to help recognize unusual Go board positions. The AI engaged in reinforcement learning, playing against itself until it could anticipate its own moves and how those moves would affect the game's outcome.[8] In the first three days AlphaGo Zero played 4.9 million games against itself in quick succession.[9] It appeared to develop the skills required to beat top humans within just a few days, whereas the earlier AlphaGo took months of training to achieve the same level.[10]” https://en.wikipedia.org/wiki/AlphaGo_Zero 20 • But the main downside of model-based RL algorithms is that a ground-truth model of the environment is usually NOT available to the agent. ------------------------------------------------------ • Algorithms that do not use a model are called model-free methods. An agent learns (policies, environment models etc.) through experience -- by acting out in the environment. • Model-free methods are more popular than model-based methods, also easier to implement and tune! • There are two main approaches to representing and training agents with model-free RL: 1. Policy Optimization Learns the policy (a function/model) directly by optimizing the model parameters 2. Q-learning Learns the Value functions. https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 21 Policy Optimization • Methods in this family represent a policy explicitly as ( | ).

Reinforcement Learning (1) Key Concepts & Algorithms

OLIVAW: Mastering Othello with Neither Humans Nor a Penny Antonio Norelli, Alessandro Panconesi Dept

Towards Incremental Agent Enhancement for Evolving Games

Understanding & Generalizing Alphago Zero

Efficiently Mastering the Game of Nogo with Deep Reinforcement

AI Chips: What They Are and Why They Matter

ELF Opengo: an Analysis and Open Reimplementation of Alphazero

Applying Deep Double Q-Learning and Monte Carlo Tree Search to Playing Go

Minigo: a Case Study in Reproducing Reinforcement Learning Research

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

Creating the Engine for Scientific Discovery

Alpha Zero Paper

The AI Revolution in Scientific Research