<<

Winter 2020 CSC 594 Topics in AI: Advanced Deep Learning

5. Deep (1) Key Concepts &

(Most content adapted from OpenAI ‘Spinning Up’)

1 Noriko Tomuro Reinforcement Learning (RL)

• Reinforcement Learning (RL) is a type of Machine Learning where an agent learns to achieve a goal by interacting with the environment -- trial and error. • RL is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. • The purpose of RL is to learn an optimal policy that maximizes the return for the sequences of agent’s actions (i.e., optimal policy).

https://en.wikipedia.org/wiki/Reinforcement_learning 2 • RL have recently enjoyed a wide variety of successes, e.g. – Robot controlling in simulation as well as in the real world Strategy games such as • AlphaGO (by Google DeepMind) • Atari games

https://en.wikipedia.org/wiki/Reinforcement_learning 3 Deep Reinforcement Learning (DRL)

• A policy is essentially a function, which maps the agent’s each action to the expected return or reward. Deep Reinforcement Learning (DRL) uses deep neural networks for the function (and other components).

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 4 5 Some Key Concepts and Terminology

1. States and Observations – A state s is a complete description of the state of the world. For now, we can think of states belonging in the environment. – An observation o is a partial description of a state, which may omit information. – A state could be fully or partially observable to the agent. If partial, the agent forms an internal state (or state estimation).

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 6 • States and Observations (cont.) – In deep RL, we almost always represent states and observations by a real-valued vector, matrix, or higher-order tensor. – For example, a state in the CartPole environment is represented by four features: • Cart position • Cart velocity • Pole angle • Pole velocity

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 7 2. Action Spaces – Different environments allow different kinds of actions. The set of all valid actions in a given environment is often called the action space. – Some environments, like Atari and Go, have discrete action spaces, where only a finite number of moves are available to the agent. – Other environments, like where the agent controls a robot in a physical world, have continuous action spaces. In continuous spaces, actions are real-valued vectors.

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 8 2. Policies – A policy is a rule used by an agent to decide what actions to take. It can be deterministic, in which case it is usually denoted by µ: =

or it may be stochastic𝑡𝑡 , usually𝑡𝑡 denoted by π: 𝑎𝑎 ~ 𝜇𝜇 𝑠𝑠 [Note: ~ represents ‘sampling’ from the stochastic process.] 𝑎𝑎𝑡𝑡 𝜋𝜋 � 𝑠𝑠𝑡𝑡 – In deep RL, we deal with parameterized policies: policies whose outputs are depend on a set of parameters θ (e.g. the weights and biases of a neural network): = ~ ( | ) 𝑎𝑎𝑡𝑡 𝜇𝜇𝜃𝜃 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 𝜋𝜋𝜃𝜃 � 𝑠𝑠𝑡𝑡

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 9 3. Trajectories – A trajectory τ is a sequence of states and actions in the world: = , , , , … – State transitions (what happens to the world between the τ 𝑠𝑠𝑜𝑜 𝑎𝑎𝑜𝑜 𝑠𝑠1 𝑎𝑎1 state at time t (st)and the state at t+1 (st+1) are governed by the natural laws of the environment, and depend on only the

most recent action at. They can be either deterministic, = ,

𝑡𝑡+1 𝑡𝑡 𝑡𝑡 or stochastic, 𝑠𝑠 𝑓𝑓 𝑠𝑠 𝑎𝑎 ~ | , – Actions come from an agent according to its policy. 𝑠𝑠𝑡𝑡+1 𝑃𝑃 � 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 – Trajectories are also frequently called episodes or rollouts.

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 10 4. Reward and Return – The reward function R is critically important in RL.

– It takes the current state st, the action at taken at the state and the next state st+1 where the environment placed the agent after taking the action, and returns a reward rt (for just that one step; the value is usually a real number). = ( , , )

But sometimes simplified𝑡𝑡 𝑡𝑡 notations𝑡𝑡 𝑡𝑡+1 are used: =𝑟𝑟 𝑅𝑅, 𝑠𝑠 𝑎𝑎or𝑠𝑠 = ( ) – Then we can define return: the cumulative reward for a 𝑟𝑟𝑡𝑡 𝑅𝑅 𝑠𝑠𝑡𝑡 𝑎𝑎𝑡𝑡 𝑟𝑟𝑡𝑡 𝑅𝑅 𝑠𝑠𝑡𝑡 whole trajectory τ – the total reward the agent receives from (or at the end of) the trajectory:

= = ( , , ) 𝑇𝑇−1 𝑇𝑇−1 𝑡𝑡 𝑡𝑡 𝑡𝑡+1 𝑡𝑡 �𝑡𝑡=0 𝑅𝑅 𝑠𝑠 𝑎𝑎 𝑠𝑠 𝑅𝑅 𝜏𝜏 �𝑡𝑡=0 𝑟𝑟 https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 11 – We often apply discount factor γ (gamma; between 0 and 1) to rewards obtained later in the trajectory (especially the trajectories are infinite). = = + + + ∞ 𝑡𝑡 0 1 2 𝑡𝑡 𝛾𝛾 � 𝑟𝑟0 𝛾𝛾 � 𝑟𝑟1 𝛾𝛾 � 𝑟𝑟2 ⋯ – But𝑅𝑅 the𝜏𝜏 ultimate�𝑡𝑡=0𝛾𝛾 goal� 𝑟𝑟of the agent is to find an optimal policy that maximizes the expected return (over all trajectories). To wit, assuming stochastic policy and environment transitions, we first define the probability distributions. – The probability of a T-step trajectory is

= ( ) 𝑇𝑇−1 ( | , ) ( | )

0 0 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑃𝑃Initial𝜏𝜏 state𝜋𝜋 probability𝜌𝜌 𝑠𝑠 � 𝑃𝑃 𝑠𝑠 𝑠𝑠 𝑎𝑎 � 𝜋𝜋 𝑎𝑎 𝑠𝑠 𝑡𝑡=0 Action probability of agent State transition probability of environment

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 12 – Then the expected return, denoted J(π), is defined as:

= = E ~

𝜏𝜏 𝜋𝜋 𝐽𝐽 𝜋𝜋 �𝜏𝜏 𝑃𝑃 𝜏𝜏 𝜋𝜋 � 𝑅𝑅Return𝜏𝜏 of the trajectory𝑅𝑅 𝜏𝜏 Probability of trajectory w.r.t the given policy – The central optimization problem in RL can then be expressed by = argmax ( ) ∗ where being the𝜋𝜋 optimal𝜋𝜋 policy𝐽𝐽 𝜋𝜋 . ∗ 𝜋𝜋

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 13 5. Value functions – Value functions are used in almost every RL . – A value function returns the expected return if you start from a given state (V(s)) or for a state-action pair (Q(s,a)).

There are four main functions to note: 1. On-policy Value Function – returns the expected return if you start from the given state s and always act according to the given/current policy π: ( ) = E ~ | = 𝜋𝜋 2. On-policy Action-Value𝜏𝜏 𝜋𝜋 Function0 – returns the expected return if𝑉𝑉 you𝑠𝑠 start from 𝑅𝑅the𝜏𝜏 given𝑠𝑠 state𝑠𝑠 s, take an arbitrary action a, then forever after act according to policy π: ( , ) = E ~ | = , = 𝜋𝜋 𝜏𝜏 𝜋𝜋 0 0 𝑄𝑄 𝑠𝑠https://spinningup.openai.com/en/latest/spinningup/rl_intro.html𝑎𝑎 𝑅𝑅 𝜏𝜏 𝑠𝑠 𝑠𝑠 𝑎𝑎 𝑎𝑎 14 3. Optimal Value Function – returns the expected return if you start from the given state s and always act according to the optimal policy in the environment π*: ( ) = max E ~ | = ∗ 𝜏𝜏 𝜋𝜋 0 4. Optimal𝑉𝑉 𝑠𝑠Action-𝜋𝜋Value Function𝑅𝑅 𝜏𝜏 –𝑠𝑠returns𝑠𝑠 the expected return if you start from the given state s, take an arbitrary action a, then forever after act according to the optimal policy π*: ( , ) = max E ~ | = , = ∗ 𝜏𝜏 𝜋𝜋 0 0 ------𝑄𝑄 𝑠𝑠 𝑎𝑎 𝜋𝜋 𝑅𝑅 𝜏𝜏 𝑠𝑠 𝑠𝑠 𝑎𝑎 𝑎𝑎 * Note on the connection between V and Q functions.

– = E ~ , – 𝜋𝜋 = max 𝜋𝜋 , 𝑉𝑉 𝑠𝑠 𝒂𝒂 𝝅𝝅 𝑄𝑄 𝑠𝑠 𝑎𝑎 ∗ ∗ 𝑉𝑉 𝑠𝑠 𝒂𝒂 𝑄𝑄 𝑠𝑠 𝑎𝑎 https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 15 6. Bellman Equations – All four value functions obey special self-consistency equations called Bellman equations. – Its basic idea is that the value of the starting state is “the reward you expect to get from being there, plus the value of wherever you land next.”

• Bellman equations for the on-policy value functions are: ( ) = E ~ , ~ , + ( ) ( , 𝜋𝜋) = E ′, + E 𝜋𝜋( ,′ ) 𝑉𝑉 𝑠𝑠 ~ 𝑎𝑎 𝜋𝜋 𝑠𝑠 𝑃𝑃 𝑟𝑟 𝑠𝑠 𝑎𝑎 ~𝛾𝛾 � 𝑉𝑉 𝑠𝑠 𝜋𝜋 ′ ′ 𝜋𝜋 ′ ′ • Bellman𝑄𝑄 𝑠𝑠 equations𝑎𝑎 𝑠𝑠 for𝑃𝑃 𝑟𝑟the𝑠𝑠 𝑎𝑎optimal𝛾𝛾 � value𝑎𝑎 𝜋𝜋 functions𝑄𝑄 𝑠𝑠 𝑎𝑎 are: ( ) = max E ~ , + ( ) ∗ ′ ∗ ′ ( , ) = E ~ , + max ( , ) 𝑉𝑉 𝑠𝑠 𝑎𝑎 𝑠𝑠 𝑃𝑃 𝑟𝑟 𝑠𝑠 𝑎𝑎 𝛾𝛾 � 𝑉𝑉 𝑠𝑠 ∗ ′ ∗ ′ ′ 𝑠𝑠 𝑃𝑃 ′ 𝑄𝑄 𝑠𝑠 𝑎𝑎 𝑟𝑟 𝑠𝑠 𝑎𝑎 𝛾𝛾 � 𝑎𝑎 𝑄𝑄 𝑠𝑠 𝑎𝑎 https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 16 7. Advantage Functions – Sometimes in RL, we don’t need to describe how good an action is in an absolute sense, but only how much better it is than others on average. That is to say, we want to know the relative advantage of that action. – The advantage function ( , ) corresponding to a policy π describes how much better𝝅𝝅 it is to take a specific action a in state s, over randomly selecting𝑨𝑨 𝒔𝒔 𝒂𝒂 an action according to ( | ). , = , 𝝅𝝅 � 𝒔𝒔 𝜋𝜋 𝜋𝜋 𝜋𝜋 𝐴𝐴 𝑠𝑠 𝑎𝑎 𝑄𝑄 𝑠𝑠 𝑎𝑎 − 𝑉𝑉 𝑠𝑠

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html 17 Kinds of RL Algorithms

• There are many algorithms in RL. Here is an example taxonomy.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 18 Model-free vs. Model-based

• Model-based RL algorithms are applied when the agent has access the model of the environment – a function that predicts state transitions and rewards. • In that case, the agent can plan by thinking ahead to obtain (or learn) an optimal policy without actually acting out the plan. • A particularly well-known example of model-based RL is AlphaZero, a Google DeepMind project applied to , and go (where the predecessor AlphaGo Zero only played go).

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 19 [Note] Training in AlphaGo Zero (from Wikipedia) “AlphaGo Zero's neural network was trained using TensorFlow, with 64 GPU workers and 19 CPU parameter servers. Only four TPUs were used for inference. The neural network initially knew nothing about Go beyond the rules. Unlike earlier versions of AlphaGo, Zero only perceived the board's stones, rather than having some rare human- programmed edge cases to help recognize unusual Go board positions. The AI engaged in reinforcement learning, playing against itself until it could anticipate its own moves and how those moves would affect the game's outcome.[8] In the first three days AlphaGo Zero played 4.9 million games against itself in quick succession.[9] It appeared to develop the skills required to beat top humans within just a few days, whereas the earlier AlphaGo took months of training to achieve the same level.[10]”

https://en.wikipedia.org/wiki/AlphaGo_Zero 20 • But the main downside of model-based RL algorithms is that a ground-truth model of the environment is usually NOT available to the agent. ------• Algorithms that do not use a model are called model-free methods. An agent learns (policies, environment models etc.) through experience -- by acting out in the environment. • Model-free methods are more popular than model-based methods, also easier to implement and tune! • There are two main approaches to representing and training agents with model-free RL: 1. Policy Optimization  Learns the policy (a function/model) directly by optimizing the model parameters 2. Q-learning  Learns the Value functions.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 21 Policy Optimization

• Methods in this family represent a policy explicitly as ( | ). They optimize the parameters θ directly by gradient ascent on 𝜽𝜽 the performance objective ( ). 𝝅𝝅 𝒂𝒂 𝒔𝒔 Note: In deep RL, the policy is a deep neural network. 𝑱𝑱 𝝅𝝅𝜽𝜽 • This optimization is almost always performed on-policy, which means that each update only uses data collected while acting according to the most recent version of the policy –> i.e., parameters are NOT updated during an epoch or batch. • The primary strength of policy optimization methods is that they are principled, in the sense that you directly optimize for the thing you want. This tends to make them stable and reliable.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 22 • Some example policy optimization methods: – Policy Gradient calculates the gradient of policy performance and minimize it to optimize policy. – A2C (Advantage Actor Critic) / A3C (Asynchronous Advantage Actor Critic) perform gradient ascent to directly maximize performance. – PPO (Proximal Policy Optimization) indirectly maximize performance by maximizing a surrogate objective function.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 23 Q-Learning

• Methods in this family learn an approximator ( , ) for the optimal action-value function, ( , ). 𝑸𝑸𝜽𝜽 𝒔𝒔 𝒂𝒂 • Typically they use an objective function∗ based on the Bellman 𝑄𝑄 𝑠𝑠 𝑎𝑎 equation. • This optimization is almost always performed off-policy, which means that each update can use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was obtained –> i.e., parameters ARE updated during an epoch or batch. • Q-learning methods only indirectly optimize for agent performance, by training to satisfy a self-consistency equation. There are many failure modes for this kind of 𝜽𝜽 learning, so it tends to be𝑸𝑸 less stable.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 24 • Some example Q-learning methods: – DQN (Deep Q-learning Network) is a classic which substantially launched the field of deep RL. – C51 is a variant of DQN that learns a distribution over return whose expectation is Q*. • Some hybrid methods: – DDPG is an algorithm which concurrently learns a deterministic policy and a Q-function by using each to improve the other.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html 25 Reinforcement Learning Resources

• Tools – OpenAI Gym -- "Gym is a toolkit for developing and comparing reinforcement learning algorithms.“ – Google TensorFlow RL Agent – “TF-Agents: A reliable, scalable and easy to use Reinforcement Learning library for TensorFlow.” – Many code examples on the internet, for example, “Implementations from the free course Deep Reinforcement Learning with Tensorflow (by Thomas Simonini)”

Noriko Tomuro 26 • Books, papers – Sutton, R. and Barto, A. (January 2018). “Reinforcement Learning: An Introduction (2ed draft)”. The MIT Press. http://incompleteideas.net/book/the-book-2nd.html – Francois-Lavet, Vincent; Henderson, Peter; Islam, Riashat; Bellemare, Marc G.; Pineau, Joelle (2018). "An Introduction to Deep Reinforcement Learning". Foundations and Trends in Machine Learning. 11 (3–4): 219–354. arXiv:1811.12560. doi:10.1561/2200000071. ISSN 1935-8237. https://arxiv.org/abs/1811.12560 – Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; Bharath, A. A. (November 2017). "Deep Reinforcement Learning: A Brief Survey". IEEE Signal Processing Magazine. 34 (6): 26–38. arXiv:1708.05866. doi:10.1109/MSP.2017.2743240. ISSN 1053-5888. https://arxiv.org/pdf/1708.05866.pdf – Mnih, Volodymyr; et al. (December 2013). Playing Atari with Deep Reinforcement Learning (PDF). NIPS Deep Learning Workshop 2013. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf – And a huge amount more…

Noriko Tomuro 27