An Introduction to Reinforcement Learning and the Alphazero AI James Frost Data Platform Director Quorum About the Speaker

An Introduction to Reinforcement Learning and the AlphaZero AI James Frost Data Platform Director Quorum About the speaker • James Frost • Data Platform Director at Quorum, an Edinburgh based IT consultancy • Recently completed an MSc in Data Science at Dundee University • Final year project was to build a backgammon AI influenced by techniques based on DeepMind AlphaGo. This AI achieved human Grandmaster level. Session agenda • An introduction to reinforcement learning concepts • Monte-Carlo learning • Neural networks as function approximators • Issues with reinforcement learning and Deep Neural Networks • DeepMind and AlphaGo What is reinforcement learning? Types of Machine Learning Supervised Learning: Starts with a dataset of known examples. Supervised The engine then trains “by example”. Learning Unsupervised Reinforcement Learning Learning “A” is for Apple… … not an apple! Types of Machine Learning Unsupervised Learning: Starts with a dataset where the categories might not be known and looks for patterns / similarities / clusters which Supervised may be of interest. Learning Unsupervised Reinforcement Learning Learning For example, customer segmentation or fraud investigation Types of Machine Learning Reinforcement Learning: Is based on an agent interacting with the environment, and Supervised getting feedback in the form of a reward mechanism. Learning Unsupervised Reinforcement Learning Learning How do dogs learn? All training should be reward based. Giving your dog something they really like such as food, toys or praise when they show a particular behaviour means that they are more likely to do it again. RSPCA Website Principles of reinforcement learning Agent state reward action St+1 Rt At Environment Rewards • A reward Rt is a scalar feedback signal • Indicates the value of carrying out step t Kill John Connor (+10,000) Getting to destination (+100) Falling over (-50) Taking a step (-1) Money won or lost – Win (+1) or Loss (-1) e.g. poker or stock market Agent The Agent generally has the following components: • Model – the agents representation of the environment • Policy – how the agent behaves • Value function – estimate of how good a state or action is Model • Environment state is the environments private representation • Often not visible • Model is a representation of the environment state through observation Value Functions Almost all reinforcement learning algorithms involve estimating value functions that estimate how good it is for the agent to be in a given state …the most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values (Sutton, 2017) Value Functions / Policy - chess A sample value function for chess might be the estimated chance of winning from that position. Chess Policy From each state, calculate all legal moves For each possible move, move to state with highest value function. OR from each state pick the move with the highest action value. This is the optimal policy. Principles of reinforcement learning 1. Accurately estimating the value function is critical for reinforcement learning But how do we do that? Technique 1 – Monte Carlo Learning • Play a large number of games at random. • Record how many times each state is seen, and how many games were won from that state (or action). This lets us know the estimated value function. • This is the evaluation step for the random policy Action values State values 0.32 0.19 0.32 -0.41 -0.54 -0.41 X O 0.19 0.48 0.19 -0.54 X -0.54 X 0.32 0.19 0.32 -0.41 -0.54 -0.41 O 41% win rate from here Now use Monte-Carlo Control • Now play another bunch of games, but this time act “greedily” with respect of the value predicted by the previous policy. • Record how many times each state is seen, and how many games were won from that state 0.07 0.08 0.07 -0.14 -0.82 -0.14 X O 0.08 0.20 0.08 -0.82 X -0.82 X 0.07 0.08 0.07 -0.14 -0.82 -0.14 O 100% win rate from here Don’t get too greedy… • However, by only picking the best moves (greedy) we sometimes miss possible moves that might be better. • So need to introduce an element of exploration. • ε-greedy learning works as follows: • With probability (1-ε) make a greedy move • With probability ε move at random. • ε is often reduced as the number of episodes increases – this is guaranteed to converge to the optimal policy. Learning cycle • Policy evaluation / policy improvement is the core concept of reinforcement learning • By acting greedily with respect to the value function we can create a new, improved policy. • Iterating this process will trend towards the optimal policy Policy improvement Optimal policy Policy evaluation Problems with large state spaces Neural Networks Unfortunately most useful problems can’t store the state for every scenario. • Chess has 10^47 states • Go has 10^170 states. • How many states to record every possible scenario for a driverless car or Terminator robot? So we need some form of function approximator. Neural networks as value approximators -4 0 0 0 -4 0 -1 0 -1 6 0 -3 5 2 . -1 6 . 2 . -3 Monte-Carlo Learning and Neural Networks • Same principles as tic-tac-toe • Play a number of games at random • Sample states (or state / action pairs) from the games, the reward that these states led to, discounted by the number of steps • Use these samples to feed into the neural network for training • Now repeat the process, but instead of random play, use the neural network to predict the best moves. Pick some moves at random (ε-greedy) Deep Neural Networks as Value Functions Deep Neural Networks (DNNs) would appear to be a great candidate for a value function approximator. However, these can suffer from the following causes of instability: • correlations present in the sequence of observations • small updates to action-values estimates (Q) may significantly change the policy Atari • Deepmind paper from 2015 • Taught a deep neural network to play Atari games, such as Breakout and Kung Fu Master • Achieved above human level in over half of the games – in some cases superhuman performance (e.g. Breakout). • The network was trained using a technique based on Q-learning but introduced two important concepts: • Experience replay • Target Q network Source: Human-level control through deep reinforcement learning. Nature, 518. https://doi.org/10.1038/nature14236 Experience Replay and target Q network Agent v2 Agent action At Random samples Environment To learn new policy reward Rt state St+1 Replay Buffer ( St, At ,Rt St+1 ), ( St-1, At-1 ,Rt-1 St ), ( St-2, At-2 ,Rt-2 St-1 ), ( St-3, At-3 ,Rt-3 St-2 ), ( St-4, At-4 ,Rt-4 St-3 ), ( St-5, At-5 ,Rt-5 St-4 ) Effect of Target Q and Experience Replay AlphaGo AlphaGo • Initially trained on a database of expert human games • Then self play • After months of training beat Lee Sedol • AlphaGoZero was trained from completely random play • Within 36 hours of training AlphaZero beat AlphaGoLee 100-0 “Each time you put knowledge into a system you are actually handicapping it” David Silver (Silver, 2015) AlphaGoZero DeepMind’s AlphaGoZero project applied the following principles • only uses the raw board position as input features • use a simple Monte-Carlo Tree Search (MCTS) to evaluate positions and sample moves • residual neural network architecture • dual-headed network Monte-Carlo tree search Source: Mastering the game of Go without human knowledge. https://doi.org/doi:10.1038/nature24270 AlphaZero Success • Within 4 hours had mastered chess from first principles, beating the worlds then greatest chess engine. • Within 2 hours had beaten Elmo at Shogi. • (running on 5,000 TPUs) What is most important? Data OR Algorithm Reinforcement learning concepts summary 1. Value estimation is critical to the success of a reinforcement learning algorithm. 2. Monte-Carlo learning is a relatively simple technique to get started with and can be applied to a wide range of problems 3. Balancing exploration vs exploitation critical 4. Deep neural networks can be unstable with reinforcement learning. Deep Q Networks with Experience Replay can help stabilise this. Where next • The re-enforcement learning community are now focussing on games with imperfect information and much deeper strategies: • No limit Texas Hold’em Poker – Deepstack • StarCraft2 – AlphaStar • DOTA2 – OpenAI 5 • OpenSpiel • Ultimately the aim is to apply deep learning to real world scenarios: • Energy efficiency • Self driving cars • Protein folding • Medical diagnosis • Materials research • Artificial General Intelligence References / Useful links Netflix AlphaGo documentary Silver, D. (2015) UCL course on RL. Sutton, R. S. and Barto, A. G. (2017) Reinforcement Learning: An Introduction. OpenSpiel - https://github.com/deepmind/open_spiel Deepmind (2016) Human-level control through deep reinforcement learning Deepmind (2017) Mastering the game of Go without human knowledge Deepmind (2018) AlphaZero: Shedding new light on the grand games of chess, shogi and Go, deepmind.com. Thank You Quorum Network Resources Ltd 18 Greenside Lane, Edinburgh, EH1 3AH www.qnrl.com | [email protected] Reg. No. SC 196645, Registered Office: 18 Greenside Lane, Edinburgh, EH1 3AH Quorum Confidential.

Load more