PLANNING-BASED RL ALPHAZER0 ALPHA AND MUZERO

ALPHAGO ALPHA AND MUZERO

GO

▸ ~10^170 board positions

▸ ~10^82 atoms in (observable) universe

▸ Might never be possible to brute-force solve Go

▸ The only ‘reward’ signal: win/loss at end of game! ALPHA AND MUZERO

GO

Naive Solution: Exhaustive Search

Taken from David Silver, 2020 ALPHA AND MUZERO

TIMELINE: ALPHA-FAMILY

▸ 2016: AlphaGo

▸ Only plays Go. Beats world champion

▸ 2017: AlphaGo Zero

▸ Removed need to train on human games first

▸ 2018: AlphaZero

▸ Generalized to work on Go, , , etc.

▸ Beat world computer champion, Stockfish

▸ Stockfish has vast amounts of domain-specific engineering (14,000 LOC)

▸ 2019: MuZero

▸ Learns rules of games

▸ Works for single-agent (eg Atari)

AAACB3icbVDLSgNBEJyNrxhfUY+CDAYhAQm7EtGLEPDiMUJekF2W2clsMmT2wUxvICy5efFXvHhQxKu/4M2/cZLsQRMLGoqqbrq7vFhwBab5beTW1jc2t/LbhZ3dvf2D4uFRW0WJpKxFIxHJrkcUEzxkLeAgWDeWjASeYB1vdDfzO2MmFY/CJkxi5gRkEHKfUwJacounZTsgMPT8NJ5e4HEF32LsuzYMGZCyqrjFklk158CrxMpICWVouMUvux/RJGAhUEGU6llmDE5KJHAq2LRgJ4rFhI7IgPU0DUnAlJPO/5jic630sR9JXSHgufp7IiWBUpPA052zo9WyNxP/83oJ+DdOysM4ARbSxSI/ERgiPAsF97lkFMREE0Il17diOiSSUNDRFXQI1vLLq6R9WbVq1auHWqnezOLIoxN0hsrIQteoju5RA7UQRY/oGb2iN+PJeDHejY9Fa87IZo7RHxifP2/8l9I= AAAB7XicbVBNS8NAEN3Ur1q/qh69LBbBU0mkoseCF48V+gVtKJvtpF272YTdiVBC/4MXD4p49f9489+4bXPQ1gcDj/dmmJkXJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk41hxaPZay7ATMghYIWCpTQTTSwKJDQCSZ3c7/zBNqIWDVxmoAfsZESoeAMrdTu4xiQDcoVt+ouQNeJl5MKydEYlL/6w5inESjkkhnT89wE/YxpFFzCrNRPDSSMT9gIepYqFoHxs8W1M3phlSENY21LIV2ovycyFhkzjQLbGTEcm1VvLv7n9VIMb/1MqCRFUHy5KEwlxZjOX6dDoYGjnFrCuBb2VsrHTDOONqCSDcFbfXmdtK+qXq16/VCr1Jt5HEVyRs7JJfHIDamTe9IgLcLJI3kmr+TNiZ0X5935WLYWnHzmlPyB8/kDq/mPQg== ALPHA AND MUZERO

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

(p,v)=f✓(s)

▸ s: state of the game (eg board position)

▸ p: a probability distribution over possible actions

▸ v: a scalar value. The expected outcome of game in state s

▸ ✓ : learnable neural net parameters IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

REDUCE SEARCH WIDTH

Reduce search-width with policy p

Taken from David Silver, 2020 IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

REDUCE SEARCH DEPTH

Reduce search-depth with value v AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA== IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

MAKING A MOVE: MCTS

▸ Before making a real move:

▸ Search: try the most promising moves in its mind, and see which leads to the highest value!

▸ A position that looks good at first glance might lead to checkmate in 5 moves

▸ Obviously needs to know the rules/environment model

▸ Also add some noise to search: explore vs exploit

▸ Known as Monte-Carlo Tree Search (MCTS)

▸ Use the results of this search to create new, better policy pˆ ▸ MCTS can be seen as a policy improvement operator

AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9 AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA== TEXT

IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

▸ Reward Sparsity: win/loss/ at end of game

▸ Can take hundreds of moves to get there!

▸ Solution: train the neural net so that p is the same as pˆ

▸ Minimize cross-entropy between two distributions

▸ =(z v)2 pˆT log(p) Lalphazero

AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIRZcFNy4r9IXtUDJppg3NZIbkjlCG/oUbF4q49W/c+Tem7Sy09UDgcM695NwTJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJ3dzvPHFtRKyaOE24H9GREqFgFK302I8ojoMwS2aDcsWtuguQdeLlpAI5GoPyV38YszTiCpmkxvQ8N0E/oxoFk3xW6qeGJ5RN6Ij3LFU04sbPFoln5MIqQxLG2j6FZKH+3shoZMw0CuzkPKFZ9ebif14vxfDWz4RKUuSKLT8KU0kwJvPzyVBozlBOLaFMC5uVsDHVlKEtqWRL8FZPXiftq6pXq14/1Cr1Zl5HEc7gHC7Bgxuowz00oAUMFDzDK7w5xnlx3p2P5WjByXdO4Q+czx/5E5Eu AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA== IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

SAME IDEA: EXPERT ITERATION

(MCTS)

p

Taken from: Thinking Fast and Slow with and Tree Search, T. Anthony et al, 2017 IDEA 3: SELF-PLAY

▸ Difference with normal RL

▸ Don’t have an agent to play against!

▸ How to evaluate strength?

“Modern learning are outstanding test-takers: once a problem is packaged into a suitable objective, deep (reinforcement) learning algorithms often find a good solution. However, in many multi-agent domains, the question of what test to take, or what objective to optimize, is not clear... Learning in games is often conservatively formulated as training agents that tie or beat, on average, a fixed set of opponents. However, the dual task, that of generating useful opponents to train and evaluate against, is under-studied. It is not enough to beat the agents you know; it is also important to generate better opponents, which exhibit behaviours that you don’t know.” ~ Balduzzi et al. (2019) IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Two types of game:

▸ Transitive: Chess, Go, etc

▸ Intransitive (or Cyclic): Rock-Paper-Scissors, StarCraft

▸ Transitive games have a special property: if agent A beats agent B and B beats C, then A beats C

▸ Basis of in Chess AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9 IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Self-Play

▸ Start with neural net agent f1

▸ Play f1 against itself to generate training data

2 T =(z v) pˆ log(p) ▸ Train f1 on this data using L to produce f2

▸ Repeat

▸ Key: fn will beat f1, f2, … due to transitivity! IDEA 3: SELF-PLAY

ALPHASTAR FOR INTRANSITIVE GAMES: NEEDS HUMAN GAMES!

• https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ TEXT IDEA 4: A LEARNED MODEL + MCTS = POWER

▸ For things like Atari, or almost any real-world RL problem, we don’t know the rules/model:

▸ if I have a state s1 and take action a, what will state s2 be?

▸ MuZero: use RNN to predict next state given previous states + previous actions!

▸ Then use for planning with MCTS like in AlphaZero TEXT

IDEA 4: A LEARNED MODEL + MCTS = POWER

PacMan Reward TEXT

ALPHA ZERO

▸ Idea 1: Reduce search-space with policy-value neural net

▸ Needs rules/model of environment. Can be learnt (MuZero)

▸ Idea 2: Use the MCTS policy as the training objective

▸ Idea 3: Use self-play to generate training data

▸ Idea 4: MCTS with a learned model of the world worlds well COMPARISON TO HUMANS

COMPARISON TO HUMANS: DUAL PROCESS THEORY

“System 1 operates automatically and quickly, with little or no effort and no sense of voluntary control. System 2 allocates attention to the effortful mental activities that demand it, including complex computations. The operations of System 2 are often associated with the subjective experience of agency, choice, and concentration.” COMPARISON TO HUMANS

COMPARISON TO HUMANS: DUAL PROCESS THEORY

Neural Net Neural Net used in MCTS

Taken form David Barber Learning From Scratch by Thinking Fast and Slow with Deep Learning and Tree Search TEXT

COMPARISON TO HUMANS

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go LESSONS

LESSONS FROM IMPLEMENTING

DeepMind OpenSpiel (https://github.com/deepmind/open_spiel): “OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.” LESSONS

LESSONS FROM IMPLEMENTING ▸ A simple tic-tac-toe example

▸ Takes around ~2 mins to train on laptop LESSONS

LESSONS: HYPERPARAMETERS

“The hyperparameters of AlphaGo Zero were tuned by Bayesian optimization.” ~ AlphaZero paper

����

▸ Luckily, only two game specific hyper parameters

▸ Found reasonable values for tic-tac-toe with some trial and error LESSONS

LESSONS: OPENSOURCE

▸ Code reviews by real experts

▸ You cannot buy the sort of feedback you can get from DeepMind engineers/ researchers!

▸ Its easy to fool yourself that you understand something better than you do

▸ Fortuitous Connections

▸ Looks good for employers

▸ “Research and software engineer experience demonstrated via an internship, contributions to open source, work experience, or coding competitions.” ~ from recent Cape Town AI company job posting

▸ Benefits others! LESSONS

OPENSPIEL CALL FOR CONTRIBUTIONS See: https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md BACKPROP IN THE BRAIN

THANKS FOR LISTENING!