Mastering the game of Go with deep neural networks and tree search

David Silver1*, Aja Huang1*, Chris J. Maddison1, Arthur Guez1, Laurent Sifre1, George van den Driessche1

Julian Schrittwieser1, Ioannis Antonoglou1, Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1, Dominik Grewe1, John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, Timothy Lillicrap1, Madeleine Leach1, Koray Kavukcuoglu1, Thore Graepel1 & Demis Hassabis1 (1 Google DeepMind. 2 Google,) (*These authors contributed equally to this work) • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

2 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

3 4 • In large games, such as chess (b ≈ 35, d ≈ 80) and especially Go (b ≈ 250, d ≈ 150), exhaustive search is infeasible (b is the game’s breadth (number of legal moves per position) and d is its depth (game length).

5 풔 d = 1 d = 2

Computer (AI)

풔 (푠푡푎푡푒) 풂 (푎푐푡푖표푛) 풔′

6 d = 1

7 d = 1

8 • Reducing “action candidates” (Breadth Reduction)

d = 1

If there is a model that can tell you that these moves are not common / probable (e.g. by experts, etc.) …

9 • Reducing “action candidates” (Breadth Reduction)

d = 1

Remove these from search candidates in advance (breadth reduction)

10 • Position evaluation ahead of time (Depth Reduction)

d = 1

Instead of simulating until the maximum depth ...

11 • Position evaluation ahead of time (Depth Reduction)

d = 1 V=1 V=2 If there is a function that can measure:V(s): “board evaluation of state s”

V=10

12 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network

13 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network

14 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network

15 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network

16 17 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

18 • Imitating expert moves (supervised learning) • 160K games, 30M positions from the KGS Go Server (Online Go experts 5~9 dan) • Learning : P(next action | current state) = 푃 푎 푠)

Prediction Model

19 • Imitating expert moves (supervised learning) • 160K games, 30M positions from the KGS Go Server (Online Go experts 5~9 dan) • Learning : P(next action | current state) = 푃 푎 푠)

Deep Learning (13 Layer CNN) (SL) policy network 풑흈

maximize the likelihood : 휕푙표𝑔 푃 (푎|푠) : ∆𝜎 ∝ 휎 휕휎 An accuracy of 57.0% 20 • A faster but less accurate rollout policy 푝휋(a|s), using a linear softmax of small pattern features • His achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.

21 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

22 • Improving through self-‐plays (reinforcement learning)

Expert Moves Imitator Model

maximize the likelihood : 휕푙표푔 푃휌(푎푡|푠푡) ∆𝜌 ∝ 푧푡 휕휌

23 • Improving through self-‐plays (reinforcement learning)

Expert Moves Imitator Model

maximize the likelihood : 휕푙표푔 푃휌(푎푡|푠푡) ∆𝜌 ∝ 푧푡 휕휌

24 • Improving through self-‐plays (reinforcement learning)

Older models vs. Newer models

Updated Updated Model VS Model Ver. 1.1 Ver. 1.3

It uses the same topology as the expert moves imitator model,

and just uses the updated parameters 25 • Improving through self-‐plays (reinforcement learning)

Older models vs. Newer models

Updated Updated Model VS Model Ver. 2343 Ver. 3455

Return: board positions, win/lose info

26 • Improving through self-‐plays (reinforcement learning)

Older models vs. Newer models

Updated Expert Moves Model Imitator Model VS Ver. 100000

The final model wins 80% of the time when playing against the first model 27 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

28 • Focuses on position evaluation, estimating a value function

• In practice, we approximate the value function using a value network 푝푝 ∗ 푣휃 푠 with weights 휃, 푣휃 푠 ≅ 푣 (푠) ≅ 푣 (푠).

29 Adds a regression layer to the model Predicts values between 0~1 Close to 1: a good board position Close to 0: a bad board position

Updated Value Prediction Model Model Ver. 100000 (Regression)

30 minimize the mean squared error (MSE) between the predicted value 풗휽(풔), and the corresponding outcome 풛

Updated 휕푣 푠 Model ∆휃 ∝ 휃 (푧 − 푣 푠 ) 휕휃 휃 Ver. 100000

31 1.Randomly sampling a time step U ~ unif{1, 450}, 2.Sampling the first t = 1,... U − 1 moves from the SL policy network, 3.Sampling one move uniformly at random from available moves, aU ~ unif{1, 361} (repeatedly until aU is legal); 4.Sampling the remaining sequence of moves until the game terminates, t = U + 1, ... T, from the RL policy network 5.Finally, the game is scored to determine the outcome 푧푡 = ±푟 푠푇 . Only a single training example (푠푈+1, 푧푈+1) is added to the data set from each game. 푝푝 This data provides unbiased samples of the value function 푣 (푠푈+1)=피[푧푈+1 | 푠푈+1,푎푈+1,...T ~ 푝푝]

Updated 휕푣 푠 Model ∆휃 ∝ 휃 (푧 − 푣 푠 ) 휕휃 휃 Ver. 100000

32 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network

33 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

34 Each simulation traverses the tree by selecting the edge with maximum action value Q, plus a bonus u(P) that depends on a stored prior probability P(s,a) for that edge.

Board Evaluation Action Candidates Reduction (Value Network) (Policy Network)

(Rollout): Faster version of estimating → 풑(풂|풔) uses shallow networks (3ms → 2μs)

35 • At each time step t of each simulation, an action 푎푡 is selected from state 푠푡 a. Selection c. Expansion exploit term explore term When the visit count exceeds a threshold, Nr(s, a) > 푛푡ℎ푟, The successor state s′ = f(s, a) prior probability is added to the search tree.

visit count b. Evaluation d. Backup

leaf node 푠퐿 at step L 휆=0.5, winning ≥95% of games against other variant 36 the leaf node from the ith simulation • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

37 38 39 40 • The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. Distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. • Google has promised not to use more CPU/GPUs than they used for Fan Hui for the game with Lee Se-dol

41 • October 5 – 9, 2015 • Time limit:1 hour •

42 43 Fan Hui responded the move

Alphago selected the move

Alphago predicted the move AlphaGo(black) VS Fan Hui (樊麾) in an informal game 44 • March 9– 15, 2016 • Time limit: 2 hours

45 • AlphaGo seals 4-1 victory over Go grandmaster Lee Sedol

Game no. Date Black White Result Moves Lee Sedol 1 9 March 2016 Lee Sedol AlphaGo 186 resigned Lee Sedol 2 10 March 2016 AlphaGo Lee Sedol 211 resigned Lee Sedol 3 12 March 2016 Lee Sedol AlphaGo 176 resigned AlphaGo 4 13 March 2016 AlphaGo Lee Sedol 180 resigned Lee Sedol 5 15 March 2016 Lee Sedol AlphaGo 280 resigned Result: AlphaGo 4 – 1 Lee Sedol 46 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions

47 • In this work we have developed a Go program, based on a combination of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intelligence’s “grand challenges.

48