Mastering the game of Go with deep neural networks and tree search
David Silver1*, Aja Huang1*, Chris J. Maddison1, Arthur Guez1, Laurent Sifre1, George van den Driessche1
Julian Schrittwieser1, Ioannis Antonoglou1, Veda Panneershelvam1, Marc Lanctot1, Sander Dieleman1, Dominik Grewe1, John Nham2, Nal Kalchbrenner1, Ilya Sutskever2, Timothy Lillicrap1, Madeleine Leach1, Koray Kavukcuoglu1, Thore Graepel1 & Demis Hassabis1 (1 Google DeepMind. 2 Google,) (*These authors contributed equally to this work) • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
2 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
3 4 • In large games, such as chess (b ≈ 35, d ≈ 80) and especially Go (b ≈ 250, d ≈ 150), exhaustive search is infeasible (b is the game’s breadth (number of legal moves per position) and d is its depth (game length).
5 풔 d = 1 d = 2
Computer (AI)
풂
풔 (푠푡푎푡푒) 풂 (푎푐푡푖표푛) 풔′
6 d = 1
7 d = 1
8 • Reducing “action candidates” (Breadth Reduction)
d = 1
If there is a model that can tell you that these moves are not common / probable (e.g. by experts, etc.) …
9 • Reducing “action candidates” (Breadth Reduction)
d = 1
Remove these from search candidates in advance (breadth reduction)
10 • Position evaluation ahead of time (Depth Reduction)
d = 1
Instead of simulating until the maximum depth ...
11 • Position evaluation ahead of time (Depth Reduction)
d = 1 V=1 V=2 If there is a function that can measure:V(s): “board evaluation of state s”
V=10
12 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network
13 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network
14 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network
15 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network
16 17 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
18 • Imitating expert moves (supervised learning) • 160K games, 30M positions from the KGS Go Server (Online Go experts 5~9 dan) • Learning : P(next action | current state) = 푃 푎 푠)
Prediction Model
19 • Imitating expert moves (supervised learning) • 160K games, 30M positions from the KGS Go Server (Online Go experts 5~9 dan) • Learning : P(next action | current state) = 푃 푎 푠)
Deep Learning (13 Layer CNN) (SL) policy network 풑흈
maximize the likelihood : 휕푙표𝑔 푃 (푎|푠) : ∆𝜎 ∝ 휎 휕휎 An accuracy of 57.0% 20 • A faster but less accurate rollout policy 푝휋(a|s), using a linear softmax of small pattern features • His achieved an accuracy of 24.2%, using just 2 μs to select an action, rather than 3 ms for the policy network.
21 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
22 • Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model
maximize the likelihood : 휕푙표푔 푃휌(푎푡|푠푡) ∆𝜌 ∝ 푧푡 휕휌
23 • Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model
maximize the likelihood : 휕푙표푔 푃휌(푎푡|푠푡) ∆𝜌 ∝ 푧푡 휕휌
24 • Improving through self-‐plays (reinforcement learning)
Older models vs. Newer models
Updated Updated Model VS Model Ver. 1.1 Ver. 1.3
It uses the same topology as the expert moves imitator model,
and just uses the updated parameters 25 • Improving through self-‐plays (reinforcement learning)
Older models vs. Newer models
Updated Updated Model VS Model Ver. 2343 Ver. 3455
Return: board positions, win/lose info
26 • Improving through self-‐plays (reinforcement learning)
Older models vs. Newer models
Updated Expert Moves Model Imitator Model VS Ver. 100000
The final model wins 80% of the time when playing against the first model 27 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
28 • Focuses on position evaluation, estimating a value function
• In practice, we approximate the value function using a value network 푝푝 ∗ 푣휃 푠 with weights 휃, 푣휃 푠 ≅ 푣 (푠) ≅ 푣 (푠).
29 Adds a regression layer to the model Predicts values between 0~1 Close to 1: a good board position Close to 0: a bad board position
Updated Value Prediction Model Model Ver. 100000 (Regression)
30 minimize the mean squared error (MSE) between the predicted value 풗휽(풔), and the corresponding outcome 풛
Updated 휕푣 푠 Model ∆휃 ∝ 휃 (푧 − 푣 푠 ) 휕휃 휃 Ver. 100000
31 1.Randomly sampling a time step U ~ unif{1, 450}, 2.Sampling the first t = 1,... U − 1 moves from the SL policy network, 3.Sampling one move uniformly at random from available moves, aU ~ unif{1, 361} (repeatedly until aU is legal); 4.Sampling the remaining sequence of moves until the game terminates, t = U + 1, ... T, from the RL policy network 5.Finally, the game is scored to determine the outcome 푧푡 = ±푟 푠푇 . Only a single training example (푠푈+1, 푧푈+1) is added to the data set from each game. 푝푝 This data provides unbiased samples of the value function 푣 (푠푈+1)=피[푧푈+1 | 푠푈+1,푎푈+1,...T ~ 푝푝]
Updated 휕푣 푠 Model ∆휃 ∝ 휃 (푧 − 푣 푠 ) 휕휃 휃 Ver. 100000
32 • 1. Reducing “action candidates” (Breadth Reduction) Police Network • 2. Board Evaluation (Depth Reduction) Value Network
33 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
34 Each simulation traverses the tree by selecting the edge with maximum action value Q, plus a bonus u(P) that depends on a stored prior probability P(s,a) for that edge.
Board Evaluation Action Candidates Reduction (Value Network) (Policy Network)
(Rollout): Faster version of estimating → 풑(풂|풔) uses shallow networks (3ms → 2μs)
35 • At each time step t of each simulation, an action 푎푡 is selected from state 푠푡 a. Selection c. Expansion exploit term explore term When the visit count exceeds a threshold, Nr(s, a) > 푛푡ℎ푟, The successor state s′ = f(s, a) prior probability is added to the search tree.
visit count b. Evaluation d. Backup
leaf node 푠퐿 at step L 휆=0.5, winning ≥95% of games against other variant 36 the leaf node from the ith simulation • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
37 38 39 40 • The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. Distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1,202 CPUs and 176 GPUs. • Google has promised not to use more CPU/GPUs than they used for Fan Hui for the game with Lee Se-dol
41 • October 5 – 9, 2015 •
42 43 Fan Hui responded the move
Alphago selected the move
Alphago predicted the move AlphaGo(black) VS Fan Hui (樊麾) in an informal game 44 • March 9– 15, 2016 •
45 • AlphaGo seals 4-1 victory over Go grandmaster Lee Sedol
Game no. Date Black White Result Moves Lee Sedol 1 9 March 2016 Lee Sedol AlphaGo 186 resigned Lee Sedol 2 10 March 2016 AlphaGo Lee Sedol 211 resigned Lee Sedol 3 12 March 2016 Lee Sedol AlphaGo 176 resigned AlphaGo 4 13 March 2016 AlphaGo Lee Sedol 180 resigned Lee Sedol 5 15 March 2016 Lee Sedol AlphaGo 280 resigned Result: AlphaGo 4 – 1 Lee Sedol 46 • Introduction • Supervised learning of policy networks • Reinforcement learning of policy networks • Reinforcement learning of value networks • Monte Carlo tree search • Evaluations • Conclusions
47 • In this work we have developed a Go program, based on a combination of deep neural networks and tree search, that plays at the level of the strongest human players, thereby achieving one of artificial intelligence’s “grand challenges.
48