Reinforcement Learning
AlphaGo 的左右互搏
Tsung-Hsi Tsai 2018. 8. 8 統計研習營
1 絕藝如君天下少 , 閒⼈似我世間無 。 別後⽵窗⾵雪夜 , ⼀燈明暗覆吳圖 。 —— 杜牧 《 重送絕句 》
2 Overview
I. Story of AlphaGo
II. Algorithm of AlphaGo Zero
III. Experiments of simple AlphaZero
3 Story of AlphaGo
4 To develop an AlphaGo
Human experts technically games (op onal)
Approach: reinforcement learning + deep neural network
Manpower: Compu ng programming power: CPUs + skill GPUs or TPUs
5 AlphaGo
= 夢想 + 努⼒ + 時運
storytelling
6 Three key persons
• Demis Hassabis (direc on)
• David Silver (method)
• Aja Huang (implement) Demis Hassabis (1976)
• Age 13, chess master, No 2.
• In 1993, designed classic game Theme Park, in 1998 to found Elixir Studios, games developer
• 2009, PhD in Cogni ve Neuroscience
• In 2010, founded DeepMind
• In 2014, started AlphaGo project
8 David Silver
• Demis Hassabis’ partner of game development in 1998
• Deep Q-Network (breakthrough on AGI
ar ficial general intelligence) show Atari game
• Idea of Value Network (breakthrough on Go program)
9 Aja Huang
• AlphaGo 的⼈⾁⼿臂 • 圍棋棋⼒業餘六段 • 2010 年 ,⿈⼠傑開發的圍棋程式 「 Erica」 得到 競賽冠軍 。 • 臺師⼤資⼯所 碩⼠論⽂為 《 電腦圍棋打劫的策略 》 2003 博⼠論⽂為 《 應⽤於電腦圍棋之蒙地卡羅樹搜尋 法的新啟發式演算法 》 2011 • Join DeepMind in 2012
10 The birth of AlphaGo
• The DREAM started right a er DeepMind joined Google in 2014. • Research direc on: deep learning and reinforcement learning. • First achievement: a high quality policy network to play Go, trained from big data of human expert games. • It beated No. 1 Go program CrazyStone with 70%.
11 A cool idea
• The most challenging part of Go program is to evaluate the situa on of board. • David Silver’s new idea: self-play using policy network to produce big data of games. Training Data: boards with label game results {0,1} • A high quality evalua on of situa on of board was created, the AlphaGo team called it, value network.
12 Defeat professional player
• The strength of AlphaGo improved quickly a er introducing value network. • In Oct. 2015, it beated European Go champion Fan Hui (樊麾) with 5:0, that is “a feat previously thought to be at least a decade away”. • Jan. 27 2016, Nature published the paper of AlphaGo, and the news of defea ng Fan Hui Chinese professional 2 dan.
13 Next Challenge
李世乭( Lee Sedol, 33歲 )
本世紀的圍棋世界棋⺩
• Challenge Lee Sedol • Compare Fan Hui & Lee Sedol
14 號外 Notable breakthrough
Beated Lee with 4:1, in March 2016
15 AlphaGo 的代表作 : 第⼆局第 37 ⼿
16 李世乭第四局的反擊
( AlphaGo 執⿊棋 )
17 AlphaGo upgrade
• Solu ons of the bug in game 4.
• AlphaGo Master: 2016 年底開始連續在圍棋網路平台上 , 以每天 下 10 盤的速度爭戰四⽅ , 「斬殺 」 中韓⽇最頂尖的圍棋⾼⼿ , 包括 : 柯潔 、 朴廷桓 、 井⼭裕太 、 ……。
• 退役之戰 :2017 年 5 ⽉在中國浙江烏鎮 ,與柯潔 (No 1) 下 3 場 分先對弈⽐賽 , 獎⾦ 150 萬美元 。
• 公開 50 局 AlphaGo ⾃⼰對弈的棋譜 。
18 DeepMind make another breakthrough, AlphaGo Zero
19 AlphaGo Zero
• Algorithm base solely on reinforcement learning, without human data.
• Start from completely random behavior and con nued without human interven on.
• AlphaGo Zero is “simple”, elegant, and it seems universal for board games.
20 Progress in Elo ra ngs
--- 2015
--- 2005
21 AlphaZero
General approach that applied to other board games such as chess and shogi (⽇本將棋 ).
David Silver presented the result in NIPS 2017.
22 Algorithm of AlphaGo Zero
23 Chess, … MiniMax algorithm with alternate moves
max
min
max
min
24 Evalua on of the posi on
example
(4 steps in one round)
Each node stores a ra o: # wins / # playouts
26 Balancing exploita on and explora on (Introduced in 2006) Recommend to choose move with the highest value in this formula:
wi : # wins a er the i-th move ni : # simula ons a er the i-th move c : explora on parameter t : total # simula ons for the node
27 Monte Carlo tree search in AlphaGo
At each me step t of each simula on, an ac on
(or move) at is selected from state st
where Q(s, a) is the ac on value and u(s, a) is the bonus, and they are determined by policy network and value network.
28 Determine next move from MCTS
The distribu on of recommending moves is
π(a) ∝ N(s,a)^(1/τ) where s is the current state (board posi on), a is an ac on (move), N(s,a) is the visit count of the edge (s,a) in the tree, and τ (tend to zero) is a temperature parameter.
29 Self-Play to generate data
• Use policy network and value network to perform Monte Carlo tree search. • AlphaGo Zero perform 1,600 simula ons of MCTS to select each move. • Training data (s, π, z), where s is posi on, π is the distribu on of recommending next move and z ∈ {−1,+1} is outcome of game. • Loss func on:
30 Policy networks Value networks (predict the perfect move) (predict the game result)
31 New network
AlphaGo Zero combine policy network and value network into a two-head (two outputs) network.
32 Network architecture
Consists of a single convolu onal block followed by either 19 or 39 residual convolu onal blocks of the following modules: • A convolu on of 256 filters of kernel size 3 × 3 with stride 1 • Batch normaliza on • A rec fier nonlinearity (ReLU) The output of “the residual tower” is passed into two separate “heads” (fully connected layer) for compu ng the policy and value.
33 Training under 20 blocks
AlphaGo Zero winning 100 : 0 against AlphaGo Lee.
34 Training under 40 blocks
AlphaGo Zero winning 89 : 11 against AlphaGo Mster.
Go to AlphaGo Zero 对局研究 35 Simple experiments
36 Ques on
Is a simple reinforcement learning without using MCTS possible to master Go?
37 Implement on simple variants of Go.
38 Atari Go or the Capture Game
Win the game whenever capturing opponent’s stones.
39
More examples
40
Advance strategy
The strategy does not only capturing opponent's stones, but also building own territory.
41 Training pipeline
Self-play using network Training network be er network
Game pool Pick data for training
42 Play game using network
Apply predic on of network: Input: W×W×3, where W is board size
Example: board & black next
43 Choose move using network output
• Network output: W×W (each value in (0,1))
• Exclude invalid points.
• Compute the distribu on of recommending move.
• Choose move randomly according to the distribu on.
44 Network (Keras code)
1. Convolu on(3x3) layers: number of filters (denoted by F)
2. Number of residual blocks (denoted by K)
3. Batch Normaliza on & Ac va on ReLU
4. Fla en before fully connected to output layer
45 Training network
1. Labeling data (giving incen ve) data: W×W×3 (choose all winning side) label the point of next move 1; other point 0 2. Pick training data randomly from the game pool. 3. Compile with loss func on 'categorical_crossentropy’
Game data:
label:
46 Program flowchart & factors
Ini ally random-play Data Set (game-size: n_game) insert & delete Pick N data
Self-play n_1 K = #blocks games F = #filters
renew network
47 Examining the progression
1. Values of loss func on.
2. Winning rate with random-play.
3. Winning rate with previous network.
4. Other sta s c, such as black winning rate?
5. Print some games and check by human.
48