Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Reinforcement Learning AlphaGo 的左右互搏 Tsung-Hsi Tsai 2018. 8. 8 統計研習營 1 絕藝如君天下少 , 閒⼈似我世間無 。 別後⽵窗⾵雪夜 , ⼀燈明暗覆吳圖 。 —— 杜牧 《 重送絕句 》 2 Overview I. Story of AlphaGo II. Algorithm of AlphaGo Zero III. Experiments of simple AlphaZero 3 Story of AlphaGo 4 To develop an AlphaGo Human experts technically games (oponal) ApproaCh: reinforcement learning + deep neural network Manpower: Compung programming power: CPUs + skill GPUs or TPUs 5 AlphaGo = 夢想 + 努⼒ + 時運 storytelling 6 Three key persons • Demis Hassabis (direcon) • David Silver (method) • Aja Huang (implement) Demis Hassabis (1976) • Age 13, chess master, No 2. • In 1993, designed classic game Theme Park, in 1998 to found Elixir Studios, games developer • 2009, PhD in Cognive Neuroscience • In 2010, founded DeepMind • In 2014, started AlphaGo project 8 David Silver • Demis Hassabis’ partner of game development in 1998 • Deep Q-Network (breakthrough on AGI arLficial general intelligence) show Atari game • Idea of Value Network (breakthrough on Go program) 9 Aja Huang • AlphaGo 的⼈⾁⼿臂 • 圍棋棋⼒業餘六段 • 2010 年 ,⿈⼠傑開發的圍棋程式 「 Erica」 得到 競賽冠軍 。 • 臺師⼤資⼯所 碩⼠論⽂為 《 電腦圍棋打劫的策略 》 2003 博⼠論⽂為 《 應⽤於電腦圍棋之蒙地卡羅樹搜尋 法的新啟發式演算法 》 2011 • Join DeepMind in 2012 10 The birth of AlphaGo • The DREAM started right aer DeepMind joined Google in 2014. • Research direcLon: deep learning and reinforcement learning. • First achievement: a high quality policy network to play Go, trained from big data of human expert games. • It beated No. 1 Go program CrazyStone with 70%. 11 A Cool idea • The most challenging part of Go program is to evaluate the situaon of board. • David Silver’s new idea: self-play using policy network to produce big data of games. Training Data: boards with label game results {0,1} • A high quality evaluaon of situaon of board was created, the AlphaGo team called it, value network. 12 Defeat professional player • The strength of AlphaGo improved quickly aer introducing value network. • In Oct. 2015, it beated European Go champion Fan Hui (樊麾) with 5:0, that is “a feat previously thought to be at least a decade away”. • Jan. 27 2016, Nature published the paper of AlphaGo, and the news of defeang Fan Hui Chinese professional 2 dan. 13 Next Challenge 李世乭( Lee Sedol, 33歲 ) 本世紀的圍棋世界棋⺩ • Challenge Lee Sedol • Compare Fan Hui & Lee Sedol 14 號外 Notable breakthrough Beated Lee with 4:1, in March 2016 15 AlphaGo 的代表作 : 第⼆局第 37 ⼿ 16 李世乭第四局的反擊 ( AlphaGo 執⿊棋 ) 17 AlphaGo upgrade • SoluLons of the bug in game 4. • AlphaGo Master: 2016 年底開始連續在圍棋網路平台上 , 以每天 下 10 盤的速度爭戰四⽅ , 「斬殺 」 中韓⽇最頂尖的圍棋⾼⼿ , 包括 : 柯潔 、 朴廷桓 、 井⼭裕太 、 ……。 • 退役之戰 :2017 年 5 ⽉在中國浙江烏鎮 ,與柯潔 (No 1) 下 3 場 分先對弈⽐賽 , 獎⾦ 150 萬美元 。 • 公開 50 局 AlphaGo ⾃⼰對弈的棋譜 。 18 DeepMind make another breakthrough, AlphaGo Zero 19 AlphaGo Zero • Algorithm base solely on reinforcement learning, without human data. • Start from completely random behavior and conLnued without human intervenLon. • AlphaGo Zero is “simple”, elegant, and it seems universal for board games. 20 Progress in Elo rangs --- 2015 --- 2005 21 AlphaZero General approach that applied to other board games such as chess and shogi (⽇本將棋 ). David Silver presented the result in NIPS 2017. 22 Algorithm of AlphaGo Zero 23 Chess, … MiniMax algorithm with alternate moves max min max min 24 Evaluaon of the posion example 25 Monte Carlo tree searCh (4 steps in one round) Each node stores a rao: # wins / # playouts 26 Balancing exploitaon and exploraon (Introduced in 2006) Recommend to choose move with the highest value in this formula: wi : # wins aer the i-th move ni : # simulaons aer the i-th move c : exploraon parameter t : total # simulaons for the node 27 Monte Carlo tree searCh in AlphaGo At each Lme step t of each simulaon, an acLon (or move) at is selected from state st where Q(s, a) is the acLon value and u(s, a) is the bonus, and they are determined by policy network and value network. 28 Determine next move from MCTS The distribuLon of recommending moves is π(a) ∝ N(s,a)^(1/τ) where s is the current state (board posiLon), a is an acLon (move), N(s,a) is the visit count of the edge (s,a) in the tree, and τ (tend to zero) is a temperature parameter. 29 Self-Play to generate data • Use policy network and value network to perform Monte Carlo tree search. • AlphaGo Zero perform 1,600 simulaons of MCTS to select each move. • Training data (s, π, z), where s is posiLon, π is the distribuLon of recommending next move and z ∈ {−1,+1} is outcome of game. • Loss funcon: 30 PoliCy networks Value networks (predict the perfect move) (predict the game result) 31 New network AlphaGo Zero combine policy network and value network into a two-head (two outputs) network. 32 Network arChiteCture Consists of a single convoluLonal block followed by either 19 or 39 residual convoluLonal blocks of the following modules: • A convoluLon of 256 filters of kernel size 3 × 3 with stride 1 • Batch normalizaon • A recLfier nonlinearity (ReLU) The output of “the residual tower” is passed into two separate “heads” (fully connected layer) for compuLng the policy and value. 33 Training under 20 bloCks AlphaGo Zero winning 100 : 0 against AlphaGo Lee. 34 Training under 40 bloCks AlphaGo Zero winning 89 : 11 against AlphaGo Mster. Go to AlphaGo Zero 对局研究 35 Simple experiments 36 QuesAon Is a simple reinforcement learning without using MCTS possible to master Go? 37 Implement on simple variants of Go. 38 Atari Go or the Capture Game win the game whenever capturing opponent’s stones. 39 More examples 40 AdvanCe strategy The strategy does not only capturing opponent's stones, but also building own territory. 41 Training pipeline Self-play using network Training network beyer network Game pool Pick data for training 42 Play game using network Apply predicon of network: Input: w×W×3, where w is board size Example: board & black next 43 Choose move using network output • Network output: w×W (each value in (0,1)) • Exclude invalid points. • Compute the distribuLon of recommending move. • Choose move randomly according to the distribuLon. 44 Network (Keras code) 1. Convoluon(3x3) layers: number of filters (denoted by F) 2. Number of residual blocks (denoted by K) 3. Batch Normalizaon & AcLvaon ReLU 4. Flaen before fully connected to output layer 45 Training network 1. Labeling data (giving incenLve) data: w×W×3 (choose all winning side) label the point of next move 1; other point 0 2. Pick training data randomly from the game pool. 3. Compile with loss funcLon 'categorical_crossentropy’ Game data: label: 46 Program flowchart & factors IniLally random-play Data Set (game-size: n_game) insert & delete Pick N data Self-play n_1 K = #blocks games F = #filters renew network 47 Examining the progression 1. Values of loss funcLon. 2. winning rate with random-play. 3. winning rate with previous network. 4. Other stasLc, such as black winning rate? 5. Print some games and check by human. 48 .