<<

Reinforcement Learning

AlphaGo 的左右互搏

Tsung-Hsi Tsai 2018. 8. 8 統計研習營

1 絕藝如君天下少 , 閒⼈似我世間無 。 別後⽵窗⾵雪夜 , ⼀燈明暗覆吳圖 。 —— 杜牧 《 重送絕句 》

2 Overview

I. Story of AlphaGo

II. Algorithm of AlphaGo Zero

III. Experiments of simple AlphaZero

3 Story of AlphaGo

4 To develop an AlphaGo

Human experts technically games (oponal)

Approach: reinforcement learning + deep neural network

Manpower: Compung programming power: CPUs + skill GPUs or TPUs

5 AlphaGo

= 夢想 + 努⼒ + 時運

storytelling

6 Three key persons

(direcon)

• David Silver (method)

(implement) Demis Hassabis (1976)

• Age 13, chess , No 2.

• In 1993, designed classic game Theme Park, in 1998 to found Elixir Studios, games developer

• 2009, PhD in Cognive Neuroscience

• In 2010, founded DeepMind

• In 2014, started AlphaGo project

8 David Silver

• Demis Hassabis’ partner of game development in 1998

• Deep Q-Network (breakthrough on AGI

arficial general intelligence) show Atari game

• Idea of Value Network (breakthrough on Go program)

9 Aja Huang

• AlphaGo 的⼈⾁⼿臂 • 圍棋棋⼒業餘六段 • 2010 年 ,⿈⼠傑開發的圍棋程式 「 Erica」 得到 競賽冠軍 。 • 臺師⼤資⼯所 碩⼠論⽂為 《 電腦圍棋打劫的策略 》 2003 博⼠論⽂為 《 應⽤於電腦圍棋之蒙地卡羅樹搜尋 法的新啟發式演算法 》 2011 • Join DeepMind in 2012

10 The birth of AlphaGo

• The DREAM started right aer DeepMind joined Google in 2014. • Research direcon: and reinforcement learning. • First achievement: a high quality policy network to play Go, trained from big data of human expert games. • It beated No. 1 Go program CrazyStone with 70%.

11 A cool idea

• The most challenging part of Go program is to evaluate the situaon of board. • David Silver’s new idea: self-play using policy network to produce big data of games. Training Data: boards with label game results {0,1} • A high quality evaluaon of situaon of board was created, the AlphaGo team called it, value network.

12 Defeat professional player

• The strength of AlphaGo improved quickly aer introducing value network. • In Oct. 2015, it beated European Go champion (樊麾) with 5:0, that is “a feat previously thought to be at least a decade away”. • Jan. 27 2016, Nature published the paper of AlphaGo, and the news of defeang Fan Hui Chinese professional 2 dan.

13 Next Challenge

李世乭( Lee Sedol, 33歲 )

本世紀的圍棋世界棋⺩

• Challenge Lee Sedol • Compare Fan Hui & Lee Sedol

14 號外 Notable breakthrough

Beated Lee with 4:1, in March 2016

15 AlphaGo 的代表作 : 第⼆局第 37 ⼿

16 李世乭第四局的反擊

( AlphaGo 執⿊棋 )

17 AlphaGo upgrade

• Soluons of the bug in game 4.

• AlphaGo Master: 2016 年底開始連續在圍棋網路平台上 , 以每天 下 10 盤的速度爭戰四⽅ , 「斬殺 」 中韓⽇最頂尖的圍棋⾼⼿ , 包括 : 柯潔 、 朴廷桓 、 井⼭裕太 、 ……。

• 退役之戰 :2017 年 5 ⽉在中國浙江烏鎮 ,與柯潔 (No 1) 下 3 場 分先對弈⽐賽 , 獎⾦ 150 萬美元 。

• 公開 50 局 AlphaGo ⾃⼰對弈的棋譜 。

18 DeepMind make another breakthrough, AlphaGo Zero

19 AlphaGo Zero

• Algorithm base solely on reinforcement learning, without human data.

• Start from completely random behavior and connued without human intervenon.

• AlphaGo Zero is “simple”, elegant, and it seems universal for board games.

20 Progress in Elo rangs

--- 2015

--- 2005

21 AlphaZero

General approach that applied to other board games such as chess and shogi (⽇本將棋 ).

David Silver presented the result in NIPS 2017.

22 Algorithm of AlphaGo Zero

23 Chess, … MiniMax algorithm with alternate moves

max

min

max

min

24 Evaluaon of the posion

example

25

(4 steps in one round)

Each node stores a rao: # wins / # playouts

26 Balancing exploitaon and exploraon (Introduced in 2006) Recommend to choose move with the highest value in this formula:

wi : # wins aer the i-th move ni : # simulaons aer the i-th move c : exploraon parameter t : total # simulaons for the node

27 Monte Carlo tree search in AlphaGo

At each me step t of each simulaon, an acon

(or move) at is selected from state st

where Q(s, a) is the acon value and u(s, a) is the bonus, and they are determined by policy network and value network.

28 Determine next move from MCTS

The distribuon of recommending moves is

π(a) ∝ N(s,a)^(1/τ) where s is the current state (board posion), a is an acon (move), N(s,a) is the visit count of the edge (s,a) in the tree, and τ (tend to zero) is a temperature parameter.

29 Self-Play to generate data

• Use policy network and value network to perform Monte Carlo tree search. • AlphaGo Zero perform 1,600 simulaons of MCTS to select each move. • Training data (s, π, z), where s is posion, π is the distribuon of recommending next move and z ∈ {−1,+1} is outcome of game. • Loss funcon:

30 Policy networks Value networks (predict the perfect move) (predict the game result)

31 New network

AlphaGo Zero combine policy network and value network into a two-head (two outputs) network.

32 Network architecture

Consists of a single convoluonal block followed by either 19 or 39 residual convoluonal blocks of the following modules: • A convoluon of 256 filters of kernel size 3 × 3 with stride 1 • Batch normalizaon • A recfier nonlinearity (ReLU) The output of “the residual tower” is passed into two separate “heads” (fully connected layer) for compung the policy and value.

33 Training under 20 blocks

AlphaGo Zero winning 100 : 0 against AlphaGo Lee.

34 Training under 40 blocks

AlphaGo Zero winning 89 : 11 against AlphaGo Mster.

Go to AlphaGo Zero 对局研究 35 Simple experiments

36 Queson

Is a simple reinforcement learning without using MCTS possible to master Go?

37 Implement on simple variants of Go.

38 Atari Go or the Capture Game

Win the game whenever capturing opponent’s stones.

39

More examples

40

Advance strategy

The strategy does not only capturing opponent's stones, but also building own territory.

41 Training pipeline

Self-play using network Training network beer network

Game pool Pick data for training

42 Play game using network

Apply predicon of network: Input: W×W×3, where W is board size

Example: board & black next

43 Choose move using network output

• Network output: W×W (each value in (0,1))

• Exclude invalid points.

• Compute the distribuon of recommending move.

• Choose move randomly according to the distribuon.

44 Network (Keras code)

1. Convoluon(3x3) layers: number of filters (denoted by F)

2. Number of residual blocks (denoted by K)

3. Batch Normalizaon & Acvaon ReLU

4. Flaen before fully connected to output layer

45 Training network

1. Labeling data (giving incenve) data: W×W×3 (choose all winning side) label the point of next move 1; other point 0 2. Pick training data randomly from the game pool. 3. Compile with loss funcon 'categorical_crossentropy’

Game data:

label:

46 Program flowchart & factors

Inially random-play Data Set (game-size: n_game) insert & delete Pick N data

Self-play n_1 K = #blocks games F = #filters

renew network

47 Examining the progression

1. Values of loss funcon.

2. Winning rate with random-play.

3. Winning rate with previous network.

4. Other stasc, such as black winning rate?

5. Print some games and check by human.

48