Mastering the Game of Go from Scratch

Mastering the game of Go from scratch Michael Painter *1 Luke Johnston *1 Abstract ultimate goal is to set up a general framework for this sort of transfer learning, so that complex tasks such as Go can In this report we pursue a transfer-learning in- be learned without reference from expert data. spired approach to learning to play the game of Go through pure self-play reinforcement learn- In a more general setting, it’s an interesting question to ing. We train a policy network on a 5 5 Go consider how can we, in the absence of human profes- ⇥ board, and evaluate a mechanism for transferring sional datasets, attempt to apply machine learning tech- this knowledge to a larger board size. Although niques (specifically reinforcement learning techniques) to our model did learn a few interesting strategies problems with large state spaces. We pursue this unsuper- on the 5 5 board, it never achieved human vised learning approach due to one main motivation: for ⇥ level, and the transfer learning to a larger board tasks which lack expert data, or for tasks which no expert size yielded neither faster convergence nor better exists, supervised training is impossible. Hence, this area play. is vital to the development of general artificial intelligence (GAI), since no GAI will be able to rely on expert data for all of its tasks. 1. Introduction Full of hope and optimism, we had hoped to coin the term In the recent paper “Mastering the game of Go with deep “BetaGo” for our agent (excuse the pun). However, as will neural networks and tree search” [1], superhuman perfor- become clear in the remainder of the report, that we settled mance on the game of Go was achieved with a combina- for a more apt “ZetaGo”, leaving Beta, Gamma, and well, tion of supervised learning (from professional Go games) the rest of the Greek alphabet available for more capable and reinforcement learning (from self-play). Traditional AI. pure reinforcement learning approaches have not yielded satisfactory results on the game of Go on a 19 19 board ⇥ 2. Related Work because its state space is so large and its optimal value function so complex that learning from self-play is infeasi- 2.1. TD-gammon ble. However, neither of these limitations apply to smaller TD-gammon [6] was the first reinforcement learning agent board sizes (for example, 5 5). ⇥ to achieve human master level play on a board game In this paper, we attempt to investigate this prob- (backgammon) using only self-play. A neural network is lem. Specifically, we evaluate an entirely reinforcement- used to estimate the value function of the game state, and learning based approach to ‘mastering the game of Go’, the TD update is applied after every step: that first learns how to play on smaller board sizes, and then t uses a form of transfer learning to learn to play on succes- t k ∆w = ↵[V (s ) V (s )] γ − V (s ) t+1 − t r k sively larger board sizes (without ever consulting data from k=1 expert play). X where ↵ is the learning rate, w the weights of the network, The inspiration for the transfer learning component comes V the value function, and γ the future reward discount. from the observation that humans almost always learn large Starting from no knowledge, and only playing against it- tasks by breaking them up into simpler components first. A self, this network is able to learn a value function that is human learning the game of Go would be taught techniques competitive with human masters with the simple TD learn- and strategies at a much smaller scale than the full 19 ⇥ ing update above. However, the game of backgammon has 19 board, and only after mastering these concepts would many advantages for this approach that other games, like they have any chance of mastering the larger board. So our Go, lack. First, backgammon state transitions are inde- *Equal contribution 1Stanford University, Palo Alto, USA. terministic, so a form of random exploration is inherently built into the policy. Second, simple evaluations of board state are relatively straightforward, as the board is one- Mastering the game of Go from scratch dimensional, and the inherent objective is to simple move 3. Approach stones along the board in one direction (although the strat- egy is complex). It is also worthwhile noting that the TD- 3.1. OpenAI Gym gammon agent uses human-constructed features for the in- The OpenAI Gym toolkit for reinforcement learning [2] put in addition to the basic board state. Finally, a 19 19 Go ⇥ provides an environment for learning the game of Go on board has an much, much larger state space than backgam- both 9 9 and 19 19 sized boards, playing against an mon - 10170 versus 1020. ⇥ ⇥ ⇡ ⇡ opponent controlled by the Pachi open-source Go program [3]. For our project, we added our own environments for 2.2. Policy Networks each board width from 5 to 19, and modified the environment to allow self-play (we made it a two player environ- A policy network approximates a stochastic policy with a ment instead of a single player environment where the op- neural network. The policy gradient theorem [7] states that ponent is always Pachi). In this environment, for a board for a policy function approximator ⇡ with parameters ✓, we width of W , the current player has W 2 +2actions: play can compute the derivative of the expected total discounted at any location in the board, resign, or pass. If the agent rewards p with respect to the parameters ✓ as follows: attempts to make an impossible move, this is interpreted as resignation. Since this results in extremely frequent res- @p @⇡(s, a) = d⇡(s) Q⇡(s, a) ignations in the early stages of training, we masked out @✓ @✓ s a the probabilities of obvious resignations, disallowing the X X model to attempt to play on top of an existing piece. The game ends when either player resigns, or when both play- where the s are all states, the a are all actions, d⇡ is the ers pass in succession (at which point the board is evaluated stationary distribution of states according to the policy, and and the winner is determined). Rewards are 0 until the ter- Q is an estimate of the expected discounted reward from minal state, at which point a reward of 1 indicates a win, taking action a at state s. In this paper we use the actual re- k 1 1 a loss, and 0.0 a draw. The state of the board is repre- turns, Q(st,at)=Rt = 1 γ − rt+k, although Q can − k=1 sented as a (W W 3) array, where each entry is either also be approximated with another function approximator. ⇥ ⇥ 0 or 1, and one channel represents black moves, one white, In either case, the policy willP converge to a locally optimal and one empty spaces. policy [7]. Policy networks have been used to great success for recent reinforcement learning tasks, such as AlphaGo The OpenAI Gym toolkit also provides a Pachi agent [3] below [1], and the A3C algorithm on diverse tasks such as implementation, which uses a standard UCB1 search policy navigating 3D environments and playing Atari games [8]. with monte-carlo tree search (described in section 3.6). 2.3. AlphaGo 3.2. Policy network reinforcement learning AlphaGo [1] is an AI capable of defeating human mas- We train a policy network to estimate the probability of ters at the game of Go. This feat was considered a mile- each action given a state at time t: stone of AI and was accomplished recently, in the spring A p (s ) R| | of 2016. To achieve this, they first train a 13-layer con- ✓ t 2 volutional policy network to predict expert play. Then, where p✓ is the policy function with parameters ✓, st is a they initialize a second policy network with these results, state of the game, and A is the number of actions. These | | and train it further with reinforcement learning from self values are normalized to probabilities with a softmax layer. play. During self play, network parameters are saved ev- To train the policy network we follow a similar approach ery 500 iterations, and the opponent is selected randomly as [1]. For a minibatch of size n, the policy network plays from one of those saves (to reduce overfitting to the cur- n games against a randomly selected previous iteration of rent policy). Thirdly, they train a value network to estimate the policy network. According to [1], this randomization the value function for each state. It has the same architec- prevents overfitting to the current policy. Let the outcome i ture as the convolutional policy network, except the final of the ith game be ri at terminal turn T . Then the policy layer, which outputs a single value instead of policy logits. update we implement is This network was trained 30 million game positions, each i a random position of a distinct self-play game. Positions ↵ n T @ log p (ai si) ∆✓ = ✓ t| t r are taken from unique games to reduce correlation between n @✓ i i=1 t=1 the training data and prevent overfitting. Finally, a form of X X Monte-Carlo tree search parameterized by the policy and Note that this represents a modification from [1], which in- value networks is utilized for action selection during both cludes a baseline estimate of the value function for variance training and testing. reduction. Mastering the game of Go from scratch larger board.

Load more