<<

10/24/17

AlphaGo Overview Ron Parr CompSci 590.2 Duke University

Overview

• Digression: Zero sum, alternating move games

• Original AlphaGo approach

• Recent AlphaGo Zero approach

1 10/24/17

Alternating move, zero-sum, 2-player games

• Ordinary bellman equation V (s) = max R(s,a) + g P(s'| s,a)V (s') a ås'

• Two player V (s) = max R(s,a) + g P(s'| s,a)V (s') max a ås' min V (s) = min R(s,a) + g P(s'| s,a)V (s') min a ås' max

Vmin and Vmax

• Vmin is, by definition, the negative of Vmax

• No need to store two separate value functions

• Can store just one and flip the sign based upon who is playing

2 10/24/17

Original AlphaGo

Ingredients

• Policy network trained by

• Policy network trained by policy gradient

• Value function network

• Monte Carlo Tree Search (MCTS)

3 10/24/17

Supervised Policy Network

• Similar to neuro-gammon – just tried to mimic human moves • CNN with softmax output • Trained on 30 million expert moves • 57% accuracy on held out test set

• Seems low, but keep in that this not a binary prediction problem, so this is way better than flipping a coin

• Also trained a faster “rollout” network that got 24% accuracy – used this for fast rollouts

RL policy network

• Used same network structure as supervised network • Initialized to same weights as supervised network • Trained using policy gradient • Trained initially against supervised network, then against previous version of the RL policy network (with some randomization) • Is this correct? NO! • Still worked very well! • 80% win rate against supervised network • 85% win rate against Pachi, a very strong MCTS program at the time (2 amateur dan) • Surprising that it played this well with no lookahead

4 10/24/17

Value network

• Should predict probability of win given board position • Assuming both players play the same policy • Is this a reasonable thing to assume? • Tried training on the human database, but this didn’t do well • No single policy used? • Relatively small amount of training data, so overfit • Paper mentions correlations, but not clear • Trained on data generated from the RL policy network • Did reasonably well: 0.234 MSE • Did not overfit badly

MCTS

• Does MCTS with exploration bonus • The first time a new node is encountered: • Evaluated by the value network, AND • A rollout using the fast, supervised policy network for each action • Initializes Q-value estimates for each action • Weight exploration bonuses by prior probabilities = policy network distribution • Results mixed using a voodoo constant

• When time is up (done searching) • Algorithm chooses the most visited node from the root • Ignores the value of the root, though most visited node will tend to have high value

5 10/24/17

Observations

• Supervised learning network (trained on human moves) was better for rollouts than RL network, even though RL network produced a stronger player

• Why? Authors argue that supervised network covered the space more, but that’s not a rigorous argument

Performance

• On a single CPU, AlphaGo significantly outperformed all available computer players (see figure 4 in paper)

• Rollouts, the value network, and the policy network were all individually pretty strong

• Combining them made them stronger

• Distributed version beat the best European go player

6 10/24/17

Comparison with Deep Blue

• Deep Blue (best computer chess player)

• Deep blue used little/no

• Evaluated more board positions (more brute force)

• Supervised and reduce search space • Learned value function and rollouts initialize new leaves w/reasonable values • Less time spend on exhaustive search

AlphaGo History

• Nature paper published in January 2016 • Played best living human go player, Lee Sedol, (possibly one of the strongest go players ever) in March 2016 • Used between 1K and 2K CPUs, 150-300 GPUs, and possibly Google’s new TPUs (not clear if these were counted as GPUs) • Won 4/5 games

• Subsequently revealed that • Value network was tuned by self play • Used bigger NN and 48 TPUs

• Play was described as surprising and original • Made moves that were unexpected at first, but made sense in hindsight

7 10/24/17

AlphaGo History II

• Played the current best go player (Ke Jie) in May 2017

• Won 3/3 games

• Livestream was censored in China: https://www.theguardian.com/technology/2017/may/24/china- censored-googles--match-against-worlds-best-go-player

Surprising things about AlphaGo

• Rollouts with fast, supervised policy better than rollouts with RL policy – enables more searching? • RL policy used only indirectly to train value function • Searchless play was surprisingly good for both RL policy and V • Contrast with chess: • Search seems essential for reasonable chess play • Rollouts less helpful • Deep blue was a hardware triumph • AlphaGo is an AI/ML triumph

8 10/24/17

Original AlphaGo Zero

What is AlphaGo Zero

• Announced in mid October 2017 with much hype

• Learns to play with “zero” human knowledge

• No obvious relationship to Coke Zero

9 10/24/17

Differences from AlphaGo Classic I

• Uses a single convolutional network to propose actions (softmax) and produce value estimates • First time node is visited: NN assigns value and probabilities of actions (same as classic?) • No handcrafted features (were these previously discussed?) • No rollouts • Used Just a single machine with 4 TPUs

Differences from AlphaGo Classic II

• Exploits rotation and reflection invariance (did classic do this?)

• When game ends: • Neural network is trained to maximize similarity between predicted and actual outcome for each state in the game • Network is also trained to make action probabilities consistent with MCTS action selection probabilities

10 10/24/17

Training

• Millions of games of self-play • 64 GPUs, 19 CPUs • Surpassed the version that defeated Lee Sedol after just 36 hours of training • After ~30 days, surpassed AlphaGo Master – a version that had been beating top human masters 60-0 online • See figure 6

Remarkable things about AlphaGo Zero

• Compare w/TD gammon • TD gammon did pretty well without expert features • Needed expert features to excel • AlphaGo Zero used raw board positions (essentially images) • Advantage of convolutional networks? • Independently learned expert level knowledge of important board/end game configurations • Developed new approaches to known Go problems • Simpler and cleaner algorithm than classic • Reduced computational resources at execution time

11 10/24/17

Why this matters

• Previously viewed as a challenge problem for AI

• Previous challenge problems were solved using: • Simple algorithms and massive hardware (Deep blue) • Special purpose hardware and algorithms (autonomous driving)

• AlphaGo Zero uses almost no domain specific human knowledge: • Rules of the game • Symmetries

Is this a watershed event for AI/ML?

• Could be • Some questions that need to be answered: • MCTS helps for other domains, but seems like a particularly big win for Go • Can MCTS + RL be as big of a win for other domains? • Human Go ability leveraged human prowess • Was human Go dominance an artifact of a vanishing edge humans had in pattern recognition? • Are humans the right benchmark • Can this be done with less resources than a Google-like firm? • What does this say about general intelligence?

12