10/24/17
AlphaGo Overview Ron Parr CompSci 590.2 Duke University
Overview
• Digression: Zero sum, alternating move games
• Original AlphaGo approach
• Recent AlphaGo Zero approach
1 10/24/17
Alternating move, zero-sum, 2-player games
• Ordinary bellman equation V (s) = max R(s,a) + g P(s'| s,a)V (s') a ås'
• Two player V (s) = max R(s,a) + g P(s'| s,a)V (s') max a ås' min V (s) = min R(s,a) + g P(s'| s,a)V (s') min a ås' max
Vmin and Vmax
• Vmin is, by definition, the negative of Vmax
• No need to store two separate value functions
• Can store just one and flip the sign based upon who is playing
2 10/24/17
Original AlphaGo
Ingredients
• Policy network trained by supervised learning
• Policy network trained by policy gradient
• Value function network
• Monte Carlo Tree Search (MCTS)
3 10/24/17
Supervised Policy Network
• Similar to neuro-gammon – just tried to mimic human moves • CNN with softmax output layer • Trained on 30 million expert moves • 57% accuracy on held out test set
• Seems low, but keep in mind that this not a binary prediction problem, so this is way better than flipping a coin
• Also trained a faster “rollout” network that got 24% accuracy – used this for fast rollouts
RL policy network
• Used same network structure as supervised network • Initialized to same weights as supervised network • Trained using policy gradient • Trained initially against supervised network, then against previous version of the RL policy network (with some randomization) • Is this correct? NO! • Still worked very well! • 80% win rate against supervised network • 85% win rate against Pachi, a very strong MCTS program at the time (2 amateur dan) • Surprising that it played this well with no lookahead
4 10/24/17
Value network
• Should predict probability of win given board position • Assuming both players play the same policy • Is this a reasonable thing to assume? • Tried training on the human database, but this didn’t do well • No single policy used? • Relatively small amount of training data, so overfit • Paper mentions correlations, but not clear • Trained on data generated from the RL policy network • Did reasonably well: 0.234 MSE • Did not overfit badly
MCTS
• Does MCTS with exploration bonus • The first time a new node is encountered: • Evaluated by the value network, AND • A rollout using the fast, supervised policy network for each action • Initializes Q-value estimates for each action • Weight exploration bonuses by prior probabilities = policy network distribution • Results mixed using a voodoo constant
• When time is up (done searching) • Algorithm chooses the most visited node from the root • Ignores the value of the root, though most visited node will tend to have high value
5 10/24/17
Observations
• Supervised learning network (trained on human moves) was better for rollouts than RL network, even though RL network produced a stronger player
• Why? Authors argue that supervised network covered the space more, but that’s not a rigorous argument
Performance
• On a single CPU, AlphaGo significantly outperformed all available computer players (see figure 4 in paper)
• Rollouts, the value network, and the policy network were all individually pretty strong
• Combining them made them stronger
• Distributed version beat the best European go player
6 10/24/17
Comparison with Deep Blue
• Deep Blue (best computer chess player)
• Deep blue used little/no machine learning
• Evaluated more board positions (more brute force)
• Supervised and reinforcement learning reduce search space • Learned value function and rollouts initialize new leaves w/reasonable values • Less time spend on exhaustive search
AlphaGo History
• Nature paper published in January 2016 • Played best living human go player, Lee Sedol, (possibly one of the strongest go players ever) in March 2016 • Used between 1K and 2K CPUs, 150-300 GPUs, and possibly Google’s new TPUs (not clear if these were counted as GPUs) • Won 4/5 games
• Subsequently revealed that • Value network was tuned by self play • Used bigger NN and 48 TPUs
• Play was described as surprising and original • Made moves that were unexpected at first, but made sense in hindsight
7 10/24/17
AlphaGo History II
• Played the current best go player (Ke Jie) in May 2017
• Won 3/3 games
• Livestream was censored in China: https://www.theguardian.com/technology/2017/may/24/china- censored-googles-alphago-match-against-worlds-best-go-player
Surprising things about AlphaGo
• Rollouts with fast, supervised policy better than rollouts with RL policy – enables more searching? • RL policy used only indirectly to train value function • Searchless play was surprisingly good for both RL policy and V • Contrast with chess: • Search seems essential for reasonable chess play • Rollouts less helpful • Deep blue was a hardware triumph • AlphaGo is an AI/ML triumph
8 10/24/17
Original AlphaGo Zero
What is AlphaGo Zero
• Announced in mid October 2017 with much hype
• Learns to play with “zero” human knowledge
• No obvious relationship to Coke Zero
9 10/24/17
Differences from AlphaGo Classic I
• Uses a single convolutional network to propose actions (softmax) and produce value estimates • First time node is visited: NN assigns value and probabilities of actions (same as classic?) • No handcrafted features (were these previously discussed?) • No rollouts • Used Just a single machine with 4 TPUs
Differences from AlphaGo Classic II
• Exploits rotation and reflection invariance (did classic do this?)
• When game ends: • Neural network is trained to maximize similarity between predicted and actual outcome for each state in the game • Network is also trained to make action probabilities consistent with MCTS action selection probabilities
10 10/24/17
Training
• Millions of games of self-play • 64 GPUs, 19 CPUs • Surpassed the version that defeated Lee Sedol after just 36 hours of training • After ~30 days, surpassed AlphaGo Master – a version that had been beating top human masters 60-0 online • See figure 6
Remarkable things about AlphaGo Zero
• Compare w/TD gammon • TD gammon did pretty well without expert features • Needed expert features to excel • AlphaGo Zero used raw board positions (essentially images) • Advantage of convolutional networks? • Independently learned expert level knowledge of important board/end game configurations • Developed new approaches to known Go problems • Simpler and cleaner algorithm than classic • Reduced computational resources at execution time
11 10/24/17
Why this matters
• Previously viewed as a challenge problem for AI
• Previous challenge problems were solved using: • Simple algorithms and massive hardware (Deep blue) • Special purpose hardware and algorithms (autonomous driving)
• AlphaGo Zero uses almost no domain specific human knowledge: • Rules of the game • Symmetries
Is this a watershed event for AI/ML?
• Could be • Some questions that need to be answered: • MCTS helps for other domains, but seems like a particularly big win for Go • Can MCTS + RL be as big of a win for other domains? • Human Go ability leveraged human pattern recognition prowess • Was human Go dominance an artifact of a vanishing edge humans had in pattern recognition? • Are humans the right benchmark • Can this be done with less resources than a Google-like firm? • What does this say about general intelligence?
12