Alphago Overview Ron Parr Compsci 590.2 Duke University

10/24/17 AlphaGo Overview Ron Parr CompSci 590.2 Duke University Overview • Digression: Zero sum, alternating move games • Original AlphaGo approach • Recent AlphaGo Zero approach 1 10/24/17 Alternating move, zero-sum, 2-player games • Ordinary bellman equation V (s) = max R(s,a) + g P(s'| s,a)V (s') a ås' • Two player V (s) = max R(s,a) + g P(s'| s,a)V (s') max a ås' min V (s) = min R(s,a) + g P(s'| s,a)V (s') min a ås' max Vmin and Vmax • Vmin is, by definition, the negative of Vmax • No need to store two separate value functions • Can store just one and flip the sign based upon who is playing 2 10/24/17 Original AlphaGo Ingredients • Policy network trained by supervised learning • Policy network trained by policy gradient • Value function network • Monte Carlo Tree Search (MCTS) 3 10/24/17 Supervised Policy Network • Similar to neuro-gammon – just tried to mimic human moves • CNN with softmax output layer • Trained on 30 million expert moves • 57% accuracy on held out test set • Seems low, but keep in mind that this not a binary prediction problem, so this is way better than flipping a coin • Also trained a faster “rollout” network that got 24% accuracy – used this for fast rollouts RL policy network • Used same network structure as supervised network • Initialized to same weights as supervised network • Trained using policy gradient • Trained initially against supervised network, then against previous version of the RL policy network (with some randomization) • Is this correct? NO! • Still worked very well! • 80% win rate against supervised network • 85% win rate against Pachi, a very strong MCTS program at the time (2 amateur dan) • Surprising that it played this well with no lookahead 4 10/24/17 Value network • Should predict probability of win given board position • Assuming both players play the same policy • Is this a reasonable thing to assume? • Tried training on the human database, but this didn’t do well • No single policy used? • Relatively small amount of training data, so overfit • Paper mentions correlations, but not clear • Trained on data generated from the RL policy network • Did reasonably well: 0.234 MSE • Did not overfit badly MCTS • Does MCTS with exploration bonus • The first time a new node is encountered: • Evaluated by the value network, AND • A rollout using the fast, supervised policy network for each action • Initializes Q-value estimates for each action • Weight exploration bonuses by prior probabilities = policy network distribution • Results mixed using a voodoo constant • When time is up (done searching) • Algorithm chooses the most visited node from the root • Ignores the value of the root, though most visited node will tend to have high value 5 10/24/17 Observations • Supervised learning network (trained on human moves) was better for rollouts than RL network, even though RL network produced a stronger player • Why? Authors argue that supervised network covered the space more, but that’s not a rigorous argument Performance • On a single CPU, AlphaGo significantly outperformed all available computer players (see figure 4 in paper) • Rollouts, the value network, and the policy network were all individually pretty strong • Combining them made them stronger • Distributed version beat the best European go player 6 10/24/17 Comparison with Deep Blue • Deep Blue (best computer chess player) • Deep blue used little/no machine learning • Evaluated more board positions (more brute force) • Supervised and reinforcement learning reduce search space • Learned value function and rollouts initialize new leaves w/reasonable values • Less time spend on exhaustive search AlphaGo History • Nature paper published in January 2016 • Played best living human go player, Lee Sedol, (possibly one of the strongest go players ever) in March 2016 • Used between 1K and 2K CPUs, 150-300 GPUs, and possibly Google’s new TPUs (not clear if these were counted as GPUs) • Won 4/5 games • Subsequently revealed that • Value network was tuned by self play • Used bigger NN and 48 TPUs • Play was described as surprising and original • Made moves that were unexpected at first, but made sense in hindsight 7 10/24/17 AlphaGo History II • Played the current best go player (Ke Jie) in May 2017 • Won 3/3 games • Livestream was censored in China: https://www.theguardian.com/technology/2017/may/24/china- censored-googles-alphago-match-against-worlds-best-go-player Surprising things about AlphaGo • Rollouts with fast, supervised policy better than rollouts with RL policy – enables more searching? • RL policy used only indirectly to train value function • Searchless play was surprisingly good for both RL policy and V • Contrast with chess: • Search seems essential for reasonable chess play • Rollouts less helpful • Deep blue was a hardware triumph • AlphaGo is an AI/ML triumph 8 10/24/17 Original AlphaGo Zero What is AlphaGo Zero • Announced in mid October 2017 with much hype • Learns to play with “zero” human knowledge • No obvious relationship to Coke Zero 9 10/24/17 Differences from AlphaGo Classic I • Uses a single convolutional network to propose actions (softmax) and produce value estimates • First time node is visited: NN assigns value and probabilities of actions (same as classic?) • No handcrafted features (were these previously discussed?) • No rollouts • Used Just a single machine with 4 TPUs Differences from AlphaGo Classic II • Exploits rotation and reflection invariance (did classic do this?) • When game ends: • Neural network is trained to maximize similarity between predicted and actual outcome for each state in the game • Network is also trained to make action probabilities consistent with MCTS action selection probabilities 10 10/24/17 Training • Millions of games of self-play • 64 GPUs, 19 CPUs • Surpassed the version that defeated Lee Sedol after just 36 hours of training • After ~30 days, surpassed AlphaGo Master – a version that had been beating top human masters 60-0 online • See figure 6 Remarkable things about AlphaGo Zero • Compare w/TD gammon • TD gammon did pretty well without expert features • Needed expert features to excel • AlphaGo Zero used raw board positions (essentially images) • Advantage of convolutional networks? • Independently learned expert level knowledge of important board/end game configurations • Developed new approaches to known Go problems • Simpler and cleaner algorithm than classic • Reduced computational resources at execution time 11 10/24/17 Why this matters • Previously viewed as a challenge problem for AI • Previous challenge problems were solved using: • Simple algorithms and massive hardware (Deep blue) • Special purpose hardware and algorithms (autonomous driving) • AlphaGo Zero uses almost no domain specific human knowledge: • Rules of the game • Symmetries Is this a watershed event for AI/ML? • Could be • Some questions that need to be answered: • MCTS helps for other domains, but seems like a particularly big win for Go • Can MCTS + RL be as big of a win for other domains? • Human Go ability leveraged human pattern recognition prowess • Was human Go dominance an artifact of a vanishing edge humans had in pattern recognition? • Are humans the right benchmark • Can this be done with less resources than a Google-like firm? • What does this say about general intelligence? 12.

Load more