Alphago Overview Ron Parr Compsci 590.2 Duke University

10/24/17 AlphaGo Overview Ron Parr CompSci 590.2 Duke University Overview • Digression: Zero sum, alternating move games • Original AlphaGo approach • Recent AlphaGo Zero approach 1 10/24/17 Alternating move, zero-sum, 2-player games • Ordinary bellman equation V (s) = max R(s,a) + g P(s'| s,a)V (s') a ås' • Two player V (s) = max R(s,a) + g P(s'| s,a)V (s') max a ås' min V (s) = min R(s,a) + g P(s'| s,a)V (s') min a ås' max Vmin and Vmax • Vmin is, by definition, the negative of Vmax • No need to store two separate value functions • Can store just one and flip the sign based upon who is playing 2 10/24/17 Original AlphaGo Ingredients • Policy network trained by supervised learning • Policy network trained by policy gradient • Value function network • Monte Carlo Tree Search (MCTS) 3 10/24/17 Supervised Policy Network • Similar to neuro-gammon – just tried to mimic human moves • CNN with softmax output layer • Trained on 30 million expert moves • 57% accuracy on held out test set • Seems low, but keep in mind that this not a binary prediction problem, so this is way better than flipping a coin • Also trained a faster “rollout” network that got 24% accuracy – used this for fast rollouts RL policy network • Used same network structure as supervised network • Initialized to same weights as supervised network • Trained using policy gradient • Trained initially against supervised network, then against previous version of the RL policy network (with some randomization) • Is this correct? NO! • Still worked very well! • 80% win rate against supervised network • 85% win rate against Pachi, a very strong MCTS program at the time (2 amateur dan) • Surprising that it played this well with no lookahead 4 10/24/17 Value network • Should predict probability of win given board position • Assuming both players play the same policy • Is this a reasonable thing to assume? • Tried training on the human database, but this didn’t do well • No single policy used? • Relatively small amount of training data, so overfit • Paper mentions correlations, but not clear • Trained on data generated from the RL policy network • Did reasonably well: 0.234 MSE • Did not overfit badly MCTS • Does MCTS with exploration bonus • The first time a new node is encountered: • Evaluated by the value network, AND • A rollout using the fast, supervised policy network for each action • Initializes Q-value estimates for each action • Weight exploration bonuses by prior probabilities = policy network distribution • Results mixed using a voodoo constant • When time is up (done searching) • Algorithm chooses the most visited node from the root • Ignores the value of the root, though most visited node will tend to have high value 5 10/24/17 Observations • Supervised learning network (trained on human moves) was better for rollouts than RL network, even though RL network produced a stronger player • Why? Authors argue that supervised network covered the space more, but that’s not a rigorous argument Performance • On a single CPU, AlphaGo significantly outperformed all available computer players (see figure 4 in paper) • Rollouts, the value network, and the policy network were all individually pretty strong • Combining them made them stronger • Distributed version beat the best European go player 6 10/24/17 Comparison with Deep Blue • Deep Blue (best computer chess player) • Deep blue used little/no machine learning • Evaluated more board positions (more brute force) • Supervised and reinforcement learning reduce search space • Learned value function and rollouts initialize new leaves w/reasonable values • Less time spend on exhaustive search AlphaGo History • Nature paper published in January 2016 • Played best living human go player, Lee Sedol, (possibly one of the strongest go players ever) in March 2016 • Used between 1K and 2K CPUs, 150-300 GPUs, and possibly Google’s new TPUs (not clear if these were counted as GPUs) • Won 4/5 games • Subsequently revealed that • Value network was tuned by self play • Used bigger NN and 48 TPUs • Play was described as surprising and original • Made moves that were unexpected at first, but made sense in hindsight 7 10/24/17 AlphaGo History II • Played the current best go player (Ke Jie) in May 2017 • Won 3/3 games • Livestream was censored in China: https://www.theguardian.com/technology/2017/may/24/china- censored-googles-alphago-match-against-worlds-best-go-player Surprising things about AlphaGo • Rollouts with fast, supervised policy better than rollouts with RL policy – enables more searching? • RL policy used only indirectly to train value function • Searchless play was surprisingly good for both RL policy and V • Contrast with chess: • Search seems essential for reasonable chess play • Rollouts less helpful • Deep blue was a hardware triumph • AlphaGo is an AI/ML triumph 8 10/24/17 Original AlphaGo Zero What is AlphaGo Zero • Announced in mid October 2017 with much hype • Learns to play with “zero” human knowledge • No obvious relationship to Coke Zero 9 10/24/17 Differences from AlphaGo Classic I • Uses a single convolutional network to propose actions (softmax) and produce value estimates • First time node is visited: NN assigns value and probabilities of actions (same as classic?) • No handcrafted features (were these previously discussed?) • No rollouts • Used Just a single machine with 4 TPUs Differences from AlphaGo Classic II • Exploits rotation and reflection invariance (did classic do this?) • When game ends: • Neural network is trained to maximize similarity between predicted and actual outcome for each state in the game • Network is also trained to make action probabilities consistent with MCTS action selection probabilities 10 10/24/17 Training • Millions of games of self-play • 64 GPUs, 19 CPUs • Surpassed the version that defeated Lee Sedol after just 36 hours of training • After ~30 days, surpassed AlphaGo Master – a version that had been beating top human masters 60-0 online • See figure 6 Remarkable things about AlphaGo Zero • Compare w/TD gammon • TD gammon did pretty well without expert features • Needed expert features to excel • AlphaGo Zero used raw board positions (essentially images) • Advantage of convolutional networks? • Independently learned expert level knowledge of important board/end game configurations • Developed new approaches to known Go problems • Simpler and cleaner algorithm than classic • Reduced computational resources at execution time 11 10/24/17 Why this matters • Previously viewed as a challenge problem for AI • Previous challenge problems were solved using: • Simple algorithms and massive hardware (Deep blue) • Special purpose hardware and algorithms (autonomous driving) • AlphaGo Zero uses almost no domain specific human knowledge: • Rules of the game • Symmetries Is this a watershed event for AI/ML? • Could be • Some questions that need to be answered: • MCTS helps for other domains, but seems like a particularly big win for Go • Can MCTS + RL be as big of a win for other domains? • Human Go ability leveraged human pattern recognition prowess • Was human Go dominance an artifact of a vanishing edge humans had in pattern recognition? • Are humans the right benchmark • Can this be done with less resources than a Google-like firm? • What does this say about general intelligence? 12.

Alphago Overview Ron Parr Compsci 590.2 Duke University

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support