Approximate Universal Artificial Intelligence And

APPROXIMATEUNIVERSALARTIFICIAL INTELLIGENCE AND SELF-PLAY LEARNING FORGAMES Doctor of Philosophy Dissertation School of Computer Science and Engineering joel veness supervisors Kee Siong Ng Marcus Hutter Alan Blair William Uther John Lloyd January 2011 Joel Veness: Approximate Universal Artificial Intelligence and Self-play Learning for Games, Doctor of Philosophy Disserta- tion, © January 2011 When we write programs that learn, it turns out that we do and they don’t. — Alan Perlis ABSTRACT This thesis is split into two independent parts. The first is an investigation of some practical aspects of Marcus Hutter’s Uni- versal Artificial Intelligence theory [29]. The main contributions are to show how a very general agent can be built and analysed using the mathematical tools of this theory. Before the work presented in this thesis, it was an open question as to whether this theory was of any relevance to reinforcement learning practitioners. This work suggests that it is indeed relevant and worthy of future investigation. The second part of this thesis looks at self-play learning in two player, determin- istic, adversarial turn-based games. The main contribution is the introduction of a new technique for training the weights of a heuristic evaluation function from data collected by classical game tree search algorithms. This method is shown to outperform previous self-play training routines based on Temporal Difference learning when applied to the game of Chess. In particular, the main highlight was using this technique to construct a Chess program that learnt to play master level Chess by tuning a set of initially random weights from self play games. iii PUBLICATIONS A significant portion of the technical content of this thesis has been previously published at leading international Artificial Intelligence conferences and peer- reviewed journals. The relevant material for Part I includes: • Reinforcement Learning via AIXI Approximation,[82] Joel Veness, Kee Siong Ng, Marcus Hutter, David Silver Association for the Advancement of Artificial Intelligence (AAAI), 2010. • A Monte-Carlo AIXI Approximation,[83] Joel Veness, Kee Siong Ng, Marcus Hutter, William Uther, David Silver Journal of Artificial Intelligence Research (JAIR), 2010. The relevant material for Part II includes: • Bootstrapping from Game Tree Search [81] Joel Veness, David Silver, Will Uther, Alan Blair Neural Information Processing Systems (NIPS), 2009. v We should not only use the brains we have, but all that we can borrow. — Woodrow Wilson ACKNOWLEDGMENTS Special thanks to Kee Siong for all his time, dedication and encouragement. This work would not have been possible without him. Thanks to Marcus for having the courage to both write his revolutionary book and take me on late as a PhD student. Thanks to Alan, John and Will for many helpful discussions and suggestions. Finally, a collective thanks to all of my supervisors for letting me pursue my own interests. Thankyou to UNSW and NICTA for the financial support that allowed me to write this thesis and attend a number of overseas conferences. Thankyou to Peter Cheeseman for giving me my first artificial intelligence job. Thanks to my external collaborators, in particular David Silver and Shane Legg. Thankyou to Michael Bowling and the University of Alberta for letting me finish my thesis on campus. Thankyou to the external examiners for their constructive feedback. Thankyou to the international cricket community for years of entertainment. Thankyou to the few primary, secondary and university teachers who kept me interested. Thankyou to the science fiction community for being a significant source of inspiration, in particular Philip K. Dick, Ursula Le Guin, Frederik Pohl, Robert Heinlein and Vernor Vinge. Thankyou to my family, and in particular my mother, who in spite of significant setbacks managed to ultimately do a good job. Thankyou to my friends for all the good times. Thankyou to my fiancée Felicity for her love and support. Finally, thanks to everyone who gave a word or two of encouragement along the way. vii DECLARATION I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgment is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged. Joel Veness CONTENTS i approximate universal artificial intelligence1 1 reinforcement learning via aixi approximation3 1.1 Overview 3 1.2 Introduction 4 1.2.1 The General Reinforcement Learning Problem 4 1.2.2 The AIXI Agent 5 1.2.3 AIXI as a Principle 6 1.2.4 Approximating AIXI 6 1.3 The Agent Setting 7 1.3.1 Agent Setting 7 1.3.2 Reward, Policy and Value Functions 8 1.4 Bayesian Agents 10 1.4.1 Prediction with a Mixture Environment Model 11 1.4.2 Theoretical Properties 12 1.4.3 AIXI: The Universal Bayesian Agent 14 1.4.4 Direct AIXI Approximation 14 2 expectimax approximation 16 2.1 Background 16 2.2 Overview 17 2.3 Action Selection at Decision Nodes 19 2.4 Chance Nodes 20 2.5 Estimating Future Reward at Leaf Nodes 21 2.6 Reward Backup 21 2.7 Pseudocode 22 xi xii contents 2.8 Consistency of ρUCT 25 2.9 Parallel Implementation of ρUCT 25 3 model class approximation 26 3.1 Context Tree Weighting 26 3.1.1 Krichevsky-Trofimov Estimator 27 3.1.2 Prediction Suffix Trees 28 3.1.3 Action-conditional PST 30 3.1.4 A Prior on Models of PSTs 31 3.1.5 Context Trees 32 3.1.6 Weighted Probabilities 34 3.1.7 Action Conditional CTW as a Mixture Environment Model 35 3.2 Incorporating Type Information 36 3.3 Convergence to the True Environment 38 3.4 Summary 41 3.5 Relationship to AIXI 42 4 putting it all together 43 4.1 Convergence of Value 43 4.2 Convergence to Optimal Policy 45 4.3 Computational Properties 48 4.4 Efficient Combination of FAC-CTW with ρUCT 48 4.5 Exploration/Exploitation in Practice 49 4.6 Top-level Algorithm 50 5 results 51 5.1 Empirical Results 51 5.1.1 Domains 51 5.1.2 Experimental Setup 57 5.1.3 Results 61 contents xiii 5.1.4 Discussion 64 5.1.5 Comparison to 1-ply Rollout Planning 65 5.1.6 Performance on a Challenging Domain 67 5.2 Discussion 69 5.2.1 Related Work 69 5.2.2 Limitations 71 6 future work 73 6.1 Future Scalability 73 6.1.1 Online Learning of Rollout Policies for ρUCT 73 6.1.2 Combining Mixture Environment Models 75 6.1.3 Richer Notions of Context for FAC-CTW 75 6.1.4 Incorporating CTW Extensions 76 6.1.5 Parallelization of ρUCT 77 6.1.6 Predicting at Multiple Levels of Abstraction 77 6.2 Conclusion 77 6.3 Closing Remarks 78 ii learning from self-play using game tree search 79 7 bootstrapping from game tree search 81 7.1 Overview 81 7.2 Introduction 82 7.3 Background 83 7.4 Minimax Search Bootstrapping 86 7.5 Alpha-Beta Search Bootstrapping 88 7.5.1 Updating Parameters in TreeStrap(αβ) 90 7.5.2 The TreeStrap(αβ) algorithm 90 7.6 Learning Chess Program 91 7.7 Experimental Results 92 7.7.1 Relative Performance Evaluation 93 xiv contents 7.7.2 Evaluation by Internet Play 95 7.8 Conclusion 97 bibliography 99 LISTOFFIGURES Figure 1 A ρUCT search tree 19 Figure 2 An example prediction suffix tree 29 Figure 3 A depth-2 context tree (left). Resultant trees after processing one (middle) and two (right) bits respectively. 33 Figure 4 The MC-AIXI agent loop 50 Figure 5 The cheese maze 53 Figure 6 A screenshot (converted to black and white) of the PacMan domain 56 Figure 7 Average Reward per Cycle vs Experience 63 Figure 8 Performance versus ρUCT search effort 64 Figure 9 Online performance on a challenging domain 67 Figure 10 Scaling properties on a challenging domain 68 Figure 11 Online performance when using a learnt rollout policy on the Cheese Maze 74 Figure 12 Left: TD, TD-Root and TD-Leaf backups. Right: Root- Strap(minimax) and TreeStrap(minimax). 84 Figure 13 Performance when trained via self-play starting from random initial weights. 95% confidence intervals are marked at each data point. The x-axis uses a logarithmic scale. 94 xv LISTOFTABLES Table 1 Domain characteristics 52 Table 2 Binary encoding of the domains 58 Table 3 MC-AIXI(fac-ctw) model learning configuration 59 Table 4 U-Tree model learning configuration 61 Table 5 Resources required for (near) optimal performance by MC- AIXI(fac-ctw) 62 Table 6 Average reward per cycle: ρUCT versus 1-ply rollout planning 66 Table 7 Backups for various learning algorithms. 87 Table 8 Best performance when trained by self play. 95% confidence intervals given. 95 Table 9 Blitz performance at the Internet Chess Club 96 xvi Part I APPROXIMATEUNIVERSALARTIFICIAL INTELLIGENCE Beware the Turing tar-pit, where everything is possible but nothing of interest is easy. — Alan Perlis 1 REINFORCEMENTLEARNINGVIAAIXIAPPROXIMATION 1.1 overview This part of the thesis introduces a principled approach for the design of a scalable general reinforcement learning agent.

Approximate Universal Artificial Intelligence And

Understanding Agent Incentives Using Causal Influence Diagrams

Arxiv:2010.12268V1 [Cs.LG] 23 Oct 2020 Trained on a New Target Task Rapidly Loses in Its Ability to Solve Previous Source Tasks

On the Differences Between Human and Machine Intelligence

Universal Artificial Intelligence Philosophical, Mathematical, and Computational Foundations of Inductive Inference and Intelligent Agents the Learn

QKSA: Quantum Knowledge Seeking Agent

The Fastest and Shortest Algorithm for All Well-Defined Problems

The Relativity of Induction

Comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence

Realab: an Embedded Perspective on Tampering

Universal Artificial Intelligence

Categorizing Wireheading in Partially Embedded Agents

On Thompson Sampling and Asymptotic Optimality∗