Joseph West Thesis

Self-Play Deep Learning for Games Maximising Experiences by Joseph West BE Electrical (Hons), BSc Computers, MEng Sci School of Electrical Engineering and Robotics Queensland University of Technology A dissertation submitted in fulfilment of the requirements for the degree of Doctor of Philosophy 2020 Keywords: Machine Learning, Deep Learning, Games, Monte-Carlo Tree Search, Reinforcement Learning, Curriculum learning. iii In accordance with the requirements of the degree of Doctor of Philosophy in the School of Electrical Engineering and Robotics, I present the following thesis entitled, Self-Play Deep Learning for Games Maximising Experiences Statement of Original Authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made. Signed, QUT Verified Signature Joseph West 4 September 2020 v Computer: Shall we play a game? David: Love to. How about Global Thermonuclear War. Computer: Wouldn't you prefer a good game of Chess? David: Later. Let's play Global Thermonuclear War. -Wargames 1983 Acknowledgements I would like to thank my wife Kerry for her patience and supporting my indulgence in undertaking this research as well as her willingness to act as first reader for a topic she had no prior knowledge of. I would also like to acknowledge my children Joey and Mitchell for being mature enough to allow me to focus on my research sometimes at the expense of spending more time with them. To my supervisory team, Frederic, Simon and Cameron I have enjoyed working with you and whilst challenging, the processes has been overwhelmingly positive. Thankyou for your patience, mentoring and time. This project has been a large consumer of computing resources and would not be possible without the QUT High Performance Computing (HPC) resources and the support staff that maintain the system. The HPC staff have been responsive to my unique needs and their efforts are greatly appreciated. I would like to thank the Signal Processing, Artificial Intelligence and Vision Tech- nologies (SAIVT) project for allocating additional resources to me for use in this research. Finally I would also like to acknowledge Dr Ross Cooper BSc (App Chem) Hons PhD Grad Dip Ed (Sec) CEng MIChemE RPEQ 19081 for reviewing the introduction and abstract with a view to confirming clarity for the benefit of readers without machine learning expertise. ix Abstract Humans learn complex abstract concepts faster if examples are curated. Our education system, for example, is curriculum based with problems becoming progressively harder as the student advances. Ideally, problems which are presented to the student are se- lected to ensure that they are just beyond the student's current capabilities, making the problem challenging but solvable. Problems which are too complex or too simple can reduce the speed at which the student learns; i.e. the efficiency of learning can be improved by presenting a well curated selection of problems. The idea that a student can effectively learn by randomly selecting a set of problems from the entire subject, regardless of complexity, is so absurd that it is not even under consideration as a possible methodology in our educational system; however this is exactly the approach the current state-of-the-art machine-learning game playing agent uses to learn. The state- of-the-art game playing agent is a combined neural-network/tree-search architecture which generates examples by self-play, with these examples subsequently used to train the agent in how to play the game. New examples are generated from the entire problem domain with no consideration as to the complexity of the examples or the ability of the agent. In this thesis we explore methods of curating self-play experiences which can improve the efficiency of learning, focusing on enforcing a curriculum within the agent's training pipeline. We introduce the end-game-first training curriculum, a new approach which is focused on learning from the end-game first, resulting in reduced training time for comparable players. We further extend the end-game-first curriculum by automat- ically adjusting the problem complexity based on the agent's performance, showing that an agent with an automated end-game-first curriculum outperforms a comparable player not following a curriculum. Self-play is resource intensive and throughout this thesis the underlying premise of our research is to extract the maximum benefit from the generated experiences. The improvements presented in this thesis can be attributed to improving the efficiency of the agent's training loop by more effective use of self-play experiences. We use the games of Connect-4, Reversi and Racing-Kings to demonstrate the generality of our methods. xi Contents 1 Introduction1 1.1 Computer Game Playing...........................1 1.1.1 Game Theory.............................2 1.1.2 Performance of Artificial Game Players..............3 1.2 Artificial Intelligence.............................4 1.3 Real-World Games..............................6 1.4 Research Contributions...........................7 1.5 Publications..................................8 1.6 Thesis Outline................................8 2 Literature Review 11 2.1 Overview................................... 11 2.2 Historical Background............................ 12 2.2.1 Game Theory............................. 12 2.2.2 Computers Playing Board Games.................. 14 2.2.3 Computer Game Playing Approaches................ 16 2.2.4 Playing the Game of Go....................... 18 2.2.5 General Game Playing........................ 19 2.2.6 AlphaZero's Evolution........................ 19 2.3 Tree Search.................................. 20 2.3.1 Search Methods............................ 20 2.3.2 Monte Carlo Tree Search...................... 22 2.4 Neural Networks............................... 25 2.4.1 Machine Learning Problem Types................. 26 2.4.2 Neural Network Architectures.................... 27 2.4.3 Neural Network Learning...................... 31 2.4.4 Overfitting.............................. 39 3 Evaluation Framework 43 3.1 Introduction.................................. 43 3.2 Game Environment.............................. 44 3.2.1 Game Representation........................ 44 3.3 Neural-Network/Tree-Search Game Player................. 45 3.3.1 Neural-Network............................ 46 xiii xiv 3.3.2 Training the Neural-Network.................... 48 3.3.3 Tree Search.............................. 50 3.3.4 Move Selection............................ 51 3.3.5 Parameter Selection......................... 51 3.3.6 AlphaGo Zero Inspired Player.................... 51 3.4 Reference-Opponents............................. 52 3.4.1 MCTS Opponent........................... 53 3.4.2 Stockfish Opponent......................... 53 3.5 Games..................................... 53 3.5.1 Connect-4............................... 54 3.5.2 Racing-Kings............................. 55 3.5.3 Reversi................................ 56 3.5.4 Searching State-Space of Games.................. 56 3.6 Summary................................... 57 4 End-Game-First Curriculum 59 4.1 Introduction.................................. 59 4.1.1 Weakness of the Current Approach................. 60 4.2 Related Work................................. 61 4.2.1 Using a Curriculum for Machine Learning............. 61 4.3 Method.................................... 63 4.3.1 Curriculum-Player.......................... 63 4.4 Experiment.................................. 65 4.4.1 Experimental Conduct........................ 65 4.5 Results..................................... 65 4.6 Discussion................................... 66 4.6.1 Success Rate vs Time Comparison................. 66 4.6.2 Success Rate vs Steps Comparison................. 72 4.6.3 Success Rate vs Epoch Comparison................. 72 4.6.4 Using a Linear Curriculum Function................ 73 4.6.5 Limitations.............................. 74 4.6.6 Curriculum Considerations..................... 75 4.7 Summary................................... 75 5 Automated End Game First Curriculum 77 5.1 Introduction.................................. 77 5.2 Prerequisites................................. 79 5.2.1 Visibility-Horizon........................... 82 5.2.2 Performance of the Agent...................... 84 5.3 Method.................................... 86 5.4 Evaluation................................... 88 5.5 Results..................................... 88 5.6 Discussion................................... 94 Contents xv 5.6.1 Comparison With Baseline-Player................. 94 5.6.2 Comparison with Handcrafted Curriculum............. 97 5.7 Summary................................... 97 6 Priming the Neural-Network 99 6.1 Introduction.................................. 99 6.2 Terminal Priming............................... 100 6.2.1 Method................................ 100 6.2.2 Results................................ 102 6.2.3 Discussion............................... 102 6.3 Incremental-Buffering............................ 107 6.3.1 Method................................ 107 6.3.2 Results................................ 107 6.3.3 Discussion............................... 108 6.4 Summary................................... 111 7 Maximising Experiences 113 7.1 Introduction.................................

Load more