Self-Play Deep Learning for Games

Maximising Experiences

by Joseph West BE Electrical (Hons), BSc Computers, MEng Sci School of Electrical Engineering and Robotics Queensland University of Technology

A dissertation submitted in fulfilment of the requirements for the degree of Doctor of Philosophy 2020 Keywords: Machine Learning, Deep Learning, Games, Monte-Carlo Tree Search, Reinforcement Learning, Curriculum learning. iii

In accordance with the requirements of the degree of Doctor of Philosophy in the School of Electrical Engineering and Robotics, I present the following thesis entitled,

Self-Play Deep Learning for Games Maximising Experiences

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet require- ments for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Signed, QUT Verified Signature

Joseph West 4 September 2020

v

Computer: Shall we play a game?

David: Love to. How about Global Thermonuclear War.

Computer: Wouldn’t you prefer a good game of ?

David: Later. Let’s play Global Thermonuclear War.

-Wargames 1983

Acknowledgements

I would like to thank my wife Kerry for her patience and supporting my indulgence in undertaking this research as well as her willingness to act as first reader for a topic she had no prior knowledge of. I would also like to acknowledge my children Joey and Mitchell for being mature enough to allow me to focus on my research sometimes at the expense of spending more time with them. To my supervisory team, Frederic, Simon and Cameron I have enjoyed working with you and whilst challenging, the processes has been overwhelmingly positive. Thankyou for your patience, mentoring and time. This project has been a large consumer of computing resources and would not be possible without the QUT High Performance Computing (HPC) resources and the support staff that maintain the system. The HPC staff have been responsive to my unique needs and their efforts are greatly appreciated. I would like to thank the Signal Processing, Artificial Intelligence and Vision Tech- nologies (SAIVT) project for allocating additional resources to me for use in this re- search. Finally I would also like to acknowledge Dr Ross Cooper BSc (App Chem) Hons PhD Grad Dip Ed (Sec) CEng MIChemE RPEQ 19081 for reviewing the introduction and abstract with a view to confirming clarity for the benefit of readers without machine learning expertise.

ix

Abstract

Humans learn complex abstract concepts faster if examples are curated. Our education system, for example, is curriculum based with problems becoming progressively harder as the student advances. Ideally, problems which are presented to the student are se- lected to ensure that they are just beyond the student’s current capabilities, making the problem challenging but solvable. Problems which are too complex or too simple can reduce the speed at which the student learns; i.e. the efficiency of learning can be improved by presenting a well curated selection of problems. The idea that a student can effectively learn by randomly selecting a set of problems from the entire subject, regardless of complexity, is so absurd that it is not even under consideration as a pos- sible methodology in our educational system; however this is exactly the approach the current state-of-the-art machine-learning game playing agent uses to learn. The state- of-the-art game playing agent is a combined neural-network/tree-search architecture which generates examples by self-play, with these examples subsequently used to train the agent in how to play the game. New examples are generated from the entire prob- lem domain with no consideration as to the complexity of the examples or the ability of the agent. In this thesis we explore methods of curating self-play experiences which can improve the efficiency of learning, focusing on enforcing a curriculum within the agent’s training pipeline. We introduce the end-game-first training curriculum, a new approach which is focused on learning from the end-game first, resulting in reduced training time for comparable players. We further extend the end-game-first curriculum by automat- ically adjusting the problem complexity based on the agent’s performance, showing that an agent with an automated end-game-first curriculum outperforms a comparable player not following a curriculum. Self-play is resource intensive and throughout this thesis the underlying premise of our research is to extract the maximum benefit from the generated experiences. The improvements presented in this thesis can be attributed to improving the efficiency of the agent’s training loop by more effective use of self-play experiences. We use the games of Connect-4, Reversi and Racing-Kings to demonstrate the generality of our methods.

xi

Contents

1 Introduction1 1.1 Computer Game Playing...... 1 1.1.1 Game Theory...... 2 1.1.2 Performance of Artificial Game Players...... 3 1.2 Artificial Intelligence...... 4 1.3 Real-World Games...... 6 1.4 Research Contributions...... 7 1.5 Publications...... 8 1.6 Thesis Outline...... 8

2 Literature Review 11 2.1 Overview...... 11 2.2 Historical Background...... 12 2.2.1 Game Theory...... 12 2.2.2 Computers Playing Board Games...... 14 2.2.3 Computer Game Playing Approaches...... 16 2.2.4 Playing the Game of Go...... 18 2.2.5 General Game Playing...... 19 2.2.6 AlphaZero’s Evolution...... 19 2.3 Tree Search...... 20 2.3.1 Search Methods...... 20 2.3.2 Monte Carlo Tree Search...... 22 2.4 Neural Networks...... 25 2.4.1 Machine Learning Problem Types...... 26 2.4.2 Neural Network Architectures...... 27 2.4.3 Neural Network Learning...... 31 2.4.4 Overfitting...... 39

3 Evaluation Framework 43 3.1 Introduction...... 43 3.2 Game Environment...... 44 3.2.1 Game Representation...... 44 3.3 Neural-Network/Tree-Search Game Player...... 45 3.3.1 Neural-Network...... 46

xiii xiv

3.3.2 Training the Neural-Network...... 48 3.3.3 Tree Search...... 50 3.3.4 Move Selection...... 51 3.3.5 Parameter Selection...... 51 3.3.6 AlphaGo Zero Inspired Player...... 51 3.4 Reference-Opponents...... 52 3.4.1 MCTS Opponent...... 53 3.4.2 Stockfish Opponent...... 53 3.5 Games...... 53 3.5.1 Connect-4...... 54 3.5.2 Racing-Kings...... 55 3.5.3 Reversi...... 56 3.5.4 Searching State-Space of Games...... 56 3.6 Summary...... 57

4 End-Game-First Curriculum 59 4.1 Introduction...... 59 4.1.1 Weakness of the Current Approach...... 60 4.2 Related Work...... 61 4.2.1 Using a Curriculum for Machine Learning...... 61 4.3 Method...... 63 4.3.1 Curriculum-Player...... 63 4.4 Experiment...... 65 4.4.1 Experimental Conduct...... 65 4.5 Results...... 65 4.6 Discussion...... 66 4.6.1 Success Rate vs Time Comparison...... 66 4.6.2 Success Rate vs Steps Comparison...... 72 4.6.3 Success Rate vs Epoch Comparison...... 72 4.6.4 Using a Linear Curriculum Function...... 73 4.6.5 Limitations...... 74 4.6.6 Curriculum Considerations...... 75 4.7 Summary...... 75

5 Automated End Game First Curriculum 77 5.1 Introduction...... 77 5.2 Prerequisites...... 79 5.2.1 Visibility-Horizon...... 82 5.2.2 Performance of the Agent...... 84 5.3 Method...... 86 5.4 Evaluation...... 88 5.5 Results...... 88 5.6 Discussion...... 94 Contents xv

5.6.1 Comparison With Baseline-Player...... 94 5.6.2 Comparison with Handcrafted Curriculum...... 97 5.7 Summary...... 97

6 Priming the Neural-Network 99 6.1 Introduction...... 99 6.2 Terminal Priming...... 100 6.2.1 Method...... 100 6.2.2 Results...... 102 6.2.3 Discussion...... 102 6.3 Incremental-Buffering...... 107 6.3.1 Method...... 107 6.3.2 Results...... 107 6.3.3 Discussion...... 108 6.4 Summary...... 111

7 Maximising Experiences 113 7.1 Introduction...... 113 7.2 Early Stopping For Reinforcement-Learning...... 116 7.2.1 Method...... 116 7.2.2 Results...... 118 7.2.3 Discussion...... 121 7.3 Spatial dropout...... 123 7.3.1 Method...... 123 7.3.2 Results...... 124 7.3.3 Discussion...... 124 7.4 Summary...... 125

8 Pairwise Player Comparison With Confidence 129 8.1 Introduction...... 129 8.2 Hypothesis Testing...... 130 8.2.1 The Pairwise Player Comparison Problem...... 130 8.2.2 Null-Hypothesis Testing...... 132 8.2.3 Determining Significance...... 133 8.3 Confidence Bounds...... 134 8.3.1 Confidence Bound Convergence...... 136 8.3.2 Confidence Bounds Without a Statistics Library - The Wilson-Score138 8.3.3 Prediction Method...... 139 8.3.4 Types of Error...... 140 8.3.5 Conducting a Trial Game...... 144 8.3.6 Choosing Parameters α,β,δ,n ...... 144 8.3.7 The Fixed n Prediction Method...... 148 8.4 Stopping a Contest Early...... 148 xvi

8.4.1 Accumulating Error...... 149 8.4.2 Accounting for Additional Error...... 149 8.4.3 Putting it into Practice...... 153 8.5 Summary...... 154

9 Conclusion 157 9.1 Computer Game Playing Today...... 157 9.2 Key Findings...... 157 9.3 Further Work...... 159 9.4 Longevity of Findings...... 159

A Parameters For the Players 161

Bibliography 163

Glossary 181 List of Figures

1.1 Basic Closed-Loop Control System...... 6 2.1 Double Dichotomy of Game Space...... 14 2.2 State-Space and Game-Tree Complexity...... 15 2.3 Games’ Location of the Double Dichotomy...... 15 2.4 A Go Board...... 18 2.5 Classical Development vs General Development...... 20 2.6 Partial Tic-Tac-Toe Tree...... 22 2.7 MCTS Process...... 24 2.8 MCTS Chess Trap State...... 25 2.9 Neural-Network Example for Boolean AND...... 27 2.10 2D Convolutions...... 28 2.11 AlphaGo Hand-Crafted Features...... 29 2.12 Feature Learning With CNN...... 30 2.13 Deep CNN Architecture For Image Processing...... 30 2.14 Deep Residual Network Building Block...... 31 2.15 A Deep Residual Network...... 32 2.16 AutoEncoder Architecture...... 34 2.17 Agent/Environment Interaction...... 35 2.18 Cart-Pole Problem...... 36 2.19 Cat Detection...... 39 2.20 Dropout...... 42 2.21 Simple Regression and Covariate Shift...... 42 3.1 Representing Games...... 45 3.2 Architecture...... 47 3.3 AlphaZero and AlphaGo Training Loops...... 49 3.4 Epochs and Replacement Rate Parameters...... 52 3.5 Connect-4 Game...... 55 3.6 Racing Kings Game...... 56 3.7 Reversi Game...... 57 4.1 Curriculum Player Performance in Racing-Kings...... 67 4.2 Curriculum Player Performance in Reversi...... 68 4.3 Linear Curriculum Player Performance in Racing-Kings...... 69 4.4 Linear Curriculum Player Performance in Reversi...... 70

xvii xviii

4.5 Curriculum Player Performance using AlphaGoZero Methods in Racing- Kings...... 71 5.1 Example Game Tree for Tic-Tac-Toe...... 81 5.2 Visibility-Horizon...... 82 5.3 Depth of Search...... 86 5.4 Automated End-Game-First Curriculum Player Performance in Connect- 4...... 90 5.5 Automated End-Game-First Curriculum Player Performance in Reversi 91 5.6 Automated End-Game-First Curriculum Player Performance in Racing Kings...... 92 5.7 Automated End-Game-First Curriculum Player Performance in Reversi 93 6.1 Terminal Priming Player’s Performance in Connect-4...... 104 6.2 Terminal Priming Player’s Performance in Reversi...... 105 6.3 Terminal Priming Player’s Performance in Racing-Kings...... 106 6.4 Incremental-Buffer Player’s Performance in Reversi...... 109 6.5 Incremental-Buffer Player’s Performance in Racing Kings...... 109 7.1 Architecture for Players...... 115 7.2 Figure 3.4 Reproduced...... 117 7.3 Early-Stopping Player’s Performance Playing Reversi...... 120 7.4 Early-Stopping Player’s Performance Playing Racing-Kings...... 121 7.5 Dropout Player’s Performance...... 127 7.6 Variable Dropout and Fixed Dropout Performance Comparison..... 128 8.1 Probability Mass Function...... 134 8.2 Lower Confidence Bound Using PMF...... 135 8.3 Effect of Varying α ...... 137 8.4 Observed Type II Errors With Different Confidence Bounds...... 142 8.5 Type II Errors When Varying Player Probability Difference...... 143 8.6 Relationship Between n and δ for Typical β Values...... 147 8.7 Type I Error...... 151 8.8 Type II error...... 152 8.9 Average Game When Using Sequential Stopping...... 153 List of Tables

1.1 Computing Resources for Research...... 5 5.1 Confidence in Experiences Knowledge...... 84 6.1 Incremental Priming Parameters...... 108 8.1 Prediction Outcomes and Different Error Types...... 140 8.2 Parameter Summary for Error Types...... 142 8.3 Common Z Values...... 144 8.4 Common Parameters Combinations For Confidence Testing...... 146 8.5 Common Parameters for our Sequential Stopping Method...... 155 A.1 Parameters for Racing-Kings players...... 161 A.2 Parameters for Reversi player...... 162

xix

Chapter 1

Introduction

1.1 Computer Game Playing

The fascination with automated board game players is older than computers themselves. ENIAC (Electronic Numerical Integrator And Computer) is considered the first general purpose computer built in 1940 to calculate artillery trajectories but by the 1950’s a number of computer game playing algorithms were proposed [1]. Preceding ENIAC, in 1912 El Ajedrecista was the first automated Chess playing machine that could play some end-game positions of a Chess board [2]. The applications of machine-learning1 (ML) are seemingly endless, however much of the success has arisen from applying tailored solutions to each problem. In 1958, Newell et al. thought Chess was the ultimate challenge for artificial intelligence (AI) and stated that a successful Chess machine “would seem to have penetrated the core of human intellectual endeavour” [3]. Newell’s view either overestimated the complexity of Chess or significantly underestimated the diversity of human problem solving. Despite being an engineering success, when World Chess Champion Garry Kasparov was defeated by Deep-Blue in 1997 [4], the solution was far from the parity of human intelligence and instead, a highly tailored system which relied on custom hardware, chess specific software, and brute-force processing power. Despite the accolade of being the first machine to beat a reigning world champion, Deep-Blue was not able to play the game of Checkers [5,6]. Whilst significant advancement in ML applications have occurred recently, much of the focus is on hand-crafting solutions for each problem, and this is also true of computer game players. The study of generic ML is less prolific due in part to the immediate gains that can be made by developing a specific solution rather than investigating the potential broad utility of ML agents. Our focus is on general decision making, where an ML agent learns how to make good decisions without human intervention or assistance. Games provide a convenient, well defined and diverse problem set to investigate complex general decision making [7]. The benefit of using games as the test-bed is that many

1 Machine Learning (ML) and Artificial Intelligence (AI) are often used interchangeably in much of the liter- ature. We discuss this later in this chapter.

1 2 real-world problems can also be expressed in the form of a game, making advances in game playing applicable to a wide range of problems.

1.1.1 Game Theory

Salen and Zimmerman consider a number of possible definitions of games and conclude that a game is “a system in which players engage in an artificial conflict, defined by rules, that results in a quantifiable outcome” [8, Chapter 7]. In our opinion however, conflict need not be artificial for a system to be classified as a game as real conflict also qualifies. Carmichael’s definition of a game is more concise in stating that a game is simply a strategic situation i.e. a “scenario or situation where, for two or more (players), their choice of action or behaviour has an impact on the others” [9]. A game can be represented by its state st after t time steps in the game. A game has a set of moves A = {a1, .., am} which are possible. For each state, there exists a subset of the possible moves which are legal actions for a state L(s) ⊂ A and taking an action results in a change to the game’s state to st+1. i.e. if F denotes the game’s transition function and action at is selected then F : st × at → st+1. Two predominant methods exist for decision making for strategic games; brute-force methods or knowledge-based methods[10]. One brute-force approach is to build a search-tree exploring every possible board position. Practically, the state-space is too large for current technology for many in- teresting games, so heuristics are used to reduce the tree to a size that is small enough to be processed with current technology. One approach to deriving a heuristic is to conduct a monte-carlo simulation from a board position, and this is what underpins Monte-Carlo tree search (MCTS) [11] which we discuss in detail in Section 2.3.2. Knowledge-based methods rely on knowledge of the game’s underlying dynamics. Although Deep-Blue’s solution incorporated tree search methods, the search was cus- tomised for Chess, and only employed when the existing board position was not known, making Deep-Blue a hand-crafted, knowledge-based solution. This is one of the com- mon approaches for game playing agents, where the designer uses their prior knowledge of the game to customise the player. Artificial Neural Networks have been shown to be universal approximators [12], and with a suitable design can learn the underlying structure of a game. Although typi- cally not listed under knowledge-based approaches in the literature, a neural-network can still be considered a knowledge-based approach. The difference is that a neural- network acquires the knowledge through learning, instead of obtaining it directly from the designer. Currently many applications using neural-networks also have preexisting human-crafted elements. Our key interest is to use a neural-network without any human customisation, preferring instead that the neural-network acquire all of its knowledge through learning. Introduction 3

1.1.2 Performance of Artificial Game Players

Artificial agents can now consistently defeat human professionals in Chess, Check- ers, Reversi and numerous other games by using either modern tree search techniques and/or neural-networks [13]. The game of Go is one of the more complicated traditional board games. Research on the game of Go began in the late 1960’s [14] and became more extensive immediately after Deep-Blue’s Chess win in 1997, but despite the years of research effort the best Go agent up until 2015 was Crazy Stone which achieved a playing standard equivalent to that of a strong amateur, by winning occasional games with handicaps on a reduced board size [15, 16]. Crazy Stone employed a variant of the MCTS algorithm with hand-crafted game specific heuristics [17]. Neural-networks were also found to be effective in correctly predicting expert moves for the game of Go and were able to equal the performance of state-of-the-art MCTS-based Go agents in 2015 [18]. The breakthrough in the game of Go was in March 2016 when Google Deepmind’s agent AlphaGo defeated 9 dan Go professional, Lee Sedol, 4 games to 1, in a mil- lion dollar challenge by successfully employing a combined neural-network/tree-search architecture [19], notionally advancing the field by a decade [14]. In 2019 Lee Sedol retired from playing Go after a career spanning 24 years, declaring that since AI cannot be defeated he could only be the second best as covered in the Chinadaily.com.cn blog article Go master Lee Sedol retires, says AI ‘cannot be defeated’ - Chinadaily.com.cn (Nov 2019). Much of the success in specialised game playing has been to tailor solutions for a sin- gle game. Although AlphaGo’s neural-network/tree-search architecture was somewhat reusable, it still relied heavily on hand-crafted features, a large quantity of existing ex- pert games and a large computational budget to train the agent offline. These elements are not necessarily available if the game cannot be studied in advance, like when seek- ing to play a variety of previously unseen games. The current state-of-the-art general game playing agents utilise MCTS [20] due to the algorithm being domain indepen- dent. However, MCTS is not optimal for all games, especially if the state-space is large or if terminal states cannot be readily found by random playout. In these situations knowledge-based methods are required to guide the search and although this may be addressed by employing a general, rules-based solution [6], neural-networks can also be used if sufficient time is provided in advance to learn the game. The field of general game playing (GGP) encompasses agents which are designed to play a diverse set of games without further human intervention [21], however GGP is often a keyword used to describe a specific subfield of general games involving the use and analysis of logic to play general games [20]. When we refer to general game playing we are referring to an agent that is capable of playing a diverse set of games without dictating the mechanism for achieving this - whether it uses logic or not. 4

1.2 Artificial Intelligence

With the recent advances in low cost processing power and graphics processing units, the applications of machine-learning (ML) have become widespread. However, many of these ML applications are very narrow in their approach to solving a given problem by relying on humans to tailor the system in such a way that the system is unable to be used for any other problem. Whilst broadly the term Artificial Intelligence (AI) is often used as a synonym for ML, one definition of intelligence is the “ability to achieve goals in a wide range of environments”[22]. As such, many of the current AI applications cannot really be considered intelligent. For an artificial agent to succeed in “a wide range of environments” it will be expected to be able to plan and make effective decisions, using either its prior experiences, or the ability to explore and analyse the environment’s state. A system that can be applied to absolutely any environment is the aim of artificial general intelligence (AGI); although the current status of research is far from AGI as it is still grappling with applying a single system to different problem classes within the same domain. For instance, detection is one of the more prominent AI applications, however there exists many different cancer detection research problems, each with their own unique system. For example, in cancer detection there are unique approaches for detection of cancers in the: skin, head and neck, prostate, pancreas and breast [23, 24, 25, 26, 27, 28, 29]. Whilst a general cancer detection system may be possible, cancer subject-matter-experts, for good reason, focus on solving specific problems which provide the most immediate and beneficial return for their efforts. Whilst a true AGI agent may actually be a fantasy, progressing towards a generic agent2 that is suitable for a single domain will allow subject-matter-experts to advance their fields using machine- learning without requiring them to build custom agents of their own. The study of generic artificial agents has progressed substantially in last five years, predominantly using games as the study domain. Initially, our research motivation was the creation of a generic agent for the entire domain of two player, zero-sum, turned-based, board games, however this was achieved by what is now considered the state-of-the-art generic board game agent AlphaZero [30]. AlphaZero is effective at learning a previously unseen game by self-play, i.e. playing against itself, and when tested, it succeeded in beating professional human players and all other game playing agent benchmarks. For this work we use AlphaZero as the baseline agent because, in our opinion, its approach is likely to persist into the future and become more widely adapted to other domains. AlphaZero derives its success from the combination of two different machine-learning methods, tree search and neural-networks; and is the culmination of decades of work by numerous researchers in each of those domains. Regardless of future advancements in tree-search or neural-network methods, AlphaZero’s underlying

2 We use the term generic agent to mean an agent designed for a broad problem domain or class of problem as opposed to a general agent which is an agent used across domains. Introduction 5

Agent Self-Play Resources Competition Resources AlphaGo [19] 1,920 CPUs + 280 GPUs 1202 CPUs + 176 GPU AlphaGo Zero* [30] 64 GPUs + 19 CPUs 4 TPU AlphaZero [31] 5,000 TPUs 4 TPU MuZero [32] 1,000 TPUs 4 TPU AI Player in this thesis 1 GPU + 10 CPU 20 CPU

Table 1.1: The computing resources allocated to Deepmind’s game playing agents dur- ing self-play/training and during competition. Note the acronyms: Central Processing Unit (CPU), Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). A TPU is a machine-learning customised GPU variant which is optimised for neural- network inference and gradient-descent performing 15-30 times better than equivalent GPUs [33]. *An informal estimate (without peer review) for the AlphaGo Zero 40 day experiment is upwards of $35M in processing costs (www.yuzeh.com/data/agz- cost.html). Each artificial player presented in this thesis uses 1 GPU + 10 CPU per player for training and 20 CPU for competition play. approach will likely remain valid and itself benefit from any further improvements. AlphaZero’s achievements are an exciting development in the progress towards a general agent, however the self-play reinforcement-learning (RL) method used to train the agent is very processor intensive costing literally millions of dollars. AlphaZero uses a two stage learning pipeline consisting of generating experiences via self-playing then using these experiences to train the agent using typical neural-network gradient-descent methods. Table 1.1 shows the allocated processing resources for different generations of Deepmind’s state-of-the-art game playing agents. An informal estimate for the 40 day, AlphaGo Zero experiment is in the order of $35M (www.yuzeh.com/data/agz- cost.html Jan 2019) - well beyond the resources available to many researchers. The table highlights the larger proportion of resources required for self-play compared with that required for competition indicating a bottleneck in the self-play learning loop. The contributions we present in this thesis improve the efficiency of the self-play learning loop by seeking to minimise the creation of expectedly poor quality experiences and improving the efficiency in the use of generated experiences. By improving the self-play training loop we are also reducing the resources required to achieve similar performance results, making the state-of-the-art approach more widely accessible and applicable to problems without large resource budgets. In this thesis we zero-in on the self-play learning loop, having identified an inherent weakness with the current method. To address this weakness we present a novel ap- proach by enforcing a structure to the training process which improves the training time when compared with a comparable agent not using our approach. We further explore how to improve the self-play learning loop and present a number of novel methods which enhance the current state-of-the-art method. The methods we present here are likely to remain relevant regardless of advancement in tree-search or neural-networks and by improving the rate of training of an agent will allow the state-of-the-art approach to be more widely used. 6

Figure 1.1: A basic closed-loop control system from [38]. In the context of game playing the Controller is the agent and the Process is the game environment. The Input to process is the state of the game and the Disturbances are caused by the opponent(s). The Reference is the desired reward while the Response is the reward obtained from the game. The Error/Deviation is the difference between the desired reward and the received reward; and the Controlled input to process is the actions taken by the agent. The Controller is attempting to minimise the Error, and achieves this by learning how to play the game by using game playing machine-learning methods.

1.3 Real-World Games

A game can be defined as an interaction between decision makers in a given environ- ment, as such many real-world problems fall within the remit of game theory [34]. In Von Neumann’s pioneering paper on game theory the authors discuss economics and games interchangeably [35]. Alan Turing’s ultimate test of intelligence was also presented in the form of a game [36]. An artificial agent learning to play a zero-sum game is fundamentally seeking to learn an optimal decision making policy within an environment to maximise its reward, while another adversarial actor takes actions to minimise the agent’s reward. A strategic situation is one where the outcome for each actor depends on both the environment and the actions of others. In game theory, a game is defined as a strategic situation. The stock market, global economics and military conflict are all considered strategic situations and there are many real-world problems which can be presented in the form of a game. Recent advances in computer game playing have resulted in a generic agent being able to learn to play highly complex games through self-play reinforcement-learning, with no game specific customisation [37]. This work, using a cross-section of Atari video games, demonstrated how game playing methods can be used to learn control in a continuous virtual environment. Likewise, board games can be considered to be discrete closed-loop control problems where: the environment’s transition function is encapsulated in the game rules, the player acts as the controller seeking to maximise some reward by winning the game, and the opponent adds noise by seeking to minimise the player’s reward. Figure 1.1 from [38] shows a basic closed-loop control system and in the description of the figure we explain how it is analogous to games. Since board games can be viewed as control problems, advances in generic game playing provide a mechanism to explore the possibilities of a general controller which could be applied as an expert system for a diverse set of applications, instead of devel- oping bespoke systems for every problem. In our view, a general controller would need to acquire the knowledge of the environment through learning, rather than being pre- Introduction 7 programmed in advance. Board games provide a diverse, well defined domain to explore how a single artificial agent can learn optimal control in a previously unknown virtual environment - without human intervention. It is for these reasons that we use board games, more specifically combinatorial board games, as the problem-set to investigate how a deep learning agent can learn control for a previously unseen environment. The real-world applications of a general controller are almost unlimited. For ex- ample, in the field of Neurology, electrical currents can be used to stimulate parts of the brain to control crippling neurological issues arising from a range of illnesses including Parkinson’s Disease, Epilepsy and Post Traumatic Stress Disorder using a therapy called Deep Brain Stimulation (DBS) [39]. Currently DBS provides a fixed, steady-state stimulation to the patient’s brain through surgically implanted electrodes. The problem with a fixed stimulus is that patients’ conditions are often variable [40]. Each patient reacts differently to stimulation due to differences in their condition and physiological characteristics; as such the DBS parameters are found by trial and error, individually for each patient, over a number of months and once set there is almost no variability to account for changes in the patient. If sufficient advances can be made towards a general controller then it might be possible for an artificial agent to learn continuous adaptive DBS control for each individual patient, restoring normalcy to many people’s lives who have these neurological illnesses [41]. Whilst learning how to play board games seems like a fun topic to research, there are many serious real-world applications that this research could benefit.

1.4 Research Contributions

The research contributions of this thesis are as follows:

• We present a new curriculum-learning paradigm which we call the end-game- first curriculum and show its utility by using it to improve the training time of a state-of-the-art inspired reinforcement-learning agent on different games. Our method excludes expectedly poor experiences from the experience-buffer. This contribution is explained in Chapter4.

• We extend the utility of the end-game-first curriculum by presenting a generalised, automated method of applying the end-game-first-curriculum using the relative search-depth of the tree-search to indicate how well trained the neural-network is. We demonstrate the effectiveness of this method on three different games. This contribution is described in Chapter5.

• We present a new approach for training a reinforcement-learning agent by treating the earliest epochs differently, trading off the risk of overfitting for the speed gained by learning only from a small sample of experiences. This contribution is described in Chapter6.

• We present a new application of early-stopping principals traditionally used only in supervised-learning problems and instead employ them during the reinforcement- 8

learning training pipeline. This method is a unique adaption of an existing method and is explained in Chapter7.

• We present a new variation to spatial-dropout for convolutional neural-networks (CNN) by applying varying spatial-dropout regularisation to convolutional layers in a residual neural-network based on the estimated proportion of the problem space that is being learned. Our approach is unique in its employment in a neural- network/tree-search architecture where batch-normalisation is already employed. It is also unique in that we also vary the amount of dropout based on a calculated estimate of the proportion of the state-space being presented to the network. This contribution is explained in Chapter7. This work is still in progress, however we have demonstrated from this preliminary work that additional research may yield a further improvement to learning stability.

• We contribute to the understanding and use of statistical hypothesis testing as it applies to computer game playing. We demonstrate the use of hypothesis test- ing and provide a step by step procedure to perform a pair-wise comparison of the playing quality of two players with a desired confidence. We then present a new approach permitting the hypothesis trial to terminate early if one player is significantly outperforming the other, while retaining the appropriate confi- dence level. We take an empirical approach to hypothesis testing allowing game playing researchers to verify that the approach is effective, demystifying an often misunderstood process. This contribution is explained in Chapter8.

1.5 Publications

Three papers have arisen from this work:

• Improved Reinforcement-Learning with Curriculum in the Elsevier Journal of Ex- pert Systems with Applications. This paper has been accepted for publication and outlines the effectiveness of an end-game-first curriculum and contains much of the content from Chapter4.

• Automated End-Game-First Curriculum for Reinforcement-Learning in the Else- vier Expert Systems with Applications. This paper is in progress and contains much of the content from Chapters5,6 and7.

• Automated Stopping With Confidence for Computer Game Playing Trials in the Elsevier Journal of Artificial Intelligence. This paper is in progress and contains much of the content from Chapter8.

1.6 Thesis Outline

We first present a literature review in Chapter2 which covers the history of computer game playing and the components which make up the state-of-the-art general game Introduction 9 playing system. The literature review includes an analysis of the state-of-the-art game playing methods and the key elements which contribute to its success. Our research uses the state-of-the-art methods as the baseline which we compare our results against. Chapter3 outlines the platform which we have developed to allow us to evaluate the modifications which we cover in this thesis. Chapter3 also provides an insight into the reasons for selecting particular parameters for the network and the motivation behind the games which we use. In Chapter4 we explain our new approach to reinforcement-learning which we call the end-game-first-curriculum. Our novel approach, which has been peer reviewed, uses a hand-crafted end-game-first curriculum to improve the training time for a state-of- the-art neural-network/tree-search game playing agent. One of the limitations with our approach in Chapter4 is that the method relies on some prior knowledge of the game’s complexity and we address this limitation in Chapter5. In Chapter5 we present a novel method for automating the end-game-first curricu- lum, without the need for any prior game knowledge. We also identify, for what we believe to be the first time, that the relative search-depth for a neural-network/tree- search agent is a measure of the agent’s skill at playing a game. In Chapter5 we use this knowledge regarding the relative search-depth to control the applied curriculum to improve learning with no game specific customisations. Throughout this thesis we identify that the cost of generating experiences is a key bottleneck in training for our agent, and in Chapters6 and7 we further explore how to obtain the maximum value from generated experiences. We have found that the novel application of existing methods typically used for supervised-learning problems can lead to improved learning speeds for our reinforcement-learning problem. Finally the topic of statistical confidence is addressed in Chapter8. We provide an overview of how to apply confidence intervals to game playing using empirical results to support the statistical theory. A typical requirement when comparing two machine- learning agents is to identify which player is better and to be able to make a declaration with a level of confidence. This confidence measure is common in a number of scientific fields, however is less widely used for machine-learning and in particular game playing agents. In addition to providing an insight into how to apply statistical confidence testing to game playing we also provide, to our knowledge, a new method which permits stopping a contest early while still retaining confidence in the outcome. Chapter9 concludes the thesis, and reflects on key outputs of this thesis, and the direction of future work.

Chapter 2

Literature Review

2.1 Overview

According to van-den Herik and Uiterwijk [13], a game playing agent can use one of two approaches to decide which move to make: using either knowledge-based, or brute-force methods. Historically, knowledge-based agents were created by hand-coding the player, al- though recently neural-networks have been used to pre-learn the dynamics of the game - creating an internalised knowledge within the agent. Neural-networks used in success- ful game playing agents have often relied on hand-crafted elements to achieve reasonable game performance, making them unsuitable as generic game playing agents. A player using brute-force methods can achieve this by testing every possible combination of the game for the best move to make, and can do this with tree-search methods. For prac- tical reasons most tree-search methods require some element of knowledge to be useful for non-trivial games, such as some heuristic. Using the heuristic, the agent typically builds a search-tree1 at the moment when the decision is needed, effectively acquiring the knowledge on-demand in order to choose an action. These two approaches lend themselves to different problem types and are discussed further in Section 2.2.1. Our view differs slightly from van-den Herik and Uiterwijk [13] in that we prefer to consider the two approaches as: one where the knowledge is stored within a system (internal knowledge), and the other where knowledge is acquired on-demand, through exploration (on-demand knowledge). In our opinion the terms used by van-den Herik and Uiterwijk are insufficiently accurate given recent developments in computer game playing. Brute-force tends to imply a comprehensive, naive exploration of the state- space however this is rarely purely employed - it could be argued that even the selection of the brute-force methods requires some level of prior knowledge. Likewise, the term knowledge-based for some researchers infers that the agent is pre-programmed with all knowledge a priori - however there are other mechanisms where an agent can build internal knowledge. Other than a shift in perspective, van-den Herik and Uiterwijk’s findings [13] are still valid for our renewed perspective with on-demand knowledge and

1 or a graph

11 12 internal knowledge directly mapping to brute-force and knowledge-based respectively. The literature for modern board game playing agents primarily covers either tree- search or neural-networks as the two predominant machine-learning approaches, how- ever both were infrequently used together. The players that combined knowledge- based methods with brute-force methods typically utilised static evaluations and cus- tom heuristics for the search like Deep-Blue’s method [42]. Prior to 2016, all of the artificial players that were capable of playing complex board games to a professional human standard were highly customised to the specific game. However in 2016 a major breakthrough occurred with Google Deepmind’s AlphaGo successfully defeating profes- sional players in the game of Go, a success which was touted as being a decade before its time [19]. Although AlphaGo was highly customised for the game of Go, the method- ology was relatively general in its approach. This methodology has since undergone a number of revisions progressing to the current state-of-the-art generic boardgame player AlphaZero, using no game specific customisations with demonstrated superhu- man performance in a cross-section of board games. AlphaZero requires time before a competition to learn the game, but the learning requires no human involvement. This time requirement exists regardless of whether the game is learned or programmed, since the programmer requires time to program knowledge into a specialised game player. AlphaZero successfully exploits the benefits of both internally stored knowledge and on-demand knowledge, providing a method for learning any board game tabula-rasa2, however the computing resources required are well beyond those which are available to many researchers. Our contribution to the field of machine-learning starts here, with AlphaZero as the baseline. We have conducted a deep-dive into AlphaZero’s method and have identified a bottleneck in its training loop which can be mitigated by using a structured training curriculum. Addressing this bottleneck reduces the required resources and/or improves the time needed for the agent to learn to the same standard. In this chapter we review the history of computer board game playing and the key research giving rise to AlphaZero’s success. We review the preceding work with two key themes: games and machine-learning; then present an overview of the common machine-learning methods used in game playing.

2.2 Historical Background

2.2.1 Game Theory

Seriousness of Games

Game Theory, described as the “study of conflict and cooperation” by von Stengel and Turocy [43], not only covers abstract games like Chess but also encompasses many real life situations. The definition of a game is broad, it is “a formal description of a strategic situation” [43], i.e. a situation where the outcomes for each player are dependent on the actions of all players [44].

2 Without any prior knowledge. Literature Review 13

With the definition of a game being so broad, many problems in economics, logistics, biology, decision making and conflict are all covered by the definition of a game. Pen- nachin and Goertzel have even claimed that “nearly every artificial intelligence problem could be brought into the form of a game” [45, Chapter 5]. In Alan Turing’s Imitation Game, his proposed test of intelligence for a machine was also presented in the form of a game where the artificial machine wins if it cannot be distinguished from a human in a blind conversation [36]. Since games are so widely applicable, the study of how to play them optimally has many real-world applications. Decisions for many real conflict situations are made sequentially with the opponent’s decisions becoming evident as the situation develops [46]. The process of finding a sequence of actions to obtain some goal-state is called search [47, Chapter 3]. Many board games are sequential decision making problems, and research interest in how a machine can play board games has a history which even predates computers. Chess was once considered to be the ultimate challenge of machine intelligence [3], however when Deep-Blue finally defeated the reigning world Chess champion, Gary Kasparov, the machine’s decision making was not relatable to human thought processes [48]. While psychologists still disagree on the precise definition of human intelligence, broadly speaking intelligence is the “ability to achieve goals in a wide range of envi- ronments.” [49]. With this definition, agents like Deep-Blue fail to qualify as being intelligent, despite beating the world’s best human in the single task it was designed for. A singular task machine-learning application is termed weak AI and although not considered intelligent, has underpinned many recent successful applications [47, Chap- ter 26]. Strong AI, or more formally artificial general intelligence (AGI) is described as a machine which is able to be applied to any problem [47]. A true AGI agent is not on the foreseeable horizon, however exploring how a single agent can learn to play any game is a small step in furthering AGI.

Game Complexity

For games of perfect information, there exists an optimal value function v∗(s) which determines the outcome of the game under perfect play where s is the state representa- tion of the game environment [50, Chapter 6.3]. Practically, players seek to maximise the value v(s) through their decisions and the winner is the player with the most ac- curate estimation of v∗(s) which is reflective of the optimal strategy. Where v∗(s) has been determined for a game, the game is said to have been solved, although there are three levels of solved depending on how comprehensively v∗(s) is defined: ultra-weakly solved, weakly solved and strongly solved [10]. The difficulty in determining the game theoretic value function, v∗(s), can be quan- tified by considering the double dichotomy of state-space complexity and game tree complexity. State-space complexity is how many unique states exist in the game, whilst the game tree complexity is “the number of leaf-nodes in the solution search-tree” [10]. Heule and Rothkrantz outline how these two metrics can be used to forecast whether a game will be solved by either brute-force or knowledge-based techniques [10]. In games 14

Figure 2.1: A double dichotomy of the game space from van-den-Herik et al.[13]. Whilst this plot is not prescriptive it indicates the conditions where brute-force or knowledge-based methods are suitable for problems. Problems which have a high game tree complexity but low state-space complexity can be solved using brute-force meth- ods, while the inverse condition would use knowledge-based methods. This plot was developed in 2002, and in this author’s opinion the description under Category 4 should be “if solvable, then by combined brute-force/knowledge-based methods.” Figure 2.2 shows estimated complexity values for different games, and Figure 2.3 shows a compar- ative plot for these games. that are not yet solved, this approach can also be used to predict which methods would be used to achieve super-human performance. According to Heule and Rothkrantz, games which have a low state-space complexity are more likely to be solvable using brute-force methods whilst games with a low game tree complexity are more likely to utilise knowledge-based methods. Figure 2.1 shows the effect of state-space and game tree complexity on solvability. Figure 2.2 shows the complexity of a selection of com- mon games and figure 2.3 shows approximately where in the complexity space these games are positioned. Given recent advancements which we outline in this thesis, it is our view that Category 4 in Figure 2.1 should be updated to read “if solvable at all, then by a combination of both brute-force and knowledge-based methods”.

2.2.2 Computers Playing Board Games

The success of a computer based game playing agent relies primarily on its estimate of the value function v(s) where the actual value function is v∗(s). The closer v(s) is to v∗(s) the better the player. For effective move selection, it is necessary to know either the value function or the relative expected values of all actions for state s. The relative value is the expected return for taking an action from a certain state and is expressed as a probability distribution across all legal actions from the state s called the policy, π(s). In his 1950 paper on [51], Shannon noted that for trivial games like Nim the exact value function was easily calculated, however, for Chess an approximate of v∗(s) would be required since the true value of a state only becomes apparent when approaching the terminal state. Exhaustive tree search methods can be used to accurately determine v(s) if the game state is small enough given the available processing budget. Complex games, like Literature Review 15

Figure 2.2: State-space and game tree complexity for a selection of games from [13]. Figure 2.3 shows a plot of these values.

Figure 2.3: Games from Figure 2.2 with the game’s ID shown on the plot - reproduced from [13]. Overlaying this plot with Figure 2.1 provides some insight into the method required for a game playing agent. ID 15 refers to the game of Pentominoes and cross- referencing with Figure 2.1 indicates that the game is probably category 1 or 2 - it would be solvable by either any method or brute-force methods. ID 9 is the game of Go and using Figure 2.1 this game is not solvable, however recent research has shown that a successful Go player can be obtained using combined brute-force/knowledge-based methods. 16

Chess, make choosing moves by brute-force methods intractable as we have explained in the previous section. A conservative estimate is that for Chess there are in excess of 10120 game states [51]. Despite this large state-space, Deep-Blue’s success relied in part on brute-force performing 28 position search evaluations per second [48]. It would be inaccurate however, to categorise the method purely as brute-force since the search itself was highly tailored, employing hand-crafted position evaluation. A more accurate description of Deep-Blue is that it uses a combined brute-force/knowledge- based approach. Figure 2.3 shows Chess having both a large state-space complexity and large game tree complexity and using Figure 2.1 this would imply that a combined brute-force/knowledge-based approach is required - and the architecture employed by Deep-Blue used a combined approach. Although Chess is not solved, computer Chess players that employ a combined brute-force/knowledge-based method are able to beat professional Chess players with relatively low computing resource budgets. A game-tree can be constructed for a game and its branching factor can be used as a measure the complexity of a game3. The branching factor of a game tree is related to how many branches each child node has on average. The branching factor for the 19x19 game of Go is large, approximately 200 for each board position, compared with 35 for Chess [52] making Go’s state-space approximately 10170 positions [53]. The subtleness of Go having only one piece type per player results in very minor board changes per turn, which can seem visually innocuous, however well placed individual stones can change the value of a game’s state markedly. These subtle visual changes on the board, combined with the large game space makes identifying the value function of Go intractable by brute-force methods and one of the reasons why Deepmind’s creation of a superhuman, machine-learning Go player was such an important milestone for machine-learning.

2.2.3 Computer Game Playing Approaches

As outlined in the previous subsection, two approaches exist for solving a game; brute- force or knowledge-based, and depending on the type of game, one method is more likely to succeed over the other. A convergent game is one where “every action taken by a player leads the game toward its terminal state.” A non-convergent game is one where the game does not necessarily progress closer to an end with each move [54]. A feature of convergent games is that random play will always result in a terminal state, whilst using random play on a non-convergent game can result in infinite moves. Games where the board fills with each move are usually convergent, like Tic-Tac-Toe, Reversi and Go; whereas games that have piece mobility like Chess and Checkers are often non-convergent. Although placing a move limit on a Chess game, as is done in many tournaments, makes the game technically convergent, a Chess game ending in this manner is atypical and not a reflection of the game’s dynamics.

3 We discuss how to build a game-tree later in Section 2.3, however at this stage we seek to simply highlight the complexity of the game of Go. Literature Review 17

Uninformed tree-search methods require terminal states in order to calculate the optimum action to take, as such tree-search methods are more effective on convergent games than non-convergent games. To obtain a check-mate in Chess for example, a King has to be both under threat of capture and trapped by the opponent’s pieces. There are a number of possible check-mate configurations, however finding these configurations randomly is difficult4 - making any uninformed tree-search method ineffective. For some non-convergent games (particularly Chess) retrograde analysis has been used to build end-game-databases prior to competition from known terminal states. Retrograde analysis, a backwards induction approach [55], is a method which starts with a terminal state then successively backs up moves, fully analysing previous states as it progresses toward the start of the game. The depth of the retrograde analysis is limited by the processing resources and once the database is built, significantly less resources are used to look up a pre-calculated position. For Chess, large end- game databases have been constructed using only those pieces required for victory i.e. filtering out non-essential pieces for a particular terminal-state [56, 57]. This allows a player approaching the end-game to use the end-game database to force a previously known winning configuration. To conduct retrograde analysis a manageable size of terminal-states are required, or at a minimum a subset of the most common terminal states can be used. For the same sized board, the number of terminal piece arrangements for non- convergent games is less than the number of terminal board positions for convergent games, because usually every full board is terminal in a convergent game while only specific arrangements are terminal in non-convergent games. Retrograde analysis relies on a smaller terminal-state-space size, and as such is not typically suitable for non- convergent games. An end-game database, created from retrograde analysis, was a core component of the Chinook Checkers agent which came close to beating the world champion in 1994, and won the first Man-Machine World Championship in Checkers in 2009 [58]. Although Chinook did not technically beat a human Checkers world champion at the tournament, it was later re-purposed and used to weakly solve5 the game of Checkers. The game of Reversi is convergent, which renders the creation of end-game databases difficult, because the piece arrangements at the end of the game are rarely repeated. Logistello was the first super-human Reversi game agent to defeat a world champion in 1997 [60]. Logistello employed a hand-crafted evaluation function, derived from patterns on the board and used this function to inform its search [61]. This approach of using an static-evaluator or other heuristic to guide the building of a search-tree has become an important approach with more complex games, that is, using knowledge to improve search. We have previously discussed that games with perfect information have an optimal

4 We discuss a variant of the game of Chess and Reversi in more detail later in this chapter. 5 For weakly solved games both the result and a strategy for achieving it from the start of the game are known: the result of Checkers is a draw. [59] 18 value function v∗(s) which describes the value of a state s, and that for simple games the optimal value function is easily calculated [51]. For larger games where the state- space complexity is large like in Chess with a state-space complexity in the order of 10120 states, an agent is only able to approximate the value function. Methods of accurately approximating v∗(s) underpin the research of many game agents. To employ search methods for large state-spaces some heuristic or knowledge can be used to either reduce the search space - or bias it in such a way that the most consequential paths are explored. Numerous tree search algorithms exist and one of the more commonly used methods for general games is MCTS. MCTS is domain independent, asymmetrically building a search-tree along the most promising paths which are estimated using random playouts. MCTS can be used with- out any pre-existing domain knowledge since a board position’s value is derived from the outcome of a random roll-out. If terminal states are not readily found then MCTS is not likely to be the optimum approach6. MCTS is discussed in more detail later in this Chapter, and is an important source of general knowledge-on-demand approach for game playing. The game of Go, identified by the number 9 in Figure 2.3 is the most complex board game under consideration in [13]. In accordance with our discussion in Section 2.2.1 a successful Go game playing agent requires both brute-force (knowledge-on-demand) and knowledge-based (internal knowledge) methods to solve this game. Whilst the game of Go is not solved, Google Deepmind employed a combination of both of these methods in AlphaGo resulting in this combined architecture defeating the world Go champion in 2016 [19].

2.2.4 Playing the game of Go

The game of Go, thought to be one of the oldest board games in the world, originated in China approximately 2,500 years Figure 2.4: A Go board ago [62]. Go is a two player game, played on a 19x19 grid (goban) showing a game of Go in progress. with players alternately placing their coloured stones on the vacant intersections of the grid. The aim of the game is to surround more territory than the opponent, and despite the only choice a player makes is where to place their stone, the game itself is particularly complex with approximately 10170 unique positions [53]. On 15 March 2016 Google Deepmind’s AlphaGo defeated 9 dan Go professional Lee Sedol in a Go tournament in what might be described as a watershed moment for AI. Google Deepmind had previously published their research in the January 2016 edition of Nature [19] having previously defeated a 2 dan Go professional in October 2015. AlphaGo succeeded by using a neural-network to bias both the building of an MCTS tree and the

6 Figure 2.8 and the discussion later in this Chapter provide an example of how MCTS can fail to find an optimum decision Literature Review 19 playout used to evaluate non-terminal nodes. One of the key contributions of AlphaGo’s system was how it combined existing neural-network and tree-search methods. AlphaGo builds an MCTS tree to determine which move to make. The breadth of the search is reduced by using a CNN to estimate the probability a successful player would choose a particular move, ensuring that the most promising moves are explored first. AlphaZero used a new technique for determining the value of non-terminal states by caching all moves from the search, and similar moves were repeated during the MCTS playout, making the playout less random [19]. Research independent of Google by Jin and Keutzer in December 2015 also highlights the benefits of a combined neural- network/MCTS solution for Go [63]. Gelly et al. also describe a method to bias MCTS random playouts using board patterns to generate a more realistic playout [64][65]. AlphaGo’s method was similar to Gelly’s work in that both the playout and the tree- policy were biased utilising hand-crafted Go features, which limited its generality.

2.2.5 General Game Playing

The field of General Game Playing (GGP) requires an agent to be able to play a variety of games without human intervention or prior knowledge of the game [66, Chapter 1]. Figure 2.5 articulates how the development of a GGP agent is differentiated from the development of a single-game agent. With a GGP agent, although the agent is created by the developer, humans have no further involvement beyond this initial development i.e. for each different game the agent is required to learn by itself. GGP focuses on how to learn, rather than how to solve a particular problem. The Stanford Game Description Language (GDL) [66] provides a full logic based description of the game environment, and is one of the most utilised game descriptions in GGP research. One particular GGP niche focuses on analysing the game description [20], however requiring the full game description limits the applicability to real-world game applications as outlined in our game theory discussion in Section 1.1.1. For many real-world situations the environment is not fully known to the agent [22][49], instead the agent has to explore the environment in order to determine the relevant dynamics. In order to retain broader relevance to real-world applications we prefer that the game environment is represented as a black box system to the agent as detailed in Section3, requiring the agent to discover the game dynamics from a virtual environment. Players employing MCTS are currently considered to be using the state-of-the-art approach for GGP [54][67][68][6][21], as it is game independent and can be halted at any time. Whilst a number of variations to the MCTS algorithm have been suggested, as covered in Browne’s MCTS survey [69], they do not necessarily provide an improvement in all games, and in some cases, these variations can result in performance degradation [54].

2.2.6 AlphaZero’s Evolution

Google Deepmind’s AlphaGo surprised many researchers by defeating a professional human player at the game of Go in 2016 [19], however it was highly customised to the 20

Figure 2.5: Classic Agent Development vs GGP Agent Development from [70]. Histor- ically when a computer game player was created the programmer incorporated their knowledge of the game into the design of the system, making an agent only suitable for the game the programmer had in mind at the time of the design. For GGP development the programmer does not know in advance what the game will be, as such the agent will be created in such a way to acquire the required knowledge of the game. game of Go. In 2017 AlphaGo Zero was released, superseding AlphaGo with a generic algorithm with no game specific customisation or hand-crafted features, although only tested on Go [30]. The AlphaGo Zero method was quickly validated as being generic with the release of AlphaZero, an adaption of AlphaGo Zero for the games of Chess and [30]. Both AlphaGo Zero and AlphaGo had a three stage training pipeline; self-play, optimization and evaluation as shown in Figure 3.3 and explained in Section 3.3.2. AlphaZero differs primarily from AlphaGo Zero in that no evaluation step is conducted. While AlphaGo utilises a neural-network to bias a MCTS expansion [71, 69], AlphaGo Zero and AlphaZero completely exclude conducting monte-carlo rollouts during the tree-search, instead obtaining value estimates from the neural-network. This MCTS process is explained in Section 3.3.3.

2.3 Tree Search

2.3.1 Search Methods

A perfect game player selects the optimal move for a given board position to maximise their return. The challenge in designing a game player is how that player will make their decisions and one approach is to build a game tree. A tree is a hierarchical data- structure suitable for modelling sequential decision situations [46] like board games. The process of exploring a tree to find a sequence of actions that yield a desired state (goal) is called tree-search. In game theory a tree representation of a game is called a game tree, where each node of the tree is a comprehensive representation of the game’s Literature Review 21 state at a particular time and the edges of the tree (connections between nodes) are the actions available to the player from that game state. The root-node is the top of the tree, and is not exclusively the initial game state because a tree can be built any time after the game has commenced, as such the game tree can be thought of as a collection of sub-trees. Figure 2.6 shows a partial game tree-expansion for the game Tic-Tac-Toe. Even for the simple game of Tic-Tac-Toe, which has been solved, building a complete game tree is nearly intractable with 39 board positions. While a board game can be represented by a tree, fully defined trees do not exist for complex games [72], as such trees are usually built while the game is in progress, specifically for the move being considered. Given the size and complexity of game tree, a large part of game playing research has been to find and/or improve heuristics which can be used to prune the game tree to a manageable size, while it is being built. Commencing at the root-node, a tree can be naively built by: • storing a reference to every edge (move) from the visited nodes in a queue; • selecting an edge from the queue and then expanding the tree along that edge; • adding a new child node to the tree and subsequently its edges to the queue; and • the next edge is then taken from the queue until the tree is fully expanded. Once the tree is built, the highest value move can be selected. Using this naive method for building a tree, if the queue is first-in-first-out (FIFO) order, then the tree will grow breadth-first and if the queue is last-in-first-out (LIFO) order, then the tree will grow depth-first. A straightforward consideration is to realise that one player is attempting to max- imise their value while trying to minimise their opponent’s value. As the tree deepens the edges alternate between moves for the player and the opponent, and consequently the desire to maximise or minimise that player’s value. The max player chooses the maximum value and the min player chooses the minimum value in a process called mini-max. For reasons outlined above, this naive approach results in a large and often unmanageable tree size. α - β pruning is a process which allows a child node and its entire sub-tree to be removed from the tree (pruned) if a move is found to be worse than the previously found best move by comparing the resulting node’s values. For a mini-max tree, pruning occurs when a node’s value is less than the previously discovered maximum value for the (max) player or greater than the previously discovered minimum value for the opponent (min player). Applying α - β pruning to a mini-max tree results in the same best move as it would without pruning but from a smaller sized tree. A mini-max tree with α - β pruning is an improvement over the naive method, however is still too large to be of practical use for many problems. In addition to the size of the tree another fundamental challenge with tree-search is how to estimate the value of non-terminal nodes. With most board games the outcome of the game is not known until the end of the game, and even for those games where there is an incremental score, players typically prioritise the final result over some intermediate value (this is especially true in strategic scenarios), giving primacy to 22

Figure 2.6: A partial mini-max game tree for Tic-Tac-Toe from Russell et al.[47, Chapter 5]. The player at the root-node is trying to maximise their value by their choice of action, while the opponent at the second level down is attempting to minimise the player’s value. The utility shown at the bottom of the tree, we call the reward as it is a value assigned to a win, draw or loss in a game for the player whose turn it is. the outcome at the terminal-state. Due to this focus on the reward at the terminal state, building the tree as outlined above will provide no valuable information about the game’s value function until a terminal-node is reached. For this reason a number of approaches are used to estimate the value of non-terminal states. If the game is known in advance then a hand-crafted evaluation function can be created. If the game is not known in advance, then a random play-out (Monte Carlo7 playout) can be used to estimate a node’s value. The MCTS method uses random play-outs to estimate the value of non-terminal-nodes, and to obtain an upper confidence bound for a node’s value to guide the building an asymmetrical game tree prioritising the most promising paths. MCTS is a useful generic algorithm which can be used where there exists no prior knowledge of the game to be played.

2.3.2 Monte Carlo Tree Search

Koscis and Szepesvari combined statistical confidence bounds with Monte-Carlo Plan- ning developing the MCTS algorithm which builds an asymmetric tree along using the most promising paths [71]. Browne et al. explains MCTS as “a method for finding optimal decisions in a given domain by taking random samples in the decision space and building a search-tree according to the results” [69]. The strength of MCTS is that it is: a statistical anytime algorithm, simple, domain independent, converges to the exact solution and uses random play-outs to estimate the value of a given node. It is the combination of the exploration provided by tree-search combined with random sampling to guide the tree-expansion that has resulted in MCTS being the cornerstone

7 Monte Carlo methods use random sampling to estimate unknown values by conducting a random trial and inferring the value from the resulting distribution. Literature Review 23 in many modern general game AI systems [69]. The MCTS process is shown in Figure 2.7. Starting with the current game state, a root-node is created based from that state and a game tree is built - the MCTS process for building the game tree is an iterative loop involving four steps: 1. Selecting a node to explore until a leaf of the tree is reached using the tree-policy. 2. Adding a node to the tree. 3. Employ the default-policy to play-out the remainder of the game until a terminal state. 4. Back-propagate the result of the play-out through the nodes visited and update each node’s estimates. These steps are typically repeated until the allocated time has expired, however the steps can also be performed for a specific number of iterations. For each node in the tree the number of wins w and the number of visits n are maintained. If the play-out results in a win then w and n are both incremented during the back-propagation step for all visited edges - only n is incremented if the play-out is a loss. The default-policy describes how the playout is conducted to find a terminal state, and typically this is done by simply randomly selecting actions until the game is over. i.e. A Monte-Carlo rollout. The selection of which nodes to explore in step 2, are guided by the tree-policy and the most commonly used method is the upper confidence bound for trees (UCT). UCT utilises the distribution of the results obtained from the play-out to estimate the upper confidence value for a node. The node with the highest upper confidence value is selected to explore, this is the most promising path. The upper confidence value is calculated by Equation 2.1. The UCT formula balances the exploration of rarely visited paths with the exploitation of paths which win more frequently. Once the search has concluded, the edge8 which leads from the root-node to the node with the most visits is the predicted best move. A move could be selected by choosing the edge which leads to the highest win-rate, however typically the number of visits is used as the deciding variable.

s w ln(n ) UCT = + A p , (2.1) n (n) where w is the number of wins of the current node, n is the number of visits for the current node, np is the number of visits for the parent node and A is a constant. The number of wins is determined by the play-out. The UCT formula consists of two s w ln(n ) elements, the value v = and the exploration parameter e = c p . As the tree n (n) is expanded asymmetrically each node has a different number of visits, and the more promising nodes have had more visits. Despite the benefits of MCTS and the numerous variations described in Browne et

8 Recall that an edge is the connector to each node, and for a game tree an edge represents a legal action/move that the player can take from that node. 24

Selection Expansion Simulation Backpropagation

Tree Default Policy Policy

Figure 2.7: MCTS process figure from Browne et al.[69]. The method of expanding the tree is called the tree-policy and typically the upper confidence bound for trees (UCT) is used to guide the tree-expansaion. The default-policy is the approach for conducting the playout from a non-terminal state to a terminal state, and typically this is done by randomly selecting actions until the game is over. al. MCTS survey paper [69], the highest standard achieved for a Go playing MCTS agent is that of a good amateur. The reason for the underwhelming performance in Go is that MCTS requires a representative random sample in order to select the best move [73], and the Go game space is too large to achieve this given current processing limitations. A key limitation of random sampling is that trap states hidden within a large search-space can be overlooked, e.g. this occurs frequently in non-convergent games like Chess. Figure 2.8 shows a Chess position which has only one forced-win sequence for white within 3 ply. There are approximately 108,000 possible unique combinations of legal actions which can be made within 3 ply, without including repeated board positions arising from moving a piece and then moving it back. This means that there is approximately a 1 in 108,000 chance that a random rollout will find the winning position if repeated positions are excluded, and a much lower probability if repeated positions are included. Trap states like this one are problematic for MCTS due its reliance on random sampling and quickly become intractable with only a small number of ply. However if there was some mechanism to inform the search in such a way that moves near the opponents King were explored first then it is reasonable to expect that the winning sequence of actions might be found more readily. MCTS is not suitable for all games, particularly non-convergent games, however it has been adopted for general problems due to its general nature and is the preferred decision system for GGP, superseding methods using classical AI approaches [6]. With sufficient processing resources MCTS has been shown to converge to the optimal solu- tion when using UCT [71], however the required resources for complex games is beyond the budget of many researchers. Literature Review 25

Figure 2.8: Chess problem from [74], white to play and win in 3 ply. Each white piece is labelled A-L and the board is marked indicating each possible cell that a particular white piece could be moved to in the first action with a corresponding label. The numbers 1-3 indicate the moves required to force white’s win within 3 ply. White has approximately 60 moves to choose from, black will then have approximately 30 moves and finally white another 60 moves, giving an estimated number of possible actions of 60 ∗ 30 ∗ 60 ≈ 108, 000 but only 1 action which is a forced win for white. Naive random sampling of this state-space is unlikely to result in the optimal moves being selected.

2.4 Neural Networks

A problem can take one of two forms: calculating answers from a given function or cal- culating the function from a set of answers (examples). The field of machine-learning studies how computers can learn without being explicitly programmed [75]. Linear regression models the relationship between two variables, e.g. y = Mx + C where M and C are constants and y and x are variables. The process of determining the line of best fit for this function can be thought of as a simplistic machine-learning algorithm, since example data-points are used to estimate the relationship between the two vari- ables. Machine-learning can also learn to classify data, with the key difference being that classification is discrete, whilst regression is continuous. For example, the binary outcome of whether a particular object exists in an image or not is a machine-learning classification problem. real-world problems however are rarely linear and typically have many more variables than two. Neural-networks are a machine-learning approach which 26 can be used for multi-dimensional regression or classification problems by learning from examples. With the two player game playing regression problem, the aim is to find the optimal value-function from a set of exemplar game experiences which, when employed by a game playing agent, results in the player winning.

2.4.1 Machine Learning Problem Types

Machine-learning problems may or may not have exemplar experiences available in advance. For instance, in a typical linear regression problem a series of data-points are known in advance and these are used to build a model which represents the data as closely as possible. This model can then be used to infer additional data-points (extrapolate or interpolate). The process of using a machine-learning model is called inference. The linear regression example is single-shot, in that the model is built once and then utilised as required. This is similar to a number of more complex machine-learning problems, like the identification of an object in an image. In an object identification problem, labelled data consisting of exemplar images with and without the object is collated in advance and the machine-learning agent is trained to recognise the object from the data-set. Once trained, if successful, the agent can be used for inference on images which it has previously not seen. Supervised and unsupervised-learning which are briefly explained below, are typically single-shot learning problems, however some problems do not have labelled data readily available. For some problems where examples do not exist in advance they can be generated while the agent explores the problem space. A robot navigating a new environment, a computer game player, and most control problems, are problems where examples can be created through exploration. These types of problems do not necessarily have a defined conclusion to learning and could be called continual learning problems. For a continual learning problem the agent continues to navigate within the environment generating examples to learn from, this is distinct from single-shot learning where once the examples are used learning is over. An agent in a continual learning environment will continue to improve up to the capacity of their neural-network, if the system is operating effectively, provided the experiences are accurate and sufficiently diverse. For many continual learning problems, the generated experience utilises feedback from the environment to create the ground-truth label, and for some environments this feedback is only sparsely provided e.g. occurring only in terminal states. The consequence of sparse feedback is that the agent has to use an estimated ground-truth label where no feedback is given - which can result in examples which are imprecise reflections of the environment. Since the agent is “learning as it goes”, it is not able to accurately label the examples until it has received feedback from the environment, which in turn means that the early estimations are inaccurate. This early inaccuracy is expected when employing reinforcement-learning as discussed in Section 2.4.3 because as training progresses and the agent improves, the generated examples also progressively improve. We focus on this characteristic of reinforcement-learning to improve learning Literature Review 27 by seeking to prevent generating expectedly poor training examples.

2.4.2 Neural Network Architectures

Multi-layer Perceptron

Inspired by biological processes, neural-networks consist of interconnected virtual neu- rons (perceptrons) which respond to stimulus with an output defined by an activation function Y 0(x), which ideally simulates a desired response. Neural-networks have been shown to be universal approximators [76] and are useful for regression problems. Feed- forward neural-networks propagate information from input to output as opposed to recurrent neural-networks which have bi-directional information flow using a feedback loop. Figure 2.9 shows a simple feed-forward network for the Boolean AND function and although the weights in this example are defined, learning these weights is the core challenge for a neural-network through the process of training. A perceptron layer is also known as a fully connected layer since weights connect the perceptrons to all neurons in the previous layer. Neural-networks can learn features from noisy data as demonstrated by being able to discriminate voices from noisy audio signals [77], or identifying objects in a poor quality image [78]. This ability to learn under noisy con- ditions allow neural-networks to learn from approximations rather than exact data-sets. CNN are one specific type of neural-network which have been effectively used for many image processing problems.

Figure 2.9: Neural-network Boolean AND implementation with truth table. Three layer perceptron with input layer x1,x2 hidden weighted summation layer and output layer.

Convolutional-Neural-Networks

CNN are a specific type of feed-forward neural-network where individual neurons pro- cess overlapping perspectives of the input. CNNs have typically been used in many visual applications as they can identify features and objects in an image [79]. A con- 28 volution is an integral transform9 which can indicate the amount of similarity between τ the kernel g and the input f such that [f ∗ g](t) = 0 f(τ)g(t − τ)dτ [81, Chapter 1]. . In image processing, a 2-dimensional convolution can be used to create filters by using pre-defined kernels to transform an image, for example transforming an image into an embossed version of itself as shown in Figures 2.10. A CNN utilises stacks of kernels to transform the input into the decision space with multi-layer perceptron’s representing the kernels. These kernels are often referred to as CNN features because once they are learned, they represent the features of the problem.

(b) (a)

Figure 2.10: 2D Convolution from https://developer.apple.com/[82]. (a) Demonstrates how a patch of the image space is compressed in a convolution to produce a single datapoint. (b) Embossing using a convolution with the original image and the embossed output.

The weights of a CNN are learned using neural-network training methods and se- quentially cascading multiple convolutional layers to create a deep CNN allows higher order features to be learned as shown in Figure 2.12. An example deep CNN architec- ture is shown in Figure 2.13. An image can be represented as a tensor, with a given height and width extending several planes - typically with each plane representing a primary colour for the image’s pixels. Likewise, a board game with a board with x × y cells and n types of pieces can also be represented as a x, y tensor with n planes in a similar way to how an image is represented. This insight allows the same neural- network methods used for vision applications to also be used to represent board game inputs. CNNs have underpinned much of the advancement in vision applications [83] and also provide a suitable architecture for board game problems. By using a CNN to learn from board game positions the CNN learns abstract game features which underlie the structure of the game, essentially seeking to learn the transform from the board position to the value function of the game. Despite the ability of CNN to learn features, many solutions still rely on handcrafted game specific features. Google’s AlphaGo employed a CNN in addition to using approximately 142 million hand-crafted small pattern features as listed in Figure 2.11. It is understood within the field of AI that “learning methods work best when supplied with features of a state that are relevant to predicting the state’s value, rather than with just the raw state

9 An integral transform maps the input from its domain to another domain [80] Literature Review 29

Figure 2.11: AlphaGo hand-crafted features [19]. description” [84, Chapter 3], and the requirement to handcraft features highlights the limitations of many learning algorithms [85]. Whilst a board position (game state) can be generally represented in a similar way to an image; supplementing the feature learn- ing of the neural-network with a pre-processing step which highlights previously known features has been fundamental in computer game playing neural-network approaches prior to AlphaZero.

Deep Residual Neural Networks

Image processing is an important challenge with many direct uses and external applica- tions. A number of competitions exist to determine the best methods of understanding images, including object detection, object localisation and image classification. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [87] was one of the primary challenges allowing the comparison of methods on a standardised dataset. ILSVRC operated independently from 2010 to 2017 and by 2015 the classification error for their dataset decreased 4.2x (28.2% to 6.7%) with a 1.7x reduction in single object localisation error (42.5% to 25.3%) [87]. The competition is now held on kaggle.com [88]. One of the primary changes giving rise to improvement in image processing was the structure of the neural-networks. Between approximately 1989 to 2015 [89, 90, 91] deep CNNs were the predominant image processing architecture as shown in Figure 2.13.A deep CNN is loosely defined in the literature as meaning more than 1 convolu- tional layer, with deeper CNNs allowing higher order features to be learned resulting in better image processing performance. Training a deep neural-network is difficult, and one of the key advancements arising from ISLVRC was the 2015 winner’s use of a deep residual network. A deep residual network, as shown in Figure 2.15 was found to be easier to optimise than an equivalent CNN, and as such significantly deeper networks were able to be used [92]. Explained in detail in the next section, neural-networks are most commonly trained by propagating error reducing adjustments (gradients) backwards from the output to the input. The result of back-propagating gradients is that each preceding layer is dependant on the previous calculation and as a result these gradients can either grow 30

Figure 2.12: Learned features of a three layer CNN, with layer 1 showing fundamental features and each layer having higher order features. Layer 1 has lines and blobs, while layer 2 has recognisable facial features and layer three different facial types. Reproduced from [86].

Figure 2.13: Deep CNN architecture used for image processing from [87, Wang et.al, 2014 Challenge] showing 6 convolutional layers (grey), two dropout layers and the output layer (softmax). Literature Review 31

Figure 2.14: Deep residual neural-network basic building block showing the output of a residual block combining both the block’s output with the block’s input. Residual blocks can be layered like any other neural-network blocks, with numerous residual blocks being combined to create the neural-network. The weight layers shown in the figure can be any neural-network layer and in He et al. they use a CNN [92]. uncontrollably (explode) or vanish [93, 94], especially if there are a large number of layers [95], i.e. a very deep network. A residual neural-network is structured so that the output of a residual block combines with its input using skip connections as shown in Figure 2.14. The effect of combining the input and output in this manner is explained fully in He et al.[92], however the skip connection provides a reference point to ensure the back-propogated error doesn’t vanish or explode.

2.4.3 Neural Network Learning

Back-propagation

When training a neural-network its parameters (weights), θ, are adjusted to minimise some loss function. To determine how to adjust the parameters, the loss function compares the network’s prediction y0(s) with the ground-truth value y(s) which is then used to reduce the difference between these two variables, where s is the input to the neural-network. Hinton provides a simple three step process to describe the training process [96]:

• Provide labelled data - exemplar input data labelled with ground-truth output data i.e. {s, y(s)}.

• Measure the difference between the inferred neural-network’s output y0(s) and the ground-truth value y(s) giving the error.

• Change the weights θ so the network produces a better approximation - reducing the error between y(s) and y0(s).

The error arising from the difference between the neural-network’s inference and the ground-truth can be used to build a difference map with the proportional contribution 32

Figure 2.15: The networks used to confirm the effectiveness of deep residual network architecture from [92]. On the left the traditional CNN, on the right the residual network showing the skip connections of a residual network. Identity skip connections are used when the input and output are the same dimension (solid line) while the dotted skip connections increase the dimensions using zero padding. The residual network was found to be easier to train with more predictable behaviour and with better comprehension of the task it was trained for (image processing). Literature Review 33 of each weight to the error, called the error derivative [96]. The error derivative can be incrementally calculated by first processing the output layer, then progressively propagating the error backwards toward the input layers, in a process called back- propagation. The back-propagation algorithm, first proposed by Paul Werbos in 1974 as part of his Doctoral thesis at Harvard [97], was first thought to be only applicable to classification problems, but he again defended the algorithm in 1988 demonstrating its general applicability [98]. The generality of the back-propagation algorithm is now uncontested and the algorithm is one of the most commonly used methods to calculate the error derivative and enable the adjustment of neural-network weights. To reduce the error, the steepest gradient is found, reflective of the largest change in the loss, and the weights adjusted accordingly - this method is called gradient-descent. The back-propagation algorithm is processor intensive, as each weight in the network needs to be calculated. It has only been relatively recently that the hardware has become available to truly enable deep learning i.e. neural-networks with large numbers of neuronal layers. Fortunately, arising from a decade of commercial competition in computer video game playing, the development of graphic processing units (GPU) has resulted in consumer affordable parallel computing hardware. Designed to calculate 3 dimensional rendered scenes on the fly, GPUs are also suitable for back-propagation, and consequentially machine-learning applications. To some extent the advancement in 3D computer game graphics hardware has enabled the recent progress in machine- learning. Depending on the problem type, there are a number of neural-network learning ap- proaches, however, the predominant method for updating the weights remains the back- propagation algorithm with gradient-descent. Despite the different problem types that neural-networks can be used for, if they use back-propagation with gradient-descent they require a loss function which is both differentiable and is reflective of the differ- ence between the network’s prediction and the ground-truth value. The challenge with some problems is that these are not readily available, for example some problems exist without ground-truth values. While other problems can seek to control the long-term outcome of an environment, rather than making an immediate decision as is the case with a classification problem. Supervised learning is where exemplar inputs are labelled with ground-truth data and Hinton’s three step process above is directly applied to these problems. If ground- truth data is not available then unsupervised-learning can be still used to learn an underlying representation of the problem [85, 100]. With unsupervised-learning the output of the neural-network seeks to reproduce the input, then the input is used as the ground-truth value. This process still allows the neural-network to learn the dynamics of the problem using essentially the same process as with supervised-learning. An auto- encoder is a typical unsupervised-learning system with an example architecture shown in Figure 2.16. Once trained, the learned features can be used to supplement other machine-learning or decision making methods [99]. To train a network using supervised or unsupervised-learning, the ground-truth 34

Figure 2.16: Unsupervised learning, auto-encoder architecture from [99] showing the ∼ input L1 = x1..n, output L3 = x1..n and learned representation L2 = y1..m. Noting that ∼ the learned representation L2 has less parameters than the input L1 i.e. m < n. x seeks to represent the input with L1 becoming the ground-truth value for estimating loss gradients. hw,b(x) is the output of the autoencoder which seeks to reproduce the input x using the reduced representation of y. Despite not having any specific ground-truth this architecture allows the standard back-propagation algorithm with gradient-descent to learn the network’s weights. Literature Review 35 value is immediately compared with the inference value, however in control problems the interest lies in some future reward, not the immediate outcome of a singular action. Reinforcement-Learning is a method which is able to learn a sequence of actions in problems where a number of steps are taken before the true consequences of the actions are known.

Reinforcement-Learning

Reinforcement-learning (RL) seeks to learn good decision making for an environment by mapping the environment’s states to their optimal actions. Actions taken within the environment result in reward signals, and an artificial agent using RL discovers these rewards during exploration as shown in Figure 2.17. These rewards can occur frequently or sparsely. RL agents typically seek to maximise rewards over the long-term, allowing strategies to be learned which forgo short-term gains for more favourable long- term outcomes. For example, in the share market, trading a stock incurs a cost which would be considered a negative reward however it is clear that to profit in playing the share market these short-term costs are unavoidable. The same situation can occur when playing board games where one player might choose to sacrifice one or more key pieces, appearing to reduce their value in order to set up a later victory. For these reasons short-term rewards provided by the environment are often re-evaluated by the agent. Reinforcement-learning is a credit assignment problem, seeking to allocate credit for each of the actions that were taken. Exploration of the environment and delayed reward are the driving characteristics of RL problems [101].

Figure 2.17: The interactions of an agent and its environment from Sutton and Barto [101]. At any time t the environment is at state st with a reward rt. The agent makes a decision to take an action at from it entire set of actions which changes the environment to st+1, rt+1. The agent is seeking to maximise their reward.

An agent operating in any environment has:

• sensors (a mechanism to measure the state of the environment), • actions which it can take to change the environment (which includes simply mov- ing itself spatially), and 36

• some goal or motivation.

With reinforcement-learning problems, while the actions are known their effect on the environment is not. For example an agent may operate in a 1 dimensional environ- ment where they can only move forwards or backwards, however despite only having two movements the consequences of those actions can be highly complicated like in the cart-pole problem. The challenge of the cart-pole problem is to keep a pendulum pole vertically balanced while on a rail-cart which can move only left and right as shown in Figure 2.18. Despite the cart-pole problem having a small action space (left and right), and the state being able to be described with just 2 variables (cart position and pole angle) the dynamics of balancing the pole is a complex problem. An RL agent initially learning the cart-pole problem repeatedly fails to keep the pole vertical, however after sufficient episodes (attempts) it can progressively learn to keep the pole vertical by moving the cart.

Figure 2.18: Cart-pole diagram from Nagendra et al. where the challenge is to keep a pendulum pole vertical on a moving cart where the cart can only move left and right. The cart problem has a small action space (left and right) and requires only 2 variables (cart position and pole angle) to describe the state, however the dynamics of the problem are highly complex.

In board games, exploration of the environment can occur by self-play i.e. the agent playing the game against itself. The state of a board game can sometimes be comprehensively represented by the board position i.e. the arrangement of pieces on the board, however some games have move limits, time limits or repetition restrictions which are external to the board. To ensure that the game state is comprehensively represented these non-board characteristic elements of the game need to be encoded in the state description. For instance a position that is repeated for the third time in Literature Review 37

Chess results in the game being over and a draw being declared. The third occurrence of the board position is obviously different from the first two since it results in the game being over despite the board positions being identical. To create a comprehensive state representation which incorporates this element of Chess, the number of times a position is seen needs to form part of the game’s state representation. If the state is comprehensively defined and does not require past states then it is said to have the Markov property and be a Markov state [102, 103]. A sequence of Markov states, called a Markov chain, is a stochastic model where each state depends only on the previous state. Formally, a chain is a Markov chain if and only if P (St+1|St) = P (St+1|S1, ..., St) where P is the state transition probability matrix, S is a finite set of states and t is the time step [104]. A Markov chain is represented by the tuple < S, P > where S is a set of states and P is a transition probability matrix [105]. A Markov chain can be extended by adding the expected reward R when transition- ing to a new state. A Markov reward process is a Markov chain with an accompanying reward for each state Rs = E[Rt+1|St = s] for the reward, giving a representation tuple < S, P, R > 10. Finally, a Markov decision process (MDP) is a Markov reward process which incorporates the finite set of legal actions A giving the tuple < S, A, P, R > where:

• S is a finite set of states,

• A is a finite set of action,

a 0 0 • P is a state transition probability matrix Ps s = P [St+1 = s |St = s, At = a],

• typically a reward discount factor γ ∈ [0, 1] is also included.

An MDP can formally describe a reinforcement learning problem, recalling that both MDP and reinforcement learning problems are defined by a set of states S, with actions A that result in transitions between state with a probability distribution P and some reward R which is received for changing state. The aim of a reinforcement learning problem is to find the mapping of states to actions i.e. the policy π which maximises the cumulative reward11. Markov chains and Markov decision processes are proven mathematically and are also able to address situations where states are not fully Markov compliant and where the problem is not fully observable [106]. For a deeper understanding of Markov processes the book by Boucheri and van Dijik [107] provides comprehensive coverage, however the critical element required to be understood for reinforcement-learning for this thesis is that the representation of a state has the Markov property such that it stands alone in being able to inform the agent irrespective of the path taken to arrive at that state. Sutton an Barto in Chapter 3 of their second edition book on reinforcement learning also cover MDP in detail as well

10 A Markov reward process can include some reward discount factor γ to avoid infinite returns in cyclic processes and to account for uncertainty in future rewards. 11 The cumulative reward is also called the return 38 as provide extensive coverage of mathematical foundations of reinforcement learning [108].

Training a Reinforcement-Learning Agent

The process of learning the neural-network weights for a RL agent is often done using back-propagation with gradient-descent as explained in Section 2.4.3. As noted in Section 2.4.3 gradient-descent requires the comparison of the neural-network inference12 y0(s) with some ground-truth value y(s), where s is the input to the neural-network i.e. the state of the environment. We established in our preceding discussion that the defining properties of RL were “exploration of the environment and delayed reward”. It follows that since the environment needs to be explored and may only intermittently provide a reward, then ground-truth data may not necessarily exist for RL problems, however it can be estimated during exploration. An agent learns via reinforcement-learning by minimising the difference between the network’s prediction for a state at one time and a future prediction, for the same state at some later time, called temporal difference (TD) learning [75, 109, 110]. In most board games it is only at the terminal state where the winner can be declared and a corresponding reward R assigned. For terminal state sT , it is common for R(sT ) = {+1, −1, 0}, the reward for where the game is a {win, loss, draw} for the agent. For states of the game which are not terminal, typically no reward is provided, as such we seek to learn the value function v(s) for all states, where the value for terminal states is equal to the reward v(sT ) ← R(sT ). With v(sT ) being defined then it follows that the state immediately preceding the terminal state sT −1 would have some proportion of the terminal state’s value i.e. v(sT −1) = γ · v(sT ) where γ is the proportion of the future state’s reward to apply to sT −1, known as the discount factor. The value function can then be estimated by propagating the future values backwards through all of previously visited non-terminal states. This can be rewritten v(st) = γ · v(st+1), stating that the value of the current state is some proportion of the value of future states. Having used this method to calculate v(st) this can then be compared with the neural-network’s 0 inference v (st) as part of the back-propagation algorithm to adjust the weights. With sufficient training episodes/games that conducted a sufficiently diverse mapping for v(st) can be built using this approach.

Since each state has a set of actions which can be taken where st+1 ← st(ai) where ai is an action chosen from the set of n legal actions At = {a0, a1, .., an}, the value of the state can be estimated by averaging the future rewards for all possible actions. The future prediction can be approximated by conducting a tree-search to look-ahead in time, this variation of TD learning is called TD-Leaf [111] and underpins AlphaZero’s combined neural-network design, that is; AlphaZero uses the difference between the neural-network’s inference and the outcome of a tree-search to train its network using back-propagation with gradient-descent.

12 The output of the neural-network for the given input s Literature Review 39

2.4.4 Overfitting

In most applications of neural-networks the aim is to train the network using the available examples and then employ it for problems which it has not explicitly seen, but are still within the same problem space. If the neural-network fails to generalise then it is said to have overfit [112]. Take for instance Figure 2.19 which shows two images of the same cat. The cat images are very different with differences in pose, distance to the cat, background and image lighting, however a cat detection classification neural- network would be expected to identify the cat in both images. Furthermore, cats of different colours and shapes would also be expected to be identified regardless of the image they appeared in without having previously been seen by the neural-network. Given the significant variance in cat breeds, the large number of potential poses, and the possibilities with regard to background and image exposure, it becomes evident that the entire set of cat images cannot be presented to a neural-network for learning. An object detection system using the current classification and object detection methods [113] would have no issues in detecting the cats shown in Figure 2.19 despite having never seen either this cat or these images previously. There are a number of potential causes of overfitting which have to be considered when designing a neural-network.

Figure 2.19: Cats with different pose, background and image exposure are still cats. The challenge in training a neural-network is to learn only the features which are relevant to the problem.

Consider a cat detection system using a neural-network which is only trained using images of one breed of orange cats. The system will most likely fail when a cat with different breed or different colour is presented to it. This example of overfitting is caused by insufficient diversity in the training set and can occur inadvertently where the smaller subset of training samples are highly correlated. Cats which are one breed and one colour form only a small subset of the entire cat problem space and as expected if a neural-network is trained on a narrow subset of the problem space then it will be only able to be used for that subset. This hypothetical cat detection system may be able to generalise and detect other orange cats of the breed type that it hasn’t seen, however unless this is specific design requirement then the network will have failed to generalise to all cats i.e. it is overfitted. To prevent this, a key principal of training 40 neural-network is to ensure diversity in the exemplar data-set. For complex games, that have large state-spaces, only a fraction of the entire game space can be stored to train a neural-network. It is a practical limitation that the entire state-space cannot be used, as such care has to be taken to ensure that the smaller subset used to train the neural-network is sufficiently diverse to represent the entire problem space. In some applications noise is deliberately added to increase the diversity of examples, however in some cases the neural-network can also learn the noise and incorrectly associate it as a feature of the solution. The examples used to train a neural-network will often be noisy or have extraneous information which is irrelevant to the problem. Images of cats can have diverse back- grounds and include other objects which are not cats, and if these other elements occur frequently then the network may include them as part of its model. Burnham et al. defines overfitting as “unknowingly extracting the residual variation (i.e. the noise)” [114]. Erroneously learning the noise can occur if a common irrelevant signal consis- tently appears with the positive examples e.g. if most of the cats were wearing a small bell then the neural-network may learn that bells form part of the feature space for cat detection. In practice small bells may in some way contribute to identifying some cats, however they are not organic to a cat’s composition as such are sub-optimal features to learn for generalisation. Riberto et al. performed an investigation into whether we should trust some clas- sifiers and tested a system which was intended to discriminate a husky13 from a wolf, where the key feature that the neural-network relied on was whether there was snow in the background [115] and not the physical appearance of the animal. Many of the wolf examples were in the wild (with snow), whilst many husky images included urban terrain since they are pets. This is a subtly different issue from a lack of diversity as extraneous features are saturating the true features which would permit more general prediction. If a game playing neural-network agent were to be trained on experiences using only one opponent, then the network may learn how to defeat only that oppo- nent, and not actually learn how to play the game. i.e. learning the features related to that opponent’s playing style instead of the game. Likewise to learn a board game from scratch implies that the agent may play very poorly (or randomly) initially and by doing so it may introduce some level of consistent noise which an agent may learn particularly if it is generating its own experiences. Learning extraneous features can, in some cases, be attributed to having a network which is too complex in relation to the problem space. For a simple regression problem, attempting to fit a 2nd order polynomial, or higher, to a first order (linear) regression problem is likely to result in a good approximation for the presented data, but a sig- nificant divergence when extrapolating [116]. The same is true of a neural-network’s trainable parameters in that the more parameters that exist, the more likely that it can be fitted to the data that has been presented - regardless of whether the data con-

13 A domesticated dog that has a wolf like appearance. Literature Review 41 tains extraneous features or not. The conundrum that exists is that for more complex problems, more trainable parameters are required and consequently they become more prone to learning irrelevant features at the expense of generalising. Parameter normalisation methods are regularisation techniques used to prevent overfitting of large neural networks by penalising large weight parameter values in the neural network - typically by adding an element to the loss function [117]. In section 2.4.3 we explain that a neural network seeks to minimise some loss function via back- propogation. By adding an element to the loss function which increases as the neural network weight parameters increase has the effect of keeping the weight parameters smaller. Equation 3.1 which is the loss function used for training the agent’s used throughout this thesis shows L2-norm regularisation. A related, but slightly different method is a process called weight decay, where the weights are directly reduced by some proportion after each training step [118, 119]. These methods seek to ensure that the parameter values retain relatively small, manageable values. Another common method to prevent over-fitting related to network complexity is the use of dropout [120]. Dropout removes a portion of the network’s weights dur- ing training in order to temporarily reduce the number of parameters which could be fitted to the experiences. Dropout prevents the neural-network’s parameters “from co-adapting” [121]. With CNN however, randomly removing weights during training makes little sense since features are derived from spatial relationships. Spatial-dropout [122] differs from traditional dropout in that a portion of the feature space is dropped out during training as shown in Figure 2.20. Spatial-dropout retains spatial relation- ships, but still reduces the number of free parameters used to train the neural-network. Overfitting can also be caused by too much training, resulting in the trainable parameters becoming saturated. To prevent overfitting arising from excessive training in supervised-learning problems, early-stopping is employed [123, Chapter 2]. Early- stopping methods monitor the change in loss between training epochs and halt when learning begins to deteriorate. Batch-normalisation is another method to prevent the saturation of trainable parameters by continually balancing the weights. Unless training experiences are highly curated, there is often a difference between the distributions in the training set and the test set. For game playing this can present itself in the form of having a disproportionate number of drawn games, or an imbalance between terminal experiences and non-terminal experiences. Because of the difference in distributions, when training the neural network the back-propagation algorithm will attempt to fit this data even without accounting for these distribution difference. Co- variate shift refers to a change in the distribution of the input variables [125]. Figure 2.21 shows a simple regression problem and demonstrates how different distributions can result in a poorly fitted model. In machine learning, batch-normalisation attempts to compensate for internal covariate shift [126]. The process of batch-normalisation seeks to readjust weights to account for covariate shift. Ioffe and Szegedy claim that by using the batch-normalisation process a neural-networks learning is less sensitive to the initialisation and the learning rates and in some cases eliminates the need for dropout 42

Figure 2.20: Example of dropout in a fully connected network from [124]. The standard network (a) has all neurons connected however when applying dropout (b) a random selection of neurons are dropped for that training run.

[126]. Recent deep CNN solutions use batch-normalisation in lieu of dropout including the current state-of-the-art board game player AlphaZero.

Figure 2.21: A simple regression problem where the training and testing data-sets have different distributions. The learned function shown in green fits the training data and a portion of the problem space well, however it is not indicative of the portion of the problem space of interest where the test samples are located. From Herrera’s presentation to IWANN on covariate shift [127]. Chapter 3

Evaluation Framework

Highlights • The state-of-the-art board game playing agent is a combined neural-network/tree- search system and our player design is inspired by the state-of-the-art system. • We trade speed for reproducability in design considerations. • Demonstrating comparative performance is the primary design consideration.

3.1 Introduction

In order to test and evaluate different learning approaches we have built an experimental system which compares the performance of two reinforcement-learning game players. This section outlines the key elements of the system. Our experimental platform consists of two core modules; the game environment and the players, and has the following characteristics:

• Both the environment and the players utilise a standard framework, regardless of the game or the type of player.

• The system is designed in such a way as to reduce the variability between com- pared players.

• The agent’s performance during both training and gameplay is traded off in favour of reducing the variability of the experiments.

• The system is designed for observing comparative performance not the absolute performance in any one game.

For example, the training pipeline which is used to train the neural-network outlined in Section 3.3.2 is conducted sequentially, however better performance could be obtained by conducting the training in parallel, in the same way the AlphaZero does. Likewise, when two players are compared they are both trained on the same system simultane- ously to further reduce the impact of any variance in processor load. Again, better individual player performance could be achieved by providing all of the resources to train a single agent. The result of these design decisions is that the complexity of the

43 44 agent, the difficulty of the games and the quality of the opponents are constrained to permit the simultaneous training of two AI agents, using sequential processes, within a period of time which is reasonable yet is still sufficiently complex to demonstrate the effectiveness of the methods proposed in this thesis. Our focus is not on the absolute performance of the agent in playing any particular game, but instead the comparative improvement resulting from any modifications we make.

3.2 Game Environment

The game environment encodes the definitions of the game including the rules and actions, and maintains the progress of the game. The game environment is a black box to the players [128], permitting players to make moves and updating the game state accordingly. The players query the game environment to inform their decisions. Every game has a set A of m possible total actions, A = {a1, .., am}. For any of the game’s board positions, the environment provides:

• a tensor of sufficient size to represent the current state s of the game;

• a vector representing the legal actions for the state L(s) ⊂ A in a format accepted by the environment’s move function; NB that A is used to denote the set of all actions while a is used to denote a single action; and

• a scalar W indicating if the game is ongoing, a draw, player 1 wins, or player 2 wins.

3.2.1 Game Representation

Bengio stated that “The performance of many machine-learning methods are heavily dependent on the choice of data representation on which they are applied” [129] high- lighting one of the primary challenges to using a neural-network for general games; that is, suitably representing every possible game with a standardised representation. Numerous researchers have discussed the importance of good representations and high- light that different representations can contain more or less “explanatory factors” be- hind the problem domain [130, 131], however the focus of our research is in general games, meaning that hand-crafting features are outside of our scope. Instead we utilise a standardised representation by directly mapping piece locations to positions in vector. Players require certain information from the game environment to make decisions. The player requires that any game environment it interfaces with can be queried for the current board position (state) and which player’s turn it is. In addition to this, for any given position, the interface is required to provide whether the position is terminal, who has won if it is terminal, and what legal moves are available for the position. Formally the state S is represented as S := (B,P,W ), where B is a tensor describing the board; P is an integer indicating which player’s turn it is, W is an integer indicating:

• if the game is still ongoing W = 0, Evaluation Framework 45

• which player is the winner W = {1, 2} ,

• if the game is a draw W = 3.

The legal actions (moves) for single state s are represented as a tensor L(s) of size m. For move number t in a game, the player selects a single action ai from the legal actions L(st) based on its evaluation of st and the environment will change state to st+1, i.e. if F denotes the game’s transition function and action at is selected then

F :(st, at,Wt) → {st+1,Lt+1,Wt+1}. The board B is represented as a set of bit planes with sufficient size to represent the playing space. Decomposing the board into bit planes creates sparse matrices, where most of the elements are zero, which has been reported to be beneficial to training neural-network weights [132]. In the simplest form, each unique playing piece has its own plane representing the cells on the board which that piece occupies. i.e. The board shown in Figure 3.1 is represented as 6 independent bit planes, one for each piece type (3 planes for each colour).

0 0 1 0 0 0 0 0 1 0 0 0 0 0 0       0 0 0 0 0 0 0 0 0 0 1 1 1 1 1       B = 0 0 0 0 0 , 0 0 0 0 0 , 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0       0 0 0 0 0 0 0 0 0 0 0 0 0 0 0       W = 0 0 0 0 0 , 0 0 0 0 0 , 0 0 0 0 0             0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0

Figure 3.1: Example of how to represent a board position by bit planes. The position is decomposed into bit planes, one for each piece type, which represent the cells that each piece occupies. Shown in B are the bit planes for the black pieces, while W shows the planes for the white pieces.

3.3 Neural-Network/Tree-Search Game Player

A baseline-player is used to compare our new methods with the current state-of-the-art- methods. The baseline-player is a combined neural-network/tree-search reinforcement- learning system which chooses an action a from the legal moves L(s) for any given state s, where L(s) ⊂ A. The neural-network with parameters θ is trained by self- play reinforcement-learning. As the players are inspired by AlphaZero [31], only the components relevant to this thesis will be covered in this section. The parameters for the neural-network/tree-search players are shown in AppendixA. This section details the key design elements for the baseline-player. 46

3.3.1 Neural-Network

The input to the neural-network is a state tensor s from the game environment. The neural-network with parameters θ has two outputs, a policy vector pθ(s), and a scalar value vθ(s) for the given state s. The policy pθ(s) is an m sized vector, indexed by actions a, representing the probability distribution of the best actions to take from s. vθ(s) is the neural-network’s value estimate for s. pθ(s) is used to guide the tree-search, while vθ(s) provides an estimate for the value of states which are non-terminal. The neural-network is comprised of a multi-layer deep residual CNN. Deep residual CNNs have been found to be more stable and less prone to overfitting than traditional CNNs [92] for large networks. The network architecture is shown in Figure 3.2. Evaluation Framework 47 ). s ( θ v ) and s ( θ p from the environment and outputs s Figure 3.2: Architecture for the AI players with input 48

3.3.2 Training the Neural-Network

Training Pipeline

The pipeline to train the neural-network consists of two independent processes; self- play and optimisation. A new generation of player is created at the completion of each training cycle i. The initial weights θi=0 for the neural-network are randomised. After the optimisation step, the weights are updated using gradient-descent yielding new weights θi+1. The self-play process generates training examples using a player with the latest weights to play against itself, filling an experience-buffer as explained later in this section. The optimisation process trains the network using these experiences via batched gradient-descent. The process is shown in Figure 3.3(a). An experience is created and saved for each move made in a game and given that experiences from the same game are likely to be highly correlated, if care is not taken this will increase the risk of overfitting. Since experiences created from within the same game are correlated, the experience-buffer size is measured in games instead of experiences, to ensure the experience-buffer contains sufficient move diversity. We seek to reduce the variability of training by conducting the training process sequentially on a single system instead of in parallel across multiple systems like the state-of-the-art system. When the experience-buffer contains the experiences from a set number of games, self-play stops and optimisation begins. After optimisation has finished all experiences from a percentage of the earliest games are removed from the experience-buffer, and self-play recommences.

Experiences

Training experiences are generated during self-play using the latest network weights. Each training experience X is comprised of a set X := {s, π(s), z(s)} where s is the tensor representation of the state, π(s) is a probability distribution of most likely actions (policy) indexed by a obtained from the tree search, and z(s) is the scalar reward from the perspective of the player for the game’s terminal state. During self-play, an experience is saved for every ply1 in the game, creating an experience-buffer full of experiences from a number of different self-play games. z(s) = rT (s) where rT (s) is the reward for the terminal state of the game and rT (s) is −1 for a loss, +1 for a win, and −0.5 for a draw. We use a slightly negative reward for a draw instead of 0 to discourage the search from settling on a draw, instead preferring exploration of other nodes which are predicted as having a slightly negative value.

Updating Weights

During training, experiences are uniformly randomly selected, and parameters θ are updated to minimise the difference between vθ(s) and z(s), and maximise similarities

1 The term ‘ply’ is used to disambiguate one player’s turn which has different meanings in different games. One ply is a player’s single action. Evaluation Framework 49

Figure 3.3: Two training loops used to demonstrate effectiveness of curriculum-learning. a. AlphaZero inspired and b. AlphaGo Zero inspired.

between pθ(s) and π(s) using the loss function shown in Equation 3.1. One training step is the presentation of a single batch of experiences for training, while an epoch e is completed when all experiences in the buffer have been utilised once. A number of epochs are conducted for each training cycle i. Experiences are stored in the experience- buffer in the order in which they were created. At the conclusion of each training cycle the buffer is partially emptied by removing a portion of the oldest games. We have performed a small preliminary experiment and determined that removing 20% of the oldest games each training cycle is an effective replacement portion as shown in Figure 3.4.

2 2 l = (z(s) − vθ(s)) − π(s) · log(pθ(s)) + c · ||θ|| (3.1) where:

z(s) = The reward from the game’s terminal state (return). c · ||θ||2 = L2-norm weight regularisation.

vθ(s) = Neural-network value inference. π(s) = Policy from the MCTS.

pθ(s) = Neural-network policy inference. 50

AlphaGoZero Training Pipeline

For completeness we have also conducted an additional experiment using AlphaGo Zero’s method of adding a third process to evaluate the best weights θ? [19]. The primary difference when employing this training pipeline is that self-play is conducted with the best weights θ? instead of the current weights, as shown in Figure 3.3(b).

3.3.3 Tree Search

The tree search used in our system is MCTS. MCTS builds an asymmetric tree with the states s relating directly to nodes, and actions a as edges. At the conclusion of the search the policy π(s) is generated by calculating the proportion of the number of visits to each edge. During the tree-search the following variables are stored:

• The number of times an action was taken N(s, a).

• The neural-network’s estimation of the optimum policy pθ(s) and the network’s

estimate of the node’s value vθ(s).

• The value of the node V (s).

• The average action value for a particular action from a node Q(s, a).

MCTS is conducted as follows:

• Selection. The tree is traversed from the root-node by calculating the upper con-

fidence bound U(s, a) using Equation 3.2 and selecting actions ai = argmax(U(s, a)) a until a leaf-node is found [30]. Note the use of pθ(s) from the neural-network in Equation 3.2.

• Expansion. When a leaf-node is found vθ(s) and pθ(s) are obtained from the neural-network and a node is added to the tree.

• Evaluation. If the node is terminal the reward r is obtained from the environ-

ment for the current player and V (s) := r otherwise V (s) := vθ(s). Note that there is no monte-carlo roll-out.

• Back-propagation. Q(s, a) is updated by the weighted average of the old value and the new value using Equation 3.3.

s m P N(s, aj) j=1 U(s, a) = Q(s, a) + 3 · (p (s) + δ) (3.2) θ 1 + N(s, a)

N(s, a) · Q(s, a) + V (s) Q(s, a) ← (3.3) N(s, a) + 1 where Evaluation Framework 51

Q(s, a) = Average action value.

pθ(s) = Policy for s from neural-network. δ = The Dirichlet noise function. 0 p (s) = pθ(s) with added Dirichlet noise. m P N(s, aj) = Total visits to parent node. j=1 N(s, a) = Number of visits to edge (s, a). V (s) = The estimated value of the node.

At the conclusion of the search a probability distribution π(s) is calculated from N(s,aj ) the proportion of visits to each edge πaj (s) = P N(s,A) , j = 1, .., m. The tree is retained for reuse in the player’s next turn after it is trimmed to the relevant portion based on the opponent’s action.

3.3.4 Move Selection

To ensure move diversity during each self-play game, for a given number of ply as detailed in AppendixA, actions are chosen probabilisticly based on π(s) after the search is conducted. After a given number of ply, actions are then chosen in a greedy manner by selecting the action with the largest probability from π(s). When conducting a tournament to test the effectiveness of any modifications moves are always selected greedily.

3.3.5 Parameter Selection

The training pipeline of our machine-learning player differs from the state-of-the-art player in that we execute the training pipeline sequentially to reduce variability. One of the outcomes of this design decision is that the number of experiences to be replaced each training cycle is required to be set - instead of generating as many new games as possible. Likewise a design decision is required as to how many epochs to train the agent for each training cycle. Regardless of the parameters selected, when comparing the relative performance of two players the same parameters are used for both players. We have tested suitable parameters and have decided to use 20 epochs, with 20% experience replacement as shown in Figure 3.4.

3.3.6 AlphaGo Zero Inspired Player

The AlphaGo Zero inspired player is very similar to the player explained above (see Section 4.3.1), with the exception of an evaluation step in the training loop. The evaluation step plays a two player competition between the current best player with ? weights θ and a challenger using the latest weights θi. The competition is stopped when the lower bound of the 95% percentile Wilson2 confidence score with continuity correction [133] is above 0.5, or the upper bound is below 0.5 allowing a competition

2 We provide further detail on the Wilson confidence intervals in Chapter8 and equations 8.7 and 8.6 shows the formula. 52

Figure 3.4: Determining suitable epoch and replacement rate parameters. We compare the performance of an agent using 20 epochs and 20% replacement (dotted lines) against agents using 100 epochs with 20% and 20 epochs with 100% replacement replacement of experiences. Note that the red plots were performed on the same system simultaneously, and the green plots were performed together on the same system. We can see that 20 epochs and 20% replacement outperforms the other parameter values. To obtain this plot two experiments are run, both use 20 epochs and 20% replacement rate as a fixed reference point with the alternative set of parameters. This is done to account for any variations arising due to processor load. The colours indicate that they were run simultaneously. winner to be declared. The competition is also stopped when the difference between the upper and lower confidence interval is less than 0.1, in which case no replacement is conducted. If the challenger is declared the winner of the competition, then its weights become the best weights and are used for subsequent self-play until they are replaced after a future evaluation competition. Although this method for stopping does not provide exactly 95% confidence [134], it provides sufficient precision for determining which weights to use to create self-play training examples.

3.4 Reference-Opponents

The reference-opponents provide fixed performance opponents for comparing two AI players. When testing if one of our methods is effective, we play the two variants of the player - one with and the other without modifications against the reference-opponent. Rather than playing two players head-to-head the external reference-opponent is used Evaluation Framework 53 to verify that the agent’s learning is universally applicable, and not just relative to another player using a similar methodology. We liken the use of the external player in these experiments to using a held back validation dataset to measure the accuracy of a trained supervised learning neural-network. Two external players are utilised, MCTS and Stockfish-multivariant.

3.4.1 MCTS Opponent

A standard MCTS player [71, 69] is used as a reference-opponent for the games of Reversi and Connect-4. The quality of the MCTS player can be easily adjusted by varying how many iterations of the MCTS process are conducted per move. The number of MCTS iterations is approximately equal to the number of nodes which are added to the search-tree - a single node is added for every iteration with the exception of when a visited terminal node already exists in the tree. The edge from the root-node with the most number of visits at the end of the search is selected as the move to play. The number of MCTS iterations for the reference-opponent is chosen to ensure that the AI player initially loses 100% of the games. In addition to this the number of MCTS iterations for the reference-opponent is higher than the number used in the AI player creating a in favour of the reference-opponent. For example, the MCTS reference-opponent for Reversi uses 1000 MCTS iterations while the AI player uses 200 iterations. The reason for this is to highlight the learning which is occurring in the AI player, demonstrating how the learned game knowledge eventually overcomes this handicap. This handicap allows us to highlight the importance of efficient neural- network learning in this system. MCTS is suitable as an opponent for games where terminal states are readily found by random play, which means that for Chess variants MCTS is not suitable. When using MCTS for any we observed that whilst the initialised player would lose all games, it would not take long for the AI player to win a significant amount of games. This is the reason why a Chess specific player is used.

3.4.2 Stockfish Opponent

First released in 2008, Stockfish is a well known open source computer Chess player which uses a combination of opening and closing book plays combined with heuristic based tree-search. A multi-variant version of Stockfish, [135] was used as the reference- opponent for Racing-Kings. The Stockfish level was set such that initially the AI player would lose 100% of the games. Even though the AI player is untrained initially it is still conducting a tree search with 200 MCTS iterations so is initially better than a random player.

3.5 Games

We evaluate the effectiveness of using a modified version of Racing-Kings [136] and Reversi [137]. The games selected have fundamental differences in how a player makes 54 their move. Reversi is a game in which the board starts nearly empty and as the game progresses, tiles are placed on the board filling the board up until the game ends; tiles are not relocated once they are placed. Games like Tic-Tac-Toe, Connect Four, Hex and Go have the same movement mechanics as Reversi and with the exception of the occasional piece removal in Go, these games also fill the board as the game progresses. Racing-Kings was chosen as the representative of the class of games that have piece mobility, i.e. where a piece is moved by picking it up from one cell and placing it at another. Like Racing-Kings, often games with piece mobility also have the character- istic of piece removal through capture. Games with similar movement mechanics as Racing-Kings include the many Chess variants, Shogi and Nim. The calculation of the winner for both Racing-Kings and Reversi is relatively straight forward. For Reversi it is simply the number of tiles and for Racing-Kings it is when a King is at the final row on the board. The calculation for the winner of Connect-4, however requires the consideration of the spatial relationship between a number of pieces making it more complex to calculate the winner. We include Connect-4 in our experiments where this additional end-game complexity is required to demonstrate a finding.

3.5.1 Connect-4

Connect-4 is a two player board game with the aim of the game being for a player to get four of their pieces in a row. The players alternate by dropping a coloured disc into a column of the vertical board, with the piece settling at the top of the stack of pieces in that column. If the board is full, without 4 in a row then the game is a draw. The common name Connect-4 is a trademark by Milton Bradley, however the game itself has also been known as Captain’s Mistress, Four Up, Plot Four, Find Four, Four in a Row, Four in a Line, Drop Four and Gravitrips3. For this research the standard board with 7 columns and 6 rows is used. Connect Four was first solved in 1988 using knowledge-based methods [138] and was later solved in 1995 using brute-force methods [139] by generating an 8-ply position database. The first player is guaranteed to win if they play perfectly by starting in the central column. Using mini-max with α-β pruning, move ordering and transposition tables allows a machine-learning player to play Connect-4 nearly perfectly. In this research we found that whilst MCTS performs well, it does not play perfectly since some board positions have only one perfect single action but the winning terminal state is too deep for MCTS to search to. Since our AI player contains an MCTS component, we use MCTS as the reference-opponent to demonstrate how the addition of a neural-network to the search can overcome the inherent weakness with using monte-carlo search. Figure 3.5 shows the board of Connect-4, and also a position which is very difficult for a monte-carlo search to find the optimal action. Although the game of Connect-4 is solved, it is far from trivial without using prior knowledge requiring a substantial search-depth to play perfectly. Each column is a possible action until it is full making a drawn game

3 https://en.wikipedia.org/wiki/Connect Four Evaluation Framework 55

7 × 6 = 56 ply long with a total possible number of 4 × 1012 unique board positions.

Figure 3.5: Connect-4 after 2-ply of perfect play. Using the analysis from https: //connect4.gamesolver.org we can see that red has only 1 winning action, and if that action is taken red can win after 40 ply (20 moves). This equates to a large num- ber of sub-optimal combinations for a monte-carlo search giving a low probability of discovering this single optimal move randomly.

3.5.2 Racing-Kings

Racing-Kings is a game played on a Chess board with Chess pieces. Pieces are placed on a single row at one end of the board with the aim being to be the first player to move their King to the other end of the board. Checkmate is not permitted, and neither is castling, however pieces can be captured and removed from the board. We modify the full Racing-Kings game by using fewer pieces and adjusting the starting position of the pieces by placing them in the middle of the board, as shown in Figure 3.6, instead of on the first rank. As discussed previously, we reduce the game complexity to ensure our experiment runs within a reasonable time given our design constraints to reduce variability by implementing a sequential training pipeline as well as training two agents on shared computing resources. As the Racing-Kings game library is part of a suite of Chess variants [140] we maintain the environment as it would be for Chess. A state s is represented as a tensor of size 8 × 8 × 12; the width and height dimensions representing cells on the board and the 12 planes representing each of the player’s pieces; King, Queen, Rook, Bishop, Knight and Pawn. As the pieces are moved from one cell to another the total number of actions includes all possible pick and place options, 64 × 64 = 4096, excluding any additional actions such as promotion of pawns. There are 44 possible from/to pawn promotion movements in Chess but this also needs to be multiplied by the number of pieces which a pawn can be promoted to. We only consider promotion to a Knight or a Queen giving 88 possible promotion actions. The total number of actions for the 56

Racing-Kings game environment is m = 4184.

Figure 3.6: Starting Position for the game of Racing-Kings. The winner of the game is when a player’s King reaches the top row of the board. Kings are not permitted to be placed in Check in Racing-Kings, however with this exception all other Chess rules are the same. The standard Racing-Kings Board starts with these pieces on the first row, additionally the remaining major Chess pieces are also utilised and are placed next to their partner piece - the Queen is placed above the King. This modified version allows two games to be run simultaneously on a single system reducing experiment variability.

3.5.3 Reversi

Reversi “is a strategic boardgame which involves play by two parties on an eight-by- eight square grid with pieces that have two distinct sides ... a light and a dark face. There are 64 identical pieces called ‘disks’. The basic rule of Reversi, if there are player’s discs between opponent’s discs, then the discs that belong to the player become the opponent’s discs” [141]. The starting board position for the game of Reversi is shown in Figure 3.7. The winner is the player with the most tiles when both players have no further moves. For the Reversi game environment a state s is represented as a tensor of size 8 × 8 × 2; the width and height dimensions representing cells on the board and each plane representing the location of each players’ pieces, with 64 possible different actions m = 64. Our Reversi environment was inspired by the open source project at github.com/mokemokechicken[142].

3.5.4 Searching State-Space of Games

For games that fill the board as the game progresses; such as Reversi, Go and Tic- Tac-Toe, terminal states are easily found by simply randomly choosing moves until the board is full. Games like Chess and Racing-Kings however, have potentially infinite ply which makes finding a terminal state via random play much more challenging. For example, the game of Reversi, as outlined in Section 3.5.3, is played on a 64 cell board Evaluation Framework 57

Figure 3.7: Starting Board position for Reversi. and starts with 4 pieces in the centre. Reversi has a maximum of 60 ply, regardless of whether the moves are selected randomly or not. The game of Racing-Kings is also on a 64 cell board and when played randomly, and allowing position repetitions, it can take in excess of 360 ply to end the game, with most of the games ending in a draw either due to insufficient material or the 75-move-rule - when no pieces were captured and no pawns were moved in the last 75 moves, this highlights why search alone cannot readily be used for playing this game. The agent we employ utilises tree-search, and as such in some cases Racing King’s branching factor toward the end of the games becomes small enough for the search to find a check-mate. Using our neural-network/tree-search agent we found that for the initial training cycle, a self-play Racing-Kings game takes approximately 120 ply with the majority of games still ending in a draw - some games exceeded 300 ply. To avoid adding an overwhelming number of drawn examples to the experience-buffer, we have a 70% chance of dropping experiences from drawn games.

3.6 Summary

Our experimental setup is designed specifically to minimise variance between training runs and to highlight the relative performance differences between different methodolo- gies. The consequence of this design decision is that the absolute performance of the agent is inhibited requiring, in some cases, minor simplifications of games. Importantly, we handicap the machine-learning player by performing less MCTS iterations than the MCTS reference-opponent emphasising the value of training the neural-network effi- ciently. This framework is used throughout the remainder of this thesis for all experiments.

Chapter 4

End-Game-First Curriculum

Highlights • Early-game experiences generated by self-play in the early stages of training are poor experiences. • Training time can be reduced by not generating expectedly poor experiences. • End-game-first curriculum reduces training time by learning end-games first.

4.1 Introduction

Humans tend to learn complex abstract concepts faster if examples are presented in a structured manner. For instance, when learning how to play a board game, one of the first concepts needed for a human to learn a game is how the game ends, i.e. learning which actions for given board positions which will result in a terminal state (win, lose or draw). The advantage of learning the end-game first is that once the actions that lead to a terminal state are understood, it then becomes possible to incrementally learn the consequences of actions that are further away from a terminal state, called backwards induction. We call a structured training approach which enforces backwards induction an end-game-first curriculum. This approach is not currently used in machine-learning game players. Currently the state-of-the-art machine-learning player for general board games, AlphaZero by Google DeepMind, does not employ a structured training curricu- lum; instead learning from the entire game at all times. By employing an end-game-first curriculum to train an AlphaZero inspired player, we empirically show that the rate of learning of an artificial player can be improved when compared to a player not using a training curriculum. Whilst Deepmind’s approach is effective, their method for generating experiences by self-play is resource intensive, costing literally millions of dollars in computational resources. We have developed a new method called the end-game-first training cur- riculum, which, when applied to the self-play/experience-generation loop, reduces the required computational resources to achieve the same level of learning. Our approach improves performance by not generating experiences which are expected to be of low training value. The end-game-first curriculum enables significant savings in processing

59 60 resources and is potentially applicable to other problems that can be framed in terms of a game. AlphaZero is a combined neural-network/tree-search game player which is trained by self-play reinforcement-learning. Neural-network/tree-search agents do not use the neural-network to directly make move decisions, instead the neural-network has a dual role; to identify the most promising actions for the search to explore and to estimate the value of non-terminal positions during the search. One of the benefits of using self-play reinforcement-learning to train a neural-network/tree-search agent is that as the neural-network improves, so do the decisions arising from the tree-search. A known characteristic of reinforcement-learning is that the network’s predictions are initially inaccurate since it has had little training. AlphaZero’s method is effective, making it state-of-the-art however many of the early generated training examples are of inherently poor quality as they are generated using an inadequately trained1 neural-network. By applying an end-game-first curriculum we seek to eliminate the use of these poor quality training experiences resulting in an improved rate of learning when compared with an agent without a curriculum.

4.1.1 Weakness of the Current Approach

AlphaZero generates its own training examples as part of its learning loop through self-play. The self-play generated examples continually improve as the network learns via reinforcement-learning [106, 110, 143]. Back-propagation with gradient-descent is commonly used to train the neural-network’s weights with reinforcement-learning by minimising the difference between the network’s prediction for a state at one time and a future prediction for the same state at some later time, called temporal difference (TD) learning [110] as explained in Section 2.4.3. The future prediction can be approximated by conducting a tree-search to look-ahead. This variation of TD learning is called TD-Leaf [111] and underpins AlphaZero’s combined neural-network design, that is; AlphaZero uses the difference between the neural-network’s prediction and the outcome of a tree-search to train its network. The neural-network in a neural-network/tree-search agent does not directly make move predictions, instead its net effect is to identify an initial set of most promising paths for the tree-search to explore. As the search is conducted, if the most promising paths are poorly selected by the neural-network they can be overridden by a sufficiently deep search, resulting in a good move decision despite the poor initial most promising set. A weakness with this approach arises when:

• the network is inadequately trained, typically during the early stages of training; and

• the tree search is less likely to discover terminal states, typically in the early stages of a game.

1 A network can be inadequately trained either due to poor training or insufficient training. End-Game-First Curriculum 61

If the network is inadequately trained, then the set of moves that the network selects, which are intended to be promising moves, will likely be random or at best poor. If the tree-search does not find sufficient terminal states then the expansion of the tree is primarily determined by the neural-network, meaning that there may not be enough actual game-environment rewards for the tree-search to correct any poor network pre- dictions. If both of these situations occur together then we say that the resulting decision is uninformed, which results in an uninformed training example that has little or no information about the game that is being learned. In this chapter we demonstrate how employing an end-game-first training curricu- lum to exclude expectedly uninformed training experiences results in an improved rate of learning for a combined neural-network game playing agent.

4.2 Related Work

4.2.1 Using a Curriculum for Machine Learning

Two approaches are predominant in the literature relating to curriculum-learning for machine-learning: reward shaping and incrementally increasing problem complexity [144]. A third curriculum-learning approach is also covered, where the agent is placed at the terminal-state.

Reward Shaping

Reward shaping is where the reward is adjusted to encourage behaviour in the direction of the goal-state, effectively providing subgoals which lead to the final goal. For exam- ple, an agent with an objective of moving to a particular location may be rewarded for the simpler task of progressing to a point in the direction of the final goal, but closer to the point of origin. When the state-space can be achieved, the reward is adjusted to encourage progression closer to the target position [145]. Reward shaping has been used successfully to train an agent to control the flight of an autonomous helicopter in conducting highly non-linear manoeuvres [146, 147, 148]. Sun et al.[149] uses a variation of reward shaping, by diminishing consecutive rewards to mitigate any one agent greedily consuming all resources in a multi-agent environment. The method we propose in this chapter differs from these approaches in that it focuses on controlling training examples instead of the reward like in other reward shaping literature [150].

Incrementally Increasing Problem Complexity

Another approach to curriculum-learning for a neural-network is to incrementally in- crease the problem complexity as the neural-network learns [151]. This method has been used to train a network to classify geometric shapes by using a two step training process. In this approach, a simpler training set was used to initially train the net- work, before further training the network with the complete dataset which contained additional classes. This approach resulted in an improved rate of learning [152] for 62 a simple neural-network classifier. Both reward shaping and incrementally increasing problem complexity rely on prior knowledge of the problem and some level of human customisation [153, 154].

Reverse Curriculum

Florensa et al. [155] demonstrated how an agent can learn by reversing the problem space, by positioning an agent at the goal-state and then exploring progressively fur- ther away from the goal. Their method, reverse curriculum, sets the agent’s starting position to the goal-state and noisy actions are taken to obtain other valid states in close proximity to the goal. As the agent learns, the actions result in the agent moving further and further from the goal. Their method was demonstrated to improve the time taken to train the agent in a number of single-goal reinforcement-learning problems in- cluding a robot navigating a maze to a particular location and a robot arm conducting a single task such as placing a ring on a peg. The set of problems which the reverse curriculum is suited to is constrained due to the requirement of a known goal-state, and that the goal-states themselves be non- terminal; i.e. on reaching the goal-state legal actions still exist which move the agent away from the goal. Having a known goal-state permits the reverse curriculum to be particularly useful if the problem’s goal-state is unlikely to be discovered through random actions. The weakness with the reverse curriculum is that if the problem has a number of distinct goal-states, then focusing on training near a single known goal would result in the agent being overfitted to the selected known goal at the exclusion of all others. The reverse curriculum is not suitable for games, as games have multiple terminal states which are not known in advance, and once a terminal state is reached there are no further legal actions that can be taken.

End-Game-First Curriculum

We define an end-game-first curriculum as one where the initial focus is to learn the consequences of actions near a terminal state/goal-state and then progressively learn from experiences that are further and further from the terminal state. We consider the reverse curriculum to be a special case of the end-game-first curriculum due to the additional requirements outlined above. It is not a requirement of the end-game-first curriculum that the agent commence near a terminal/goal-state, or that one is even known; but in the course of exploration when a terminal/goal-state is discovered the distance to any visited non-goal-states can be calculated and a decision can be made as to whether those states will be used for training the agent depending on their distance. The end-game-first curriculum differs from incrementally increasing the problem complexity in that the consequences of actions leading to a terminal state may in fact be more complex to learn than the earlier transitions. The end-game-first curriculum does however temporarily reduce the size of the problem space by initially training the network on a smaller subset of the overall problem space. The advantage of an end- game-first curriculum is that it doesn’t rely on any prior knowledge of the problem - End-Game-First Curriculum 63 including requiring a known terminal state. By first focusing near a terminal state, the agent is trained to recognise the features and actions which give rise to environmental rewards (terminal states) then progressively learns how to behave further and further from these states. We demonstrate in this chapter that using an end-game-first cur- riculum for training a combined neural-network game playing agent can improve the rate at which the agent learns.

4.3 Method

An end-game-first training curriculum requires that end-game positions are presented to the agent first. An end-game-first approach can be achieved by discarding a fraction of early-game experiences in each self-play game, with the fraction dependent on the number of training epochs which have occurred - progressively discarding less and less early experiences as training continues. By selectively discarding the early-game experiences, we propose that the result is a net improvement in the quality of training experiences which leads to the observed improvement in the rate of learning of the player. Since creating an experience requires a full tree search to be conducted, discarding an experience which has already been created is inefficient. Instead we seek to simply avoid generating these experiences in the first place. We demonstrate the effectiveness of the end-game-first training curriculum on two games: modified Racing-Kings and Reversi. During training, we compare the success-rate of two AlphaZero inspired game playing agents: a baseline-player without any modification, and a player using an end- game-first curriculum (curriculum-player); against a fixed reference-opponent. We find that by using only the late-game data during the early training epochs, and then expanding to include earlier game experiences as the training progresses, the curriculum-player learns faster compared to the baseline-player. Whilst we empirically demonstrate that this method improves the player’s success-rate over the early stages of training, the curricula used in this experiment were chosen semi-arbitrarily, and as such we do not claim that the implemented curricula are optimal. We do however show that an end-game-first curriculum-learning approach improves the training of a combined neural-network game playing reinforcement-learning agent like AlphaZero.

4.3.1 Curriculum-Player

We modify the baseline-player defined in Chapter3 and create a curriculum-player to demonstrate the effectiveness of the end-game-first curriculum. The difference between the baseline-player and the curriculum-player is that some experiences are not generated during self-play for the curriculum-player. A curriculum function ζ(e) is introduced indicating the distance from the terminal-state that experiences must be within, where e is the number of training epochs conducted. ζ(e) is a proportion indicating which portion of the game’s experiences will be retained for training - starting from the end of the game. i.e. n · (1 − ζ(e)) is the permissible distance from a terminal state where 64 experiences may be added to the buffer, where n is the number of ply in the game. The criteria for an experience to be used for training is that its distance Xd to the game’s terminal state has to be less than n(1 − ζ(e)). i.e. if a game has dt total plies then the curriculum requires dt − Xd < n(1 − ζ(e)) to retain the experience. This has the effect of retaining only the last ζ(e) × 100% portion of any game in the experience-buffer. For example, the baseline-player has an equivalent function ζ(e) = 1 for all e; that is retaining 100% of a game’s experiences. The curriculum used for Racing-Kings is shown in Equation 4.1, while the curriculum used for Reversi is shown in Equation 4.2. We also investigate the performance of a linearly increasing curriculum as shown in Equation 4.3. Generating experiences and trimming them is inefficient, so instead we aim to not generate them in the first place. We exploit the fact that the curriculum-player excludes early-game experiences during training by not conducting a tree-search for actions which will result in an experience that is going to be excluded. To achieve this we calculate the approximate number of ply required to reach the n(1 − ζ) barrier and then randomly play that many actions. We calculate the number of random actions by maintaining the average ply,n ¯ for all games played. The firstn ¯(1 − ζ(e)) of a games actions are randomly played. After these random actions, we then use the full MCTS process to choose the remaining actions to complete the game. If a terminal state is found during random play then the game is rolled back ton ¯(1 − ζ(e)) ply from this terminal state and MCTS is used for the remaining actions. We demonstrate the effectiveness of the end-game-first training curriculum on two games: modified Racing-Kings and Reversi. During training, we compare the success- rate of two AlphaZero inspired game playing agents: a baseline-player without any mod- ification, and a player using an end-game-first curriculum (curriculum-player); against a fixed reference-opponent.

 0.1 t = 0   0.33 0 < t < 100   0.5 100 ≤ t < 500 Racing-Kings, ζ(t) = (4.1) 0.66 500 ≤ t < 800   0.8 800 ≤ t < 1000   1.0 1000 ≤ t  0.25 t = 0   0.5 0 < t < 100 Reversi, ζ(t) = (4.2) 0.75 100 ≤ t < 500   1.0 500 ≤ t

Incremental CL, ζ(t) = A · t + B where; A = 0.001 and B = 0.1 for Reversi and (4.3) A = 0.01,B = 0.1 for Racing-Kings. End-Game-First Curriculum 65

4.4 Experiment

4.4.1 Experimental Conduct

We use the experimental setup explained in Chapter3 for this experiment. Two AI players are trained: one using the proposed curriculum-learning approach (curriculum- player), the other without (baseline-player) and weights θi are saved after each training cycle i. During training, weights θi are selected randomly for a competition between the AI players and a reference-opponent, and the players’ respective success rates yθi are recorded. Success is a win or a draw, since in some games perfect play may result in a draw.

A competition consisting of a minimum of 30 games for randomly selected θi is conducted against the opponent to obtain the success ratio yθi . The experiment is conducted in full three times and the results are combined. The wins, draws and losses are accumulated periodically to create a data-point with a 95% confidence interval obtained using the Wilson-Score with continuity correction [156]. Wilson-Score with continuity correction can be used to determine the confidence-bounds for a Bernoulli trial - a trial which has two outcomes, success or failure. A training step is completed after the presentation of one batch of experiences and a training epoch is completed after all experiences have been presented once. As training begins after a set number of games, the number of steps for each epoch varies from experiment to experiment, likewise the time taken to play a game is also completely unique making the time, epochs, and steps independent from experiment to experi- ment. As such, when combining data from multiple experiments the three plots may have different appearances. Noting that the number of steps in an experiment is inde- pendent of the time taken for an action/experience to be created and that epochs are independent to how many actions/experiences exist in the experience-buffer.

Whilst yθi vs time is the measure which we are primarily interested in, measuring against steps and epochs is also informative. We mitigate the potential differences which might arise from differing system loads by training both AI players simultaneously on one dual-GPU system. Each player is allocated a single GPU and shares in the combined CPU resources of the system.

4.5 Results

Figure 4.1 shows the performance of the curriculum-player and the baseline-player in a Racing-Kings competition against the Stockfish opponent. Figure 4.2 shows the per- formance of the curriculum-player and the baseline-player playing Reversi against the MCTS opponent. Figures 4.3 and 4.4 shows the performance of the curriculum-player and the baseline-player playing Racing-Kings and Reversi respectively. Figure 4.5 shows the performance whilst playing Racing-Kings against the Stockfish opponent with both 66 the curriculum-player and the baseline-player using the AlphaGoZero training pipeline with the added evaluation step. Note that the plots showing success against time is the metric which we are most interested in. The additional plots allow us to investigate some of the effect of using a curriculum. The success-rate of the player using the end-game-first training curriculum exceeds the baseline-player during the early stages of training in all cases when measured against time. This was also observed across multiple training runs with differing network parameters and differing curricula.

4.6 Discussion

Our results indicate that a player trained using an end-game-first curriculum learns faster than a player with no curriculum during the early training periods. The improve- ment over time can, in part, be attributed to the increased speed of self-play moves being chosen randomly by the curriculum-player instead of conducting a full search; however the improvement in success-rate is also observed when compared against train- ing steps indicating that a more subtle benefit is obtained. The success-rate of the two players when compared against training epoch shows similar performance for the two players. In this section we compare and contrast the performance of the curriculum-player and the baseline-player against the reference-opponents and separately discuss the re- sults with respect to time (Subsection 4.6.1), training steps (Subsection 4.6.2) and epochs (Subsection 4.6.3). We provide a brief discussion of the results obtained by us- ing the linear curriculum (Subsection 4.6.4) and finish with an outline of the limitations of these experiments (Subsection 4.6.5).

4.6.1 Success Rate vs Time Comparison

Plot (a) of Figures 4.1, 4.2 and 4.5 shows the success-rate of the AI players vs time. The curriculum-player does not retain experiences generated early in a game during the early training periods. This is achieved by selecting these moves randomly instead of using the naive approach of performing a full search on experiences which will be discarded as explained in Section 4.3.1. The time saved as a result of randomly selecting actions instead of conducting a tree-search can be significant. However when making a random action an experience is not created which means that a randomly selected move contributes in no way to the training of the player. The number of experiences created using a curriculum is less than when not using the curriculum. The curriculum needs to balance the time saved by playing random moves with the reduction in generating training experiences. Whilst the curriculum-player leads in success rate compared to the baseline-player during the early time periods, the success-rates of both players converge in all exper- iments. For a given neural-network configuration there is some expected maximum End-Game-First Curriculum 67

(a)

(b) (c)

Figure 4.1: Success rate for AlphaZero inspired player vs Stockfish level 2 playing Racing-Kings with and without end-game-first curriculum from Equation 4.1. Whilst the time-based success-rate is the measure we are concerned with, success-rate vs steps and epochs is also informative. Note that an improved learning rate is evident in both the time and the steps plots. Improvement in learning when measured against steps indicates that the performance improvement is not just from the time saved by conducting random moves during self-play. Note that some temporary unlearning was observed for some individual training runs for the curriculum-learning player. This plot is the combination of results from 3 separate experiments, 95% confidence-bounds are obtained by using the Wilson-Score with continuity correction. Note that since the agent’s weights were selected at random, the competitions which produced these results result in less data-points at the end of the training run resulting in wider confidence intervals. The confidence interval widths are dependent on both the variance of the values and the number of games played as discussed in Chapter8 - less games were played at the end of training resulting in wider confidence intervals. 68

(a)

(b) (c)

Figure 4.2: Success rate for AlphaZero inspired player vs MCTS opponent playing Re- versi with and without the training curriculum from Equation 4.2. Consistent with the results in Figure 4.1 improvement is observed over both time and steps and the success- rate of both players is also similar over epochs despite the player with curriculum- learning having less training examples per epoch due to the curriculum. This plot is the combination of results from 3 separate experiments, 95% confidence-bounds are obtained by using the Wilson-Score with continuity correction. Note that since the agent’s weights were selected at random during training for the competition to produce these results, there are less data-points at the end of the training run resulting in wider confidence intervals. End-Game-First Curriculum 69

(a)

(b) (c)

Figure 4.3: Success rate for AlphaZero inspired player vs Stockfish opponent playing Racing-Kings with and without the incremental training curriculum from Equation 4.3. This plot is the combination of results from 3 separate experiments, confidence-bounds are obtained by using the Wilson-Score with continuity correction. Note that using the same curriculum as used for Reversi resulted in the curriculum-player learning slower than the baseline-player due to the curriculum changing too slowly for the game. As we increased the rate of change of the linear curriculum the agent performed better. Due to the nature of the Racing-Kings game we found that using a step curriculum was more efficient in finding a curriculum which performed better than the baseline-player. 70

(a)

(b) (c)

Figure 4.4: Success rate for AlphaZero inspired player vs MCTS opponent playing Reversi with and without the incremental training curriculum from Equation 4.3. This plot is the combination of results from 3 separate experiments, confidence-bounds are obtained by using the Wilson-Score with continuity correction. Note that the GPUs used to obtain data for this plot were of a lower standard than those used for Figure 4.2. We used 2xM40 GPU’s for this data, whilst 2xP100s were used for Figure 4.2 leading to a lower final success-rate for the baseline-player. End-Game-First Curriculum 71

(a)

(b) (c)

Figure 4.5: Success rate for AlphaGoZero inspired player vs Stockfish level 2 playing Racing-Kings with and without training curriculum from Equation 4.1. This plot is the combination of results from 3 separate experiments, confidence-bounds are obtained by using the Wilson-Score with continuity correction. 72 performance threshold against a fixed opponent; in the ideal case this would be a 100% success-rate but if the network is inadequate it may be less. Although we expect that the success-rates of the two players would converge eventually, it appears that conver- gence occurs prior to the maximum performance threshold as shown in Figure 4.1(a) near 4000 minutes and Figure 4.5(a) near 5000 minutes; while Figure 4.2(a) shows convergence near 6000 minutes at what appears to be the network’s maximum perfor- mance threshold. For an optimal curriculum, the convergence of the two players would be expected to occur only at the maximum performance threshold.

4.6.2 Success Rate vs Steps Comparison

Plot (b) in Figures 4.1, 4.2 and 4.5 shows the success-rate of the AI players vs the number of training steps. Although the success-rate improvement over time for the curriculum-player can be attributed in part to the use of random move selection, a success-rate improvement is also observed when measured against training steps. A training step is when one batch of experiences is presented to the optimise module, making a training step time independent; i.e. how long it takes to create an experience is not a factor for training steps. Given that the success-rate for the curriculum-player outperforms the baseline- player when measured over training steps, we argue that there is a gain which relates directly to the net quality of the training experiences. Consider, for example, the baseline-player’s very first move decision in the very first self-play game for a newly initialised network. With a small number of MCTS simulations relative to the branching factor of the game, the first AI player’s decision will not build a tree of sufficient depth to reach a terminal node, meaning that the decision will be derived solely from the untrained network weights - recall that the neural-network is used for non-terminal value estimates. The resulting policy π which is stored in the experience-buffer has no relevance to the game environment as it is solely reflective of the untrained network. We posit that excluding these uninformed experiences results in a net improvement in the quality of examples in the training buffer. Later in that first game, terminal states will eventually be explored, over-ruling the inaccurate network estimates, and the resulting policy will become reflective of the actual game environment - these are experiences which should be retained. As the training progresses the network is able to make more accurate predictions further and further from a terminal state creating a visibility- horizon which becomes larger after each training cycle. The optimum curriculum would match the change of the visibility-horizon.

4.6.3 Success Rate vs Epoch Comparison

Plot (c) in Figures 4.1, 4.2 and 4.5 shows the success-rate of the players vs the number of training epochs. Recall that an epoch is when all experiences in the experience-buffer have been presented to the optimise module. Since training is only conducted when the experience-buffer has sufficient games, an epoch is directly proportional to the number of self-play games played, but is independent of the number of experiences in the buffer End-Game-First Curriculum 73 and the time it takes to play a game. When applying the curriculum, fewer experiences per game are stored during the early training periods compared to the baseline-player, meaning that the curriculum-player is trained with fewer training experiences during early epochs. The plots of the success-rate vs epoch shows the curriculum-player outperforming the baseline-player for the early epochs in Figure 4.1(c) with the two success-rates converging rapidly; while Figures 4.2(c) and 4.5(c) show the two players performing similarly. The similarity of the results when measured against epochs shows that despite the curriculum-player excluding experiences, no useful information is lost by doing this. In recognising that useful information is not excluded during these early epochs it supports our view that the net quality of the data in the experience-buffer is improved by applying the specified end-game-first curriculum. It is expected that due to a combination of the game mechanics and the order in which a network learns there may be some learning resistance2 which could result in plateaus in the success-rate plot, allowing the trailing player to catch up temporarily. We expect a sub-optimal curriculum to result in additional learning resistance or in the extreme case deteriorated performance which would predominantly be observed imme- diately following a change in the curriculum value. For Reversi the final curriculum increment from Equation 4.2 occurs after 500 epochs and Figure 4.2(c) shows a training plateau shortly after this change, albeit at the network’s maximum learning limit. For the Racing-Kings experiment, Figure 4.1, a performance reduction was observed in each training run for the curriculum-player shortly after the curriculum had changed to 80% as shown in Equation 4.1. This learning loss is averaged out in the plot, however of the three experiments that comprise the data for this plot two of them have a learning loss around 800 epochs and the other at 1000 epochs - the final step in the curriculum. The observed learning loss differs with each individual training run and highlights the importance of the order in which learning occurs and its impact on the effectiveness of the curriculum.

4.6.4 Using a Linear Curriculum Function

Figures 4.3 and 4.4 show the success-rate of the AI players using the linear curriculum from Equation 4.3. For the game of Reversi the linear curriculum performed better than the baseline-player, consistent with the previous observations. However, for the game of Racing-Kings the linear curriculum performed to approximately the same standard as the baseline-player. Due to the nature of Racing-Kings, non-stalemate terminal states occur infrequently by random play, as such the initial self-play training process is time consuming when compared with the time a training cycle takes when the agent is trained. After the initial training cycles, learning about terminal-states occurs relatively

2 To our knowledge, the term learning resistance is not defined in relation to machine-learning. We define it to mean a short-term resistance to network improvement. 74 quickly since a winning terminal state is simply the King being positioned on the last row of the board. Before conducting this Racing-Kings experiment, we first conducted the experiment using the same linear curriculum as used for Reversi, from Equation 4.3. When using the Reversi curriculum for Racing-Kings we found that the curriculum-player performed worse than the baseline-player because the variation was too slow - this result was as expected since different games require different curriculum changes. Consider the extreme case where only the last 10% of the game’s board positions are used for training, i.e. using a very slow curriculum change. With a very slow rate of change, the agent would only learn a small portion of the game and the baseline-player would be expected to perform better over time. To address this, we customised the linear curriculum to increase the rate of change when compared with the Reversi curriculum, to account for the nature of the Racing-Kings game. The purpose of our research is not to find an optimum curriculum for any specific game, but to demonstrate the effective use of the end-game-first curriculum. Whilst a better linear training curriculum might exist for Racing-Kings our findings from this experiment indicate that it is easier to find a curriculum which outperforms the baseline-player by using a hand-crafted step curriculum rather than a linear function for the curriculum.

4.6.5 Limitations

The nature of self-play reinforcement-learning means that as the agent learns, the quality of experiences also improves - in turn improving the training of the agent. This means that potentially the agent could continue to improve indefinitely, subject to the capacity of the architecture. The experiments performed for this research were continued for sufficient time to observe the effects of applying the prescribed curricula when compared with the baseline-player, however no guarantee exists as to how the comparative performance would progress should training continue indefinitely. This limitation is not specific to our experiment or our proposed curriculum-learning method, but instead a characteristic of self-play reinforcement-learning in general. When training an agent via self-play reinforcement-learning the decision to halt training can never guarantee that the chosen moment in time was the optimum time to stop. We halt the experiment when the network appears to have stopped learning, the rate of improvement for both players has plateaued or at approximately 10,000 minutes. To reduce variance, as discussed in Chapter3, the size of the network is smaller than it would be if we were training a single agent solely for maximum performance, as such the maximum level of playing performance is expected to be reached earlier and to also occur before the success-rate is 100%. As shown in Table 1.1, training via self-play reinforcement-learning using the state- of-the-art architecture to achieve superhuman performance can be costly and/or time consuming. The results presented by Deepmind introducing their state-of-the-art agents End-Game-First Curriculum 75 were obtained from singularly trained agents (one single experiment). Time and re- source constraints limit the ability to perform multiple training runs for every version of an agent. Although, one of Deepmind’s key researchers noted during questions at the Neural Information Processing Systems that numerous non-reported experiments were conducted, indicating repeatable performance [157]. Likewise, the results reported in this chapter were preceded by several smaller scale experiments confirming the gen- erality of the new methods outlined here, prior to conducting the final larger-scale experiment. These smaller scale experiments, whilst useful for research progress, were often less interesting because they were not sufficiently complex to observe a signifi- cant improvement. We observed during this preliminary work that as game complexity increases the benefits also increase, driving our choice of games for this experiment. Compared with Deepmind’s experiment, our reduced game complexity has enabled us to perform three separate experiments for each game. The purpose of running each ex- periment three times is to demonstrate that the benefits of applying an end-game-first curriculum is observed on average, rather than from any single training run.

4.6.6 Curriculum Considerations

Although it is expected that the baseline-player’s performance would converge with the curriculum-player’s performance, ideally this would occur near the maximum success- rate, and we observe this effect in the Reversi tournament shown in Figure 4.2. The curriculum-player’s success-rate approaches the performance of the baseline-player in the Racing-Kings tournament indicating that perhaps the curriculum is sub-optimal. Given that the curriculum implemented is semi-arbitrary this is expected. While curriculum-learning is shown to be beneficial during the early stages of training, the gain could still be lost if the curriculum changes too slowly or too abruptly. It is likely that overfitting is occurring, resulting in a loss of the early gains. Methods of regu- larisation as discussed in Section 2.4.4 could be used to address this overfitting if it is problematic and we explore this in Chapter7. When designing a curriculum, consideration needs to be given to the speed at which the curriculum changes. At one extreme is the current practice where the full game is attempted to be learnt at all times, i.e. the curriculum is too fast by immediately attempting to learn from 100% of a game at epoch 0. At the other extreme is where the curriculum is too slow, which can result in the network overfitting to a small portion of the game space or discarding examples which contain useful information. We argue that each training run for each game could have its own optimum curriculum profile, due to the different training examples which are generated.

4.7 Summary

We have shown that the rate at which a self-play reinforcement-learning AI agent learns when using a combined neural-network/tree-search architecture can be improved by employing an end-game-first training curriculum. Although the hand-crafted curricula 76 used in this study are not necessarily optimal, it is possible that no fixed curriculum exists that would be optimal at all times as the order and the composition of the gen- erated experiences are themselves a factor in the neural-network learning. Likewise, the optimum curriculum is likely to be non-linear with different stages of the game being harder to learn than others. In learning how to play a game, the agent is learn- ing to control a previously unseen environment and by employing an end-game-first curriculum is able to speed up that learning.

Further work

Whilst using hand-crafted curricula is an improvement over using no curriculum, further work is required to identify an approach for this method to be considered suitable for general problems. Alternatively, if the problem of interest is singular in nature then further research could involve identifying optimum curriculum for the desired environment. The following considerations will need to be applied to obtain an optimal curriculum;

• Balancing the time saved by random moves with the loss of training experiences.

• Ensure the curriculum profile changes are not too fast as to include uninformed examples.

• Ensure the curriculum profile changes are not too slow;

– as to cause the network to overfit to a smaller portion of the environment space; – as to discard examples that would be good training examples.

To address these requirements, the curriculum will likely need to be adaptive. An adap- tive curriculum, which dynamically changes the difficulty of the generated examples in response to learning progress would likely produce an effective curriculum for general control problems. The next chapter investigates an effective adaptive curriculum. Chapter 5

Automated End Game First Curriculum

Highlights • If the neural-network in a neural-network/tree-search agent is untrained, informed decisions can only occur when the search reaches a terminal node. • A neural-network/tree-search agent has an average search-depth which increases as the agent learns. • Adding the search-depths from each training epoch creates a visibility-horizon, which is the approximate distance (ply) from a terminal node that informed decisions can be made. • A general end-game-first curriculum is shown to be effective when based on the visibility-horizon.

5.1 Introduction

In the previous chapter we described the end-game-first curriculum and demonstrated how it can be employed to speed up training of a neural-network/tree-search agent. As the name suggests the end-game-first curriculum focuses on learning positions close to the end-game first, then progressively seeks to learn positions that are further and further from the end of the game. In this chapter we present a new method for auto- matically adjusting the curriculum without any human intervention, which outperforms the existing state-of-the-art approach of not using a curriculum, for all games tested. We also present the concept of the visibility-horizon for an agent, a measure which indi- cates how far from a terminal state experiences are sufficiently informed and therefore useful for training. We also present, to our knowledge for the first time, the concept of using the change in search-depth from one training cycle to the next to indicate the learned knowledge of the agent. In Chapter4 the hand-crafted curriculum function ζ(e) was dependent on the num- ber of epochs e; where ζ was used as the criteria for permitting an experience to be added to the experience-buffer depending on its distance from a terminal state. The

77 78 curriculum function was based on the assumption that the agent improves over time (epochs). The problem with this approach is that it requires some prior knowledge of the game to assess the optimal rate at which the curriculum should progress. A complex game like Reversi requires substantially more time to learn than Tic-Tac- Toe, and as such, the rate of change of the curriculum for Reversi needs to be slower per epoch than for Tic-Tac-Toe. However, complexity cannot simply be measured by the size of the state-space, as some games can be trivial despite the state-space size. Take for instance a game of 3-in-a-row game on an 8x8 board. This trivial game has a large state-space, however the player to act first can win trivially. Despite the large state- space the curriculum for such a game needs to change rapidly, as the value function for the game is simple to learn. When defining the curriculum function by hand, if the game is not well understood in advance then it may actually slow down learning as the curriculum may progress too slow or too fast. The approach to curriculum-learning that we presented previously is based on the number of training epochs - which is directly proportional to training time. Whilst we use the number of training epochs as the independent variable to set our curriculum we have no real indication as to whether sufficient time was allocated to the particular portion of the state-space enforced by the curriculum. In setting curricula by hand, the timing of any changes needs to account for the estimated complexity for the portion of the state-space being learned. Further complicating the design of a curriculum, the complexity of small packets of the game’s state-space does not progress linearly when working backward from the end- state. Take for example the very narrow state-space which incorporates only terminal states in the game of Reversi. The value of a terminal state in Reversi is simply the difference between the number of each coloured tiles. However, the value calculations for the set of states which includes board positions that are 3-ply from terminal is substantially more complex - taking into account not only the number of tiles but also their absolute position on the board and their relative position to other tiles on the board. Similarly, complexity also does not always decrease as the game progresses from the start-state. For instance in some games, like Reversi, the very first action is trivial since they are all identical when rotations are taken into account. Our method for automation of the curriculum accounts for these issues. We discussed in the previous chapter the following considerations for setting an end-game-first curriculum, and in this section we will address these concepts:

• Balancing the time saved by random moves with the loss of training experiences.

• Ensuring the curriculum profile changes are not too fast as to include uninformed examples.

• Ensuring the curriculum profile changes are not too slow:

– as to cause the network to overfit to a smaller portion of the environment space, – as to result in discarding examples that are sufficiently effective for training. Automated End Game First Curriculum 79

In the previous chapter we established that by applying an end-game first curriculum to a neural-network/tree-search game playing agent the time to learn a game can be reduced, however some weaknesses were highlighted. A single optimal hand-crafted curriculum could not be found for all games and even when an effective curriculum was found for a single game it was sensitive to the neural-network hyper-parameters and varied from one training run to another. One of the key findings from Chapter4 was that a useful generic curriculum would need to be based on the performance of the neural-network, not just the number of training cycles or time. Our approach in setting an automated curriculum will be to increase the curriculum if the agent’s rate of learning is too fast, maintain the curriculum if the agent is showing some improved learning, and decrease the curriculum if the agent is learning too slowly.

5.2 Prerequisites

Progressing the Curriculum at the Optimum Rate

Our method of implementing an end-game-first curriculum in Chapter4 was to in- crementally add experiences further and further from a terminal state as the agent improves. We relied on the assumption that the agent improves1 with each training epoch, thus our motivation for basing the curriculum function ζ(e) on epochs e. The assumption that the agent improves with each training epoch was found to be correct on average, and we were able to demonstrate that incrementing the curriculum based on epochs was an improvement when compared against an agent not using the cur- riculum. The weakness with using epochs as the dependent variable for the curriculum function is that it does not take into account variances between different training runs, and it also requires some prior knowledge of the rate of learning for each specific game and neural-network combination. We have established in Section 4.1.1 that for a completely untrained agent, informed decisions can only be made when terminal states are found during the search. We have also found that the agent’s ability to make informed decisions propagates further and further from terminal states as training progresses. The agent, using the neural- network, builds an internal action value mapping of the environment during training, and information about the environment propagates backwards through its map from the terminal-states to the starting states - recalling the discussion from Section 2.4.3. At any point in time the reward information from the environment will have propagated some distance from the terminal-state within the agent’s internal mapping, we call this the informed distance. For a search-tree built to estimate the ground-truth for an experience, any nodes that are within the informed distance from a terminal-node contain some information about the environment and we call these informed nodes. Informed nodes may or may not contain enough information to be useful for training, but they contain at least some amount of environment information. Consider the

1 Implicit in our use of the word “improve” is that the agent makes more accurate predictions. 80 situation for a newly initialised agent when the tree search includes just one terminal state - there is no guarantee that the single environmental reward will be enough to override the neural-networks random estimate but the root-node would still be within the informed distance from terminal. The distance from a terminal-node where nodes are sufficiently informed to make accurate predictions and are useful for training is what we call the visibility-horizon, as shown in Figure 5.1. Calculating the precise value of the visibility-horizon is likely to be intractable, however the upper limit for the visibility-horizon is the informed distance. The lower limit for the visibility-horizon is 0, since an untrained agent may not be able to make accurate predictions at any distance from a terminal state. Informed nodes contain either information directly from the environment via terminal-nodes, or indirectly through other informed nodes, and with each training cycle the informed distance increases if the search finds at least one informed node. If the network is learning with each training cycle, the visibility-horizon also increases. When building a tree for a particular game state, if the deepest node in the game tree does not reach the visibility-horizon then it is uninformed, i.e. the tree will be comprised of nodes which have insufficient information about the environment. We liken the visibility-horizon to the distance out to sea from which a boat can determine the location of land, i.e. has useable information about the land. The boat can directly observe the land’s location if it is within the boat’s range of sight. If the land is beyond the horizon then without any other information the boat’s navigation decisions to find land will be uninformed. To further extend this analogy, to aid future navigation decisions, a marker buoy can be dropped at the location where the land is first seen and in the future this buoy can be used to extend the distance that the boat can make navigation decisions as shown in Figure 5.2. This buoy will allow the boat to derive information about the land without needing to directly see the land. i.e. extending the visibility-horizon further out to sea - subsequently more and more buoys can be placed further and further out. This furthest distance from the land where a buoy can be seen is equivalent to the visibility-horizon for an agent. To bring this ocean analogy back to our RL problem; land is equivalent to terminal- nodes - where the ground-truth value of the environment is directly obtained in the form of a reward. How far the boat can see is equivalent to the depth of the search. The marker buoys have some similarity to the neural-network learning the dynamics of the state-space at that distance from a terminal node. Consider for a moment whether there is any benefit to dropping a marker buoy if the boat spawned in the middle of the ocean has no prior knowledge and no landmarks - this buoy would be completely uninformed about where the land is. This placement of a random marker buoy is similar to creating an experience with no direct or indirect information from the environment, which is the current approach to generating ground-truth experiences for neural-network/tree-search agents. For newly initialised players, experiences are created even if no environmental reward is observed. The difference between the boat analogy and the RL problem is that a randomly dropped buoy may eventually connect Automated End Game First Curriculum 81

Figure 5.1: Example game tree for the game of Tic-Tac-Toe with an expansion of 9 nodes for the board position shown inside the root-node (red). Note that the depth of the tree is 3 nodes, and that the deepest node is terminal. If the agent was newly initialised and produced this tree, then we could say that the experience created from this root-node was informed, as it contains at least some information about the environ- ment. Whether there is enough information to make the experience useful for training is uncertain. The visibility-horizon attempts to define the distance from a terminal node where generated experiences are useful for training and the upper limit in this case is 3 nodes - but the true value is likely to be less than that. 82 with others which are connected to land and become useful, whereas the uninformed RL experience will always remain uninformed.

Figure 5.2: Without directly seeing land, a boat can still derive information about where the land is, if markers are placed in a manner such that their separation is within the distance the boat can see, and one of the markers has visibility of the land.

5.2.1 Visibility-Horizon

To formalise the concept of the visibility-horizon we define the visibility-horizon as extending from the terminal-node up the tree for some distance where the agent is still able to make informed decisions. Experiences created outside the visibility-horizon are a waste of resources since the experience will be uninformed. The distance from a terminal-node to the visibility-horizon is ν, unfortunately the precise value of ν is not easily calculated. ν has an upper limit, ν+ which becomes larger with each training cycle and is based on the previous value for ν and the depth of the search-tree, such that; + + νi+1 = min(νi + di, n) where i is the training cycle, di is the search-depth and n is the number of ply in the game. Note that our definition of ν is the distance from terminal-nodes from which an Automated End Game First Curriculum 83 agent is able to make an informed decision. Finding a terminal-node during a tree- search does not guarantee that sufficient information is obtained to enable the agent to make an informed decision for that position, especially if only a small number of terminal nodes are found. Because of this, the true value of ν will be less than ν+ and in our opinion calculating it precisely is likely to be intractable. Instead of attempting to find the true value of ν we hypothesise that ν increases as the agent’s knowledge (skill) increases and as such explore a mechanism to measure the relative skill of the agent which we will use to guide our estimate of ν. To implement an optimum curriculum, the curriculum should be set so that expe- riences are only created for board positions that are within ν of a terminal-state. Since ν is expected to be difficult to calculate we will instead periodically adjust ν, using the agent’s relative performance. As outlined in Chapter4, experiences are created by performing a tree-search using the current board position as the root-node and these experiences are then used to train the agent. The ground-truth of an experience for non-terminal-nodes are initially estimated by the neural-network, and if the tree-search does not override the neural-network then the created experience depends solely on how well trained the neural-network is. It is useful to be able to describe experiences, nodes and states as being either informed or uninformed depending on whether their distance to the closest terminal state is within the agent’s visibility-horizon. Given our previous discussions we highlight the following observations with respect to the visibility-horizon. Noting that the ground-truth value of an experience is derived from the root-node of a tree-search and that each node in the tree relates directly to a game state, an experience inherits the label of being informed/uninformed from the associated node/state that was used to create it.

• Terminal nodes are informed, in fact they are directly informed with first-hand information from the environment. They are unambiguous and perfectly accurate as to their value since the information comes directly from the environment. Note that non-terminal reward signals, if they exist, do not necessarily provide perfect value information if the desire is to maximise the reward over the long-term - recalling our discussion in Section 2.4.3 about an agent choosing to forgo a small immediate reward for a larger reward later.

• An informed node has at least one subordinate informed node in its search-tree - noting that terminal nodes are also informed nodes.

• An uninformed node has no subordinate informed nodes.

• The maximum visibility-horizon ν+ is the frontier distance in a game tree, mea- sured from a terminal-node, where nodes are informed. ν+ increases with each training cycle by the depth of the search-tree but only where the tree includes an + + + informed node. The change in ν can be calculated by νi+1 = νi + di provided the tree has an informed node where di is the depth to the informed node.

• The visibility-horizon ν is some uncertain distance, less than ν+ where experiences 84

are sufficiently informed that they are valuable for training.

• The closer a node is to a terminal node the more informed it is.

Inherently, experiences are created with varying amounts of environmental knowl- edge and ideally only well informed experiences will be generated during training. The preceding efforts to define the visibility-horizon in this manner allows us to crudely quantify our confidence in the accuracy of an experience’s ground-truth, based on its distance Xd from a terminal-experience and its relationship to ν as shown in Table 5.1.

Confidence in accuracy Relationship Label of derived ground-truth + ν < Xd Uninformed 0% (No confidence) Xd = 0 Directly Informed 100% (Complete confidence) + Knowledge increases as Confidence increases as 0 < Xd ≤ ν Xd → 0 Xd → 0

Table 5.1: Labelling experiences based on whether they include information from the environment. Labelling experiences in this manner allows us to express a level of confidence in the derived ground-truth value for the experience depending on how far the experience is from a terminal experience and the maximum visibility-horizon ν+.

5.2.2 Performance of the Agent

From the previous section we have defined the upper bound for the visibility-horizon ν+ being the maximum distance from a terminal state from which an informed decision can be made. The actual distance where informed decisions can be made is ν ≤ ν+. Since the desire is to only generate informed experiences we will create an estimate for ν∗ ≈ ν and use this as the criteria to determine if an experience will be generated for a given board position. We will adjust ν∗ based on the agent’s skill and in this section we explain how we measure the agent’s skill. To use another analogy, determining the agent’s skill could be likened to the problem of determining which year an un- known/unseen student should be placed in at school. The desired result being that the student is placed in the appropriate year to challenge their knowledge with problems that are just beyond their current ability. Likewise we desire to set the curriculum so that appropriate experiences are presented to the agent which are at the limit of ν. To determine ν∗, during training we increase ν∗ if the agent is progressing quickly in its learning, and decrease ν∗ if the previous adjustment was too abrupt, causing a severe slowing down in progress. ν∗ is adjusted in this manner for the duration of training, in- creasing and decreasing depending on how fast the agent is improving. Before this can be achieved the agent’s performance needs to be readily measured without incurring additional resource costs.

The Agent’s Relative Performance

One method to determine the agent’s skill at any point in time is to perform external testing against a known reference-opponent, in the same way we have presented our Automated End Game First Curriculum 85 comparative results in the previous chapter. Whilst this provides a clear reference point as to how the agent’s skill is progressing, doing so is both time and resource intensive and exists outside of the existing training pipeline. In addition to this issue, there exists no single external reference-opponent which can play all games to superhuman levels and if such an agent were to exist then it could be directly used to generate the self-play experiences. To perform external skill testing of an agent for the purposes of setting a curriculum will likely inhibit the speed of learning, countering the aim of this thesis. Due to these reasons we have dismissed this as an option for testing the agent’s skill to inform the appropriate curriculum. Another potential indicator as to the agent’s skill could be to utilise the loss func- tion. Despite the neural-network generating its own ground-truth experiences the loss function shown in Equation 3.1 in Chapter3 does tend to reduce over time as the agent learns the environment. The problem with this approach however is that by im- plementing a curriculum, we are varying the complexity and the variety of experiences over the training time, which in turn causes the absolute value of the loss function to vary. This makes using the loss unsuitable for determining the agent’s improvement.

Search-Depth as a Performance Indicator

The neural-network in the agent is used to bias the tree-search so that the most promising moves are searched, as we explained in Section 4.1. The neural-network effectively recommends which actions the tree-search should explore. A well trained neural-network can correctly predict the best action for a node with a high relative value to the other actions, and in doing so makes a strong recommendation. If a strong recommendation is made then the search is likely to accept the agent’s recommenda- tion, visiting the recommended child node and subsequently exploring its grandchildren - increasing the depth of the tree instead of the breadth. Strong neural-network pre- dictions can occur at any depth and as the neural-network learns, we expect more confident predictions. Because of this characteristic, we hypothesise that the search- depth increases as the agent learns (making the tree narrower and deeper). By reversing our hypothesis we then infer that the relative depth of the search from one training cycle to the next provides an indication as to whether the neural-network is improving or not. The bigger the jump in search-depth between training cycles, the more learning that has occurred. The depth of the search is dependent on the branching factor of the problem and the number of search iterations conducted. We hypothesise that for a single game with a constant average branching factor, and an agent performing a fixed number of search iterations, that any variations in search-depth is indicative of the neural- network’s learning. We confirmed this hypothesis and found that the average search for a given game was deeper using a trained neural-network, when compared with the depth achieved using an untrained neural-network, as shown in Figure 5.3. The relative search-depth provides a measure of how well the neural-network is trained i.e. the agent’s skill at playing. If the neural-network was perfectly trained, then 86 the search would not need to explore other actions (i.e. it would immediately identify the most optimal moves for victory), causing the depth to be large. Likewise if only one or more nodes were well understood then the tree search would effectively accept the neural-network’s recommendation and explore the children of the well understood positions. The advantage of using the search-depth as a measure of the agent’s learned knowledge is that it requires no additional resources, just simply an additional variable to record the search-depths achieved for each move decision. Whilst the precise playing quality cannot be inferred from this measure, we can use this to predict if the agent’s learning is improving or not.

Figure 5.3: The average search-depth of experiences in the buffer at epochs for a baseline-player without any improvements playing different games. This demonstrates that with each epoch, the depth achieved by the search increases as training progresses. Note that the game of Connect4 7x6 is the full Connect4 game on a 7x6 board with a maximum ply of 42 and after 700 epochs the search-depth is greater than 12 ply on average - making it a relatively deep search for the state-space size. Different games perform different numbers of epochs depending on the game’s complexity and the parameters used for the player.

5.3 Method

We apply the end-game-first curriculum by attempting to create only experiences within the visibility-horizon, i.e. experiences that have a distance to the end-game Xd which is less than the visibility-horizon ν. i.e. Xd < ν. We reuse the method from Chapter Automated End Game First Curriculum 87

4 to enforce this condition by playing random moves until the distance-to-terminal for a game is approximately ν, then create experiences for the remaining moves in the game. We obtain an approximate value for the distance-to-terminal for any experience by maintaining the average number of ply for a game. Key parameters used for this Chapter are shown in in AppendixA.

For the ith training cycle, we estimate νi by increasing or decreasing νi−1 depending on how the average search-depth di has changed when compared with its depth for the previous training cycle di−1. We define ∆ as the difference in average search-depth between the current and previous experience-buffer giving ∆ = di − di−1. Equation 5.1 shows the function we use for estimating ν.

 ν + di if −0.5 ≤ ∆ ≤ 0.5 (a)  i−1 10 νi(∆) = νi−1 − 1 if ∆ < −0.5 (b) (5.1)  di νi−1 + 2 if ∆ > 0.5 (c) The reasons for the values in Equation 5.1 are as follows:

• Equation 5.1(a) - when the average search-depth has not changed significantly. Since ∆ is small we make the assumption that the agent has not improved sig- nificantly since the preceding training cycle, as such we wish to continue training with this value of ν. There is a requirement to adjust it by some small positive amount. The reason for this adjustment is that there is an absolute maximum depth for a search and this maximum may be reached before ν encompasses the entire game causing ∆ to be small despite having learned all it can with this value of ν. In this case, it is desirable to continue to advance the curriculum so that eventually the curriculum encompasses the entire game space, as such we increment ν by a small portion of the search distance.

• Equation 5.1(b) - when the average depth decreases. This situation may indicate that the curriculum has moved too fast, which is why ν is decreased by 1 when this occurs. We choose not to decrement by the same proportion as in (c) to ensure that the curriculum is always moving toward the end of the game where possible.

• Equation 5.1(c) - when the average depth increases significantly. We can consider that the agent has rapidly learned this portion of the state-space for this value of ν if there is a large increase in search-depth. We increase the problem space ∆ by incrementing ν by some portion of ∆ - in our case 2 . With this adjustment, we are aiming to create new experiences which are at the edge of the agent’s ∆ 1 visibility-horizon. We chose 2 simply because it is 2 way between the upper and lower limits. If after varying ν we find that the change was too rapid, then condition (b) is expected to occur during the next training cycle, reducing ν.

The chosen parameters used to adjust ν are not necessarily optimal, however these parameters were found to improve the speed of learning against an unmodified Alp- haZero inspired player and the hand-crafted curriculum-player presented in Chapter 88

4. Further research in how to more precisely estimate ν may result in an optimum curriculum, further speeding up training.

5.4 Evaluation

The experimental setup is the same as in Chapter4, and uses the framework outlined in Chapter3. We train two players side by side, the auto-curriculum-player using the specified automated curriculum, and a second player without, the baseline-player. We train both players simultaneously on the same dual-GPU machine, allocating one GPU per player. Periodically during training a competition of at least 30 games is played against the appropriate reference-opponent and the agent’s success-rate is recorded. The training pipeline of the agent results in experiences being created during self-play with the tree-search being used to create the ground-truth. For this experiment the search-depth and the distance to the terminal state are stored with the experience to provide a deeper insight into the effect of applying the curriculum. Confidence- bounds were calculated using the Wilson-Score with Continuity Correction, giving 95% confidence with data accumulated over the specified time interval.

5.5 Results

We simultaneously trained an agent using the automated end-game-first curriculum and a baseline-player using no curriculum on a particular game, then periodically conducted competitions against the reference-opponent. The success-rate for these competitions are recorded as well as a number of performance metrics from the agent which are plotted below. Each experiment is performed three times and the success-rate data is combined to provide confirmation that the improved performance is observed on average, rather than just during a lucky training run. In addition to recording the success-rate of the agents we also record a number of performance values during train- ing. For these additional performance values, we present them individually instead of averaging them since doing so can mask some characteristics of the automated curricu- lum method. Figures 5.4, 5.5 and 5.6 show the results for the games of Connect-4, Reversi and Racing-Kings of an automated curriculum player compared with a player using no curriculum. As we have stated previously the hand-crafted step curriculum used in Chapter4 was chosen semi-arbitrarily, and as such we made no claim as to whether the chosen curriculum was optimal. Comparison of the automated curriculum outlined in this chapter with the potentially suboptimal hand-crafted step curriculum from Chapter4 is not likely to yield any definitive conclusions, as such we do not dwell on this point. We do perform a single experiment comparing the automated curriculum with the handcrafted step curriculum using the game of Reversi as shown in Figure 5.7 to demonstrate the point that both methods can result in similar curricula with similar playing performance. For each experiment we record the following parameters and plot them against the training time: Automated End Game First Curriculum 89

• (a) Success-rate. The auto-curriculum-player’s rate of wins and draws against the reference-opponent in the periodic competitions.

• (b) Average search-depth. The average of each search-tree’s depth for the experiences stored in the buffer at the given time - recalling that a search-tree is built to both create the ground-truth value for training and choose an action during the game.

• (c) Maximum search-depth. The maximum depth achieved for a search-tree in the buffer at the given time.

• (d) Total experiences in buffer. The number of experiences in the buffer at the given point in time. Noting that self-play training halts when there is a sufficient number of games in the buffer - regardless of the number of experiences per game. The curriculum can reduce the number of experiences per game, since experiences are only created toward the end of a game depending on the value of the curriculum-parameter ν.

• (e) Average distance from terminal. For each experience in the buffer the distance to the terminal state is recorded and this plot shows the average distance-to-terminal of all experiences in the buffer. For the curriculum-player the curriculum-parameter ν will reduce this distance, since only those experiences that are within ν of the terminal state are included in the experience-buffer.

• (f) Average Loss for experiences in buffer. The loss measures the error between the ground-truth for an experience’s state and the neural-network’s in- ference for that state.

• (g) Curriculum Distance. The plot showing the curriculum distance shows the distance from terminal where experiences will be created, i.e. the curriculum- parameter ν. The agent not using a curriculum permits all experience to be created regardless of distance, however the curriculum-player only permits expe- riences created where the distance is less than ν. 90

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 5.4: Player using automated end-game-first curriculum vs player with no cur- riculum playing Connect-4. (a) success-rate, (b) average search-depth, (c) max search- depth, (d) number of experiences stored in the buffer, (e) the average distance-to- terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. Automated End Game First Curriculum 91

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 5.5: Player using automated end-game-first curriculum vs player with no curricu- lum playing Reversi. (a) success-rate, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance-to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curricu- lum value ν. Run 3 was conducted on a system using a lower performance GPU than runs 1 and 2. 92

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 5.6: Player using automated end-game-first curriculum vs player with no cur- riculum playing Racing-Kings. (a) success-rate, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance- to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. NB that the planned buffer size was reduced to accom- modate the player with no curriculum during the early stages of training. Because Racing-Kings is a game where terminal experiences are difficult to find by random play, as explained in Section 3.5.4, there is a disproportionate number of experiences skewing the initial values shown in these plots. Automated End Game First Curriculum 93

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 5.7: Player using automated end-game-first curriculum vs player with hand- crafted step curriculum playing Reversi. (a) success-rate, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance-to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. 94

5.6 Discussion

5.6.1 Comparison With Baseline-Player

Figures 5.4, 5.5 and 5.6 show the performance of a player employing our automated end-game-first curriculum relative to the unimproved baseline-player in the games of Connect4, Reversi and Racing-Kings. Plot (a) in these plots show the success-rate of each player against the reference-opponent as time progresses. Plot (b) to (g) show interesting metrics, as detailed below, while the agent is undergoing training. While considering these results recall that Racing-Kings is non-convergent as ex- plained in Section 2.2.3. With non-convergent games, terminal-states are difficult to find by random play, and an untrained agent is playing almost randomly. The average number of ply in a Racing-Kings game at the start of training is substantially more than when the agent has learned the game and is playing less randomly. This characteristic causes the initial peak which is observed in Figure 5.6 subplots (b) - (g).

Plot (a) Success Rate of Agent with No Curriculum and Automated Cur- riculum

Plot (a) of Figures 5.4, 5.5 and 5.6 shows the success-rate over time of an auto- mated curriculum-player and an agent with no curriculum (baseline-player) against a reference-opponent. Three independent training runs were conducted for each ex- periment and the results averaged to create this plot. These plots all show that using an automated end-game-first curriculum results in faster learning than when using no curriculum. The curriculum-player typically completes its first training cycle before the non-curriculum agent. The reason for this is that the curriculum-player takes a number of random moves, which is faster than building a tree for decision making, resulting in the buffer filling up sooner since it is measured in games. The confidence intervals are a little wider toward the end because weights are selected randomly during training to compete against the reference-opponent and the final weights have less time to be selected in the competition, reducing the number of total games that the later weights are involved in.

Plot (b) Average Search-Depth of Experiences in Buffer

This plot shows the search-depth for the experiences stored in the buffer at the given point in time, averaged across all stored experiences. When the experience is created the depth is stored with the experience to enable us to use this data. Each independent training run is shown individually instead of averaging them as done in Plot (a). Reversi and Connect-4. For the baseline-player (no curriculum) the search- depth increases with time for the games of Reversi and Connect-4 as seen in Figures 5.4 and 5.5. This increase in depth shows some correlation with the increase in playing performance observed in Plot (a). This correlation is the premise for using the relative search-depth as an indicator for playing performance. For the automated curriculum- player, the average depth varies significantly over time and does not correlate with the Automated End Game First Curriculum 95 player’s performance. This outcome is expected, recalling that the curriculum distance is adjusted depending on the change in search-depth. The curriculum is adjusted to always oppose the change in depth. Racing-Kings. The game of Racing-Kings shown in Figure 5.6 has a higher initial search-depth due in part to the game being non-convergent and is also due to the fact that toward the end of a Racing-Kings game there are usually only a few pieces - making the decision complexity small, resulting in deep trees. The correlation between average-depth and player performance is non-existent for the initial training period as shown in these plots. This high initial average-depth does not exist when employing a curriculum and the reason is that initially, due to the curriculum, search-trees are only built for positions which are close to the terminal-state, limiting the depth of some searches. The act of employing the curriculum eliminates this higher initial search- depth which in turn again makes the search-depth have some correlation with the player performance, allowing the curriculum to be automated in accordance with our method.

Plot (c) Maximum Search-Depth of Experience in Buffer

This sub-plot shows the single largest search-depth achieved whilst generating the expe- riences stored in the buffer at the given point in time. As with the average search-depth generally this value increases to some maximum value, however since the maximum value is taken from one value it is less useful as a measure of the agent’s performance since one lucky expansion could skew the results. As can be seen in Figures 5.4, 5.5 and 5.6 the maximum depth in the buffer after sufficient training can reach a relatively high proportion of the number of ply in a game. The same initial peak occurs with the game of Racing-Kings as noted in the preced- ing discussion about the average distance, showing an initial maximum-depth greater than 100 ply. Given that there are 200 MCTS iterations, 100 ply represents a significant depth when considering that a maximum of 200 nodes can be added to the tree. With- out knowing the specifics of the search giving rise to this value, it can be envisaged to occur if both players have only 1 piece on the board and the untrained neural-network incorrectly favours one action causing the tree-expansion to repeat positions.

Plot (d) Total Experiences in Buffer

This plot shows the number of individual experiences which are held in the buffer at a given point in time. Recalling that the criteria for halting self-play is the number of games in the buffer, not the number of experiences, as such the number of experiences differs from player to player and between training runs depending on the length of the game. An experience is created for every ply within a game for the baseline-player, and for those actions which fall within the curriculum distance. It was demonstrated in Chapter4 that the end-game-first curriculum improves the efficiency of learning as it learns from less experiences than the baseline-player, and this plot highlights this characteristic of end-game-first learning. Despite using significantly less experiences 96 than the baseline-player, the curriculum-player learns faster. An ancillary benefit of the end-game-first curriculum is that the size of the buffer is held relatively small for games where terminal states are difficult to find via play- out. The first self-play batch conducted immediately after initialisation in the game of Racing-Kings resulted in the average number of ply per game being in excess of 100, with some games achieving 300+ ply. When designing this Racing-Kings experiment it was planned that buffer size would be 3000 games, however the number of experiences generated by the baseline-player exceeded the memory capacity of the system, so the value was set to 2000 games. Figure 5.6(d) shows that the initial size of the buffer for the baseline-player was approximately 60,000 experiences compared with just 10,000 for the curriculum-player. The number of experiences in the buffer for the baseline-player then reduces rapidly because the agent has begun to learn and is guiding the search more accurately.

Plot (e) Average Distance from Terminal

This plot shows the average distance to an experience’s terminal state for the buffer at the given point in time. Some games like Reversi have nearly the same ply for each game, while others like Connect-4 and Racing-Kings vary and on average the value remains relatively constant for the baseline-player. For the curriculum-player however this plot shows how, over time, the experiences in the buffer are further and further from a terminal state. A portion of the oldest experiences are removed from the buffer after each training cycle replacing the experiences with new ones and for the curriculum-player the new experiences are further from terminal with each training cycle. Again the game of Racing-Kings causes this plot to appear anomalous during the early stages of training for the reasons explained previously, however as the agent learns, games become shorter avoiding unnecessary repetition.

Plot (f) Average Loss for Experiences in Buffer

This plot shows the average loss for the policy for all experiences in the buffer at the given point in time as shown in Equation 5.2. As expected the loss diminishes as time progresses for the baseline-player indicating, as in most neural-network approaches, that the agent is learning. However, the curriculum-player shows an increasing loss for the game of Reversi in Figure 5.5. By enforcing a curriculum, the complexity of the problem space changes with each change in curriculum which can alter the loss over time. By using a curriculum we aim to progressively increase the complexity of the problem as the agent learns, as such an artefact of end-game-first curriculum-learning is that the absolute loss may not decrease with time. Given sufficient time, when the agent is learning from the full game, the loss will then begin to decrease.

l = π(s) · log(pθ(s)) (5.2) Automated End Game First Curriculum 97 where:

π(s) = Policy from the MCTS.

pθ(s) = Neural-network policy inference.

Plot (g) Curriculum Distance from Terminal for Experience to be added to Buffer

This plot represents the permissible distance-to-terminal state ν for experiences to be placed into the buffer. Experiences which have a distance-to-terminal less than ν are stored in the buffer. This value is not plotted for the baseline-player as its value would be the number of ply in a game, and offer no insights.

5.6.2 Comparison with Handcrafted Curriculum

The previous section demonstrated that the curriculum can be set using the relative search-depth during training and that this method is effective in reducing learning times in all tested games when compared against a standard neural-network/tree-search agent not using any curriculum. In Chapter4 we observed an improvement by employing hand-crafted curriculum. For completeness we performed an experiment where we compared the performance of the automated curriculum-player against the hand-crafted player used in the previous chapter for the game of Reversi. Figure 5.7 shows that the automated curriculum performs equally as well as the hand-crafted curriculum-player. Figure 5.7(g) shows the two curricula as they were applied to the game playing agent, and it can be seen that both progress their curriculum at similar rates - one continuously, the other with a step. With such a similar curriculum we would expect to see a similar performance - this is what we have observed.

5.7 Summary

We demonstrated in the previous chapter that employing an end-game-first curricu- lum improves the rate of learning of a neural-network/tree-search agent employing reinforcement-learning. Our work in the previous chapter utilised hand-crafted curric- ula, and as such required prior knowledge of the game to ensure the curricula progressed at a suitable rate. The hand-crafted method was effective, however the reinforcement- learning algorithm used throughout this thesis is otherwise relatively general. We have proposed in this chapter a new automated end-game-first curriculum which is also gen- eral. The automated curriculum we have used is not necessarily optimal, however in all tested cases it outperformed not using a curriculum, and performed equally with the hand-crafted curriculum in the game of Reversi. Using our method to automate the end-game-first curriculum improves the rate of learning for a neural-network/tree-search agent. During the outline of this chapter it was noted that setting the curriculum at the visibility-horizon would likely produce the optimum curriculum progression, however it was also noted that calculating the 98 visibility-horizon is likely to be intractable. Our method for estimating ν is relatively simple and is guided by the search-depth when assessing actions (which in itself is a measure of the agent’s learning). This thesis so far has focussed on limiting uninformed experiences from being cre- ated and we have achieved success by recognising that the early training epochs contain the most uninformed experiences. At the extreme of this perspective is the initial epoch which has the most uninformed experiences. In the next section we focus specifically on the initial training epochs and ways to improve the efficiency of training by investigating methods which can be used to prime the neural-network’s learning. Chapter 6

Priming the Neural-Network

Highlights • The concept of priming a neural-network is established. • Terminal nodes can be used to create experiences which are completely informed. • During the course of a search to determine which action to take a number of terminal nodes may be visited. • Retaining all terminal experiences during the initial training cycles improves learning.

6.1 Introduction

The Macquarie Dictionary (2002) definition of priming is “to prepare or make ready”. In this section we demonstrate two methods which improve learning by modifying the training pipeline for the early training cycles. Recall from Chapter3 that the ground-truth of an experience is created by con- ducting a tree search and that the values for non-terminal leaf-nodes are obtained by using the neural-network that is being trained. In Chapter5 we demonstrated that for a neural-network/tree-search agent most of the early experiences are uninformed when an end-game-first curriculum is not employed. An experience is uninformed if it has insufficient feedback from the environment to make an accurate prediction. In Chapters4 and5 we found that:

• For the initial training cycle, any experience created without the search reaching a terminal-node is uninformed and we have zero confidence in its usefulness in training. • For subsequent training cycles experiences created without the search reaching an informed node are also completely uninformed.

These findings mean that without applying an end-game-first curriculum, much of the computing resources are wasted by generating uninformed experiences during the early training cycles. This issue is particularly pronounced in the very first training cycle because the experience will be uninformed unless the tree-search to create it

99 100 includes a terminal-node. This problem diminishes with each training cycle as the agent learns, because after the first training cycle, a tree-search which visits an informed state may lead to informed experiences. Experiences which are the furthest away from a terminal-node are more likely to be poorly informed. Experiences created from the early board positions of a game are effectively ran- dom during the initial training cycles, however these experiences eventually become informed as more training cycles are conducted. The early-stages of training result in a high proportion of uninformed experiences and as such our motivation in this chapter is to counter this weakness. We present two new methods which vary the training methodology for the initial training cycles - where the most uninformed experiences are created. The two methods are named:

• terminal priming; and • incremental-buffer priming.

With terminal priming we propose a variation to the existing method by adding all terminal-states that are discovered throughout training. Our discussions from the previous chapter have already established that terminal states are directly informed, meaning we can be confident that the generated experience will be accurate. The incremental-buffering method shortens the time taken for the earliest of training cycles by starting with a smaller buffer-size and incrementally increasing it. Whilst incremen- tally increasing the buffer-size may appear simplistic, to our knowledge the approach has not been covered in the literature and there are additional risks of overfitting which need analysis before confirming it to be an effective method.

6.2 Terminal Priming

6.2.1 Method

In previous chapters we highlighted the importance of generating quality experiences, and have demonstrated how learning commences from the terminal-states. The effi- ciency with which an agent creates and utilises experiences is a key factor in how fast an agent learns. In Section 5.2.1 we found that terminal nodes are directly informed, which means that unlike non-terminal nodes which need to be visited a number of times to estimate the value, the precise value of a terminal-node can be obtained from a single visit. The result of this observation is that the state-action pair which leads to that terminal-node can be used to create an experience which is completely accurate from this single visit. When a search is conducted to determine which move to make for a board position, the root-node of the search is the current board position. The experience which is created is based on that root-node after the search is complete. What is overlooked in the current approach is that during the search there are terminal nodes which may be visited. Actions which lead to terminal nodes can be used to create experiences with perfect information about the environment, however the current practice is to ignore Priming the Neural-Network 101 this information. In this section we demonstrate how retaining these experiences can enhance learning.

Recall that the agent used in this thesis has a two-tailed output; policy pθ(s) and value vθ(s) for an input state s. The neural-network is trained using experiences X, for all time, which are comprised of X := {s, π(s), z(s)} where π(s) is the policy derived from the tree-search, and z(s) is the reward from the environment for the game’s terminal-state as explained in Section 3.3.2. When an explored action leads to a winning terminal-state during the search, the value of the action which gives rise to the terminal-state st+1 can be determined exactly

- since it is equal to the reward obtained directly from the environment v(st, at) ← z(st+1). If the next state st+1 is a terminal-state and is a win for the player, then to build the ground-truth experience we set z(st) = 1 and the policy π(st) is set to always choose the winning action i.e. the index in the policy which represents the winning action is set to 1 and the remaining actions to 0. Conversely when an explored action results in a losing terminal-state then we would want to avoid selecting that action so the index in the policy representing the losing action becomes 0 and z(st) = −1. The policy is then re-normalised to ensure that it sums to 1 as required by the definition of a policy. We use these substituted values for π(st) and z(st) and save these experiences to the buffer. These experiences can be created with very little additional overhead, since the information has already been discovered as part of the existing search. There is an additional experience which we can also extract using this concept. If a winning action is available to a player from state st, then we can declare that the opponent’s previous move at st−1 was a weak one since it gave the player a winning action. To create an experience using this insight the experience needs to be from the opponent’s perspective since the opponent loses which gives z(st−1) = −1; and for the policy, π(st−1), the element representing that action becomes 0 (never selected) and the policy is re-normalised. An additional experience cannot be created for the opponent if the player’s terminal-state is a losing one unless all actions lead to a loss, because it is possible that the player overlooked a winning action. We could potentially create this additional st−1 experience with a losing terminal-state if and only if the search was extended to confirm that all actions were also losing, however there are no guarantees that after the additional search it will result in a conclusive terminal-state so we do not attempt to create experience in this case. By applying terminal priming to create additional experiences we are potentially introducing a large number of terminal-states and when we combine this with the smaller state-space-size that is caused by applying a curriculum we have an added risk of overfitting to the smaller state-space. This problem can be exacerbated if the terminal-value calculation is trivial like in the game of Reversi, recalling that for the game of Reversi a winner is calculated simply by the difference between the number of pieces for each player on the board. In the case of Reversi these additional terminal experiences add more examples for a trivial part of the problem space. Likewise the game of Racing-Kings has a simple terminal-value-function since a terminal-state occurs 102 when the King is on the final row. Despite being a solved game, Connect-4 has a more complicated terminal-value-function since winning depends on the spatial relationship between pieces. For these reasons we only add these additional terminal experiences for the very first training cycle. We view the problem of overfitting during the first training cycle to be less critical when we consider that the weights are initially randomised and that the existing approach is to learning from a large proportion of uninformed experiences. This approach of adding additional terminal-states could potentially be applied for the duration of training, however in our opinion once these terminal-states are learned then the focus of training should be the earlier stages of the game. This experiment employs similar method to the preceding chapters where we com- pare the performance of two agents simultaneously:

• an auto-curriculum-player using the methodology from Chapter5; and • an auto-curriculum-player utilising the terminal priming method covered in this section, which we call the terminal priming player.

We conduct periodic competitions for each player against a reference-opponent and record the success-rate to determine the player’s performance. We also record key metrics throughout the training of each player to enable further insights.

6.2.2 Results

Figures 6.1, 6.2 and 6.3 show the performance of a terminal priming player and an auto-curriculum-player for the games of Connect4, Reversi and Racing-Kings respec- tively. Plot (a) shows the average success-rate against the reference-opponent for three separate training runs and plots (b) - (g) shows some internal characteristics of the agent for each training run as learning progresses.

6.2.3 Discussion

Performance

The results show that for Connect-4, in Figure 6.1(a), terminal priming makes a sig- nificant difference to agent learning, however for Reversi and Racing-Kings, Figures 6.2(a) and 6.3(a), the performance is similar. We expected that the terminal priming method would be better suited to games with complex terminal-value-functions like Connect-4. For games with trivial terminal-value-functions, the fact that in the long run the performance of the terminal priming player is similar to that of the player without terminal priming means that this method can be used in general applications and not just in situations where knowledge of the terminal-value-function’s complexity is known in advance. For Reversi and Racing-Kings we observe an initial slow down of learning as indicated by the first data-point in their success plots. Recalling our discussion that both Racing-Kings and Reversi have trivial terminal-value-functions, so adding additional terminal-experiences increases the number of experiences required to be processed during training without necessarily improving the learning outcome, Priming the Neural-Network 103 resulting in a slowing down of the early training cycles. For the game of Racing-Kings, Figure 6.3(a), it appears that using terminal priming might be superior in the long- term, overcoming the initial slow down, however the confidence-bounds still overlap slightly at the end of the experiment meaning that we can not conclusively declare that terminal priming is better from our experiment. The Racing-Kings result may indicate that priming the network with terminal experiences improves training over the long-term for this game, however further work is required before that conclusion can be made. Instead we can conclude that terminal priming is an effective method where the game has complex terminal-value calculations and is at least neutral for games which have simple terminal-value calculations.

Internal Metrics

The remaining plots (b) - (g) provide metrics derived from the experience-buffer as learning progresses for each of the individual training runs. Reversi Figure 6.2(b) - (g). We focus on the game of Reversi for this discussion because the game effectively has a fixed number of moves. For this experiment we also continued training for longer than usual in order to observe the loss once the curriculum finished (i.e. the agent was learning from the entire game). As with the previous chapter the search-depth plot (b) appears noisy, resulting from the application of the automated curriculum, however at around 5000 minutes the large variations in search-depth reduce and eventually become an increasing gradient for runs 1 and 3. The length of a game of Reversi is almost always 60 moves - with very few games taking fewer moves. When we consider the curriculum distance plot (g) we see that at around 5000 minutes the curriculum distance exceeds the typical game length for runs 1 and 3 - meaning that the curriculum incorporates the entire game from this point onwards i.e. has finished. This means that the state-space size for newly created experiences is constant from 5000 minutes for runs 1 and 3 and it also means that the curriculum no longer impacts the creation of experiences. Creating an experience for every move in a game means 1 that the average distance-to-terminal for the total experiences will be 2 the number of moves in a typical game, so for Reversi this will be approximately a distance of 30 moves. Plot (e) shows that at around 12000 minutes the average distance-to-terminal for the experience-buffer is 30, recognising that it takes a number of training cycles to fully replace the buffer at 20% replacement per cycle. This experiment was extended to observe how the loss changes after the curriculum has finished and in the previous chapter we stated that we would expect to see the loss decrease when the curriculum finished. As with previous observations we see the loss increases as the curriculum increases and once the curriculum is finished (5000 minutes) we see the loss decreasing. Total Experiences in the buffer, Plot (f). We note that for the terminal priming player learning the games of Racing-Kings and Connect-4 (Figures 6.3 (f) and 6.1(f)) that the number of experiences is initially high, however this is not the case with the game of Reversi (Figure 6.2 (f)). The nature of Reversi is that the board has to be nearly full before the game ends. Terminal-states in Reversi will only be found after 104 having played nearly all of the moves, particularly if the search-depth is shallow, whilst in Racing-Kings and Connect-4 terminal states can be found after a smaller number of moves. The consequences of this characteristic of Reversi is that more terminal-states are found/added for Connect-4 and Racing-Kings than for Reversi.

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 6.1: Connect-4. Auto-curriculum player with and without terminal priming. (a) success rate vs reference-opponent, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance-to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. Run 2 was performed on a faster GPU than runs 1 and 3. Priming the Neural-Network 105

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 6.2: Reversi. Auto-curriculum player with and without terminal priming. (a) success rate vs reference-opponent, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance-to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. Run 2 was performed on a slower GPU than runs 1 and 3. 106

(a)

(b) (c)

(d) (e)

(f) (g)

Figure 6.3: Racing-Kings. Auto-curriculum player with and without terminal priming. (a) success rate vs reference-opponent, (b) average search-depth, (c) max search-depth, (d) number of experiences stored in the buffer, (e) the average distance-to-terminal of buffer experiences, (f) the average loss for experiences in the buffer and (g) the curriculum value ν. Run 2 was performed on a slower GPU than runs 1 and 3. Priming the Neural-Network 107

6.3 Incremental-Buffering

6.3.1 Method

The experience-buffer stores all experiences which are created during self-play and is used to train the neural-network as described in Chapter3. The size of the experience- buffer is constrained by the computing resources available, however it needs to be large enough to represent the diversity of the game otherwise there is a risk that the neural- network will overfit to a smaller portion of the game. As explained in Chapter3, experiences are created and saved for each move made during a self-play game and we measure the size of the buffer in game’s played rather than experiences. A small number of games in the buffer heightens the risk of overfitting since expe- riences from the same game are related and usually highly correlated. This is why the number of games in the experience-buffer is used as the precondition for starting train- ing instead of the number of individual experiences, i.e. to ensure a level of diversity. We provide no analysis with respect to how to set the maximum experience-buffer size, except to say that it should be, in general, as large as the computing resources permit. The largest possible buffer size ensures the largest diversity of board positions for the agent to learn from. In our incremental-buffering method we relax our concern about overfitting for the initial training cycles and start with a smaller buffer-size and increasing it with each training cycle until reaching the maximum buffer-size. Whilst we accept that some overfitting may occur, the smaller buffer-size is still sufficiently large enough to ensure diversity of experiences. The benefit is that the initial buffer-size is a fraction of the maximum size, which results in the initial training cycles completing faster. The parameters used to increase the buffer are shown in Table 6.1. As discussed in Section 6.2 we hypothesise that overfitting for the early training cycles will be no more problematic than the existing wasted resources of generating and learning from uninformed experiences. Whilst we relax our concern about overfitting it cannot be fully disregarded and as such, we ensure that the minimum buffer size still retains some level of diversity across a number of games. The conduct of this experiment is consistent with the preceding chapters where we compare the performance of two agents simultaneously: • an auto-curriculum-player using the methodology from Chapter5 as the base-line player; and • an auto-curriculum-player utilising the incremental-buffer priming method cov- ered in this section, which we call the incremental-buffer player. We conduct periodic competitions for each player against a reference-opponent and record the success-rate to determine the player’s performance.

6.3.2 Results

Figure 6.4 shows the effect of employing incremental-buffering in the game of Reversi while Figure 6.5 shows the effect while playing Racing-Kings. We have not included 108

Initial Maximum Game Buffer Size Multiplier Buffer Size (n games) (n games) Reversi 500 1.5 2000 Racing-Kings 500 2 4000

Table 6.1: The incremental-buffer priming parameters. The multiplier is applied each training cycle to the buffer size until it reaches the maximum buffer size.

Connect-4 in this experiment as the time it takes an agent to progress through the early training-cycles is already fast, as such we are not expecting the Connect-4 ex- periment to be very informative. The results show that using incremental-buffer prim- ing can improve the speed of learning. Whilst these results show a benefit in using incremental-buffer priming, initial experiments which used smaller initial buffer sizes or smaller multiples showed improvement over the short-term, then a period of un- learning. Importantly in Figure 6.5, the agent not using an incremental-buffer shows the first data-point occurring after 2000 minutes while the agent using an incremental- buffer has its first data-point much earlier. This difference in the time for the first datapoint shown in Figure 6.5 is the key benefit of using incremental-buffering. This difference occurs in Figure 6.4, although it is not as pronounced, in part, due to the aggregating of the data-points every 1000 minutes to obtain the confidence-bounds but also resulting from the combination of the chosen parameters in Table 6.1 and the convergent nature of the game.

6.3.3 Discussion

Incremental-buffer priming is a simple method which can be employed to speed up training by spending less time when the agent is poorly informed. Recall our discussion about the visibility-horizon1 and how it propagates back- wards from terminal-states with each training cycle, and that a training cycle includes sufficient self-play games to fill the buffer concluding with training the agent on the experiences in the buffer. By using a smaller buffer-size, the self-play finishes earlier since it creates less experiences and training finishes earlier since it has fewer experi- ences to train on. With each completed training cycle the visibility-horizon progresses. To prevent overfitting, the prevailing wisdom is that the buffer-size should be as large as possible, contrary to this approach. However as we have stated previously we be- lieve it is acceptable to risk overfitting for the early training cycles given the current state-of-the-art approach uses uninformed experiences. By priming the agent with a smaller number of experiences the result is an improvement in the speed of learning. Figure 6.4 clearly shows the first data-point of the non-incremental-buffer player (base-line player) trailing the incremental-buffer player by over 1000 minutes, yet in the Reversi game in Figure 6.4 the difference is not as clear. Table 6.1 shows the

1 The visibility-horizon is the measure of the distance from a terminal-state where experiences are useful for training. Priming the Neural-Network 109

Figure 6.4: Reversi. Auto-curriculum-player with and without incremental-buffer prim- ing.

Figure 6.5: Racing-Kings. Auto-curriculum-player with and without incremental-buffer priming. 110 parameters used to increase the buffer and for Reversi the minimum buffer size is 500 1 2000 = 4 of the maximum buffer size. This means that for the first training cycle the incremental-buffer player plays 500 games while the base-line player plays 2000 games. The game of Reversi is convergent and almost always has a game length of 60 ply. Both players are employing the auto-curriculum and the minimum curriculum distance was set as 4 ply which means that for each game only 4 plies are played. This means that for the base-line player 4 ∗ 2000 = 8000 experiences are created versus 4 ∗ 500 = 2000 experiences for the incremental-buffer player for the very first training cycle. The Racing-Kings result in Figure 6.5 differs from the Reversi results because it has an unpredictable game length due to its non-convergent nature and, in part, the parameters used for incrementing the buffer. Table 6.1 shows the difference between 500 1 maximum and minimum buffer-size as 4000 = 8 for the Racing-Kings experiment. The larger buffer-size creates a more pronounced difference in the two players because the player with incremental buffering completes its initial training cycles much earlier. The implementation of the curriculum is to play a random number of moves until it is the curriculum-distance from the expected game length - however in the first training cycle accurate game-length information doesn’t exist. In practice due to the fact that random moves don’t necessarily lead to terminal-states the game is usually played randomly until an end-game is found, then it is rewound by the curriculum-distance (initially this is 4 ply). Once the game is rewound there are no guarantees that the game is still near the end, since both players will be building a search-tree to inform their decisions going forward, making the impact of the additional experiences more of an issue for Racing-Kings than for Reversi. If the maximum buffer-size is large then the benefits of applying this method were found to be even more pronounced. There is a real risk of overfitting using this method, however provided it is con- strained to the initial training cycles the agent using incremental-buffer priming pro- gressed its learning faster. The minimum experience-buffer size of 500 games was the minimum which we found worked for these games, however we observed during some experimental runs that the rate of progression of the buffer-size was a critical factor. We observed in some experiments that when the buffer-size progresses too slowly, there would be an immediate improvement in success-rate to only see a learning loss at later stages of training that took longer to recover from. From our experiments, it seems almost inconsequential how small the initial buffer-size is provided this small size is not retained for many training cycles. The larger the maximum buffer-size is, the larger the minimum buffer-size can be while still expediting the early stages of learning. Priming the network by incrementally increasing the buffer size is an effective method provided consideration is given to the additional risk of overfitting to a smaller part of the network. The gain obtained from the incremental-buffer method outweighs the risk of overfitting during the early training periods provided the size of the buffer results in a sufficiently diverse set of experiences. Combining both methods is likely to result in further improvement in the speed of learning during early training cycles, however we leave this experiment for future work. Priming the Neural-Network 111

6.4 Summary

The methods presented in this chapter can be used as a primer for the neural-network to improve the learning outcomes when the agent is first learning. The predominant view in machine-learning is to avoid creating narrow portions of the state-space, however as we have shown in this chapter this concern can be relaxed to some degree when dealing with the initial training cycles for a neural-network/tree-search agent. Our hypothesis that the initial training cycles are mostly comprised of uninformed experiences led to our view that the earliest of training cycles can be treated differently despite increasing the risk of overfitting. We have found that by having a disproportionate number of terminal experiences in the buffer and a smaller number of experiences overcomes the additional risk of overfitting when priming the network. The terminal priming method increases the number of well-informed experiences available to train the agent, however these experiences are all terminal, narrowing the diversity of the experience-buffer to predominantly end-games. The incremental-buffer priming method narrows the diver- sity of the experience-buffer by playing less games resulting in the buffer potentially only representing a small fraction of the possible state-space. The end-game-first curriculum also narrows the diversity of experiences in the buffer and when combined with these priming methods creates additional risks of overfitting. We have found that overfitting is not a major issue in a neural-network/tree-search agent learning by reinforcement- learning provided the small state-space is short lived and used only for the early training cycles. We have shown that an agent’s learning benefits by reducing the time it spends in the initial training cycles by employing a rapidly changing incremental-buffer. We have also found that incorporating additional terminal-states found during the extant tree-search can provide a measurable improvement in learning for games where the ter- minal value-calculation is complex. Where the terminal-value-function is not complex this method does not significantly impede learning.

Chapter 7

Maximising Experiences

Highlights • Early-stopping methods can be used to extract the maximum learning value from a generated experience. • Employment of early-stopping methods can reduce the need for hyper-parameter tuning - training epochs and experience retention. • Spatial-dropout is a promising method for future research.

7.1 Introduction

Creating a well informed ground-truth experience using tree-search is not guaranteed, especially if the experience is based on a board position which is a long way from an informed node1. By implementing an end-game-first curriculum we are attempting to determine, for any board position, if an informed experience will be generated by performing a tree-search rather than just blindly creating experiences for every possible board position. We have established that there is some optimum curriculum which, if known, would only permit a search to be conducted for board positions that result in informed experiences i.e. experiences that are useful for training the agent. For any created experience, there exists some measure of knowledge which it has acquired and in Chapter5 we quantified this in crude confidence terms based on an experience’s distance from a terminal state. Ideally we would allocate more time to training the agent on experiences which are highly informed, less time on experiences which are on the threshold of just being informed, and no time on uninformed experiences. The problem with this approach is that when using a neural-network/tree-search agent which learns via reinforcement-learning, informed experiences are only a small subset of the full state-space during the early training cycles. Presenting only a small subset of the state-space during training creates a risk of the agent learning only the small subset of the problem space i.e. overfitting. We can avoid overfitting by ensuring that the curriculum is adjusted rapidly or by performing only a small number of training epochs per training cycle, however the risk still remains.

1 See Chapter5 for a full discussion on how informed the agent’s decision relating to a board position is.

113 114

The end-game-first curriculum approach we have presented in this thesis mitigates the risk of overfitting by seeking to progress in line with informed experiences and the expectation that the excluded experiences were not a true reflection of the environment. For some early experiments conducted in Chapter4 we did observe overfitting using the approaches in this thesis where the number of epochs per training cycle was too large, and where the size of the experience-buffer was too small. We have established that each experience contains some level of knowledge about the environment, and it stands to reason that the more knowledgeable the experience is, the more training we would like to use it for. This introduces the concept of vary- ing the number of training epochs based on the training value of the experiences - since conducting more epochs will utilise the experiences more. A method called early- stopping is commonly used in supervised-learning problems which varies the number of epochs depending on the agent’s learning progress - by using the loss function to in- form whether to halt or continue training. Early-stopping is commonly employed in the supervised-learning literature, but is rarely used in reinforcement-learning, and Ben- gio’s practical advice regarding early-stopping methods only mentions supervised and unsupervised-learning problems [158]. We demonstrate that early-stopping methods can be used during the training phase of the reinforcement-learning training pipeline described in this thesis and doing so results in enhanced learning for our agent. As far as we are aware this is a novel application of the existing early-stopping method. The neural-network’s residual-network contains the bulk of the trainable weights in the CNN. The neural-network architecture diagram from Chapter3 is shown again here for convenience in Figure 7.1. To reduce overfitting, dropout is a commonly used in fully connected network layers, however the use of dropout in convolutional layers is relatively new. As discussed in Section 2.4.4, spatial-dropout [122], which removes a portion of the CNN filters rather than simply the trainable weights, can be used for convolutional layers and we perform a preliminary experiment where we investigate its effects on an end-game-first curriculum-learning agent. When using dropout a constant proportion of the network is dropped out and for fully connected networks Srivasta et al. found that either for their experiments 20% or 50% dropout was effective [124]. By employing dropout fewer of the weights are changed with each back-propagation step, therefore on average the rate of change of the weights is lower when compared with not using dropout. We have already identified that our agents avoid overfitting with suitable parameter selection, however we are interested in whether dropout can provide additional benefits to learning. Maximising Experiences 115 is a comprehensive s ). Recall that s ( θ v ) and s ( θ p from the environment and outputs s description of the game state including board description. Figure 7.1: Architecture for the AI players with input 116

Our presented method of employing spatial-dropout differs from the existing lit- erature as we vary the amount of dropout based on the known portion of the state- space which is being used for training as dictated by the curriculum-parameter. For a curriculum-player we set a maximum dropout level of 50% and when the curriculum admits more than 50% of a game’s position for experience creation we proportionally decrease the dropout value. The amount of dropout is another hyper-parameter requir- ing human knowledge as discussed in Section 2.4.4. However,when applying dropout to directly address an end-game-first curriculum this hyper-parameter can be calculated. Although the techniques covered in this section already exist in the literature for supervised-learning, their use as part of a neural-network/tree-search agent employing an end-game-first curriculum is as yet unexplored. In Section 2.4.4 we discuss other regularisation methods which could also be used to prevent overfitting. The agent incorporates L2 regularisation in its loss function however, it may benefit instead from using weight decay methods. Both L2 regulari- sation and weight decay methods include a hyperparameter indicating the magnitude of regularisation to apply. It is possible that varying these hyperparameters as the curriculum changes would also mitigate any additional risk of overfitting introduced by the methods in this thesis. We leave this investigation to be done as future work.

7.2 Early Stopping For Reinforcement-Learning

7.2.1 Method

The back-propagation algorithm adjusts the weights of the neural-network based on the experiences that are stored in the experience-buffer. One epoch is complete when all of the experiences have been processed by the back-propagation algorithm, and for effective training multiple epochs are usually conducted. We conducted a brief experi- ment to find that 20 epochs was an effective number of training epochs for our agent, as outlined in Figure 3.4 Chapter3. Figure 3.4 is reproduced below for convenience as Figure 7.2. We can see in Figure 7.2 that additional epochs do not necessarily improve the learning outcome and with too many epochs learning can be significantly hampered. It is highly likely that the optimum number of epochs differs from one training cycle to the next. For supervised-learning problems where all the ground-truth data exists prior to training, a key consideration is how many epochs should be conducted. One approach would be to create a hyper-parameter for the number of epochs, and perform multiple training runs to test the performance of different values. To determine the best agent a portion of data is held back from training and at the conclusion of training, the best agent is the one that more closely predicts the held back ground-truth dataset. This approach is inefficient as it requires the training of multiple agents. The method of early-stopping allows a single agent’s training to be halted when the error increases from one epoch to the next. Prechelt’s, Early Stopping But When? [123, Chapter 2], describes the early-stopping process with these four steps: Maximising Experiences 117

Figure 7.2: Reproduced from Figure 3.4 for convenience. Determining suitable epoch and replacement rate parameters. We compare the performance of an agent using 20 epochs and 20% replacement (dotted lines) against agents using 100 epochs with 20% and 20 epochs with 100% replacement of experiences. Note that the red plots were performed on the same system simultaneously, and the green plots were performed together on the same system. We can see that 20 Epochs and 20% replacement outper- forms the other parameter values. To obtain this plot two experiments are run, both use 20 epochs and 20% replacement rate as a fixed reference point with the alternative set of parameters. This is done to mitigate any variations arising due to processor load. The colours indicate that they were run simultaneously.

1. Split the training data into a training set and a validation set.

2. Stop training as soon as the error on the validation set is higher than it was the last time it was checked.

3. Train only on the training set and evaluate the per-example error on the validation every n epochs.

4. Use the previous weights as the final agent.

Bengio [158] provided an update to the early-stopping method suggesting that the condition for stopping should be when the change in loss from one epoch to the next, ∆L, is below some threshold for a consecutive number of epochs and termed this the patience parameter p. i.e. ∆Le < T for p consecutive epochs, where ∆Le = Le −Le−1, e is the epoch number, T is the loss threshold parameter and p is the patience parameter. 118

By halting training based on ∆Le < T instead of Prechelt’s suggestion of ∆Le > 0, it means that the latest weights can be used for the final agent instead of the second-last weights. The patience parameter p provides a mechanism to overcome temporary learn- ing resistance during back-propagation, commonly known as local-minima, improving the learning outcome for the network. Given that our reinforcement-learning agent trains itself over numerous training cycles, for every training cycle some number of epochs are required to be performed to adjust the weights and by using early-stopping the number of epochs can vary. It seems apparent that the optimum number of epochs would be different for each training cycle for a reinforcement-learning agent, however despite this the current approach is to use a fixed number of epochs. In this experiment we train an auto-curriculum-player using early-stopping methods (early-stopping player) and an auto-curriculum-player without early-stopping (curriculum- player) to observe the effects. Prior to training the early-stopping agent we hold back the last 10% of experiences that were created and use these experiences as the validation set. The remaining 90% of experiences are used to train the neural-network. We use the most recently created experiences as the validation set because they are likely to be the most informed experiences, and the experiences will not have been previously seen by the neural- network. Even if the board position has been seen by the network, the ground-truth values will be different due to the updated network. We allow the agent to train for up to 100 epochs with the stopping condition being

∆Le < 0.1 for p = 10, i.e. 10 consecutive epochs with T = 0.1 . For one experiment we use a patience parameter of p = 20 epochs. These parameters were determined by running a short test experiment like the one performed to determine the replacement rate. Like many of the hyper-parameters for the neural-network agent, these parameters were selected semi-arbitrarily based on experience and preliminary tests. One of our findings is that our parameter selections in this case are possibly sub-optimal and we discuss possible future work to automate the control of these parameters since a fixed value is not likely to be optimal for every training cycle for the same reasons that a fixed curriculum is not suitable. A larger difference for the stopping criteria was found to decrease the number of epochs performed while a smaller difference increased the number of epochs.

7.2.2 Results

Figure 7.3 shows the effects of using early-stopping with an auto-curriculum-player in the game of Reversi and Figure 7.4 shows the effect while playing Racing-Kings. Whilst the agent with early-stopping outperforms the agent without, the relative performance of the two players is close, but one of the key benefits of this method is the simplification of hyper-parameter selection. Plot (b) shows how many epochs are performed at each time period. We can see for the auto-player without early-stopping it takes 20 epochs for each training cycle, however for the early-stopping player we see some initial variance then a constant number of epochs. It is likely that the number of epochs conducted Maximising Experiences 119 for the early-stopping player, shown in plot (b), would be more varied if the stopping condition were smaller, the replacement rate was higher, or the patience parameter increased. 120

(a)

(b)

Figure 7.3: Reversi: Early-stopping player vs fixed-epoch player. (a) success rate vs reference-opponent, (b) number of epochs conducted per cycle. Plot (a) is the accumulation of results from three separate training runs which are shown separately in (b). Note that run 1 uses a patience parameter of 20 epochs while the other runs use 10. Maximising Experiences 121

(a)

(b)

Figure 7.4: Racing-Kings: Early-stopping player vs fixed-epoch player. (a) success rate vs reference-opponent, (b) number of epochs conducted per cycle. Plot (a) is the accumulation of results from three separate training runs which are shown separately in (b). Note that run 1 uses a patience parameter of 20 epochs while the other runs use 10.

7.2.3 Discussion

Employing early-stopping methods within a reinforcement-learning training pipeline results in improved learning outcomes when the agent is also employing an end-game- first curriculum. We stop short in declaring early-stopping as a beneficial method for reinforcement-learning without an end-game-first curriculum, as we only experimented with end-game-first players. It remains possible that early-stopping methods hinder learning without a curricu- lum due to the large number of uninformed experiences in the buffer during the early 122 training cycles. Consider the situation where the experience-buffer is comprised of nu- merous random (uninformed) experiences and a large portion of poorly informed experi- ences. When the early-stopping method is applied, it is possible that the agent will per- sist with a high number of training epochs attempting to fit the neural-network to this low quality data - in our experiment up to 100 epochs. The result of this would be that the weights are changed in such a way that they are further from the actual solution. Then, for the following training cycle generated experiences could be even more poorly informed preventing the visibility-horizon from propogating backwards to the start of the game, hindering learning. This is why we only claim that early-stopping works when employing an end-game-first curriculum not in general reinforcement-learning approaches. In both figures we see the player without early-stopping taking 20 epochs for each time period, as expected since this was set for that agent. Figures 7.3(b) and 7.4(b) shows that initially a large number of epochs are being performed during the training of the early-stopping player, then it plateaus to a constant value. The plateau is slightly above the patience parameter set for each early-stopping player. For instance in Figure 7.4(b) the early-stopping player in run 1 performs in excess of 50 epochs initially, then completes 21 to 22 epochs, near the preset patience value p = 20. We would expect this to occur if the stopping condition is being too easily met. Three parameters can be adjusted to make the stopping condition more difficult to achieve:

• decrease the loss threshold parameter T ,

• increase the patience parameter p, or

• increase the replacement rate of experiences after each training cycle.

It is not clear whether continually achieving the stopping condition (resulting in constant epochs) is optimal or whether it would be better to have varying epochs each training cycle. It is our opinion that readily achieving the stopping condition is indicative of a problem set that is not sufficientlychallenging for the agent - however further work is required to determine this. The observed plateau of the number of epochs for the early-stopping players indicate that the agent can readily achieve the stopping condition. This might indicate that the stopping condition is too easy, however it may also be possible to use this as an indicator that the agent is well trained on the state-space which has been presented to it by comparing the patience parameter p with the number of epochs conducted e. If e ≈ p then we could infer that the problem space is not challenging enough for the agent and use this to adjust the curriculum of an automated end-game-first curriculum-player - recalling that we currently use the search-depth to control the curriculum. Using the relationship between e and p may also provide an opportunity to auto- matically decrease the loss threshold T , or increase the patience parameter p. Further work is required to ascertain whether the relationship between e and p can be used in these ways to further improve learning. Maximising Experiences 123

The training-pipeline currently used by the agent includes removing 20% of the buffer at the end of each training cycle. It is possible that the early-stopping agent continually meeting the stopping condition might provide some insight into whether the rate of replacement is sufficient. It may be possible to adjust the replacement rate in much the same way that the auto-curriculum was adjusted by always seeking to ensure that the number of epochs moves about some constant value with the replacement rate always resisting any changes in number of epochs conducted. We conclude that the use of early-stopping aids learning and simplifies parameter selection for a reinforcement-learning agent employing an end-game-first curriculum. Our research in this section opens up a number of further research opportunities which may lead to the automation of: the replacement rate, the loss threshold parameter and the patience parameter, this may further enhance the rate of learning making the algorithm more general and extracting additional benefit from any generated self-play experience.

7.3 Spatial dropout

7.3.1 Method

A neural-network is overfitted when it is able to make accurate predictions for the data which was used to train it, but performs poorly when making predictions for data that it has not explicitly seen. More subtly, the network is still overfit if it can only make predictions for a small portion of the entire state-space - unless it is designed specifically for that smaller set. If only a small portion of the state-space is ever presented to the neural-network then it stands to reason that the neural-network will also be only able to learn to make predictions for that portion of the state-space which it has seen. Overfitting can occur when there is insufficient data, poor data diversity, or if back-propagation is applied too many times. Overfitting is a fundamental machine-learning challenge driving the requirement for large data-sets and a number of regularisation methods exist to prevent this, as discussed in Section 2.4.4. As detailed in Chapter3 we utilise batch-normalisation as the primary method of regularisation in line with related work. We also ensure that the curriculum, which narrows the state-space, progresses at a sufficient rate to avoid overfitting. Dropout has been found to be effective in reducing overfitting for fully connected networks and more recently, spatial-dropout has been used for CNNs. Perhaps due to how recent the spatial-dropout method is, it has not been prolifically used in CNN based reinforcement-learning agents. We have performed a quick study for the purposes of determining whether the addition of spatial-dropout improves learning for an end-game-first curriculum-player given that the dynamics of employing dropout may differ when using a curriculum-learning agent compared to an agent without. When employing an end-game-first curriculum, initially only the positions that are near terminal states are used for learning, i.e. only a small portion of the actual problem space is used for training. Training on a small portion of the problem space increases the 124 risk of the network overfitting. The end-game-first methods discussed in chapters4 and 5 mitigate the risk of overfitting by progressively increasing the problem space before overfitting can occur. With the way that we progress the end-game-first curriculum, the curriculum-parameter can be used to establish the portion of the state-space which the agent is learning. For example the game of Reversi has approximately 60 moves on average and if the auto-curriculum parameter for a given training cycle includes 40 2 40 moves from terminal then we can estimate that 60 = 3 of the state-space is being presented to the agent. We can use this calculation to set the proportion of dropout for the agent. The research in this section is a preliminary study to observe the effects of spatial- dropout on a neural-network/tree-search agent using an end-game-first curriculum with reinforcement-learning. Our aim here is to determine whether further research is war- ranted with relation to spatial-dropout specifically on an agent using the improvements we presented in this thesis, recalling that we already use batch-normalisation in the network. We perform two preliminary experiments:

• Comparing the effect of an auto-curriculum agent using spatial-dropout with an agent without; and

• Observing how variable spatial-dropout compares with fixed dropout.

For this experiment we use all improvements discovered in the course of this re- search: auto-curriculum, incremental priming, terminal priming and early-stopping. Our experiments in this section are preliminary with a view to inform whether to continue investigating how spatial-dropout can improve learning for an end-game-first reinforcement-learning agent.

7.3.2 Results

Figure 7.5 shows the relative performance of an agent with spatial-dropout and an agent with no dropout using an auto-curriculum and all improvements mentioned throughout this thesis. Figure 7.6 shows the relative performance of an agent with fixed spatial-dropout and a variable spatial-dropout with a maximum dropout of 50% with an auto-curriculum and all improvements mentioned throughout this thesis.

7.3.3 Discussion

The results shown in Figure 7.5 indicate that it is inconclusive as to whether spatial- dropout is beneficial or not when applied to an end-game-first curriculum-learning player. The architecture of the agent includes the use of batch-normalisation, which has become the popular regularisation method for deep CNNs. The results may in- dicate that batch-normalisation is effective enough by itself rendering spatial-dropout superfluous as reported in [126]. We had thought that dropout may provide some ad- ditional benefit, particularly when all improvements where added to the agent, since they all in their own way reduce the state-space-size for some period of time. Maximising Experiences 125

Figures 7.5(a) and 7.5(b) are encouraging results and suggest that further research in this area may be fruitful. Of particular interest would be whether the use of spatial- dropout mitigates poor hyper-parameter selection in which an unmodified agent would be overfitted. The results shown in Figure 7.6 with regard to varying the dropout proportion was also found to be inconclusive. With the end-game-first curriculum we can estimate which proportion of the state-space is being presented to the agent, as such we hypoth- esised that by dropping out an equal portion of the network we anticipated either better performance or more stable training. We were cautious with our variable-dropout agent by setting a maximum dropout of 50% or the proportion of the state-space used for training by the curriculum, whichever was smaller. We expected to observe a difference between the player with variable-dropout and the 50%-fixed dropout player when the curriculum included greater than 50% of the state-space, however the results are incon- clusive, but promising. To extend this work it would interesting to observe the learning behaviour of a variable agent with a maximum dropout of 90% against the fixed dropout agent. This would be a fairly significant departure from the current recommendations in the literature, however we hypothesise that since we already control the state-space proportion with the curriculum, setting the dropout to the same proportion would be beneficial to learning, either by decreasing learning time or decreasing sensitivities to hyper-parameters. Our variable method of applying dropout decreases the dropout based on the known portion of the game space being learned, ensuring that during the training cycles where there is less diversity due to the curriculum, more of the network is subject to dropout.

7.4 Summary

In this chapter we study the effectiveness of two existing methods from supervised- learning, however they are not commonly used for reinforcement-learning. The early- stopping method essentially allows a supervised-learning agent to train until all of the information is extracted from the prepared ground-truth examples. We made the case early in this thesis that self-play generation of experiences for a reinforcement-learning agent is very costly, as such, when the agent is being trained on these experiences it will be beneficial to extract as much value as possible from them - before discarding them and recreating new ones. Each training cycle for a reinforcement-learning agent is different, with different experiences, different state-space-size and different gradients. As such, it seems that setting a fixed number of epochs for learning is counter intuitive, however this is what the current state-of-the-art reinforcement-learning agents employ. We have shown that the same process of early-stopping that has been found to be beneficial to supervised-learning is also beneficial to reinforcement-learning. We have identified that further research in this area may yield further efficiencies in the use of self-play generated experiences. Dropout was used in many of the early state-of-the-art image recognition as the neural-networks became deeper however with the increase in use of CNNs, the use of 126 dropout seems to be diminishing in favour of other regularisation methods like batch- normalisation. Our curriculum-learning agent however creates a unique diversity prob- lem by deliberately presenting only a portion of the state-space at any one time, which leads to enhanced learning which may benefit from adaptive dropout. We have shown that spatial-dropout can still be used in conjunction with batch-normalisation for deep CNNs and have identified further research areas where variable-dropout may be em- ployed to improve learning for an end-game-first curriculum agent. Maximising Experiences 127

(a)

(b)

(c)

Figure 7.5: Auto-curriculum-players with and without dropout (DO). (a) Connect-4 (b) Reversi (c) Racing-Kings 128

(a)

(b)

(c)

Figure 7.6: Auto-curriculum-players with fixed dropout (DO) and with variable- dropout. (a) Connect-4 (b) Reversi (c) Racing-Kings Chapter 8

Pairwise Player Comparison With Confidence

Highlights • How to compare two players using statistical hypothesis testing methods for pro- portions. • A sequential stopping method for a pairwise trial retaining confidence.

8.1 Introduction

In computer game playing, researchers are often interested in the comparative perfor- mance of two players to indicate whether one playing method is better than another. To determine if one player is better than another, the two players often compete against each other in a tournament and their relative performance is measured by their win- rate. Alternatively, as done in this thesis, the two players can compete against a third opponent and their win-rates against that reference-opponent are compared. The process of estimating binomial proportions with confidence is well established and is utilised in many scientific fields [159], however it is rarely covered in machine- learning game playing literature. In our case the proportion we are most interested in is a player’s win-rate. A study in 2005 by Belia et al. found that many scientists misunderstood confidence intervals and the researchers recommended that guidelines be provided for researchers in each field [160] and more recently similar comments were made by Verdam et al.[161]. In the course of the research of this thesis it was observed that confidence intervals are less commonly reported when presenting machine-learning results - particularly in game playing, which suggests that perhaps the findings of Belia et al. are still relevant. Even for the premier scientific journal Nature, statistical con- fidence methods are not always required when reporting state-of-the-art game playing advancements [19, 30], although their presented results were a demonstrable improve- ment. The purpose of this chapter, in part, is to demonstrate the simplicity of using hypothesis testing to differentiate two players and to present a new sequential-stopping- with-confidence method for conducting a tournament. Our sequential-stopping method allows the competition to halt early if one player significantly outperforms the other.

129 130

One of the first steps in conducting research is to establish a hypothesis, then an experiment is designed and conducted to test that hypothesis. Typically a hypothesis cannot be definitively proven or disproven and as a result, findings are often expressed in probabilistic terms. When testing a new drug for instance, researchers hypothesise that the drug is effective in treating a particular illness. Then, a trial is conducted where the drug is given to a sample of patients and the instances of improvement are recorded. The distribution of symptoms in the treated patients is compared with a control population and if the distributions are sufficiently different the researcher can make a claim with some level of confidence about the drug’s effectiveness based on statistical methods [156]. The statistical process of hypothesis testing thus enables researchers to quantify their claims about a hypothesis [162]. In this chapter we provide an outline of hypothesis testing as it applies to comparing two machine-learning players, then we provide a practical step by step guide for per- forming hypothesis testing to determine, with-confidence, if one player is better than the other. We develop the method incrementally using empirical analysis to confirm that the user specified confidence levels are achieved. Although we provide a deep-dive into hypothesis testing from basic principals, all that is needed to determine the confi- dence levels for a pairwise player comparison is the number of games played n and the number of wins of the player w. We have discussed in detail pairwise player comparison via a head-to-head competition, however the method also holds for comparing players against an independent reference-opponent. For a head-to-head competition a single distribution is obtained and the confidence intervals measured against the value of 0.5 to determine the better player. However, when using an external reference-opponent the same method can be used to calculate the confidence intervals in which case two distributions will be obtained: one for each player. If the extremities of both distribu- tions i.e. the confidence-bounds, do not overlap then this can be used to determine the better player. The standard hypothesis test requires that a fixed number of trials (fixed-n) are conducted first before the hypothesis is tested. After presenting the fixed-n method we then propose a variation which permits halting a trial early if one player is significantly outperforming another. As far as the authors are aware, our proposed procedure for sequential-stopping is a unique variation of the fixed-n method in the machine-learning literature and, if not new in statistical fields then it is at least not commonly presented.

8.2 Hypothesis Testing

8.2.1 The Pairwise Player Comparison Problem

In Section 3.3.2 we explained the three stage training pipeline used by AlphaGoZero1 being: self-play, optimisation and evaluation, where the best player is used to generate the self-play examples. It is during the optimisation stage that a new generation of

1 AlphaGoZero was the last generation of Deepmind’s computer Go players to utilise a three step training process. Pairwise Player Comparison With Confidence 131 player is created. The evaluation stage tests whether the newest player is better than the current best player. In AlphaGoZero, the criterion used to accept or reject the new player as the best player is from a tournament of 400 games between the best and the new player, and the new player is accepted if they win more than 55% of the games, i.e. the new player wins more than 220 games [30]. The explanation for these decision parameters is not covered in [30], however given their chosen number of evaluation games of n = 400 and the required number of wins w > 0.55 · n, if a decision is made to replace the best player it is done with close to 95% confidence that the new player is better. We will cover how this can be calculated in this chapter. What is not obvious from the reported AlphaGoZero results is that with only 400 games and 95% confidence the two players need to have a probability of winning differ- ence of approximately 0.2 i.e. the challenger has to have a true probability of winning of p = 0.6, for replacement to occur. The AlphaGoZero evaluation is one potential use-case for the methods contained in this chapter. Our sequential stopping method explained in the next section would allow also the competition to finish earlier if one player substantially outperformed the other and still retain the specified confidence in the decision. In a tournament consisting of n games between a player and an opponent2 the total number of wins w for the player is recorded, and this can be used to estimate the player’s probability of winning against the opponent using Equation 8.1; which has a probability of winning ofq ˆ. Note thatp ˆ = 1−qˆ i.e. the sum of both players’ probabilitiesp ˆ+q ˆ = 1 because when one of the players wins a game the other player loses that game, i.e. a zero-sum game. This zero-sum property means that we can estimate the win-rate of one player and calculate the relative performance of the opponent. The problem then is simply to calculate the lower L and upper U limits of the true win-rate p and assess whether these limits have a value which can be used to make a determination about the players’ relative abilities. If the player of interest’s win-rate is labelled p and q then the required values to declare one player better are Lp > 0.5 or Up < 0.5 for a head-to-head competition and when using a reference-opponent Lp > Uq and Up < Lq or Lq > Up and Uq < Lp. w pˆ = (8.1) n where:

pˆ = is the probability of winning for one player w = is the number of wins and n = is the number of games conducted.

When competing against an opponent, the true probability of winning for the player is p - this is not known in advance and instead the competition is conducted to estimate p. The number of games in the competition determines how accurately the confidence- bounds reflect the true value p and with enough games, L ≤ p ≤ U can be described in a

2 The opponent can be either the other player of interest or the reference-opponent. 132 way which is sufficiently accurate to declare that the two players have different playing abilities with a predefined level of confidence if their performance in the competition is sufficiently different. If one player is better than the other, then it follows that the two players are not equal and p 6= q and specifically when using the head-to-head competition p > 0.5 or p < 0.5 i.e. p 6= 0.5. The remainder of our discussion revolves around performing a head-to-head compe- tition between two players of interest, although only a small modification is required to adapt it to the case of playing against a reference-opponent. In the case of a head-to- head competition, we have established that only the results for one player is required to be maintained, and to simplify our notation we only refer to those results being for the player - as opposed to the opponent.

8.2.2 Null-Hypothesis Testing

If the two players of interest, one we call the player and the other we call the opponent, are of equal strength then p = q = 0.5, recalling that p is the probability of winning against the opponent. If p = 0.5 then, after playing a large number of games, by rearranging Equation 8.1 we get w ≈ pˆ · n → 0.5 · n or the estimated probability of winning for the player is shown in Equation 8.2.

w pˆ = ≈ 0.5 (8.2) n

If instead after the competition we discovered thatp ˆ 6≈ 0.5 then we would be inclined to infer that one player was better than the other and could reject the idea that both players were of equal strength. This is essentially the underlying premise of hypothesis testing. First, we establish our hypothesis that both players are equal; then we con- duct an experiment, in our case a tournament of n games; finally the outcome of the experiment may or may not disprove the hypothesis. To formally define the hypothesis, we call the hypothesis which we are trying to disprove the null-hypothesis, H0. The alternate hypothesis, Ha, is what would be true if H0 is false. In the case of a pairwise player comparison our hypotheses are:

H0 : p = 0.5

Ha : p 6= 0.5

Where p is the true, unknown, probability of winning. The experiment used to test

H0 is a competition of n games between the players, recording the number of wins w for one player. Ifp ˆ from Equation 8.2 is found to be significantly different from 0.5 we can reject H0 and accept Ha : p 6= 0.5. If we findp ˆ ≈ 0.5 then we conclude that our experiment could not disprove H0. In the next section we outline how to determine the criteria for dismissing H0 with some level of confidence Importantly, an experiment can not prove a null-hypothesis [163], it can only disprove it e.g. for our game playing example we could always play more games and potentially discover the player’s are actually different. For example, if after n games Pairwise Player Comparison With Confidence 133 in a competition we observe a player’s win-rate asp ˆ ≈ 0.5 which is predicted by the null-hypothesis, we might be tempted to declare that the players are equal, however we can only state that the experiment did not detect any potential differences between the players. To illustrate why we can never prove a null-hypothesis, if our null-hypothesis was p = 0.5 then imagine if we played n = 1000 games and the outcome resulted inp ˆ = 0.5, the true win-rate of the player may still be p = 0.500001, a small difference between the players however strictly both players are still not equal. Such a small difference might be discovered after more games than as originally planned. The purpose of comparing the two players is to determine if they are different from each other. A game ending in a draw does not provide any information as to which player is better, as such draws are excluded from the competition for this hypothesis. Since draws provide no insight into whether one player is better or not, when a game is drawn then it is discarded, as if it never existed. It may be tempting to attempt to utilise draws to infer some level of similarity between the players, however a null-hypothesis can not be proven. A null-hypothesis is a statement that is the status quo until proven otherwise i.e. there is no statistical difference between two phenomena, and experiments are conducted to “gather empirical evidence to determine if it is unreasonable” [164], drawn games do not in any way contribute to the evidence that the players are different. A null-hypothesis is based on a theoretically infinite population as such there is always more evidence which could be gathered which might disprove it [165]. In our case the null-hypothesis is the two players are presumed equal, until proven otherwise.

8.2.3 Determining Significance

A Bernoulli trial is an experiment with two outcomes [166], in the case of a pairwise player comparison the outcomes are: win and loss. X is usually used to represent a success and for our case a success is a win w so for clarity we use X ← w. The number of wins in a competition is a discrete value, so X is a discrete random variable. If we define a win as 1 and a loss as 0 then for n games we can represent a sequence of independent Bernoulli trials as a binomial distribution B(n, p)[167], and doing so allows us to visualise the distribution of probabilities with the probability mass function (PMF). The PMF, as defined more formally by Chen, “letting X be the number of suc- cesses in n independent Bernoulli trials with common success probability p; i.e. X is distributed as a binomial distribution B(n, p)” [168]. Practically, the common success probability p is a player’s hidden probability of winning against their opponent. From Stewart, “the PMF for a discrete from variable X, provides the probability that the value obtained by X on the outcome of a probability experiment is equal to x”[169]. The PMF is accessible in Python’s scipy.stats library and is common in statistics libraries for other languages, an example PMF is shown in Figure 8.1(a). We also find it informative to observe the probability mass over thep ˆ from Equation 8.2, as shown in Figure 8.1(b), although conceptually this becomes a little harder to discuss since the plot indicates the probability that the true population probability p is some value. 134

(a)

(b)

Figure 8.1: Plot (a) shows the probability mass function for the possible values for X, given the outcomes from n Bernoulli trials with probability of 0.65. Plot (b) shows the same data however the horizontal axis is converted to its sample probability using Equation 8.2. It can be seen from these two plots in (b) that the higher value of n results in a narrower probability distribution. The total probability under the curve is 1, since 0 ≤ p ≤ 1.

When we make any claims resulting from the number of wins for a competition we can also make the same claim regarding the sample probability, which for a game, is the win-rate achieved in a competition.

8.3 Confidence Bounds

Since the PMF for B(n, p) provides the probability distribution for discrete values of X we can accumulate a sequence of these probabilities to sum to any desired probability mass. e.g. for B(100, 0.65) using the PMF we can obtain the probabilities of winning 1 game, then 2 games, then 3 games etc. and continually accumulate these probabilities until we have reached a desired probability mass. More formally if we define the desired probability mass as α and each discrete Pairwise Player Comparison With Confidence 135

Figure 8.2: Two PMF for Binomial distributions B differing only by n with the red L Px area showing P (xi) = α = 5% of the probability mass. For B(100, 0.65), the 95% x=0 lower confident bound (LCB) is Lx = 57 wins and for B(1000, 0.65) Lx = 620 wins, i.e. 95% of the probability mass is when X > Lx. Again this can be discussed in terms of a player’s probability of winning against an opponent by dividing by n.

possible value for X being X ← {xi = 0, .., n} with the value of i being directly related th to the number of wins. We define Lx as the i element in a summation where the sum of the probability equals α as shown in Equation 8.3. We use the subscript x for Lx to denote the units which the bounds specify, because later we prefer to have the bounds specified as probabilities of winning (win-rates).

L Xx P (xi) = α. (8.3) xi=0

By doing this, we can then make a claim, with a probability of α that X ≤ Lx, or inversely that there is a probability of 1 − α that p > Lx. 1 − α is called the confidence level and Lx is the lower confidence bound in number of successes. For example if

α = 0.05 and we apply this method to B(n, p) to obtain Lx we can claim 95% confidence that X > Lx. The percentage point function in scipy.stats.ppf provides the ability to calculate the point (Lx) where the sum of P (x) is α. This is how a confidence bound can be calculated from a binomial distribution with α defined as the significance level and 1 − α being defined as the confidence level. Figure 8.2 provides an example of how to obtain a desired mass for a confidence bound. The same procedure outlined above can be repeated to determine the upper confi- dence bound (UCB) Ux, except instead we accumulate probabilities from P (x = 0) to

P (x = Ux) until the probability is 1 − α, resulting in α of the probability mass being greater than Ux. If both the LCB and UCB are used to create an interval then the α confidence level for that interval is 1 − 2 , the mass that is between the two values is given by Equation 8.4;

α P (L ≤ x ≤ U) = 1 − (8.4) 2 For example, if P (x ≤ L) = 0.05 and P (x ≥ U) = 0.05 then 90% of the probability 136 mass is between the two bounds. This is called a two-sided test and in order to obtain 95% of the probability mass between the LCB and UCB α is halved.

We can test our hypothesis by obtaining Ux and Lx for p = 0.5 and if the number for a competition falls outside of the bounds then we can reject our hypothesis with confidence. Another point of view could be to also obtain the distribution from the observation B(n, pˆ) and check if its bounds fall outside the hypothesis. It is more convenient for our problem to specify the confidence in terms of probabilities of winning

Lx Ux pˆ and as previously mentioned we achieve that with Equation 8.2 or with L, U = n , n . For the purposes of further discussion we now revert to using the number of wins w instead of x, the number of games played in a competition n and also prefer to specify the confidence bounds as a proportion, since it is more relevant for testing our hypothesis. We also prefer to conduct the hypothesis test by using the observation distribution because this method is also suitable for comparing two players while using a reference-opponent. In the case of using an external reference-opponent if the confidence bounds do not overlap then the null-hypothesis can be rejected and a claim made that one player was better than the other.

8.3.1 Confidence Bound Convergence

Further discussion uses the confidence bounds as proportions, not the numbers of suc- cesses. The proportion we are interested in is a player’s probability of winning p where 0 ≤ p ≤ 1. With n = 0 trials conducted, the confidence bounds L and U are initially 0 and 1 respectively. As n increases the confidence bounds converge to the true value p, as;

n → ∞; U − L → 0; and L and U → p

Having narrow confidence bounds is desirable as it has the advantage of providing a more precise estimate of p. To obtain a higher confidence level for the same n, the width of the confidence interval could be increased, i.e. L and U could be moved closer to their initial values. At one extreme, consider a confidence level of 100% - other than when n = ∞, the only solution for 100% confidence is for [L, U] = [0, 1] regardless of the size of n. In order to obtain a high level of confidence, U and L estimates are more conservative, resulting in their rate of change per trial being lower as the required confidence increases. On the other hand, low confidence levels result in the bounds moving more freely since more errors are permitted. Figure 8.3 shows how different confidence levels impact the lower confidence bound (LCB) given the same n and the samep ˆ - the lower the confidence the more the LCB moves from its initial value of 0. It is this relationship between the confidence level, the width of the confidence interval and the rate at which the bounds move per trial which gives rise to the requirement to conduct more games in order to obtain sufficiently narrow confidence intervals when p is closer to 0.5 i.e. further away from the initial [0, 1]. In other words, the more similar the players are, then more games are needed to differentiate them. Pairwise Player Comparison With Confidence 137

Figure 8.3: Holding p and n constant we vary the 1 − α confidence level and observe its impact on L. With the initial value L = 0 we see that as the confidence increases L becomes further from p. 138

8.3.2 Confidence Bounds Without a Statistics Library - The Wilson- Score

A number of closed-form methods exist which can be used to estimate the confidence bounds for a a number of Bernoulli trials if a statistics package is not available. A binomial distribution can be approximated by a normal distribution resulting in a continuous function which makes it easier to derive closed-form solutions. The method most commonly taught is the Wald method [170] shown in Equation 8.5 due to its simplicity, however it has practical limitations.

s pˆ · (1 −ˆ p) L, U =p ˆ ± z (8.5) n Where:

α th • z is the 100(1 − 2 ) percentile of the standard normal distribution and

w • pˆ = n with w being the number of wins from a tournament of n games [171].

By naively applying a normal approximation to a binomial distribution, inaccuracies occur at the boundaries [172]. Since a binomial distribution is in the range [0, 1] if the distribution mean is near 0 or 1 a normal approximation would result in values outside the permitted range. Additionally it was found that the Wald interval is prone to unlucky pairs of numbers which result in actual confidence levels below expectations. It is for these reasons why numerous other methods to obtain confidence bounds have been developed [173]. The comparison of confidence interval methods conducted in [156] recommends that the Wald intervals be “no longer acceptable for scientific literature”. For these reasons we use the Wilson-Score with continuity correction [174] as the closed- form solution for obtaining confidence bounds, as shown in Equations 8.6 and 8.7[159]. From a tournament of n games, with w wins for a player, the Wilson method can provide the confidence bounds. The advantage of a closed-form solution is that it is faster to calculate than the exact method used above, and does not require any special libraries, although it does require a z-score table.

q 2 2 1 2npˆ+z −1−z z −2− n +4ˆp(n(1−pˆ)+1) Lα = max(0, 2(n+z2) ) (8.6)

q 2 2 1 2npˆ+z +1+z z +2− n +4ˆp(n(1−pˆ)−1) Uα = min(1, 2(n+z2) ) (8.7) Where:

• Lα and Uα are the lower and upper 1 − α confidence limits.

• z is the z-score relating to the desired confidence level where z = zα for the α th one sided test and z = zα/2 for the two-sided test. z is the 100(1 − 2 ) per- centile of the standard normal distribution. Z-scores are often simply obtained from a z-score table. Common z-score values for two-tailed confidence levels are Pairwise Player Comparison With Confidence 139

z0.05 = 1.96, z0.02 = 2.33. The score for any α can also be obtained using the scipy.stats.norm.ppf function.

• n is the number of trials (games)

• pˆ = w/n ; w is the number of successes (wins)

8.3.3 Prediction Method

Given two players of unknown ability, we propose the following method to predict if one player is better than the other with confidence. Whilst there are a number of different statistical approaches that can achieve this, the method we propose is relatively straight forward and we extend it in the next section to enable terminating a competition early if one player significantly outperforms the other. Recall that we are attempting to disprove the null-hypothesis that two players have an equal probability of winning.

H0 : p = 0.5

Ha : p 6= 0.5

In order to test the null-hypothesis the method is as follows:

• Play n games between both players and record w for player 1.

• Using w and n, calculate the confidence intervals [L ≤ p ≤ U].

• If L > 0.5 or U < 0.5 reject the null-hypothesis H0 and state with confidence

that the two players are not equal. When H0 is rejected a prediction can be made as to which player is the best one by selecting the player with the most wins. Technically, this prediction claim does not have precisely the same confidence 3 as rejecting H0, but in a practical sense the difference is insignificant . We call [L > 0.5 or U < 0.5] the prediction criteria, as these values are required before a prediction can be made.

• If L ≤ 0.5 and U ≥ 0.5 then n was insufficient to distinguish a difference between the two players and no prediction is made.

We have now transformed the binary hypothesis test into three prediction outcomes. The possible outcomes are:

1. The players are indistinguishable, i.e. H0 was not rejected, with the possibility that the players were actually different. If this error is made this is called a type I error.

2. There is 100(1 − α)th percentile confidence difference between the players and

approximately the same confidence that the tested player is better, i.e. H0 was

3 We ran 100,000 tournaments of 500 games with p = 0.52 and in cases where H0 was correctly rejected there were zero errors in predicting the better player. 140

rejected and the tested player was found to be better, with the possibility that the players were actually equal. If this error is made this is called a type II error.

3. There is 100(1 − α)th percentile confidence difference between the players and

approximately the same confidence that the tested player is worse, i.e. H0 was rejected and the tested player was found to be better, with the possibility that the players were actually equal. If this error is made this is called a type II error.

8.3.4 Types of Error

Table 8.1: Prediction outcomes showing the conditions which result in different error types.

H0 True H0 False Reject H0 Type I Error Correct Fail to Reject H0 Correct Type II Error

Type I Errors

If the null-hypothesis H0 is rejected when in actual fact it was true this is a type I error. The probability of making a type I error is called the significance level α, and 1 − α is called the confidence level. If after n games the prediction criteria is met, [L > 0.5 or U < 0.5] but p = 0.5, a type I error would be committed. It is an important point to note that type I errors can only occur if p = 0.5. The methods for calculating the confidence bounds outlined above are designed to ensure that when calculating the values of L&U they are always at the 1−α confidence level, so if we observe L > 0.5 or U < 0.5 after n games then our confidence in rejecting

H0 and making a prediction is directly inherited from the method used to calculate these intervals. With the prediction criteria being [L > 0.5 or U < 0.5], we can completely avoid type I errors by simply not playing enough games resulting in both L and U remaining on their respective sides of 0.5. That is to say, we will never make a type I error if we never reject H0 and never make a prediction, which is why type II errors are important in our application of hypothesis testing. It is important to keep in mind that type I errors only occur when a prediction is made and when p = 0.5.

Type II Errors

If the true p 6= 0.5 and we fail to reject H0, this is a Type II Error and the probability of making a Type II error is β. If after n games if the prediction criteria is not met, L < 0.5 and U > 0.5 but p 6= 0.5, a Type II error would be committed. Type II errors can only occur if p 6= 0.5. Table 8.1 shows a summary of the when the error types occur. Pairwise Player Comparison With Confidence 141

The effect of using a very high 1 − α confidence level is that the bounds move at a slower rate with each game played, so if a higher confidence level is used, additional games are required to ensure that the prediction criteria can be met if the players are sufficiently different. If insufficient games are played then L and U will not move sufficiently to cross 0.5, preventing a prediction from being made. The more games that are played, the smaller the difference between the players that can be detected. We define the difference between the probabilities of the two players as δ = |p − q|. Power ρ is defined as the probability of not committing a type II error i.e. ρ = 1 − β and is dependent on the sample size n, the significance α and p [175, Chapter 1]. Figure 8.4 shows how type II errors vary with n for a fixed δ = 0.1. This figure shows that as n decreases the T2 error increases. T2 was obtained by the method outlined in Section 8.3.5.

TYPE III Errors

The standard hypothesis test has two outcomes; either reject the hypothesis H0 or not reject H0. However our method provides 3 prediction outcomes:

1. The players are indistinguishable, i.e. H0 was not rejected risking a type I error.

2. There is 100(1 − α)th percentile confidence difference between the players and

approximately the same confidence that the tested player is better, i.e. H0 was rejected and the tested player was found to be better, risking a type II error.

3. There is 100(1 − α)th percentile confidence difference between the players and

approximately the same confidence that the tested player is worse, i.e. H0 was rejected and the tested player was found to be worse, risking a type II error.

In converting the binary hypothesis test to a ternary prediction problem, it appears we have introduced an additional source of error, being when H0 was correctly rejected but the incorrect player is identified as the best player. This error is called a type III error, and these errors rarely occur when suitable parameters are chosen. Type III errors are rarely discussed in the literature as they are infrequent. Although it appears that we have introduced the potential for type III errors by employing 3 prediction outcomes, what we have actually done is made type III errors more visible. Using the outcomes of hypothesis testing in this ternary manner was first proposed by Mosteller in 1942 in his seminal paper identifying the possibility of type III errors [176], however the approach is not commonly utilised. With the original binary hypothesis test it is possible that the hypothesis is correctly rejected, but for the wrong reasons. e.g. if p was only slightly above 0.5 it is possible to envisage a random distribution causing U to have a value less than 0.5 resulting in correctly rejecting H0 but for essentially the wrong reason. Type III errors are more prevalent when the difference between the players δ = |p − q| is small and n is small, but even in extreme cases type III errors are still relatively rare. For example with α = 0.05, δ = 0.005 and n = 100, 25 type III errors were observed in 2000 tournaments 142

giving a type III error rate t3 ≈ 0.0125 - for these parameters, no type III errors were observed where δ < 0.08. Later in this chapter we provide a method to determine the number of games required based on user defined requirements, and we have only observed type III errors occurring well outside of our recommended combinations of parameters. e.g. With α = 0.05 and n = 10 type III errors were only observed when δ < 0.32, but our findings indicate that only players with a difference of δ > 0.9 can be confidently differentiated with n = 10 as shown in Figure 8.6.

Type I Type II Error Probability α β Inverse Error Probability Confidence, 1 − α Power, ρ = 1 − β Permissible p p = 0.5, players are equal p 6= 0.5, players not equal

Table 8.2: Parameter summary for error types

Figure 8.4: Plot of observed type II errors T2 vs n using the exact and the Wilson confidence intervals with δ = 0.1 and α = 0.05. 1000 competitions of 100 were played at each data-point to calculate T2. It can be inferred from these plots that the less games conducted the higher the occurrence of T2 errors.

Which Error Is More Important?

Type III errors are rarely discussed in the literature due to their low probability of occurring and for practical purposes they can be ignored. Pairwise Player Comparison With Confidence 143

Figure 8.5: Plot of type II errors T2 vs δ using the exact and the Wilson confidence intervals with n = 100 & α = 0.05. Note that there are 0 type II errors at p = 0.5 because only type I errors occur when δ = 0 (p = 0.5). 1000 competitions of 100 were played at each data-point to obtain T2. The plots show that as δ → 0 : T 2 → 1 .

Type I errors are typically the primary concern because in rejecting H0 a definitive statement is made that “one player is better than the other with 1 − α confidence”.

In medical experiments, successfully rejecting H0 often means that a new drug or treatment has been effective and will advance to the next stage of testing. The lost opportunity by failing to reject H0 is often outweighed by the risk of approving a treatment that doesn’t perform as expected.

Where rejecting H0 is definitive, on the other hand, if H0 is not rejected the claim is much less definitive - “our experiment was unable to differentiate the players”. Some researchers go so far as to recommend “err(ing) on the side of caution and risk type II errors” [177], however when comparing two game players either error may be more critical. In determining which error is more important, the consequence of rejecting or failing to reject H0 : p = 0.5 should be considered. If, for example a player is discarded if it is shown to be worse than another, in some use-cases the losing player will be deleted and will not be tested again creating the risk of removing a player that may have been better. This would infer a user requirement to control type I errors, α. If however, a challenger is discarded if it fails to perform better than some base-line player then controlling type II errors, β may be more important. AlphaGoZero’s evaluation 144 step had three distinct outcomes:

• discard the new player if it did not outperform the current best player, i.e. discard

the new player when H0 was not rejected, risking a type I error.

• discard the new player if H0 was rejected, but the new player was found to be weaker, risking a type II error.

• promote the new player to the best player if H0 was rejected and the new player was stronger, risking a type II error.

As such the outcomes arising from AlphaGoZero’s evaluation imply that both type I and type II errors are important. In general we recommend using α = β, however both can be traded off for a reduced n if the use-case permits.

8.3.5 Conducting a Trial Game

We empirically demonstrate the effectiveness of our proposed method. We demonstrate the performance of our prediction method by running a controlled trial of n games where the winner is randomly sampled at p. The number of wins w for one player is used to make predictions using the method above and the error rate was observed. A number of trials are conducted to obtain the average errors. Where α and β are the required user defined type I and II parameters, and T1 and T2 are the observed error rates from an experiment.

8.3.6 Choosing Parameters α,β,δ,n

Choosing 1 − α Confidence Level Parameter

Type I errors, T1, only occur when a prediction is made. Type I errors are controlled by selecting the desired value for α when calculating the confidence bounds, or in the case of the Wilson-score method selecting the appropriate value of z corresponding to the required α. In choosing the 1 − α confidence value, consideration needs to be given to its effect on other parameters. A higher confidence level results in the confidence bounds changing more conservatively with each game, which impacts β and ultimately will result in the requirement to conduct more games per tournament. As the required confidence increases more games are required to constrain β. Common values for α and their z − score are shown in Table 8.3.

Table 8.3: Common values for α and the respective z-score. When using the Wilson- Score method the z-score is required, when calculating the bounds directly then α is used.

α Z-Score 0.01 2.58 0.05 1.96 0.1 1.64 0.2 1.28 Pairwise Player Comparison With Confidence 145

Choosing 1 − β - Power Level Parameter

Type II errors, T2, occur when predictions are not made, but the players are actually different. β is the maximum type II error rate parameter. The value chosen for β will depend on the consequences of failing to detect a difference between two players. If there are no pressing user requirements we recommend setting β = α. With α fixed, the rate of type II errors is dependent on the difference between the players, δ, and the number of games played n. Two of the three parameters β, δ, n need to be fixed and the third parameter can be derived from Figure 8.6. We use ∆ to indicate the chosen fixed parameter value of δ.

If δ is fixed, increasing n decreases T2 errors as shown in Figure 8.4. Likewise Figure 8.5 shows for fixed n that as the players become more similar i.e. δ decreases, the error also increases. To obtain the 1 − β power level either n needs to be set directly which then determines the value of ∆ that can differentiate the players with 1 − β power, or conversely ∆ is chosen which will then determine the minimum number of games required. Figure 8.6 provides the respective (n, ∆) combinations for common values of β. Figure 8.5 shows how when n is fixed, β increases as δ → 0. In other words β increases as the players become more similar. The exception is when δ = 0 which is a special case where type II errors cannot be made. In Figure 8.6 by fixing δ and α we demonstrate that β decreases with n. For any pair (n, p) we can empirically obtain β via the process outlined in Section 8.3.5. Conversely we can also empirically obtain the required n for any (β, δ) pair. This results in a final design decision relating to the parameter ∆, and that is setting either;

• the similarity parameter ∆ which will stipulate the required n, or

• setting n which will impact how effective the system is at differentiating two players with δ = ∆.

Since n = F (α, β, δ), to assist with the selection of this parameter Figure 8.6 can be used.

Choosing ∆ - Similarity Parameter

The similarity parameter ∆, indicates the maximum value of δ where the players could be considered practically equal. For instance if p = 0.5001 and q = 0.4999, then for most practical applications these two players would be considered essentially the same quality. An infinitesimally small value for ∆ can be set, however the trade off is that more games will be required to obtain the desired 1 − β power. It is where δ ≥ ∆ that the specified 1−β power level is achieved. For example, a value of ∆ = 0.1 would mean that if p ≥ 0.55 or p ≤ 0.45 i.e. δ = |p − q| ≥ 0.1, then we could predict the difference between these players with greater than 1 − β power. We have shown in this chapter that the more similar the two players are, the more difficult it is to differentiate between them, as such the choice of ∆ directly dictates the value of n needed. For example, with α = 0.05, β = 0.05 and ∆ = 0.1, using Figure 8.6 the required n ≈ 1300 games. 146

Table 8.4: Some parameter combinations from Figure 8.6

α β ∆ n 0.01 0.01 0.1 2600 0.05 0.05 0.05 8000 0.05 0.05 0.1 1300 0.1 0.1 0.2 250 0.2 0.2 0.11 400

Choosing n - Number of Games

If resource limitations are a primary consideration, then the value of n can be set directly. In doing so, Figure 8.6 can be used to estimate at which value of ∆ the 1 − β power levels are achieved. If n = 1000 games with α = 0.3, β = 0.3 then from the figure, ∆ = 0.05. That is to say that when δ ≥ 0.05 we can expect up to 30% type II errors. Previously it was noted that AlphaGoZero conducted 400 games per evaluation. Using Figure 8.6 we can infer that where the difference between both players was ∆ ≈ 0.08 this difference would be identified only 70% of the time assuming α = 0.3, β = 0.3. Other possible variations of these parameters are: with δ ≈ 0.15 this difference could be identified 90% of the time using α = 0.1, β = 0.1, and for α = 0.05, β = 0.05 then ∆ ≈ 0.2. Pairwise Player Comparison With Confidence 147

Figure 8.6: The relationship between n and δ for typical β values. This plot is obtained by using the method in section 8.3.5 and incrementing n until the required value of β is obtained. n = F (α, β, δ) 148

8.3.7 The Fixed n Prediction Method

The method covered in this section uses a fixed number of games, n, regardless of how different the players actually are. If one player is significantly outperforming the other during the competition the method proposed here cannot be stopped without impacting on the error rates. In the next section we modify this method to permit early-stopping of the competition if one player is clearly better than the other before the allocated number of games is completed. To conduct fixed-n hypothesis testing to compare two players:

• Set the desired permissible type I and II error rates, α, β.

• Set the desired minimum probability difference between the two players where the error rates will be achieved.

• Either calculate the required number of games n or use Figure 8.6 to obtain an estimate.

• Conduct a tournament of n games and obtain the win-rate w for one of the players.

• Obtain the confidence interval [L,U] using w and n. eg. the Wilson method from Equations 8.7 and 8.6

• If L > 0.5 or U < 0.5 declare with confidence that the player with the higher win-rate is better. Otherwise declare that testing was insufficient to differentiate the players.

8.4 Stopping a Contest Early

We have shown a method to determine the best of two players with hidden winning probabilities of p and q = 1−p by conducting an experiment consisting of a competition w of n games and using the win-rate n to make a prediction with some confidence. Prior to conducting the experiment decisions are required as to the acceptable error rate for type I, α, and type II, β, errors. These are respectively the confidence 1 − α and the power 1 − β. In addition to the confidence and power, the desired prediction fidelity ∆ = |p − q| is also required unless n is known in advance, both can be obtained from Figure 8.6. Figure 8.4 also shows that as δ → 1, β → 0, demonstrating that the larger the difference between the two players the more likely a correct prediction will be made. If one player is significantly better than the other and this was known in advance then a smaller number of games could be played, however this information is not available. If however, one player significantly outperforms the other player while the competition is in progress, we consider whether some threshold of comparative performance can permit early-stopping of the competition while still retaining the required confidence. Pairwise Player Comparison With Confidence 149

If the method outlined in the previous section is naively modified to check every f games, additional type I errors will be introduced. Every time a check occurs, there is some chance that a type I error is made, resulting in more opportunities for the tournament to be stopped. This issue is highlighted in statistical sequential stopping literature [178]. In this section we adapt the previous fixed-n method to allow stopping a tournament early while still maintaining the required error specifications. By using our proposed sequential stopping method significant time can be saved by only playing the requisite number of games, instead of always playing n games.

8.4.1 Accumulating Error

The methods used for obtaining confidence intervals are well established in the literature and they ensure that the type I error rate, T1, is held below α, as the parameter α, or some derivation of it (z), is used directly in the confidence bounds formula. For clarity we refer to α and β as the type I and II error parameters i.e. the desired error rates, and T1 and T2 as the observed error rates. In situations where the players have a significantly different performance, it can be inferred from the previous plots that as the difference increases i.e. δ → 1, then the confidence bounds L and U are more likely to cross the prediction boundary of 0.5 before the set number of games in a tournament, n, are played. If the tournament were to be stopped as soon as the bounds cross the prediction boundary an additional parameter is required, f, being the number of games to be played between checking the bounds i.e. the check period. To stop a tournament early, the condition for making a fixed n prediction also becomes the stopping condition for the tournament, i.e. when the lower bound L > 0.5 or the upper bound U < 0.5 the tournament is halted. Algorithm1 shows the pseudo-code for conducting a number of tournaments with sequential stopping and returning the observed error rates. Over the course of a tournament of n games, as the stopping condition is checked multiple times the actual error rate is higher. Figure 8.7 shows the effects on the type

I error, T1, of naively applying the sequential stopping conditions, L > 0.5 or U < 0.5, to a tournament of n games. Figure 8.7 demonstrates that the more frequently (lower period) the stopping condition is tested, the higher the type I error. Figure 8.8 shows the effects on the type II error, T2 of naively applying the sequential stopping conditions to a tournament. Figure 8.8 demonstrates that the more frequently the stopping condition is checked, the better the type II error rate, since a type II error occurs when the players are not equal p 6= 0.5 but a prediction wasn’t made. The enabling of sequential stopping rules provides more opportunity to make a prediction, resulting in an improved type II error rate as more checks are conducted.

8.4.2 Accounting for Additional Error

The previous section demonstrated that T1 and T2 performed differently when imple- menting sequential stopping rules for predicting if one player is better than another. This section outlines a modification to our method to account for these variances. 150

Algorithm 1 Testing the type I and II error rates with sequential stopping. p is a player’s true winning probability. 1 − α is the desired confidence level. n is the number of games to be played per tournament, selected to achieve the desired 1 − β power. t is the number of tournaments to be played. f is the number of games between checking the stopping conditions i.e. checking period. 1: procedure Seq Tournaments(p,α,n,t,f) 2: T1 ← 0 . Type I error count 3: T2 ← 0 . Type II error count 4: for i ← 1 to t do . Play t tournaments 5: predicted flag← False 6: for k ← 1 to n do . Play max n games 7: play game with p probability of winning 8: w ← w + 1 if winner of game is player with p 9: if k mod f == 0 then . Every f games 10: L, U ← confidence bounds using α, w, k 11: if L > 0.5 or U < 0.5 then . Check conditions 12: predicted flag← True 13: T1 ← T1 + 1 if p = q 14: break . Stop tournament 15: end if 16: end if 17: end for 18: if not(predicted flag) and p 6= q then 19: T2 ← T2 + 1 .T2 error committed, didn’t predict 20: end if 21: end for 22: total games← t ∗ n 23: return [T1/total games,T2/total games] . Error rates 24: end procedure

To implement sequential stopping in a tournament of n games the tournament will be stopped early when: L > 0.5

or

U < 0.5 where:

• L & U are the lower and upper confidence bounds obtained with 1 −α confidence after the ith game;

• n is chosen to obtain the required 1 − β power at the required δ, the difference in probability of winning between the two players δ = |p − q| where p and q are the two players’ probabilities of winning against each other.

The method we apply to obtain the minimum confidence and power levels whilst im- plementing sequential stopping is to substitute α for A, where A is a more restrictive error rate (enhanced confidence) to account for the increased T1, and then recalculate, n to obtain β at δ. Pairwise Player Comparison With Confidence 151

Figure 8.7: Type I error (T1) increases as the period between checking the stopping con- ditions, L > 0.5 & U < 0.5, decreases. The more frequently the conditions are checked the worse the type I performance. L and U were obtained using Wilson confidence bounds. The parameters used for this plot α = 0.05, n = 1300 were shown previously to produce error rates T1 / 0.05 & T2 / 0.05, however this plot demonstrates that the observed T1 error is dependent on f when using sequential stopping. When only 1 check is conducted at f = 1300 we obtain the expected performance of T1 / 0.05, however when checking more frequently T1 accuracy deteriorates. This plot was obtained by performing 20000 tournaments of n = 1300 games with p = 0.5, α = 0.05 using the method from Algorithm1 to obtain the average T1 error per tournament.

Substituting A for α

We hypothesise that there exists some function A = A(α) which, for a set of parameters, will result in T1 / α when employing the sequential stopping method. The user sets the required values for α, β and δ and chooses how frequently the stopping condition will be checked, f. The value for n is obtained using the fixed-n method covered in the previous section. The desired confidence level is then achieved, substituting A for α with all other parameters fixed. We achieve this required confidence level α by substituting

A and incrementally decreasing A until T1 < α. Algorithm2 explains the process for finding A. This method works by essentially making the substituted confidence level A more strict than the desired confidence α, accounting for the accumulated error arising from periodically checking the stopping condition. If the frequency of checking 152

Figure 8.8: Type II error (T2) decreases as the period between checking the stopping conditions, L > 0.5 or U < 0.5, decreases. The more frequently the conditions are checked the better the type II performance. L and U were obtained using Wilson confidence bounds. The parameters used for this plot α = 0.05, n = 1300 were shown previously to produce error rates T1 / 0.05 & T2 / 0.05, however this plot demonstrates that the observed T2 error is dependent on f when using sequential stopping and achieves an error rate better than the desired rate of β. When only 1 check is conducted at f = 1300 we obtain the expected performance of T2 / 0.05, however when checking more frequently T2 accuracy improves. This plot was obtained by performing 20000 tournaments of n = 1300 games with p = 0.5, α = 0.05 and obtaining the average T2 error per tournament. increases then A would need to be more stringent.

Finding N to Achieve β

Figure 8.9 demonstrates that the more frequently the stopping condition is checked the lower the type II error, however we have also noted previously that the higher the confidence level, the higher the type II errors. These two effects do not necessarily cancel each other out, as such we also introduce N, a substitute value for n. To obtain A, we used the value of n obtained from the preceding fixed-n problem, however now that A is substituting for α, the value of n won’t necessarily provide β. The process we use to determine N is essentially same process used in the preceding section for n Pairwise Player Comparison With Confidence 153 and is explained in Algorithm3. Table 8.5 provides a lookup table for a number of common parameters for the sequential stopping method outlined in this section.

Figure 8.9: The average number of games taken using sequential stopping methods, checking every 100 games. Note that N = 1620 games, however for many values of δ significantly less games are conducted on average before making a prediction.

8.4.3 Putting it into Practice

The procedure for performing sequential stopping hypothesis is almost as straight for- ward as for fixed-n. Table 8.5 contains a set of common values for α,β,δ and f, and the procedure is as follows: • Obtain the corresponding values for A and N.

• Conduct a tournament of N games checking the stopping conditions every f games.

• The stopping conditions are L > 0.5,U < 0.5, where L, U are the 1−A confidence bounds.

• If the tournament is halted then declare, with confidence, that the player with the higher win-rate is better. Otherwise declare that the player’s performance could not be differentiated. 154

Algorithm 2 Finding A for sequential stopping. α is the desired type I error, T1 is the observed type I error. n is the maximum number of games to be played per tournament. t is the number of tournaments. f is the number of games between checking the stopping conditions (period). 1: procedure FindA(α,n,t,f) 2: A ← T1 ← α . , A, T1 initialised to α 3: while T1 ≥ α do 4: T1,T2 ← Seq Tournaments(0.5,A,n,t,f) . , T1 obtained 5: decrement A. , A reduced by some proportion 6: end while 7: return A. ,A is the required parameter 8: end procedure

Algorithm 3 Finding N to achieve β for given parameters when using sequential stopping. β is the desired type II error for a player difference of δ. A is the adjusted 1 − α confidence parameter from FindA. ns is the starting number of games required to achieve β given α, obtained using the fixed-n methods covered previously, see Figure 8.6. t is the number of tournaments. f is the number of games between checking the stopping conditions (period).

1: procedure FindN(δ,β,A,ns,t,f) 2: N ← ns 3: T2 ← β 4: p ← δ/2 + 0.5 . , Convert δ to p 5: while T2 ≥ β do 6: T1,T2 ← Seq Tournaments(p,A,N,t,f) . , T2 obtained 7: increment N. , N increased by some proportion 8: end while 9: return N. , N is the required parameter 10: end procedure

8.5 Summary

Hypothesis testing is rarely utilised in the domain of computer game playing, however we have provided Table 8.5, populated with common parameters for use in sequential stopping approaches. Figure 8.6 provides the relationship between number of games, player similarity and Type I and II error ratios and permits easy selection of parameters to perform fixed-n hypothesis tests. Pairwise Player Comparison With Confidence 155

Table 8.5: Look up table for common parameters for our sequential stopping method. The three user defined performance specifications are α, β and δ being the maximum Type I and Type II error rates and the minimum probability difference between the players where these error rates are required to be achieved. f is how frequently the stopping conditions are tested, A is the value to substitute for α in the confidence- bounds formula and N is the maximum number games required to be played. The final column Average n-games indicates on average how many games are actually played using this method to make the prediction with the required level of confidence.

Average α β δ f A N n-games 0.3 0.3 0.1 20 0.06 500 267.6 0.3 0.3 0.1 100 0.21 280 213.1 0.2 0.2 0.1 20 0.02 880 454.3 0.2 0.2 0.1 100 0.06 650 381.9 0.1 0.1 0.1 20 0.01 1320 611.9 0.1 0.1 0.1 100 0.02 1120 589.3 0.05 0.05 0.1 20 0.005 1820 751.8 0.05 0.05 0.1 100 0.01 1620 735.0 0.01 0.01 0.1 20 0.001 3140 1159.5 0.01 0.01 0.1 100 0.001 3020 1162.2

Chapter 9

Conclusion

9.1 Computer Game Playing Today

Games are essentially an interaction between decision makers in an environment and as such many real-world problems fall within the remit of game theory. A number of machine-learning methods exist which can be used by an artificial agent operating in a game environment - which may be in the real-world or virtually. To make an informed decision an artificial agent can use either a knowledge-based or an exploration-based approach. Traditionally, a knowledge-based approach would be achieved by being hard- coded by its programmer, however more recently neural-network methods have been used to learn this knowledge. An exploration-based approach is suitable when the environment is not known in advance, and can be achieved by conducting a search of the state-space. We have discussed in Chapter2 the requirement to utilise a combination of knowl- edge and exploration-based approaches for complex problems, and this is what the current state-of-the-art generic game playing agent employs - using a combined neural- network/tree-search architecture. A neural-network is used to guide a search of the state-space, with the aim of prioritising the exploration of the most promising branches. The neural-network in the current state-of-the-art system learns through self-play, gen- erating its own experiences with each move that it makes, which it then uses to train itself. We have also discussed how experience-creation via self-play is resource inten- sive, literally costing millions of dollars in some large scale experiments. The findings presented in this thesis focus on this self-play bottleneck so that generating experiences, and learning from them, is as efficient as possible.

9.2 Key Findings

Without prior knowledge of the game a neural-network/tree-search agent first explores the game environment randomly and one of our first insights in Chapter4 is that during this initial period the current state-of-the-art system is creating experiences which in no way reflect the dynamics of the environment. In the early training cycles a vast majority of created experiences are completely uninformed, resulting in wasted processing effort

157 158 and erroneous exemplar data for the neural-network. It is not until the agent discovers a state which results in an environmental reward (terminal state), that experiences begin to be created with actual feedback from the environment. Because the reward is the only feedback from the environment, the order in which the neural-network learns will be from the end of the game towards the beginning, and the current practice does not account for this. A key contribution of our research, presented in Chapter4, is the end-game-first cur- riculum, a new approach which seeks to enforce the creation of experiences only where they are informed by the environment - directly or indirectly. We have proven that a neural-network/tree-search agent, which was inspired by the state-of-the-art methods, learns faster when employing our end-game first curriculum. We first show that an end-game-first curriculum is effective by using a hand-crafted curriculum, however the limitation of this method is that it does require some prior knowledge of the game, reducing the generality of the system. We then further develop the end-game-first cur- riculum in Chapter5 by demonstrating a novel method for automating the curriculum which addresses the loss of generality in using a hand-crafted curriculum. We show that our automated end-game-first curriculum speeds up learning without requiring any prior knowledge when compared with a state-of-the-art inspired agent. The key theme in our research is to maximise the experiences which are created during self-play and we have focussed on the most inefficient part of the training cycle, that is, the early stages of training. We highlighted that almost all generated experi- ences in the very first training cycle are uninformed, as such this period of training is the most inefficient. In Chapter6 we presented two new methods which we refer to as “priming the neural-network” since their largest impact is on the earliest training cycles. The first method we call terminal priming, which is the creation of additional experiences from terminal states which are discovered in the course of generating experiences, and this is overlooked by the current methods. Terminal priming is found to be significantly bene- ficial where a game’s terminal-value-function is complex, like in the game of Connect-4. We also found that for simple terminal-value-functions that terminal priming does not hinder learning, making it a low overhead improvement. The second priming method that we utilise we have called incremental-buffering, this method simply creates fewer experiences in the early training cycles allowing the agent to more rapidly cycle through the self-play/learning process faster shortening the time spent during the most ineffi- cient stages. We have also conducted some preliminary studies into whether methods which are used to combat overfitting in supervised-learning problems could be adapted to reinforcement-learning in Chapter7. We have found that early-stopping methods which, to our knowledge, have not been applied to reinforcement-learning, can im- prove the efficiency of learning. Our preliminary experiments using spatial drop-out in conjunction with batch-normalisation indicate that further efficiencies may be obtained with this approach. The prevailing view is that using batch-normalisation overrides the Conclusion 159 need for drop-out however our results indicate that this may not be the case. Confidence intervals are not commonly reported in machine-learning literature and some statisticians have indicated that knowledge of statistical confidence methods are poorly understood by many. We have sought to demystify the use of confidence intervals and provided empirical examples as they pertain specifically to game playing. We have also presented a method which, to our knowledge is new, which allows the early-stopping of a contest between two agents if one player is substantially better than the other, while retaining the required confidence level.

9.3 Further Work

We have shown that there are a number of further research areas which may result in improved self-play efficiency. For specialist applications, researching hand-crafted curricula may result in significantly improved learning whilst also potentially providing further insights into the dynamics of a problem. For general applications we have highlighted further research opportunities for adaptive hyper-parameters such as the buffer replacement rate, the learning rate and the early-stopping patience parameter, in addition to investigating better mechanisms for setting the automated curriculum. The more information that can be extracted from generated self-play experiences, the more cost effective these state-of-the-art game-playing agents will be, making them available to a wider range of applications.

9.4 Longevity of Findings

In closing, the approach of using a neural-network to inform a search, in our view, will remain necessary for difficult decision making problems. This method has already been applied to a diverse cross-section of games and continues to demonstrate that it can learn to make super-human decisions tabula-rasa1, through self-play. Fundamental to this approach is that if a combined neural-network/tree search agent has no prior knowledge of the problem space then it has to learn from environmental rewards. When an agent is dependent on learning from environmental rewards then it must learn the end-game first and consequently learning will progress from the end-game to the start of the game. In this case, the end-game first curriculum prevents uninformed experiences from being created which speeds up learning. As such, our contributions presented in this thesis are likely to retain their utility regardless of future advancements in tree- search or neural-networks.

1 Tabula-rasa means from a blank slate i.e. without any prior knowledge.

Appendix A

Parameters For the Players

Tables A.1 and A.2 show the parameters used for the automated curriculum experiment in Chapter5.

Parameter Value Comment Resnet Blocks 10 Number of Residual Network Blocks Batch Size 350 Training Batch size Value function size 256 Neurons before vθ(s) CNN Filters 512 How many filters are used Simulations per action 200 How many nodes are added to the tree dur- ing the MCTS tree search Drop draw chance 0.5 If a game is a draw then it has this chance of being dropped. Used to mitigate if a game has a lot of draws from having too many experiences which are draws Min Buffer Size 350; Minimum number of games in the train- 500; ing buffer to stop self-play and commence 750; training. For Chess the number of required 1000; games was incremented to allow training to 1500; commence earlier 2000 Experience replacement 0.1 Proportion of experiences removed from training buffer after self-play Number probabilistic 30 Number of players actions which are se- turns lected probabilistically from the policy be- fore switching to selecting moves greedily

Table A.1: Parameters for Racing-Kings player. The minimum buffer size parameter was incremented each training cycle with Racing-Kings, instead of a fixed size as used in Reversi. Without using incremental-buffer priming from Chapter6 the time to obtain results was substantially increased.

161 162

Parameter Value Comment Resnet Blocks 10 Number of Residual Network Blocks Batch Size 350 Training Batch size Value function size 256 Neurons before vθ(s) CNN Filters 512 How many filters are used Simulations per action 200 How many nodes are added to the tree dur- ing the MCTS tree search Drop draw chance 0.1 If a game is a draw then it has this chance of being dropped. Used to mitigate if a game has a lot of draws from having too many experiences which are draws Min Buffer Size 2000 Minimum number of games in the training buffer to stop self-play and commence train- ing Experience replacement 0.1 Proportion of experiences removed from training buffer after self-play Number probabilistic 16 Number of players actions which are se- turns lected probabilistically from the policy be- fore switching to selecting moves greedily

Table A.2: Parameters for Reversi player Bibliography

[1] J. N. Shurkin, Engines of the Mind: The Evolution of the Computer from Main- frames to Microprocessors. W. W. Norton & Company, 1996.1

[2] A. Williams, History of Digital Games: Developments in Art, Design and Inter- action. CRC Press, 2017.1

[3] A. Newell, J. Shaw, and H. Simon, “Chess-Playing Programs and the Problem of Complexity,” IBM Journal, vol. 2, no. 4, pp. 320–335, Oct. 1958.1, 13

[4] J. Schaeffer and A. Plaat, “Kasparov versus deep blue: The re-match,” ICCA Journal, vol. 20, no. 2, pp. 95–101, 1997.1

[5] T. D. Kelley and L. N. Long, “Deep blue cannot play checkers: The need for generalized intelligence for mobile robots,” Journal of Robotics, 2010.1

[6]M. Swiechowski,´ H. Park, J. Ma´ndziuk,and K.-J. Kim, “Recent advances in general game playing,” The Scientific World Journal, vol. 2015, 2015.1,3, 19, 24

[7] M. Minsky, “Steps toward artificial intelligence,” Proceedings of the IRE, vol. 49, no. 1, pp. 8–30, 1961.1

[8] K. Salen and E. Zimmerman, Rules of Play: Game Design Fundamentals, ser. The MIT Press. The MIT Press, 2003.2

[9] F. Carmichael, A Guide to Game Theory. Harlow, Essex, England ; New York: Financial Times Prentice Hall, 2005.2, 185

[10] M. Heule and L. Rothkrantz, “Solving games,” Science of Computer Program- ming, vol. 67, no. 1, pp. 105–124, 2007.2, 13

[11] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “Monte-Carlo Tree Search: A New Framework for Game AI.” in AIIDE, 2008. [Online]. Available: http://www.aaai.org/Papers/AIIDE/2008/AIIDE08-036.pdf2

[12] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.2

[13] H. van den Herik, J. W. Uiterwijk, and J. van Rijswijck, “Games solved: Now and in the future,” Artificial Intelligence, vol. 134, no. 1-2, pp. 277–311,

163 164

Jan. 2002. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/ S00043702010015273, 11, 14, 15, 18, 184, 185

[14] M. M¨uller,“Computer go,” Artificial Intelligence, vol. 134, pp. 145–179, 2002.3

[15] R. Coulom, “Efficient selectivity and backup operators in monte-carlo tree search,” in International Conference on Computers and Games. Springer, 2006, pp. 72–83. [Online]. Available: http://link.springer.com/chapter/10.1007/978-3- 540-75538-8 73

[16] ——, “Monte-Carlo Tree Search in Crazy Stone,” in Proc. Game Prog. Workshop, Tokyo, Japan, 2007, pp. 74–75. [Online]. Available: https: //www.remi-coulom.fr/Hakone2007/Hakone.pdf3

[17] ——, “Computing Elo Ratings of Move Patterns in the Game of Go,” in Proc. Game Prog. Workshop, Tokyo, Japan, 2007. [Online]. Available: https://www.remi-coulom.fr/Hakone2007/Hakone.pdf3

[18] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move Evaluation in Go Using Deep Convolutional Neural Networks,” in International Conference on Learning Representations, Apr. 2015, pp. 1–8. [Online]. Available: http://arxiv.org/abs/1412.6564v23

[19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driess- che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele- man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the Game of Go with Deep Neural Networks and Tree Search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.3,5, 12, 18, 19, 29, 50, 129

[20] M. Genesereth, N. Love, and B. Pell, “General Game Playing: Overview of the AAAI Competition,” AI Magazine, vol. 26, no. 2, p. 62, Jun. 2005. [Online]. Available: http://www.aaai.org/ojs/index.php/aimagazine/article/view/18133, 19

[21] H. Finnsson and Y. Bj¨ornsson,“Learning Simulation Control in General Game- Playing Agents.” in AAAI, vol. 10, 2010, pp. 954–959. [Online]. Available: http: //www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/download/1892/21243, 19

[22] S. Legg and M. Hutter, “A Formal Measure of Machine Intelligence,” May 2006. 4, 19

[23] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural net- works,” Nature, vol. 542, no. 7639, pp. 115–118, Feb. 2017.4 Bibliography 165

[24] P. Jurmeister, M. Bockmayr, P. Seegerer, T. Bockmayr, D. Treue, G. Montavon, C. Vollbrecht, A. Arnold, D. Teichmann, K. Bressem, U. Sch¨uller,M. v. Laffert, K.-R. M¨uller, D. Capper, and F. Klauschen, “Machine learning analysis of DNA methylation profiles distinguishes primary lung squamous cell carcinomas from head and neck metastases,” Science Translational Medicine, vol. 11, no. 509, Sep. 2019. [Online]. Available: https://stm.sciencemag.org/content/11/509/eaaw8513 4

[25] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I. Fotiadis, “Machine learning applications in cancer prognosis and prediction,” Computational and Structural Biotechnology Journal, vol. 13, pp. 8–17, 2015. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/ S20010370140004644

[26] R. Cuocolo, M. B. Cipullo, A. Stanzione, L. Ugga, V. Romeo, L. Radice, A. Brunetti, and M. Imbriaco, “Machine learning applications in prostate cancer magnetic resonance imaging,” European Radiology Experimental, vol. 3, Aug. 2019. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC6686027/4

[27] S. Liu, H. Zheng, Y. Feng, and W. Li, “Prostate cancer diagnosis using deep learning with 3d multiparametric MRI,” in Prostate cancer diagnosis using deep learning with 3D multiparametric MRI, S. G. Armato and N. A. Petrick, Eds., Orlando, Florida, United States, Mar. 2017, p. 1013428. [Online]. Available: http: //proceedings.spiedigitallibrary.org/proceeding.aspx?doi=10.1117/12.22771214

[28] W. Muhammad, G. R. Hart, B. Nartowt, J. J. Farrell, K. Johung, Y. Liang, and J. Deng, “Pancreatic cancer prediction through an artificial neural network,” Frontiers in Artificial Intelligence, vol. 2, 2019.4

[29] P. Ferroni, F. M. Zanzotto, S. Riondino, N. Scarpato, F. Guadagni, and M. Roselli, “Breast cancer prognosis using a machine learning approach,” Can- cers, vol. 11, no. 3, 2019.4

[30] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017. [Online]. Available: http://www.nature.com/articles/nature242704,5, 20, 50, 129, 131

[31] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm,” arXiv:1712.01815 [cs], Dec. 2017. [Online]. Available: http://arxiv.org/abs/1712.018155, 45 166

[32] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver, “Mastering atari, go, chess and shogi by planning with a learned model,” arXiv:1911.08265 [cs, stat], 2019. [Online]. Available: http://arxiv.org/abs/1911.082655

[33] K. Sato, C. Young, and D. Patterson. (2017) An in-depth look at google’s first tensor processing unit (TPU). [Online]. Avail- able: https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles- first-tensor-processing-unit-tpu/5

[34] E. Barron, “Game theory: An introduction, second edition,” in Game Theory. John Wiley & Sons, Inc., 2013, pp. 432–435.6

[35] J. v. Neumann, O. Morgenstern, and A. Rubinstein, Theory of Games and Economic Behavior (60th Anniversary Commemorative Edition). Princeton University Press, 1944. [Online]. Available: http://www.jstor.org/stable/j. ctt1r2gkx6

[36] A. M. Turing, “Computing Machinery and Intelligence,” Mind, vol. 59, no. 236, pp. 433–460, 1950. [Online]. Available: http://www.jstor.org/stable/22512996, 13

[37] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing Atari with Deep Reinforcement Learning,” arXiv preprint arXiv:1312.5602, 2013. [Online]. Available: https://arxiv.org/abs/1312.56026

[38] K. Sharma, “Automation strategies,” in Overview of Industrial Process Automa- tion. Elsevier, 2011, pp. 53–62.6

[39] J. G. V. Habets, M. Heijmans, M. L. Kuijf, M. L. F. Janssen, Y. Temel, and P. L. Kubben, “An update on adaptive deep brain stimulation in parkinson’s disease,” Movement Disorders, vol. 33, no. 12, pp. 1834–1843, 2018. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/mds.1157

[40] D. Cella, C. Nowinski, A. Peterman, D. Victorson, D. Miller, J.-S. Lai, and C. Moy, “The neurology quality of life measurement initiative,” Archives of physical medicine and rehabilitation, vol. 92, no. 10, pp. S28–S36, 2011. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3193028/7

[41] A. Ramirez-Zamora, J. Giordano, A. Gunduz, J. Alcantara, J. N. Cagle, S. Cernera, P. Difuntorum, R. S. Eisinger, J. Gomez, S. Long, B. Parks, J. K. Wong, S. Chiu, B. Patel, W. M. Grill, H. C. Walker, S. J. Little, R. Gilron, G. Tinkhauser, W. Thevathasan, N. C. Sinclair, A. M. Lozano, T. Foltynie, A. Fasano, S. A. Sheth, K. Scangos, T. D. Sanger, J. Miller, A. C. Brumback, P. Rajasethupathy, C. McIntyre, L. Schlachter, N. Suthana, C. Kubu, L. R. Sankary, K. Herrera-Ferr´a, S. Goetz, B. Cheeran, G. K. Bibliography 167

Steinke, C. Hess, L. Almeida, W. Deeb, K. D. Foote, and M. S. Okun, “Proceedings of the seventh annual deep brain stimulation think tank: Advances in neurophysiology, adaptive DBS, virtual reality, neuroethics and technology,” Frontiers in Human Neuroscience, vol. 14, 2020. [Online]. Available: https://www.frontiersin.org/articles/10.3389/fnhum.2020.00054/full7

[42] R. Korf, “Does deep blue use AI?” AAAI Technical Report, p. 2, 1997. 12

[43] B. von Stengel and T. L. Turocy, “What Is Game Theory?” Encyclopedia of Information Systems: EJ, vol. 2, p. 403, 2003. 12

[44] H. de Garis, “Artificial brains,” in Artificial General Intelligence, B. Goertzel and C. Pennachin, Eds. Springer Berlin Heidelberg, 2007, pp. 159–174. [Online]. Available: http://link.springer.com/10.1007/978-3-540-68677-4 5 12

[45] C. Pennachin and B. Goertzel, “Contemporary Approaches to Artificial General Intelligence,” in Artificial general intelligence. Springer, 2007, pp. 1–30. [Online]. Available: http://link.springer.com/content/pdf/10.1007/978-3-540- 68677-4 1.pdf 13

[46] P. D. Straffin, M. J. McAsey, A. Benjamin, L. H. Lange, G. Berzsenyi, M. E. Saul, R. K. Guy, R. A. Honsberger, and J. Wyzkoski Weiss, Game Theory and Strategy, ser. Anneli Lax New Mathematical Library. American Mathematical Society, 1993, vol. 36. 13, 20

[47] S. J. Russell, P. Norvig, and E. Davis, Artificial Intelligence: A Modern Approach, 3rd ed., ser. Prentice Hall series in artificial intelligence. Upper Saddle River: Prentice Hall, 2010. 13, 22

[48] D. Clark, “Bluer Than Blue: How IBM Managed It [Interview],” IEEE Concur- rency, vol. 5, no. 3, pp. 8–11, Jul. 1997. 13, 16

[49] S. Legg and M. Hutter, “A Collection of Definitions of Intelligence,” Frontiers in Artificial Intelligence and applications, vol. 157, p. 17, 2007. 13, 19

[50] J. Von Neumann and O. Morgenstern, Theory of games and economic behavior, 60th ed., ser. Princeton classic editions. Princeton University Press, 1944. 13

[51] C. E. Shannon, “Programming a Computer for Playing Chess,” Philosophical Magazine, vol. 41, pp. 256–275, 1950. [Online]. Available: http://www. bibsonomy.org/bibtex/28d7d487f1632d05b56788250f1d60b72/idsia 14, 16, 18

[52] J. Burmeister and J. Wiles, “The challenge of Go as a domain for AI research: a comparison between Go and chess,” in Proceedings of the Third Australian and New Zealand Conference on Intelligent Information Systems, 1995. ANZIIS-95, Nov. 1995, pp. 181–186. 16 168

[53] C. J. Maddison, A. Huang, I. Sutskever, and D. Silver, “Move evaluation in go using deep convolutional neural networks,” arXiv preprint arXiv:1412.6564, 2014. [Online]. Available: https://arxiv.org/abs/1412.6564 16, 18

[54]M. Swiechowski´ and J. Ma´ndziuk,“Self-Adaptation of Playing Strategies in Gen- eral Game Playing,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 4, pp. 367–381, Dec. 2014. 16, 19

[55] P. Ciancarini and G. P. Favini, “Retrograde analysis of endgames,” in Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, 2010, pp. 411–418. 17

[56] K. Thompson, “Retrograde analysis of certain endgames.” ICGA Journal, vol. 9, no. 3, pp. 131–139, 1986. 17

[57] H. Iida, J. Yoshimura, K. Morita, and J. W. Uiterwijk, “Retrograde analysis of the KGK endgame in shogi: Its implications for ancient heian shogi,” in International Conference on Computers and Games. Springer, 1986, pp. 318–335. 17

[58] J. Schaeffer, One Jump Ahead : Computer Perfection at Checkers. Boston, MA: Springer US, 2009. [Online]. Available: http://link.springer.com/10.1007/978-0- 387-76576-1 17

[59] J. Schaeffer, N. Burch, Y. Bj¨ornsson, A. Kishimoto, M. M¨uller, R. Lake, P. Lu, and S. Sutphen, “Checkers Is Solved,” science, vol. 317, no. 5844, pp. 1518–1522, 2007. [Online]. Available: http://science.sciencemag.org/content/ 317/5844/1518.short 17

[60] M. Buro, “Takeshi Murakami vs Logistello,” in ICGA, vol. 20, 1997, pp. 189–193. [Online]. Available: https://pdfs.semanticscholar.org/beed/ 572f4191d5ac23305565f573256a33f3a05b.pdf 17

[61] ——, “Logistello: A Strong Learning Othello Program,” in 19th Annual Conference Gesellschaft f¨ur Klassifikation, vol. 2. Princeton, NJ: NEC Research Institute, 1995. [Online]. Available: http://citeseerx.ist.psu.edu/ viewdoc/download?doi=10.1.1.114.1746&rep=rep1&type=pdf 17

[62] P. Shotwell, “The game of go: Speculations on its origins and symbolism in ancient china,” Go World, no. 70, p. 62, 1994. 18

[63] P. H. Jin and K. Keutzer, “Convolutional Monte Carlo Rollouts in Go,” University of California, Berkeley, Berkeley, Tech. Rep., Dec. 2015. [Online]. Available: http://arxiv.org/abs/1512.03375v1;http://arxiv.org/pdf/1512.03375v1 19

[64] S. Gelly, Y. Wang, R. Munos, and O. Teytaud, “Modification of UCT with patterns in Monte-Carlo Go,” Ph.D. dissertation, INRIA, 2006. [Online]. Available: https://hal.inria.fr/inria-00117266/ 19 Bibliography 169

[65] S. Gelly and D. Silver, “Combining Online and Offline Knowledge in Uct,” in ICML ’07, ser. ACM International Conference Proceeding Series, vol. 227. ACM, 2007, pp. 273–280. 19

[66] M. Genesereth and M. Thielscher, “General Game Playing,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 8, no. 2, pp. 1–229, Mar. 2014. [Online]. Available: http://www.morganclaypool.com/doi/abs/10.2200/ S00564ED1V01Y201311AIM024 19

[67] Y. Bjornsson and H. Finnsson, “CadiaPlayer: A Simulation-Based General Game Player,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 1, no. 1, pp. 4–15, Mar. 2009. [Online]. Available: http://ieeexplore.ieee.org/document/4804731/ 19

[68] H. Finnsson and Y. Bj¨ornsson,“Simulation-Based Approach to General Game Playing,” in Proceedings of the 23rd National Conference on Artificial Intelli- gence, ser. AAAI ’08, vol. 1. Chicago, Illinois: AAAI Press, 2008, pp. 259–264. 19

[69] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A Survey of Monte Carlo Tree Search Methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1, pp. 1–43, 2012. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=6145622 19, 20, 22, 23, 24, 53

[70] D. Michulke and M. Thielscher, “Neural networks for state evaluation in general game playing,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2009, pp. 95–110. [Online]. Available: http://link.springer.com/chapter/10.1007/978-3-642-04174-7 7 20

[71] L. Kocsis and C. Szepesv´ari,“Bandit Based Monte-Carlo Planning,” in European conference on machine learning. Springer, 2006, pp. 282–293. [Online]. Available: http://link.springer.com/chapter/10.1007/11871842 29 20, 22, 24, 53

[72] A. A. Elnaggar, M. Abdel, M. Gadallah, and H. El-Deeb, “A Comparative Study of Game Tree Searching Methods,” Int. J. Adv. Comput. Sci. Appl., vol. 5, no. 5, pp. 68–77, 2014. 21

[73] S. Gelly and D. Silver, “Monte-Carlo tree search and rapid action value estimation in computer Go,” Artificial Intelligence, vol. 175, no. 11, pp. 1856–1875, Jul. 2011. [Online]. Available: http://www.sciencedirect.com/ science/article/pii/S000437021100052X 24

[74] C. for students, “Mate in Two Problem | Chess Puzzles!” 2017. [Online]. Available: http://www.chesspuzzles.com/mate-in-two 25 170

[75] A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers,” IBM Journal of research and development, vol. 3, no. 3, pp. 210–229, 1959. [Online]. Available: http://ieeexplore.ieee.org/abstract/document/5392560/ 25, 38

[76] K. Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks, vol. 4, no. 2, pp. 251–257, Jan. 1991. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089360809190009T 27

[77] A. Rashed and W. M. Bahgat, “Modified Technique for Speaker Recognition using ANN,” International Journal of Com- puter Science and Network Security (IJCSNS), vol. 13, no. 8, p. 8, 2013. [Online]. Available: http://search.proquest.com/openview/ 93dc75b23441d961486ee2bd51c36e2d/1?pq-origsite=gscholar&cbl=1026368 27

[78] M. Betke and N. C. Makris, “Fast Object Recognition in Noisy Images Using Simulated Annealing,” in Computer Vision, 1995. Proceedings., Fifth International Conference on. IEEE, 1995, pp. 523–530. [Online]. Available: http://ieeexplore.ieee.org/abstract/document/466895/ 27

[79] Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional Networks and Applications in Vision,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on. IEEE, 2010, pp. 253–256. [Online]. Available: http://ieeexplore.ieee.org/abstract/document/5537907/ 27, 182

[80] L. Debnath and D. Bhatta, Integral Transforms and Their Applications. CRC Press LLC, 2006. [Online]. Available: http://ebookcentral.proquest.com/lib/ qut/detail.action?docID=282788 28

[81] I. I. Hirschman and D. V. Widder, The Convolution Transform. Courier Cor- poration, 2012. 28

[82] Performing convolution operations. [Online]. Available: https: //developer.apple.com/library/content/documentation/Performance/ Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html 28

[83] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietik¨ainen, “Deep learning for generic object detection: A survey,” International Journal of Computer Vision, vol. 128, no. 2, pp. 261–318, 2020. 28

[84] S. Russell, P. Norvig, and A. Intelligence, “AI: A Modern Approach,” Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, vol. 25, p. 27, 1995. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1. 259.8854&rep=rep1&type=pdf 29

[85] Y. Bengio, A. C. Courville, and P. Vincent, “Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives,” CoRR, Bibliography 171

abs/1206.5538, vol. 1, 2012. [Online]. Available: https://pdfs.semanticscholar. org/f8c8/619ea7d68e604e40b814b40c72888a755e95.pdf 29, 33

[86] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a convolutional neural network,” in 2017 International Conference on Engineering and Technology (ICET). IEEE, 2017, pp. 1–6. [Online]. Available: https: //ieeexplore.ieee.org/document/8308186/ 30

[87] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet large scale visual recognition challenge,” arXiv:1409.0575 [cs], 2015. [Online]. Available: http://arxiv.org/abs/1409.0575 29, 30

[88] ImageNet object localization challenge. Library Catalog: www.kaggle.com. [Online]. Available: https://kaggle.com/c/imagenet-object-localization-challenge 29

[89] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989. [Online]. Available: http://www.mitpressjournals.org/doi/10.1162/neco.1989.1.4.541 29

[90] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” arXiv:1312.6229 [cs], 2014. [Online]. Available: http://arxiv.org/ abs/1312.6229 29

[91] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017. [Online]. Available: http://dl.acm.org/citation.cfm? doid=3098997.3065386 29

[92] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 770–778. [Online]. Available: http://ieeexplore.ieee.org/document/7780459/ 29, 31, 32, 46

[93] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994, conference Name: IEEE Transactions on Neural Networks. 31

[94] Y. Bengio, “Learning deep architectures for AI,” Foundations and trends R in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1658424 31

[95] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in Aistats, vol. 9, 2010, pp. 249–256. 172

[Online]. Available: http://www.jmlr.org/proceedings/papers/v9/glorot10a/ glorot10a.pdf?hc location=ufi 31

[96] G. E. Hinton, “How neural networks learn from experience,” SCIENTIFIC AMERICAN, p. 8, 1992. 31, 33

[97] P. J. Werbos, “Thesis: Beyond regression new tools for prediction and analysis in the behavioral science werbos,” 1974. 33

[98] Werbos, “Backpropagation: past and future,” in IEEE 1988 International Con- ference on Neural Networks, 1988, pp. 343–353 vol.1. 33

[99] Y. Wang, H. Yao, and S. Zhao, “Auto-encoder based dimensionality reduction,” Neurocomputing, vol. 184, pp. 232–242, 2016. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0925231215017671 33, 34

[100] Y. Bengio, “Deep learning of representations for unsupervised and transfer learn- ing,” in JMLR, vol. 27, 2012, p. 21. 33

[101] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, ser. Adaptive computation and machine learning. Cambridge, Mass: MIT Press, 1998. 35

[102] R. A. Howard, “Dynamic programming,” Management Science, vol. 12, no. 5, pp. 317–348, 1966, publisher: INFORMS. [Online]. Available: http://www.jstor.org/stable/2627818 37

[103] E. J. Sondik, “THE OPTIMAL CONTROL OF PARTIALLY OBSERVABLE MARKOV PROCESSES,” Stanford Univ Calif Stanford Electronics Labs, p. 220, 1971. 37

[104] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine Learning Proceedings 1994. Elsevier, 1994, pp. 157–163. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/ B9781558603356500271 37

[105] F. Grabski, Semi-Markov Processes: Applications in System Reliability and Main- tenance, 1st ed. Elsevier, 2014. 37

[106] E. J. Sondik, “The Optimal Control of Partially Observable Markov Processes Over the Infinite Horizon: Discounted Costs,” Operations Research, vol. 26, no. 2, pp. 282–304, 1978. [Online]. Available: http://www.jstor.org/stable/169635 37, 60

[107] R. J. Boucherie and N. M. van Dijk, “Markov decision processes in practice,” in International Series in Operations Research & Management Science, vol. 248. Springer International Publishing, 2017. [Online]. Available: http://link.springer.com/10.1007/978-3-319-47766-4 37 Bibliography 173

[108] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, 2nd ed., ser. Adaptive computation and machine learning series. The MIT Press, 2018. 38

[109] A. L. Samuel, “Some studies in machine learning using the gameof checkers. II- recent progress,” IBM Journal of research and development, p. 17, 1967. 38

[110] R. S. Sutton, “Learning to predict by the methods of temporal differences,” in Machine Learning. Kluwer Academic Publishers, 1988, pp. 9–44. 38, 60

[111] J. Baxter, A. Tridgell, and L. Weaver, “TDLeaf(lambda): Combining temporal difference learning with game-tree search,” arXiv:cs/9901001, 1999. [Online]. Available: http://arxiv.org/abs/cs/9901001 38, 60

[112] X. Ying, “An overview of overfitting and its solutions,” Journal of Physics: Conference Series, vol. 1168, p. 022022, 2019. [Online]. Available: https://iopscience.iop.org/article/10.1088/1742-6596/1168/2/022022 39

[113] P. N. Druzhkov and V. D. Kustikova, “A survey of deep learning methods and software tools for image classification and object detection,” Pattern Recognition and Image Analysis, vol. 26, no. 1, pp. 9–15, 2016. [Online]. Available: http://link.springer.com/10.1134/S1054661816010065 39

[114] K. P. Burnham, D. R. Anderson, and K. P. Burnham, Model selection and mul- timodel inference: a practical information-theoretic approach, 2nd ed. Springer, 2002. 40

[115] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1135–1144. [Online]. Available: https: //dl.acm.org/doi/10.1145/2939672.2939778 40

[116] C. Li and M. Qiu, Reinforcement learning for cyber-physical systems with cyber- security case studies. CRC Press, 2019. 40

[117] I. Goodfellow, Y. Bengio, and A. Courville, “Chapter 7 regularization for deep learning,” in Deep Learning. MIT Press, 2016. [Online]. Available: https://www.deeplearningbook.org/contents/regularization.html 41

[118] S. J. Hanson and L. Y. Pratt, “Comparing biases for minimal network construction with back-propagation,” in Advances in Neural Information Processing Systems 1, D. S. Touretzky, Ed. Morgan-Kaufmann, 1989, pp. 177–185. [Online]. Available: http://papers.nips.cc/paper/156-comparing- biases-for-minimal-network-construction-with-back-propagation.pdf 41

[119] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019. [Online]. Available: http://arxiv.org/abs/1711.05101 41 174

[120] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv:1207.0580 [cs], 2012. [Online]. Available: http://arxiv.org/abs/1207.0580 41

[121] N. Srivastava, “Improving neural networks with dropout,” p. 26, 2013. 41

[122] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, “Efficient Object Localization Using Convolutional Networks,” arXiv:1411.4280 [cs], Nov. 2014. [Online]. Available: http://arxiv.org/abs/1411.4280 41, 114

[123] G. Orr and K.-R. M¨uller,“Neural networks: tricks of the trade,” in Lecture notes in computer science, no. 1524. Springer, 1998. 41, 116

[124] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” JMLR, p. 30, 2014. 42, 114

[125] H. Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” Journal of Statistical Planning and Inference, vol. 90, no. 2, pp. 227–244, 2000. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0378375800001154 41

[126] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167 41, 42, 124

[127] F. Herrera, “Dataset shift in classification: Approaches and problems,” international Work Conference on Artifical Neural Networks. [Online]. Available: http://iwann.ugr.es/2011/pdf/InvitedTalk-FHerrera-IWANN11.pdf 42

[128] P. Rosin and F. Fierens, “Improving neural network generalisation,” in 1995 International Geoscience and Remote Sensing Symposium, IGARSS ’95. Quanti- tative Remote Sensing for Science and Applications, vol. 2, 1995, pp. 1255–1257 vol.2. 44

[129] Y. Bengio, “Deep Learning of Representations: Looking Forward,” in International Conference on Statistical Language and Speech Processing. Springer, 2013, pp. 1–37. [Online]. Available: http://link.springer.com/chapter/ 10.1007/978-3-642-39593-2 1 44

[130] M. Kirci, N. Sturtevant, and J. Schaeffer, “A GGP Feature Learning Algorithm,” KI-K¨unstlicheIntelligenz, vol. 25, no. 1, pp. 35–42, 2011. [Online]. Available: http://link.springer.com/article/10.1007/s13218-010-0081-8 44

[131] C. R´e,A. A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang, “Feature Engineering for Knowledge Base Construction,” arXiv preprint arXiv:1407.6439, 2014. [Online]. Available: https://arxiv.org/abs/1407.6439 44 Bibliography 175

[132] B. Graham, “Spatially-Sparse Convolutional Neural Networks,” arXiv preprint arXiv:1409.6070, 2014. [Online]. Available: https://arxiv.org/abs/1409.6070 45

[133] J. F. Reed, “Better binomial confidence intervals,” Journal of Modern Applied Statistical Methods, vol. 6, no. 1, pp. 153–161, 2007. [Online]. Available: http://digitalcommons.wayne.edu/jmasm/vol6/iss1/15 51

[134] J. Frey, “Fixed-Width Sequential Confidence Intervals for a Proportion,” The American Statistician, vol. 64, no. 3, pp. 242–249, Aug. 2010. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1198/tast.2010.09140 52

[135] D. Dugovic. (2019, Jan.) Multi-variant fork of popular UCI . Original-date: 2014-07-31T23:11:27Z. [Online]. Available: https://github.com/ ddugovic/Stockfish 53

[136] “Racing Kings • Race your King to the eighth rank to win. • lichess.org.” [Online]. Available: https://lichess.org/variant/racingKings 53

[137] T. Landau, “Othello: Brief & Basic,” 1990. [Online]. Available: http: //www.tlandau.com/files/Othello-B&B.pdf 53

[138] V. Allis, “A Knowledge-Based Approach of Connect-Four: The Game Is Solved: White Wins,” ICGA Journal, vol. 11, no. 4, pp. 165–165, Dec. 1988. [Online]. Available: http://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10. 3233/ICG-1988-11410 54

[139] J. Tromp, “Solving connect-4 on medium board sizes,” ICGA Journal, vol. 31, pp. 110–112, 2018. 54

[140] N. Fiekas, “A pure Python chess library with move generation and validation, PGN parsing and writing, Polyglot opening book reading, Gaviota tablebase probing, Syzygy tablebase probing and UCI/XBoard engine co..” Jan. 2019, original-date: 2012-10-03T01:55:50Z. [Online]. Available: https://github.com/niklasf/python-chess 55

[141] Gunawan, H. Armanto, J. Santoso, D. Giovanni, F. Kurniawan, R. Yudianto, and Steven, “Evolutionary Neural Network for Othello Game,” Procedia - Social and Behavioral Sciences, vol. 57, pp. 419–425, Oct. 2012. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S187704281204668X 56

[142] K. Morishita, “Reversi reinforcement learning by AlphaGo Zero methods.: mokemokechicken/reversi-alpha-zero,” Jan. 2019, original-date: 2017-10- 22T04:16:43Z. [Online]. Available: https://github.com/mokemokechicken/ reversi-alpha-zero 56

[143] R. S. Sutton and B. Tanner, “Temporal-Difference Networks,” arXiv:1504.05539 [cs], Apr. 2015. [Online]. Available: http://arxiv.org/abs/1504.05539 60 176

[144] J. L. Elman, “Learning and development in neural networks: the importance of starting small,” Cognition, vol. 48, no. 1, pp. 71–99, Jul. 1993. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/0010027793900584 61

[145] H. Asoh, S. Hayamizu, I. Hara, Y. Motomura, and S. Akaho, “Socially Embedded Learning of the Office-Conversant Mobile Robot Jijo-2,” in Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, vol. 2, 1997, pp. 880–885. 61

[146] A. Y. Ng, D. Harada, and S. J. Russell, “Policy invariance under reward trans- formations: Theory and application to reward shaping,” in Proceedings of the Sixteenth International Conference on Machine Learning, ser. ICML ’99. Mor- gan Kaufmann Publishers Inc., 1999, pp. 278–287. 61

[147] A. Y. Ng, H. J. Kim, M. I. Jordan, and S. Sastry, “Autonomous helicopter flight via reinforcement learning,” Neural Information Processing Systems, vol. 16, p. 8, 2004. 61

[148] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang, “Autonomous inverted helicopter flight via reinforcement learning,” in Experimental Robotics IX, ser. Springer Tracts in Advanced Robotics, M. H. Ang and O. Khatib, Eds. Springer Berlin Heidelberg, 2006, pp. 363–372. 61

[149] F.-Y. Sun, Y.-Y. Chang, Y.-H. Wu, and S.-D. Lin, “Designing non-greedy reinforcement learning agents with diminishing reward shaping,” in Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, ser. AIES ’18. Association for Computing Machinery, 2018, pp. 297–302. [Online]. Available: https://doi.org/10.1145/3278721.3278759 61

[150] J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in International Conference on Learning Representations, 2018. [Online]. Available: http://arxiv.org/abs/1710.11248 61

[151] Y. J. Lee and K. Grauman, “Learning the easy things first: Self- paced visual category discovery,” in CVPR 2011. Colorado Springs, CO, USA: IEEE, Jun. 2011, pp. 1721–1728. [Online]. Available: http: //ieeexplore.ieee.org/document/5995523/ 61

[152] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum Learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 41–48. [Online]. Available: http://doi.acm.org/10.1145/1553374.1553380 61

[153] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-paced cur- riculum learning,” in AAAI Publications, Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, p. 7. 62 Bibliography 177

[154] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” Neural Information Processing Systems, p. 9, 2010. 62

[155] C. Florensa, D. Held, and M. Wulfmeier, “Reverse Curriculum Generation for Reinforcement Learning,” p. 14, 2017. 62

[156] R. G. Newcombe, “Two-sided confidence intervals for the single proportion: com- parison of seven methods,” Statistics in Medicine, vol. 17, no. 8, pp. 857–872, 1998. 65, 130, 138

[157] D. Silver. Keynote david silver NIPS 2017 deep reinforcement learning symposium AlphaZero. [Online]. Available: https://www.youtube.com/watch? v=A3ekFcZ3KNw 75

[158] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” arXiv:1206.5533 [cs], 2012. [Online]. Available: http: //arxiv.org/abs/1206.5533 114, 117

[159] K. Dunnigan and S. Consulting, “Confidence Interval Calculation for Binomial Proportions,” p. 13, Nov. 2008. 129, 138

[160] S. Belia, F. Fidler, J. Williams, and G. Cumming, “Researchers Misunderstand Confidence Intervals and Standard Error Bars.” Psychological Methods, vol. 10, no. 4, pp. 389–396, 2005. [Online]. Available: http://doi.apa.org/getdoi.cfm? doi=10.1037/1082-989X.10.4.389 129

[161] M. G. E. Verdam, F. J. Oort, and M. A. G. Sprangers, “Significance, truth and proof of p values: reminders about common misconceptions regarding null hypothesis significance testing,” Quality of Life Research, vol. 23, no. 1, pp. 5–7, 2014. [Online]. Available: http://link.springer.com/10.1007/s11136-013-0437-2 129

[162] Z. Ali and S. B. Bhaskar, “Basic statistical tools in research and data analysis,” Indian Journal of Anaesthesia, vol. 60, no. 9, pp. 662–669, Sep. 2016. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5037948/ 130

[163] P. I. Good and J. W. Hardin, Common errors in statistics (and how to avoid them), 4th ed. Wiley, 2012. 132

[164] R. R. Wilcox, Fundamentals of Modern Statistical Methods. Springer New York, 2010. [Online]. Available: http://link.springer.com/10.1007/978-1-4419-5525-8 133

[165] J. Krueger, “Null hypothesis significance testing: On the survival of a flawed method.” American Psychologist, vol. 56, no. 1, pp. 16–26, 2001. [Online]. Available: http://doi.apa.org/getdoi.cfm?doi=10.1037/0003-066X.56.1.16 133

[166] A. Papoulis and S. U. Pillai, Probability, random variables, and stochastic pro- cesses, 4th ed. McGraw-Hill, 2002. 133 178

[167] W. J. Stewart, “Discrete distribution functions,” in Probability, Markov Chains, Queues, and Simulation, ser. The Mathematical Basis of Performance Modeling. Princeton University Press, 2009, pp. 115–133. [Online]. Available: https://www.jstor.org/stable/j.ctvcm4gtc.9 133

[168] H. Chen, “The accuracy of approximate intervals for a binomial parameter,” Journal of the American Statistical Association, vol. 85, no. 410, pp. 514–518, 1990. 133

[169] W. J. Stewart, “Random variables and distribution functions,” in Probability, Markov Chains, Queues, and Simulation, ser. The Mathematical Basis of Performance Modeling. Princeton University Press, 2009, pp. 40–63. [Online]. Available: https://www.jstor.org/stable/j.ctvcm4gtc.6 133

[170] A. Wald, “Tests of Statistical Hypotheses Concerning Several Parameters When the Number of Observations is Large,” Transactions of the American Mathematical Society, vol. 54, no. 3, pp. 426–482, 1943. [Online]. Available: www.jstor.org/stable/1990256 138

[171] Y. Guan, “A generalized score confidence interval for a binomial proportion,” Journal of Statistical Planning and Inference, vol. 142, no. 4, pp. 785–793, Apr. 2012. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S0378375811003569 138

[172] A. Agresti and B. A. Coull, “Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions,” The American Statistician, vol. 52, no. 2, pp. 119–126, May 1998. [Online]. Available: http: //www.tandfonline.com/doi/abs/10.1080/00031305.1998.10480550 138

[173] S. E. Vollset, “Confidence intervals for a binomial proportion,” Statistics in Medicine, vol. 12, no. 9, pp. 809–824, 1993. [Online]. Available: http://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780120902 138

[174] E. Wilson, “Probable Inference, the Law of Succession, and Statistical Inference: Journal of the American Statistical Association: Vol 22, No 158,” Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, Jun. 1927. 138

[175] K. R. Murphy, B. Myors, and A. H. Wolach, Statistical power analysis: a simple and general model for traditional and modern hypothesis tests, 4th ed. Routledge, Taylor & Francis Group, 2014. 141

[176] F. Mosteller, “A k-sample slippage test for an extreme population,” in Selected Papers of Frederick Mosteller, ser. Springer Series in Statistics, S. E. Fienberg and D. C. Hoaglin, Eds. Springer, 1948, pp. 101–109. [Online]. Available: https://doi.org/10.1007/978-0-387-44956-2 5 141

[177] S. Wallis, “Binomial Confidence Intervals and Contingency Tests: Mathematical Fundamentals and the Evaluation of Alternative Methods,” Journal of Bibliography 179

Quantitative Linguistics, vol. 20, no. 3, pp. 178–208, Aug. 2013. [Online]. Available: http://www.tandfonline.com/doi/abs/10.1080/09296174.2013.799918 143

[178] A. Wald, “Sequential Tests of Statistical Hypotheses,” The Annals of Mathematical Statistics, vol. 16, no. 2, pp. 117–186, Jun. 1945. [Online]. Available: https://projecteuclid.org/euclid.aoms/1177731118 149

[179] J. Pearl, “Some recent results in heuristic search theory,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-6, no. 1, pp. 1–13, 1984. 181

[180] U. Larsson, R. J. Nowakowski, and C. P. Santos, “Absolute combinatorial game theory,” 2016. [Online]. Available: https://arxiv-org.ezp01.library.qut.edu.au/ abs/1606.01975 182

[181] J. C. Kolecki, “An introduction to tensors for students of physics and engineer- ing,” p. 29, 2002. 186

Glossary

Agent An entity which operates within an environ- ment, which may be artificial or not. An agent uses a representation of the environ- ment’s state and makes decisions regard- ing which actions it will take - resulting in changes to the environment. An agent has a

set of actions A = {a1..n} which it can take in the environment and from this set a smaller subset of legal actions exist depending on the state.1

Baseline-Player A baseline-player is a game player which has some known prior quality of play which we then use as the baseline to measure any im- proved performance from the methods used in this thesis. The baseline-player has identi- cal architecture and parameters to the tested player, with the exception of the variations specified in the experiment. We first use an unmodified Alpha-Zero inspired baseline- player, however once we establish that our methods have resulted in improvement we use the improvements as the baseline-player for later experiments. 45 Branching Factor The branching factor of a game tree is a measure of how many branches each child node has on average. Formally, a game tree with a depth of d a branching degree of b and with terminal positions distributed

as F the branching factor RA is defined 1 as RA(b, F ) = limd→∞[IA(d, b, F )] d ] accord- ing to Pearl [179]. The branching degree for a given node is its number of children. 16

181 182

Brute-Force Methods Explores as much of the environment as possible before making a decision. There are a number of tree-search methods which use some knowledge-based heuristic that are still categorised as brute-force, however true brute-force methods have no prior knowledge. 2

Combinatorial Board Game Larsson et al. define a combinatorial game as “a perfect information game (without dice and hidden cards) such as Chess, Mancala and Go. There are two players, and the game is represented by a finite game tree, with the root as the starting position and two types of edges representing each player’s move options. Information about the outcome of a game is given at the terminal level of the leaves” [180]. 7 Confidence Level The statistical probability that a prediction obtained from a sample population is also true for the actual population.8 Convolutional Neural-Networks Convolutional neural-networks (CNN) are a specific type of feed-forward neural-network where individual neurons process overlapping perspectives of the input. CNNs have typi- cally been used in many visual applications as they can identify features and objects in an image [79].8

Directly Informed Experience An experience which is directly informed ob- tains value information directly from the envi- ronment from a terminal-state. The value es- timate for the experience’s root-node receives first-hand knowledge from the environment. 83

End-Game-First Curriculum An end-game-first curriculum is one where the initial focus is to learn the consequences of actions near a terminal state/goal-state and then progressively learn from experiences that are further and further from the terminal state.7 Glossary 183

Environment A problem space that agents operate within. An environment at any time step t, is rep-

resented by state st ∈ S where S represents all possible states for the environment. For

any state, st there exists a set of legal ac-

tions A(st) which an agent can take. An agent

chooses a single action at ∈ A(st) causing the

environment to transition from st to st+1. For this thesis the environment is a board game. 4, 35 Experience-Buffer A collection of experiences which were created during self-play and are used for training the neural-network. The experience-buffer has a maximum size measure in numbers of games. 7

Generic Agent A generic agent is an agent designed for a broad problem domain or class of problem as opposed to a general agent which is an agent used across domains.4

Inadequately Trained Neural Network A network can be inadequately trained either due to poor training or insufficient training. 60 Incremental-Buffer Priming Incremental-buffer priming is when a player starts with a small experience-buffer size and increases the buffer size as training progresses. 100 Inference The process of using a neural-network to make a prediction. 26 Informed distance The distance, from a terminal-node, that re- ward information will have propagated from the terminal-state within the agent’s internal mapping of the environment. The informed distance is the upper limit for the visibility- horizon. 79 184

Internal Knowledge Instead of thinking of problems as being solved by brute-force or knowledge-based ap- proaches [13] we consider these approaches as having the knowledge either obtained on- demand or stored internal to the player. In- ternal knowledge is a knowledge-based ap- proach where information about the environ- ment is stored and recalled when needed. The prior knowledge may be programmed by a hu- man or obtained by the agent. 11

Knowledge-Based Methods Uses the knowledge of the environment’s un- derlying structure to make decisions.2

Learning Resistance Learning resistance is a short-term resistance to network improvement and this may be caused by reaching of local-minima, or some other cause arising from the game dynamics . 73

Monte-Carlo Tree Search MCTS is a tree search method which uses monte-carlo (random) methods to estimate a node’s value. MCTS is a best-first tree search algorithm which asymmetrically builds a game tree based on the most promis- ing branches. MCTS is used in numerous game playing agents and is one of the most widely used general-game decision making al- gorithms. Typically, the expansion of the tree is guided by estimating the upper confidence bound for a node’s value by calculating the upper confidence bound for trees (UCT).2 Monte Carlo Monte-carlo methods use random sampling to estimate unknown values by conducting a ran- dom trial and inferring the value from the re- sulting distribution.2 Glossary 185

On-Demand Knowledge Instead of thinking of problems as being solved by brute-force or knowledge-based ap- proaches [13] we consider to consider the ap- proaches as having the knowledge either ac- quired on-demand or stored internal to the player. On-demand knowledge is a brute- force approach where information about the environment is obtained when needed by searching the environment. 11

Priming The Macquarie Dictionary (2002) definition of priming is “to prepare or make ready”. In the context of this thesis priming a neural- network is preparing it for training so that learning can be achieved more efficiently . 99

Reference-Opponent The reference-opponent is a standardised op- ponent used for competitions in a particu- lar game. The reference-opponent has a pre- defined standard of game-play and both the tested player and the baseline-player com- pete in a competition against the reference- opponent. 52

Self-Play The process of playing both sides of a two player game. An agent can use self-play to ex- plore the environment by alternating between players and taking turns which maximise the player’s rewards. For the agents in this thesis, self-play is used to create examples which are used to train the agent. 36 Strategic Game/Situation “scenario or situation where, for two or more (players),their choice of action or behaviour has an impact on the others” [9].2 186

Tensor A tensor is an algebraic construct encapsulat- ing multi-linear variables of any rank. For ex- ample, a tensor of rank 0 is commonly called a scalar, a tensor of rank 1 is called a vector and a tensor of rank 2 is commonly called a matrix. Tensors of rank greater than 2 can be commonly called n-dimensional matrices [181]. For example a colour image may be represented as a stack of 3 matrices each with a size x×y, one for each colour red, green and blue or a stack of four matrices, one for the colours cyan, magenta, yellow and black. It is more concise to generalise the possible size of the tensor than seek to have it specified in any definition. For the purposes of this thesis we consider a tensor as being an algebraic de- scriptor for a variable of indeterminate rank. 28 Terminal Priming Terminal priming is the method of creat- ing experiences for every discovered terminal node during a search. 100

Uninformed Experience An experience that is uninformed has no in- formation about the environment. 61 Upper Confidence Bound for Trees Upper Confidence Bound for Trees (UCT). UCT utilises the distribution of the results ob- tained from a play-out to estimate the upper confidence value for a node. The node with the highest upper confidence value is selected to explore, this is the most promising path. The upper confidence value is calculated by Equation 2.1. The UCT formula balances the exploration of rarely visited paths with the ex- ploitation of paths which win more frequently. 23 Glossary 187

Visibility-Horizon The distance from a terminal-node where nodes are sufficiently informed to make ac- curate predictions and as such, are useful for training i.e. extending from the terminal- node up the tree for some distance where the agent is still able to make informed decisions. 72