ABSTRACT

ANALYSIS OF Q- LEARNING BASED PLAYING AGENTS FOR ABSTRACT BOARD WITH INCREASING STATE-SPACE COMPLEXITY

by Indrima Upadhyay

This thesis investigates Q-learning agents in a Reinforcement Learning framework for abstract board games. The two key contributions are: exploring the training of Q-learning agents, and a methodology to evaluate different agents playing abstract games. We focus on - Tic-Tac-Toe, Nine-Men’s Morris, and Mancala, noting that each of these games is solved. To train our Q-agent, we test the impact of a teaching agent (Q-learning twin, deterministic Min-Max, and a non-deterministic Min-Max) based on the number of training epochs needed until an agent converges. We find that a deterministic Min-Max is the best teaching agent, but a Q- learning twin allows us to converge without needing a pre-existing agent. In training, we include a methodology to create weaker agents for entertainment purposes. To evaluate a range of competitive agents, we provide a methodology and conduct a round- robin tournament. We find that a deterministic Min-Max agent scores maximum points for all three games, the Q-learning based agent places second for both Tic-Tac-Toe and Nine Men’s Morris, and a non-deterministic Min-Max places second for Mancala. From these two contributions we summarize our conclusions and provide discussion on these results and provide insight on how this space might be further investigated. ANALYSIS OF Q- LEARNING BASED GAME PLAYING AGENTS FOR ABSTRACT BOARD GAMES WITH INCREASING STATE-SPACE COMPLEXITY

A Thesis

Submitted to the Faculty of Miami University in partial fulfillment of the requirements for the degree of Master of Science by Indrima Upadhyay Miami University Oxford, Ohio 2021

Advisor: Dr. Peter Jamieson Reader: Dr. Chi-Hao Cheng Reader: Dr. Ran Zhang

©2021 Indrima Upadhyay This Thesis titled

ANALYSIS OF Q- LEARNING BASED GAME PLAYING AGENTS FOR ABSTRACT BOARD GAMES WITH INCREASING STATE-SPACE COMPLEXITY

by

Indrima Upadhyay

has been approved for publication by

The College of Engineering and Computing

and

The Department of Electrical & Computer Engineering

Dr. Peter Jamieson

Dr. Chi-Hao Cheng

Dr. Ran Zhang Table of Contents

List of Tables v

List of Figures vi

Dedication vii

Acknowledgements viii

1 Chapter 1 : Introduction1

2 Chapter 2 : Background5 2.1 Ludii General Game System ...... 5 2.2 ...... 7 2.2.1 State-Space Complexity ...... 7 2.3 Abstract Games in this Thesis ...... 8 2.3.1 Tic-Tac-Toe ...... 9 2.3.2 Nine-Men’s Morris ...... 9 2.3.3 Mancala ...... 9 2.3.4 Zero-Sum Games and Fully Solved Games ...... 10 2.4 Reinforcement Learning for Game Playing Agents ...... 10 2.4.1 Reinforcement Learning Algorithms ...... 11 2.5 API for development of Game Playing Agent of Ludii ...... 14

3 Chapter 3 : Training a Q-learning Agent 17 3.1 Details of the Q-learning Algorithm ...... 18 3.2 How to select Learning Factor and Discount Factor ...... 18 3.3 Training Methodology ...... 19 3.4 Stopping Criteria For Training ...... 21 3.4.1 Definitions: Convergence until Stable versus Convergence until Fully Solved 21 3.5 Training Tic-Tac-Toe Agent Against Different Teaching Agents ...... 22 3.5.1 Summary of Training Convergence for Tic-Tac-Toe ...... 24 3.6 Training Nine-Men’s Morris Agent Against Different Teaching Agents ...... 25 3.6.1 Summary of Training Convergence for Nine-Men’s Morris ...... 27 3.7 Training Mancala Q-learning Agent against Different Teaching Agents ...... 27 3.7.1 Summary of Training Convergence for Mancala ...... 29 3.8 Computational Run-Time Complexity for Q-learning and Min-Max Algorithms . . 29 3.9 Snapshots for Creating Lower Quality Agents ...... 30 3.10 Confirming the Importance of Good Teaching Agents ...... 31

4 Chapter 4: Evaluation of RL Agents 33 4.1 Evaluating Tic-Tac-Toe ...... 34

iii 4.1.1 Defining a “Good” game agent for Tic-Tac-Toe ...... 34 4.1.2 Method to find the Number of Games to find a Stable Result for Tic-Tac- Toe between Two Agents ...... 34 4.1.3 Round-Robin Match tournament Results for the game of Tic-Tac-Toe . . . 36 4.2 Evaluating Nine Men’s Morris ...... 37 4.2.1 Defining “Good” game playing agents for Nine Men’s Morris ...... 37 4.2.2 Method to find the Number of Games to find a Stable Result for Nine Men’s Morris between Two Agents ...... 38 4.2.3 Round-Robin Match tournament Results for the game of Nine Men’s Morris 39 4.3 Evaluating Mancala ...... 40 4.3.1 Defining “Good” game agent for Mancala ...... 40 4.3.2 Method to find the Number of Games to find a Stable Result for Mancala between Two Agents ...... 40 4.3.3 Round-Robin Match tournament Results for the game of Mancala . . . . . 41

5 Chapter 5 : Discussion and Future Work 43

6 Chapter 6: Conclusions 45

References 46

iv List of Tables

2.1 Table to Understand RL Approach to Tic-Tac-Toe...... 12

3.1 Summary of Training until Convergence for Tic-Tac-Toe with Different Teaching Agents ...... 24 3.2 Summary of Training until Convergence for Nine-Men’s Morris with Different Teaching Agents ...... 27 3.3 Summary of Training until Convergence for Mancala with Different Teaching Agents 29 3.4 Summary of Run-Time Required for One Cycle of Training of Tic-Tac-Toe With Different Agents ...... 30 3.5 Summary of Run-Time Required for One Cycle of Training of Nine Men’s Morris With Different Agents ...... 30 3.6 Summary of Run-Time Required for One Cycle of Training of Mancala With Dif- ferent Agents ...... 30 3.7 Summary of Training until Convergence for Tic-Tac-Toe with Different Snapshot Teaching Agents ...... 32 3.8 Summary of Training until Convergence for Nine Men’s Morris with Different Snapshot Teaching Agents ...... 32 3.9 Summary of Training until Convergence for Mancala with Different Snapshot Teach- ing Agents ...... 32

4.1 The base line after 10000 games of Tic-Tac-Toe with Min-Max and Random agent as players ...... 34 4.2 Number of trials the Q-learning agent for Tic-Tac-Toe requires to reach a stable win/draw to loss ratio ...... 36 4.3 Tic-Tac-Toe tournament to look at win/draw rate for each of the players ...... 37 4.4 The base line after 1000 games of Nine Men’s Morris with Min-Max and Random agent as players ...... 37 4.5 Number of trials the Q learning agent for Nine Men’s Morris at a particular stage in training requires to reach a stable win/draw to loss ratio ...... 38 4.6 Nine Men’s Morris tournament to look at win rate for each of the players ...... 39 4.7 The base line after 1000 games of Mancala with Min-Max and Random agent as players ...... 40 4.8 Number trials the Q-learning agent for Mancala requires to reach a stable win/draw to loss ratio ...... 41 4.9 Mancala tournament to look at win/draw rate for each of the players ...... 42

v List of Figures

1.1 The mind map of the research...... 2

2.1 A game of Tic-Tac-Toe in progress...... 6 2.2 Board for the game of Tic-Tac-Toe...... 8 2.3 Board for the game of Nine-Men’s Morris...... 9 2.4 Board for the game of Mancala...... 10

3.1 Contents of Chapter 3 Training...... 17 3.2 Illustration of how the Q-agent trains...... 20 3.3 Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a Q-learning based player ...... 23 3.4 Q-learning based game playing agent Tic-Tac-Toe learning by playing against non- deterministic Min-Max player...... 23 3.5 Q-learning based game playing agent for Tic-Tac-Toe learning by playing against deterministic Min-Max player...... 24 3.6 Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a random player...... 24 3.7 Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a fully converged Q-Learning based game playing agents...... 25 3.8 Q-learning based game playing agent for Nine-Men’s Morris learning by playing against itself...... 26 3.9 Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a non-deterministic Min-Max player...... 26 3.10 Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a deterministic Min-Max player...... 26 3.11 Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a fully converged Q-Learning player...... 27 3.12 Q-learning based game playing agent for the game of Mancala learning by playing against a Q-learning based player...... 28 3.13 Q-learning based game playing agent for Mancala learning by playing against a non-deterministic Min-Max player...... 28 3.14 Q-learning based game playing agent for Mancala learning by playing against a deterministic Min-Max player...... 28 3.15 Q-learning based game playing agent for Mancala learning by playing against a fully converged Q-Learning based player...... 29

4.1 10000 games between Min-Max - Min-Max players to establish baseline ...... 35 4.2 10000 games between Random - Min-Max players to establish baseline ...... 35

vi Dedication

To my parents, Mrs. Nandita Upadhyay, and Dr. Ajay Kumar Upadhyay. Thank you for always being my best friends, my support system and everything else in between. To my sister, Dr. Ipshita Upadhyay and brother-in-law, Mr. Shashank Bajpai. Thank you for showing me nothing is unachievable and for always being there whenever I needed.

vii Acknowledgements

Writing a thesis is harder than I thought and more rewarding than I could have ever imagined. Throughout the writing of this thesis I have received a great deal of support and assistance. This thesis and the research behind it would not have been possible without the remarkable guidance of my advisor, Dr. Peter Jamieson, Associate Professor in the Department of Electrical and Computer Engineering, Miami University. His supervision, advice, guidance from the very early stage of this research and timely help took me towards the completion of this thesis. His continuous encouragement and support not only aided me in completing the thesis but also inspired me to explore my career in the pertinent industry. It is indeed a great pleasure to work under the guidance of Dr. Peter Jamieson. His knowledge and thorough attention to detail has been an absolute inspiration. I am also deeply indebted to Mr. Michael Bomholt for constantly providing his insight and expertise that greatly assisted the research by proving to be indispensable support and advice. His patience, motivation, enthusiasm, and immense knowledge helped me throughout the course of this research and writing of thesis. I am very thankful to my thesis committee members Dr. Chi Hao Cheng and Dr. Ran Zhang for their constructive criticism and feedback that helped to improve my thesis. I am grateful to Miami University for providing me with ample opportunities and the perfect environment to grow not only as a research student but also as an individual. My fellow graduate students and friends in the Electrical and Computer Engineering Department, Miami University made the entire experience all the more enriching and enjoyable. Thanks a lot for all the stimulating and interesting discussions we have had over the last two years It would be remiss if I don’t thank my parents, Dr. Ajay Kumar Upadhyay and Mrs. Nandita Upadhyay, for their unwavering faith in me and for their support in all of my decisions. Thank you so much for instilling the qualities of resilience, kindness, and perseverance in me from a very young age and for being my superheroes. From the bottom of my heart, I would like to say a big thank you to my beloved sister Dr. Ipshita Upadhyay for her endless provision in completing this and everything else in my life. She is the personification of perfection to me and without her the completion of this thesis or anything else would not have been possible. A debt of gratitude is also owed to the nicest, most helpful brother-in-law, Mr. Shashank Bajpai. He has been a constant source inspiration whose perpetual encouragement has been priceless. A special thanks to my friends Isha Singh, Shruti Shaunik, Arushi Nautiyal, and Vaishnavi Dasaka for constantly helping me through challenging times. Lastly, I would like to thank the Almighty for giving me wonderful parents, teachers, friends, and the perfect milieu. I feel immensely blessed.

viii Chapter 1 : Introduction

In this thesis, we examine the training and evaluation of agents. Our goal is to un- derstand how an existing algorithm, Q-learning, performs as a game playing agent. We provide a methodology to train and study this algorithm for three different abstract board games that increase in complexity, and we provide a detailed methodology on how to train and evaluate Q-learning agents. Reinforcement Learning (RL) is a powerful model to create learning agents for complex prob- lems. It is a research sub-area in the broader field of artificial intelligence (AI) where, in RL, an agent attempts to maximize the total reward for its actions in an uncertain environment. The appli- cation of RL in board games is interesting, because board games present simplified spaces where we can test and observe a decision based agent. Also board games provide a competitive space to compare different AI techniques all within a limited state-space complexity [1]. With a deeper understanding of the performance of RL game agents with respect to changing state-space com- plexity, we are able to evaluate AI techniques and provide insight on how to employ these agents in a board game market. This includes providing a range of quality AI agents that are challenging for players at all levels of skill, which allows a human player to improve, beat, and have fun playing against. Our goal is to explore the quantitative difference in the performance of RL based game playing agents with changing state-space complexity of abstract board games. For this purpose, our game playing agents are designed using Q-learning framework for three different games with different state-space complexities. Each agent is trained against a range of AI agents (Min-Max, Q-learning, and Random) to get a better understanding of how quickly we find convergence when training Q- agent for a particular game. Figure 1.1 shows the details of this thesis. First, we create a Q-learning agent that learns to play three different games - Tic-Tac-Toe, Nine-Men’s Morris, and Mancala. Each of these games is of an increased state-space complexity going from first to last. Chapter2 talks about the different kinds of agents used throughout this thesis, which are deterministic Min-Max agent, non-deterministic Min-Max agent, Q-Learning based agent, and random agent, shown by boxes labelled in yellow in Figure 1.1. The yellow outlined boxes in Figure 1.1 contain the names used to refer each of the agent in the blue outlined boxes that talk about the methodology used for evaluation of the agents, which will be discussed ahead. In chapter3, boxes labelled in green in Figure 1.1, we analyze our training approach. We report the number of training generations needed when training is done against itself and opponent agents as trained by us. We hypothesize that Q-learning will only be useful as an RL-based technique for a limited state-space complexity, but each of these games chosen has been shown to be solvable, and we do not study this problem beyond our game choices. However, training time increases as state complexity increases. We, also, explore how quickly our agents train to convergence under different teaching agents. From our trials we find that, deterministic Min-Max agent is the best teaching agent in terms of number of training generations required for Tic-Tac-Toe, Nine Men’s Morris, and Mancala. In chapter3, boxes labelled in green in Figure 1.1, we also show our method to create weakened

1 Figure 1.1: The mind map of the research.

2 agents that will be used in the following chapter. Additionally, to confirm our hypothesis that better teaching agents accelerate training we experiment with training our agent against weakened versions of our Q-learning agent where each teaching agent has increasing number of training generations going from first to last. Boxes labelled in blue in Figure 1.1 show sections of chapter4 where we describe our method- ology to we compare agents when playing against each other for each of the three games- Tic- Tac-Toe, Nine-Men’s Morris, and Mancala. Our approach is a round-robin tournament approach where we can compare agents. To do this, we first need to determine what a reasonable numbers of trial games is needed to conclusively determine the win to draw to loss ratio. We provide our methodology for finding this number of trials for each game, and then use this methodology to compare the range of agents we have against one another. For this research, we use a General Game Playing (GGP) system, boxes labelled in pink in Figure 1.1, that provide a program that can perform well across various types of games. We specifically use the Ludii General Game System, as created by the Digital Ludeme Project (DLP), that allows us to conduct experiments to investigate if there is a discernible difference in the per- formance of game playing agents with different state-space complexity. We describe this game playing system in the chapter2. The major contributions of this thesis are:

• An analysis of training a Q-learning RL agent in terms of finding convergence for solved zero-sum abstract board games and evaluating training speed based on using a mirror of itself versus a Min-Max agent and a random agent (Chapter3). This includes how to select parameters in the Q-learning algorithm.

• A method to create lower quality Q-learning based agents to allow human players to find a competitive agent that can provide humans challenging opponents that are not too hard to beat (Chapter3)

• A method to create a round-robin tournament to evaluate the quality of agents, which in- cludes a method to determine how many games must be played between opponents to find stable results (Chapter4)

The results of these contributions provide other RL researchers methods to properly evaluate RL agents and use Q-learning based agents in the future for virtual game playing agents. The thesis is organized as follows: Chapter2 provides background on the use of RL in gaming, the Ludii General game system, game complexity, the games explored for the purpose of this research, followed by RL approaches for game playing agents including Q-Learning and Min-Max, along with details of the API for developing game-playing agents in Ludii. Chapter3 describes our training approach and analysis of training Q-learning along with finding the number of generations required by an agent to train for before the algorithm converges on ”the best solution”, what is the ”best solution”, and how to decide if an agent is good enough? Additionally, we show how to use the converged agent to create lesser quality agents. Chapter4 explains how we create a system to evaluate agents. This includes a methodology on how to find the number of trials such that evaluation of agents is stable. We then perform a round-robin tournament for each of the agents

3 within our system. Chapter5 provides a summary of what we have learned from our experiments and provides caveats to our approach and future avenues of investigation, and finally, Chapter6 concludes the thesis.

4 Chapter 2 : Background

In this chapter, we provide background details related to abstract board games and RL agents that can be taught to play them. Board games are fixed environments that are good for experimenting with and testing various RL algorithms and agent-based learning techniques. RL, in general games, has achieved human- level performance playing Atari games [2]. Temporal difference learning is used to create TD- Gammon, a game-learning program, which proved to be the world’s best player of Backgammon[3]. RL is used by IBM’s Watson, in 2011, to make strategic decisions in Jeopardy![4]. In this chapter, firstly, we provide details of the Ludii General Game System that is a platform on which we implement our agents. Next, we look at different ways of measuring the complexity of an abstract board game. In particular, we will focus on the state-space complexity of games. Next, we describe some details of our three games used in this thesis, the basics of RL, the basics of our Q-learning algorithm, Min-Max algorithm, and some details of the API for the development of a game agent. All of our system and software has been released on github and can be downloaded at: www. github.com/indrima24

2.1 Ludii General Game System

The main objective of a General Game Playing (GGP) system is to provide a program that can perform well across various types of games i.e. it has applicability to more than one specific game [5]. A GGP works by assuming that the AI program is not tightly coupled to the board game, and therefore, this requires game descriptions similar to those found in game manuals. The text and logic based description of games and their rules is known as a Game Definition Language (GDL) [6]. In recent years, there have been many advances in GGPs [7]. Ludii General Game System, the GGP used in this research, is created by Digital Ludeme Project (DLP). DLP is Maastricht University’s attempt towards creating a GGP for reconstruction and analysis of numerous traditional and modern strategic abstract games, where an abstract game is defined as any game where the theme of the game is not important in terms of the game playing experience, and if present, it only serves the mechanics of the game. The DLP facilitates the work of game designers, historians, and educators [8] along with al- lowing us to research AI agents similar to other General Game Playing (GGP) systems such as FLUXPLAYER, developed at TU-Dresden, which is based on FLUX, a high-level programming system for cognitive agents. FLUXPLAYER intends to foster the development of integrated cog- nitive information processing technology[9]. Over the past couple of years a new GGP language called Regular Board games (RBG) has been developed at the Institute of Computer Science, University of Wrocław, Poland. RBG is focused on describing finite deterministic turn-based games with [10]. The creators of the Ludii General Game System, however, find that Ludii is superior in comparison

5 Figure 2.1: A game of Tic-Tac-Toe in progress. with the performance of the Regular Board games (RBG) system, by demonstrating that it is more general, extendable, and has other qualitative facets not found in RBG [11]. The Object Oriented nature of Ludii is chosen based on a “ludemic” approach, meaning that games are decomposed into their atomic components that encompass all related pieces and rules. Ludemes are defined as “game memes” or building blocks of games. These are units of information relevant to the game under consideration. These abstract units are what game designers work with when creating their designs. Every bit of pertinent information that is needed to play the game i.e. the players, the equipment, the rules, are presented in a straightforward, well-defined, and structured format using ludemes. This high-level style of game description summarizes the key game-related notions at the same time obscures the intricacy of the underlying implementation. So, the ludeme library acts as the basis of the Ludii system. Each Ludeme library, tagged with underlying mathematical concepts, is rendered together by object-oriented programming classes, each one pushing a certain ludeme into action. A single immutable game object Ludii consists of the definition of all relevant ludemes (players, equipment, rules) for a particular game [12]. As is required for any general game playing system, to display the board state, every single component of equipment can draw a default bitmap image for itself at a given resolution. For example, Figure 2.1 shows a very simple Ludeme game of Tic-Tac-Toe in progress. The Ludeme class code shown in Algorithm1 gives the full game description for Tic-Tac-Toe. It starts with a minimum legal game description in line 1 as game ”Tic-Tac-Toe”, followed by defining the total number of players, here 2, and the shape and size of the board given by square 3 in line 5. The pieces are then defined as ”Disc” for player 1 and ”Cross” for player 2. Finally, the minimum playing and ending rules are defined to complete the legal game description ready to be loaded in Ludii.

6 Algorithm 1 Ludeme class code for Tic-Tac-Toe 1: (game ”Tic-Tac-Toe” (players 2: 2 3: ) 4: (equipment 5: { (board (square 3)) (piece ”Disc” P1) (piece ”Cross” P2) } 6: ) 7: (rules 8: (play (add (mover) (empty))) 9: (end (if (isLine 3) (result Mover Win))) 10: (play (add (mover) (empty))) 11: ) 12: )

2.2 Game Complexity

Combinatorial outlines several ways of measuring game complexity [13], and this is used in quantifying the complexity of a game. It is important to understand that complexity is the quantitative scope of the game, whereas difficultly relates more to the qualitative strategic elements. This means that a more complex game does not necessarily mean that it will be a more difficult one.

2.2.1 State-Space Complexity Allis [14], in 1994, describes state-space complexity of a game as the total count of legal game positions accessible from the starting point of the game. In other words, it is the number of all possible legal positions with the constraints being the initial position of any particular game [15]. Each time any of the pieces makes a legal move, or secures a pawn, a fresh board state is conceived; the sum total of all of which ends up as the game’s state-space complexity. The legal division here is what creates great difficulty in precise calculation of these numbers. Even though it is not especially challenging to figure out the number of ways a game’s pieces can be set up on and off the board. It is, however, considerably trickier to decide the legal subgroup of these. For example, in the game of Tic-Tac-Toe there are three potential values: a cross, a naught, or neither, each of which could be plugged into any of the nine spaces present on the board. Simply by calculating three to the power of nine we find that a Tic-Tac-Toe board has 19863 unique states. This number, however, comprises a variety of illegal situations, such as a situation where there are five naught and no cross, or a situation where crosses and naught both form rows of three. If all of the illegal positions on the board are removed a total of 5,478 states remain and when all reflections and rotations of positions are deemed as the same, there are only 765 fundamentally distinct board states. This substantially increased complexity causes analysis paralysis, amongst researchers and scholars who, thus, tend to focus on the upper and lower bounds of state-space complexity when calculating it [16].

7 Figure 2.2: Board for the game of Tic-Tac-Toe.

The upper-bound is the amount of all the potential configurations of pieces, without giving any heed to the constraints in terms of legal positions and initial setup. In this case, having a situation with five naught and no cross would be included. The lower-bound is a measure of the minimum state-space complexity, calculated using simplified mathematical models of the game under consideration. Even for uncomplicated games, state-space complexity amounts to large, cumbersome num- bers, hence, scholars and researchers describe state-space complexity with log to base 10. This means that Tic-Tac-Toe has a upper bound state-space complexity of 4.

2.3 Abstract Games in this Thesis

Next, we describe the three abstract games we use in this thesis. To explore quantitative differences in the performance of reinforcement learning based game playing agents with changing state-space complexity of abstract board games we make use of three different board games:

1. Tic-Tac-Toe

2. Nine Men’s Morris

3. Mancala

8 2.3.1 Tic-Tac-Toe Tic-Tac-Toe, shown in Figure 2.2 also known as Xs and Os, or noughts and crosses, is a two player game, where two players take turn in filling X and O in a 3×3 grid. To win the game a player must succeed in placing three of their marks in a horizontal, vertical, or diagonal row. It is a solved game with a forced draw assuming best play from both players. The board size, for Tic-Tac-Toe, is 9 and the state-space complexity (as log to base 10) is 4. [17]

2.3.2 Nine-Men’s Morris

Figure 2.3: Board for the game of Nine-Men’s Morris.

Nine-Men’s Morris, also known as cowboy checkers, has a board that is made up of a grid with twenty-four intersecting points as shown in Figure 2.3. There are two players and both of them have nine pieces, or ”men” each. Each player actively attempts to form ’mills’ where a mill is three of a player’s own pieces lined horizontally or vertically. This grants the player to eliminate opponent’s piece from the game. This means that whenever a player achieves a mill, it can immediately remove one piece belonging to the opponent from the board. To win it is essential for a player to either reduce the opponent to two pieces so that they cannot create any mills, or leave them with no possible legal move. The Board size, for nine men’s Morris, is 24 and the state-space complexity (as log to base 10) is 10. [18]

2.3.3 Mancala Mancala, also known as Kalah, has a higher state-space complexity than Nine-Men’s Morris. Man- cala was brought into the United States of America by William Julius Champion, Jr. in the year

9 Figure 2.4: Board for the game of Mancala.

1940 [19]. The game comprises of a Mancala board and 24 seeds or counters. As shown in Fig- ure 2.4, each side of the board has 6 pits, called houses, each of which contains 4 seeds each at beginning of the game. There is also one big pit at each end, called the end zone or store. Seeds are sown alternatively by each player. When it is a particular player’s turn they must remove all of the seeds from any one of the houses on their side of the board and move the seeds, dropping one seed at a time in each house except the opponent’s store in counter-clockwise direction. If the last sown seed lands in one of the player-owned empty houses and the opponent’s adjacent house contains seeds, both the player’s own last seed and the oppositions seeds in the adjacent house are put into the player’s store. However, if the last sown seed is in the player’s own store, they get an extra turn. There are no limitations in terms of the total number of moves a particular player can make in a specific turn. The object of the game is to capture more seeds than one’s opponent. The Board size is 14 and the state-space complexity is 13. [20]

2.3.4 Zero-Sum Games and Fully Solved Games For the above games we note two characteristics of them as defined as both “Zero-sum” and ”Fully Solved Games”. Zero-sum games are games where one agent’s gain is equivalent to the other one’s loss and the sum total of all gains subtracted from the total losses, it should sum to zero. This means that whenever one player wins, the other player loses. Popular games like tennis, poker, , and rock-paper-scissor-stone are all zero-sum games. Fully solved games are games whose from any particular state can be correctly pre- dicted, with the assumption that both the players play perfectly. The optimal policy for such games can be been found. Common examples of such games are: Nim, ,English , Chopsticks, etc. All the games used for the purpose of this research are both zero-sum and fully solved games.

2.4 Reinforcement Learning for Game Playing Agents

RL tries to optimize an agent’s performance and behavior corresponding to a certain reward func- tion when it is operating in an environment comprised of different states. For RL based game

10 playing agents, the environment is a particular game which has many of well-defined actions and the agent has to decide which actions must be selected to win the game. Moreover, the encourage- ment to learn new and good strategies stems from the result of agents getting rewards for winning a game. Finite Markov Decision Process (MDP) [21] is used for RL. MDP describes the problem of learning as a consequence of actions to achieve a goal. A set of actions and states that make the foundation of the agent’s decision-making process are defined. A finite MDP mainly involves actions and the environment, which is made up of everything other than the agent. An MDP requires: 1. A finite set of actions (potential positions on the board of a game)

2. A finite set of states (a distinct arrangement on the board of a game)

3. A function of reward to return a value based on performing a specific action in a given state. This means that if the agent performs a particular action that results in finally winning the game from a given state, then R = 1. Otherwise, if the agent blunders and picks the erroneous action leading to eventually losing the game then R = -1. Else, in a situation where none of the above happens, the reward is R = 0.

4. A transition function to provide the probability of moving from an existing state to a specific new state when performing a certain action. The necessity for a robust transition function rises in situations where there is an uncertainty associated with an action in regard to eventually leading up to a certain desired result. Finite states, actions, and rewards make up a finite MDP. The framework of finite MDP acts as an aid in formalizing any problem so that actions, depending on a current state, can be easily recognized and thereby used to maximize the agent’s total reward during a game. To understand the RL approach, consider a game of Tic-Tac-Toe and assume an imperfect opponent, i.e. an opponent that can lose or make mistakes. Next, as shown in the Table 2.1, a table with one entry per state can be considered, where each entry is an estimate of the probability of a win from that particular state. This signifies the value of that state. Assuming the agent play O’s, estimating probability of winning is initialized as V[s = a state with three O’s in a row] = 1, V[s = a state with three X’s in a row] = -1, V[s = all draw states] = 0. Table 2.1 can help in understanding the RL approach to Tic-Tac-Toe. Many games are played to train, and the subsequent moves at any point are chosen by looking ahead one step. The next state with the highest estimated probability of winning, a greedy move, is generally picked. This determines the learning rule for the RL based game playing agent.

2.4.1 Reinforcement Learning Algorithms There are two main types of RL algorithms: 1. Model-based RL algorithms: These try to use experience to form an internal model of the transitions and immediate outcomes in the the environment and then choose the optimal

11 State V(s) = Estimated probability of winning

0.5

0.5 ------

1(WIN) ------

-1(LOSS) ------

0(LOSS)

Table 2.1: Table to Understand RL Approach to Tic-Tac-Toe.

12 policy based on it’s learned model.To do this a model can be learnt by running a base policy, like a random policy or any educated policy, while observing the trajectory is observed to plan through an f (s,a) control function to choose the optimal actions 2. Model-free RL algorithm: These use experience to directly update either state/ action values or policies, or both, which can then attain the identical optimal behavior but without the use of a world model or estimation,i.e., the agent uses trial-and-error method for achieving an optimal policy. Q-learning is a model-free RL algorithm. It learns the action-value function Q(s, a), which is nothing but a scalar value assigned over an action a given the state s. For our research we use Q-learning framework, a model-free RL algorithm, because model- based methods are generally limited only to specific types of tasks despite of having have more assumptions and approximations on a particular task. Moreover, Q-learning converges with the probability of one under reasonable conditions on the learning rates and the Markovian environ- ment.

Q-learning Q-learning follows value based learning. It is a reinforcement learning algorithm used to learn the value of an action in a particular state. [22] Q-learning algorithm operates without learning the model of the environment. The basis of Q-learning is simply updating the Q-value, which is an estimation of how good performing an action would be in a given state, thereby, figuring out how to navigate in an unknown environment. We will be discussing about Q-learning in detail in the following chapter.

The Min-Max Algorithm As we use Min-Max to train our Q-learning we provide a description of the algorithm. The Min- Max algorithm minimizes the loss in a worst case scenario situation. It finds the best possible move by modelling all possible continuations from a given position when the state of the board is provided and chooses the move that is best for the player. The best possible move is the one with the best outcome with the assumptions that the player always makes the move that maximizes the game value for it and the opponent always makes the move that maximizes the game value for it, therefore, minimizing the game value for the player. [23]. A zero-sum game is a game where a win for one player is a loss for the opponent, i.e. at any point of time if the total gains of all the players are added together and the total loses are subtracted, the sum total should be equal to zero. For a two-player zero-sum game, the solution given by Min- Max algorithm is the same as the [24]. This implies that for any two-person, zero-sum game with a finite number of strategies, there exists a value V, and a mixed for each player, such that Player 1’s strategy makes sure that it achieves a payoff of V regardless of Player 2’s strategy, and Player 2 ensures themselves a payoff of -V. We designed and coded two types of player classes to implement the Min-Max algorithm: 1. Deterministic Min-Max agent: If there are more than one moves with the exact same best scores in a given position, a deterministic Min-Max agent will always chose the same move.

13 2. Non-deterministic Min-Max agent: If there are more than one moves with the exact same best scores in a given position, a non-deterministic Min-Max agent will chose a move in a stochastic manner.

Deep Q-Learning One popular variant of Q-Learning algorithm is deep Q-Learning. Deep Q-Learning derives its inspiration from a biological mechanism, called experience replay, that uses a random sample of prior actions instead of the most recent action to proceed. This means that a deep Q-Learning agent learns from a batch of experiences [25]. This method helps in removing correlations in the observation sequence and smoothing the changes in data distribution. Iterative updates are made only periodically to adjust the Q value towards target. This leads to further reducing the correlations with the target. Deep Q-Learning is generally implemented in environments that involve continuous action and states. Since our research uses games with discrete environments we use Q-Learning algorithm instead of deep Q-Learning as it is computationally simpler and yet still effective.

2.5 API for development of Game Playing Agent of Ludii

To develop a game playing agent, first, Ludii loads the game from the Ludii.jar file. Static helper methods are provided by the player.utils.loading.GameLoader class of Ludii. This makes it easy to programmatically load games by simply taking one argument containing the filename of the desired game followed by a .lud extension. For example, a game of Tic-Tac-Toe may be loaded as:

/ > finalGameTicTacToe = GameLoader.loadGameFromName(”/Tic − Tac − Toe.lud”); (2.1) To run Ludii trials Context and Trial objects are constructed to describe the context of a current trial being played and to record a game played at a particular time, respectively. This can be done as described in the algorithm2

Algorithm 2 Ludii trials 1: final Trial trial = new Trial(game); 2: final Context context = new Context(game, trial);

To run a trial the game playing agent object is defined which selects the moves during the trials as shown in the algorithm3. Lastly, the main loop is implemented to carry out multiple trials. This main loop comprises of initializing for the game, acquiring a Model that manages the “control flow” of the trial, looping until a terminal game state is reached, examining which player(s) is/are to move, asking the corre- sponding AGENT object to decide on the moves, and applying them to the game. An example of this is shown in algorithm4

14 Algorithm 3 Running a trial 1: final List agent = new ArrayList (); 2: agent.add(null); 3: for (p = 1; p ¡= game.players().count(); ++p) do 4: agent.add(new RLAgent()); 5: end for

Algorithm 4 Main loop to carry out multiple trials 1: for (i = 0; i ¡ NUMBEROFTRIALS; ++i) do 2: game.start(context); 3: for (p = 1; p ¡= game.players().count(); ++p) do 4: game.start(context 5: end for 6: final Model model = context.model(); 7: while (!trial.over()) do 8: model.startNewStep(context, agent, 1.0); 9: end while 10: end for

It is important to note that it is required for custom Ludii agents to be written in Java and extend the abstract util.AI class. If it is essential to execute initialization steps before starting to play or to perform cleanup after finishing a game both initialization and cleanup can be overwritten for agents as shown in algorithm5

Algorithm 5 extend the abstract util.AI class 1: public void initAI(final Game game, final int playerID){} 2: public void closeAI(){}

The initAI() method also tells the system which player it is expected to start playing in the upcoming trial as shown in algorithm6 To implement desired custom agent, Ludii > Preferences... is selected in the menu bar of the Ludii application, the From JAR option is chosen and then the JAR file containing desired custom game playing agent’s .class file can be selected from the option of all such JAR files that extend Ludii’s util.AI abstract class.

15 Algorithm 6 initAI() method 1: Override 2: public void initAI(final Game game, final int playerID 3: this.player = playerID;}

16 Chapter 3 : Training a Q-learning Agent

In this chapter, as shown in in Figure 3.1, our goal is to analyze the training of a Q-learning based game playing agent with respect to training methodologies for abstract games. It is important to note that Q-learning is a model-free algorithm, implying that it requires no model of the environ- ment. This makes Q-Learning capable of dealing with stochastic transitions and rewards, with no need for adaptations and alterations. We analyze how to train a Q-learning based agent by using different training partners in the RL methodology. Specifically, we test how the training works using itself and both a deterministic and non-deterministic Min-Max agents as the teaching agent.

Figure 3.1: Contents of Chapter 3 Training.

We report the number of episodes needed when training is done against itself and opponent agents as trained by us. Here episodes signify all the states between the intial and final state, i.e. one whole game. We, also, explore how quickly our agents train to convergence under different training opponents. We also provide a method to create weakened agents These agents are useful as they help

17 learning humans play against opponents that make errors. We use these weakened agents to run an experiment where we observe how important it is to have good opponents for training a Q-learning agent. To confirm our hypothesis that better teaching agents accelerate training we experimented with training our agent against different agents where each teaching agent has had increasing num- ber of episodes going from first to last.

3.1 Details of the Q-learning Algorithm

Q-learning is a model free RL algorithm, which can be seen as a method of asynchronous dynamic programming (DP).The Q-learning policy can be described as a function Q that is maximized at a given state s when the agent performs an action a. Here Q is the expected reward the agent will get at the end of the game on performing a particular action a at specific state s. This acts as an encouragement for the agent to learn and pick the actions that maximize Q as it wants to increase the reward [26]. The computation that forms the foundation of Q-Learning is:

abest = arg maxa∈A Q(s,a) (3.1) To successfully calculate Q(s,a), all probable pairs of states and actions have to be looked into by the agent while obtaining feedback from the reward function R(s,a). Q(s,a) can be revised repetitively by making the agent play several games against itself or an opponent. The equation used to update the value of Q is based on the well-known equation within RL called the Bellman equation [27] The equation to update Q values is:

Q(s,a)new ← (1 − α).Q(s,a) + α.(R(s,a) + γ maxa∈A Q(sˆ,aˆ) (3.2) Here, a is the action being performed in the current state s, the new state reached after perform- ing action a iss ˆ anda ˆ is the best possible action in the states ˆ, α is the learning factor, and γ is the discount factor.

3.2 How to select Learning Factor and Discount Factor

As shown in equation 3.2, by playing multiple iterations of the game and updating the value of Q for a particular state s and action a using the given equation at the end of each game, convergence on the true Q(s,a) will be observed. Learning factor α determines to what degree the old value are overwritten. If we chose α as 0, we wouldn’t change anything, if we chose α as 1 we completely replace the old value with the new value, if we chose α as 0.5, we would get the average between the old value and the new value. Even though choosing a higher value of alpha would result in quicker learning, for the sake of our experiments we do not want to lose the past value entirely and use value similar to Wang et. al. [28]. The value of discount factor γ is what determines the weight that future rewards are considered compared to the present rewards at the current time step i.e. it is the value of future rewards. It

18 can change the learning quite a bit and can be either a dynamic value or a static value. At any step k of the algorithm, the reward received is discounted by γk[29]. If the value of γ is equal to one, the agent values a future reward just as much as a current reward. This means, in ten actions, if an agent does something good this is just as valuable as doing this action directly and a γ of zero will cause the agent to only value immediate rewards, which only works with very detailed reward functions. Since we are only interested in rewards for end-of-game states, win, loss, or draw, and even though a γ of 1 would mean that the agent would be ”farsighted”, we intend to keep γ less than 1 because there is a time limit to the sequence of actions needed to solve a task as rewards are preferred sooner rather than later[30]. Since, Q-learning converges to the optimal policy only for the discounted reward infinite hori- zon criteria, but keeping time limitations into considerations, after a few trials (with α = 0.1 and γ = 0.95, α = 0.1 and γ = 0.9, α =0.4 and γ = 0.4, and α =0.4 and γ = 0.9) and using literature [31], γ = 0.9 and α = 0.4 are used in our experiments.

3.3 Training Methodology

Algorithm 7 pseudo-code for the training phase to learn the values of Q(s,a) 1: Initialize: Q(s,a) = 0, 2: Starting state s, 3: Starting player P, 4: Number of games numGames; 5: for (t = 1; t ¡= numGames; t++) do 6: Initialize: Q(s,a) = 0, 7: With probability ε: P picks random action a, 8: Otherwise, picks an action a that maximizes Q(s,a) 9: if(result == win) then reward R(s,a) = winValue 10: if(result == loss) then reward R(s,a) = lossValue 11: if(result == draw) then reward R(s,a) = drawValue 12: while(!starting state) 13: Observe state sˆ and reward R(s,a) 14: Update Q(s,a)new ← (1 − α).Q(s,a) + α.(R(s,a) + γ maxa∈A Q( s,ˆ a)ˆ 15: end for 16: Switch turn, P = second player

To train a Q-learning agent in an RL based framework it needs to learn the values of Q(s,a). This training is done by using two agents to play against each other including the possibility of hav- ing the learning agent play against itself. We will call the agent that is being trained the “Learning Agent” and the agent that is being use to train the agent the “Teaching Agent”. The probability of an agent choosing a random action is given by ε, otherwise, it will choose to perform the best known action according to Q(s,a). This allows a learning agent to sometimes

19 explore new actions and at other times exploit the information that it has already learned. The pseudo-code for our training phase is shown in algorithm7. To understand how to get good Q values, assume that a table is created where the Q value for every potential state and move is stored. Tabulated Q values might be calculated in a manner as explained below for an example in Tic-Tac-Toe.

Figure 3.2: Illustration of how the Q-agent trains.

Figure 3.2 shows an example set of steps for a Tic-Tac-Toe game where the learning agent plays with a cross and is being trained against a teaching agent (doesn’t matter if human or RL agent) playing naught that in this example is a random agent. Since O’s last move leads to a win, that action is rewarded with +1. The Q-value becomes 0 + α* 1, since α =0.4, the Q value becomes 0.4. This value is used to update the previous action in the game history. Since the moves before this move do not end the game, there is no direct reward. Also since the discount factor, γ, is set to 0.9. And the maximum Q-value for the next state is 0.4, so we update Q-value to 0.144. Going one more state back in history, the starting position is reached. The maximum Q-value for the next state is 0.144, so the new Q-value for the first move turns out to be 0.05184. To train a Q-agent, the process is repeated for many episodes. The number of iterations, denoted by N in algorithm7, must be fairly substantial and we will describe in the next section how we determine N experimentally. The value of N will change for different games, different teaching

20 agents, and the value of N gets larger with increased state-space complexity of the board game being trained for as we will show later in this chapter. Once the value of the Q function is known, Tic-Tac-Toe is played simply by observing the Q values for all the moves possible in the current state and the move with the highest Q value is picked. In situations where there is more than one possible move with the same highest value, a move is chosen randomly amongst the options available. Having the highest value means that this move is the best move in a given situation. Occasionally a move at random, despite a lower Q-value, can also be picked. This is known as an exploratory move.

3.4 Stopping Criteria For Training

In this section, we answer the following questions with respect to training time: How many gener- ations does a learning agent need to trained for before the algorithm converges on ”the best solu- tion”? Also, what is the ”best solution”? How do we decide if the agent is good enough? These questions are in relation to how do we find, N (the number of training iterations) in algorithm7.

3.4.1 Definitions: Convergence until Stable versus Convergence until Fully Solved Before we delve into training our agents it is important to understand the two conditions of con- vergence that we are working with. In our research, “Convergence until Fully Solved” (Convergence-FS) happens once a learning agent appears to have achieved perfect play, that is meets our stopping criteria against a strong opponent. In our methodology, we perform an extra 5000 trials of training to ensure that the agent hasn’t just gotten ”lucky”, and that training has truly Converged-FS. Convergence until stability (Convergence-ST) is the number of epochs an agent requires to initially reach a win/draw to loss ratio which remains constant for at least the next 20 trials with a fluctuation of one. This means that Convergence-ST is a defined stopping mechanism, i.e. understanding it gives us guidelines on observing the number of games required to achieve a win/draw to loss ratio that can be categorised as stable despite minimal perturbations. It is interesting to note that, for any learning agent if the teaching agent is too weak, then it might attain Convergence-ST, however, still remain unable to achieve Convergance-FS . We show our results for N, the number of training epochs required for algorithm to converge until fully solved, in the following sections for each of the three games. To answer these questions we use our previously described abstract games - Tic-Tac-Toe [32], Nine-Men’s Morris[33], and Mancala [34], which are all solved zero-sum games. This means that one agent’s gain for any of these games is equivalent to the other one’s loss, and that the optimal policy for all these games has previously been found for different algorithms. Note that each of the games we have chosen has a converged learning state of the agent as follows. For Tic-Tac-Toe, the state of a forced draw assuming best play from both players[32]. It

21 has been shown that with perfect play from both players, the game of Nine-Men’s Morris results in a draw [33]. The end state for Mancala is a solved game where the first-player wins if both players play perfect games [34].

3.5 Training Tic-Tac-Toe Agent Against Different Teaching Agents

To see how a Q-learning agent trains, a learning agent for the game of Tic-Tac-Toe trains against the following teaching agents:

1. Q-learning based game playing agent: The twin of the learning agent, which itself is chang- ing in each generation.

2. Non-deterministic Min-Max agent: A Min-Max agent designed by us such that it randomly chooses a move, if there are more than one moves with the exact same best scores in a given position.

3. Deterministic Min-Max agent: A Min-Max agent designed by us such that it always chooses the same move if there are more than one moves with the exact same best scores in a given position.

4. Random agent: An agent that makes its move purely randomly, i.e., it looks for a random empty spot on the board and puts its piece there

The Q-learning learning agent trained with itself as the teaching agent is shown in Figure 3.3, and reaches convergence point after approximately eight thousand games. Note that in this figure, the curve in red denotes games that end in player 1 winning, which is the learning agent (the Q-learning agent being trained), the green line denotes games that end in player 2 (the teaching agent) winning, and blue line denotes games that end in a draw. In this figure, the x-axis shows the training game number proceeding from 0 to the convergence number plus 5000 where a game number is equivalent to a training generation, and the y-axis is a cumulative game outcomes in percentage. Note that there are 100 games per each “Game Number” that give use an outcome in % on the y-axis. For example, for “Game Number” 2000 on the x-axis in Figure 3.3, the number of draws over 100 games is 10 (in blue), Teaching agent wins 40 (in red), and the Q-learning agent wins 50 (in green). This format for showing training is reused throughout this thesis. When the learning agent that is based on Q-learning is trained against a non-deterministic Min-Max teaching agent, the teaching agent randomly chooses a move for board positions where there are more than one move with the maximum evaluation exists. Figure 3.4 shows that the learning agent converges after approximately six thousand games, and this graph follows the same conventions as Figure 3.3. Note that the learning agent ( Q-learning agent) has a straight line parallel to x-axis, and this is because for the scope of this research in the game of Tic-Tac-Toe a draw is as good as a win condition and the agent pursues draw . When the teaching agent is a deterministic Min-Max agent, then this agent always makes the same move when a given board position results in more than one possible move best move. Figure

22 Figure 3.3: Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a Q-learning based player

Figure 3.4: Q-learning based game playing agent Tic-Tac-Toe learning by playing against non- deterministic Min-Max player.

3.5 shows that the learning agent converges after approximately two hundred games. We hypoth- esize that this teaching agent has less randomness in its behavior and helps the learning agent converge to a good solution much more rapidly than the other teaching agents. Also, note that the learning agent never finds a winning strategy and that this line remains at 0. This is again because a draw, for Tic-Tac-Toe, is just as good as a win. We also observe how a learning agent trains with a random teaching agent. Figure 3.6 that even after fifty thousand games, our agent was still losing games. This result makes sense since the learning agent can not find good strategies (unless by random chance) from a teaching agent that cannot play the game, at least, sufficiently well. However, the learning agent does perform well in terms of winning, but the agent never converges from our stopping criteria definition. The Q-learning learning agent is, finally, trained against a fully converged Q-Learning teaching agent. Figure 3.7 shows that the learning agent converges after three hundred games, and this graph follows the same conventions as Figure 3.3. Note, again we see similar results to the Deterministic Min-Max agent and stabilization occurs very quickly.

23 Figure 3.5: Q-learning based game playing agent for Tic-Tac-Toe learning by playing against deterministic Min-Max player.

Figure 3.6: Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a random player.

3.5.1 Summary of Training Convergence for Tic-Tac-Toe

Teaching Agent N until Convergence Q-learning 8000 Non-deterministic Min-Max 6000 Deterministic Min-Max 200 Random - Fully Converged Q-learning 300

Table 3.1: Summary of Training until Convergence for Tic-Tac-Toe with Different Teaching Agents

Table 3.1 shows the convergence results for all of the different scenarios. Column 1 lists the teaching agent and column 2 shows the approximate value when the learning agent (Q-learning

24 Figure 3.7: Q-learning based game playing agent for Tic-Tac-Toe learning by playing against a fully converged Q-Learning based game playing agents. in all cases) converges based on our stopping criteria. We can see that the Deterministic Min- Max agent is the best teaching agent as it provides clearer guidance to the learning agent, but we also see that using a converged Q-learning agent has similar results. In the cases when existing teaching agents aren’t available, then using the twinned Q-learning is a viable means to train an agent, however, to accelerate the training process of a known game with finite states Q-learning based player should be trained by making it play against a higher quality agent. Currently, we have no hypothesis on why the Deterministic Min-Max agent is better than the converged Q-learning agent.

3.6 Training Nine-Men’s Morris Agent Against Different Teach- ing Agents

The Q-learning based player for Nine-Men’s Morris, a game with a state-space complexity of 10, was also trained with different teaching agents. Note, that we no longer use random agent as this type of training was shown to be not useful in the previous game. As discussed in section 3.4, training for Nine-Men’s Morris converges successfully when the game ends in a draw. The Q-learning learning agent trained with itself as the teaching agent is shown in Figure 3.8, and reaches conversion point after approximately sixteen thousand thousand games. Figure 3.9 shows that the learning agent converges after approximately seven thousand games against a non-deterministic Min-Max teaching agent. Figure 3.10 shows that the learning agent converges after approximately two thousand games against a deterministic Min-Max teaching agent. Figure 3.11 shows that the learning agent converges after approximately three thousand games against a fully converged Q-Learning teaching agent.

25 Figure 3.8: Q-learning based game playing agent for Nine-Men’s Morris learning by playing against itself.

Figure 3.9: Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a non-deterministic Min-Max player.

Figure 3.10: Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a deterministic Min-Max player.

26 Figure 3.11: Q-learning based game playing agent for Nine-Men’s Morris learning by playing against a fully converged Q-Learning player.

Teaching Agent N until Convergence Q-learning 16000 Non-deterministic Min-Max 7000 Deterministic Min-Max 2000 Fully Converged Q-learning 3000

Table 3.2: Summary of Training until Convergence for Nine-Men’s Morris with Different Teaching Agents

3.6.1 Summary of Training Convergence for Nine-Men’s Morris Table 3.2 shows the convergence results for all of the different teaching agents. Column 1 lists the teaching agent and column 2 shows the approximate value when the learning agent (Q-learning in all cases) converges based on our stopping criteria. With similar results to Tic-Tac-Toe training, we can see that the Deterministic Min-Max agent is the best teaching agent as it provides clearer guidance to the learning agent. Still, for cases when existing teaching agents aren’t available, then using the twinned Q-learning is a viable means to train an agent.

3.7 Training Mancala Q-learning Agent against Different Teach- ing Agents

To see if the observation made in section 3.5 that Q-learning learning agent trained playing against a deterministic Min-Max teaching agent converges the quickest we will repeat the experiments for Mancala. As mentioned in section 3.4 for the game of Mancala the first-player should win if both players play perfect games. The Q-learning learning agent trained with itself as the teaching agent is shown in Figure 3.12, and reaches conversion point after approximately thirty-two thousand thousand games. Figure 3.13 shows that the learning agent converges after approximately seventeen thousand

27 Figure 3.12: Q-learning based game playing agent for the game of Mancala learning by playing against a Q-learning based player.

Figure 3.13: Q-learning based game playing agent for Mancala learning by playing against a non- deterministic Min-Max player. games with a non-deterministic Min-Max teaching agent.

Figure 3.14: Q-learning based game playing agent for Mancala learning by playing against a deterministic Min-Max player.

28 Figure 3.14 shows that the learning agent converges after approximately nine thousand games with a deterministic Min-Max teaching agent.

Figure 3.15: Q-learning based game playing agent for Mancala learning by playing against a fully converged Q-Learning based player.

Figure 3.15 shows that the learning agent converges after approximately twelve thousand games with a fully converged Q-Learning based teaching agent.

3.7.1 Summary of Training Convergence for Mancala

Teaching Agent N until Convergence Q-learning 32000 Non-deterministic Min-Max 17000 Deterministic Min-Max 9000 Fully Converged Q-learning 12000

Table 3.3: Summary of Training until Convergence for Mancala with Different Teaching Agents

Table 3.3 again shows similar results to both the previous games in that a teaching agent that is a Deterministic Min-Max provides the most efficient training of our learning agent. The results follow a similar pattern to the previous games, however we can see that as State-space complexity increases so does the size of N increase for each of the games.

3.8 Computational Run-Time Complexity for Q-learning and Min-Max Algorithms

The Q-learning agents for each game in this research trains with other teaching agents. We de- signed and coded two types of player classes to implement the Min-Max algorithm. To understand

29 the computational run-time complexity for Q-learning, deterministic and non-deterministic Min- Max algorithms we show the results for the time required to run one cycle of training. We only focus on the one cycle of training because training complexity is constant, i.e., it does not increase in time at any point of training.

RL Agent N run-time required (milli-seconds) Q-learning 1750 Non-deterministic Min-Max 2250 Deterministic Min-Max 70510

Table 3.4: Summary of Run-Time Required for One Cycle of Training of Tic-Tac-Toe With Dif- ferent Agents

RL Agent N run-time required (milli-seconds) Q-learning 6750 Non-deterministic Min-Max 16850 Deterministic Min-Max 59565

Table 3.5: Summary of Run-Time Required for One Cycle of Training of Nine Men’s Morris With Different Agents

RL Agent N run-time required (milli-seconds) Q-learning 9150 Non-deterministic Min-Max 22785 Deterministic Min-Max 45570

Table 3.6: Summary of Run-Time Required for One Cycle of Training of Mancala With Different Agents

Tables 3.4, 3.5, and 3.6 show the computational run-times for each of the games we study in this thesis.

3.9 Snapshots for Creating Lower Quality Agents

As described earlier, the goal of a game agent is not necessarily always to create the “best” agent, but is to provide humans with artificial opponents who are competitively matched to the humans’ current capabilities. In this section, we provide a methodology to create these weaker agents via the idea of taking snapshots while training. Our methodology suggests that we maintain data of the learning agent throughout the training process. This can be achieved by saving the Q matrix to a file as the training proceeds - these files are then snapshots of the trained agent in some point of time on the training scale. Next, once we

30 have met our stopping criteria condition we then have a point in time that we call by the parameter N in algorithm7. We can then the snapshot at time N and call it Q-100, which represents the Q-learning agent as 100% trained. Using this approach we can also find other snapshots to create other Q-percent agents where a 50% agent uses the snapshot N/2 as it’s learned state. The next question is which teaching agent should we use as the teaching agent for creating our snapshots. In this thesis we have selected the Q-learning twin as the teaching agent of choice for our snapshot methodology. The reason for this is that the convergence graphs for this teaching agent have the most gradual steps in the training process, and we hypothesize that the gradual nature is best for our snapshot method. However, we do not have conclusive data on why this approach is best, and we leave this as future work where an experiment would need to be created where different approaches to snapshot could be compared either with human opponents or some measured comparison. Given our methodology, since the Q-learning agent converges against itself after 8000 Tic-Tac- Toe games, we define N = 8000 and define a ten percent agent for Tic-Tac-Toe, denoted ad Q-T-10, as the q-matrix of the agent at 10 per cent of 8000 which is 800 games. In the following chapter, we will use our snapshot agents in the evaluation results to observe how these opponents compare against the converged agents.

3.10 Confirming the Importance of Good Teaching Agents

With our snapshot agents, we perform one final experiment in this chapter on training. We hy- pothesize that better teaching agents accelerate training. Already, we have shown that a random teaching agent does not result in a learning agent that satisfies our stopping condition. To test this further we train our agent against different snapshot agents starting with Q-T-10, followed by Q-T- 20, and so on. We, then, compare the number of episodes required for convergence in such cases with the number of episodes required when the Q-learning based player learns by playing against itself. Tables 3.7, 3.8, 3.9 show our results in the same form as the previous summary tables. I would like to mention that the agent trained here is not converged to be the best but till it reaches to the first point of a stable win/draw to loss ratio, where a “stable” win/draw to loss ratio is the number of games an agent requires to initially reach a win/draw to loss ratio which remains constant for at least the next 20 trials with a fluctuation of one. Table 3.7 shows that better teaching agents accelerate training almost linearly for the game of Tic-Tac-Toe. Column 1 shows the teaching agent and column 2 shows the number of episodes until convergence. Table 3.8 shows the same results with the same structure. Specifically, we can see that better teaching agents accelerate training for the game of Nine Men’s Morris. Table 3.9 shows that better teaching agents accelerate training for the game of Nine Men’s Morris.

31 Teaching Agent N until Convergence for stability Q-T-10 7200 Q-T-20 6500 Q-T-30 5700 Q-T-40 4900 Q-T-50 4100 Q-T-60 3300 Q-T-70 2500 Q-T-80 1700 Q-T-90 1000

Table 3.7: Summary of Training until Convergence for Tic-Tac-Toe with Different Snapshot Teach- ing Agents

Teaching Agent N until Convergence for stability Q-T-10 14500 Q-T-20 13000 Q-T-30 11500 Q-T-40 10000 Q-T-50 8500 Q-T-60 7200 Q-T-70 6000 Q-T-80 5000 Q-T-90 4000

Table 3.8: Summary of Training until Convergence for Nine Men’s Morris with Different Snapshot Teaching Agents

Teaching Agent N until Convergence for stability Q-T-10 29500 Q-T-20 27000 Q-T-30 24500 Q-T-40 22000 Q-T-50 19500 Q-T-60 17500 Q-T-70 16000 Q-T-80 14500 Q-T-90 13000

Table 3.9: Summary of Training until Convergence for Mancala with Different Snapshot Teaching Agents

32 Chapter 4: Evaluation of RL Agents

The next questions with RL based frameworks that we address in this thesis is how can we compare a range of agents? In particular, we want to evaluate our Q-learning agent and it’s snapshots of weakened Q-learning agents to existing algorithms to understand the relationship between the quality of these agents. More importantly, this chapter presents a methodology on how to evaluate a range of agents for abstract board games. We provide details and results of this approach for evaluation noting that it can be modified and used by other researchers for other agent comparison needs. To evaluate the performance of AI agents the first step must be to define its intended purpose and what constitutes success. In this thesis, the intended purpose is to see how well the agents perform in an abstract game with changing state-space complexity of competitive multi-player games against a range of agents. In the space we are examining, a key property of each of the games under investigation is that they are solved games (even though each game has a different state-space complexity). In previous research, a number of methods have been proposed to evaluate players/agents. We are dealing with multi-player games and the opponents are not always constant. Thus, one popular approach in these spaces is an aggregated relative performance [35] as a popular performance metric that might suffice our need. There are two different ways in which aggregating the results can help in quantifying the performance of the agent. These are achieved by:

1. computing the win rates by distributing points per match-up (this is used in many sports leagues) [35]

2. using iterative measures such as player rating (for example, MMR in StarCraft II, ELO in chess, etc.)[35]

Sine we don’t have an established ranking system to rate the players/agents, we focus on com- puting the win rates for our evaluation method. The aim, here, is to run multiple matches of a game playing agent against multiple agents and against itself. The number of matches won, lost or drawn for a particular agent for each of the three games - Tic-Tac-Toe, Nine-Men’s Morris, and Mancala will be recorded and presented in this chapter. The first step, however, is to define what “good” is in terms of wins, draws, and losses. Next, we describe a methodology to identify the number of games played between a set of opponents where that number of games will have a high likelihood of resulting in a deterministic result, meaning that the reported win, loss, and draw rate will only have small perturbations if the number is increased. Finally, we use this number of games and execute our round-robin tournament to compare the agents in this study. In terms of chapter structure, we will treat each game by defining what is “good” for an agent, then we will find the necessary number of matches needed to be played, and then finally, we will provide results from the round-robin tournament. For each of the three games (Tic-Tac-Toe, Nine Men’s Morris, and Mancala) we will have a tournament with the following agents:

33 • A fully converged Q-learning agent, Q-100

• Q-learning agent at 50 per cent, Q-50

• Non-deterministic Min Max (Nd-Min-Max)

• Deterministic Min Max (D-Min-Max)

• Random (rand)

4.1 Evaluating Tic-Tac-Toe

4.1.1 Defining a “Good” game agent for Tic-Tac-Toe For Tic-Tac-Toe a converged solved agent playing another agent results in a forced draw assuming best play from both players, as discussed in section 3.4. So to quantitatively decide an agent is “good” we would expect it to draw against another “good” agent. However, against a weaker agent we would expect that a “good” agent would win. However, the interesting idea of training is that the agent may be motivated to draw, and therefore, this tournament structure will reveal if one RL agent is superior to another if it wins against weaker agents.

Player-1 Player-2 Player-1 wins Player-2 wins draws Min-Max agent Random agent 99.5 per cent 0 per cent 0.5 per cent Random agent Min-Max agent 0 per cent 80 per cent 20 per cent Min-Max agent Min-Max Agent 0 per cent 0 per cent 100 per cent

Table 4.1: The base line after 10000 games of Tic-Tac-Toe with Min-Max and Random agent as players

Table 4.1 shows some simple baseline results of “good”. In this experiment we show that as player 1 a “good” agent should almost win all the time against a random agent. The Min-Max agent as player 2 versus a random agent will win less games, approximately 80%, and will draw the remainder of the games. In general, the minimum requirement of an agent is drawing each game, and better agents will win more. Second, we can see that depending on the agents playing position the results will differ. Therefore, our full round-robin tournament will have agents both as player 1 and player 2.

4.1.2 Method to find the Number of Games to find a Stable Result for Tic- Tac-Toe between Two Agents Our setup for evaluation is a round-robin tournament, but the question remains, how many matches need to be played between opponents to find a stable result. For example, two converged agents only need to play 1 game since they both will result in a draw as both agents are sufficiently “good”;

34 Figure 4.1: 10000 games between Min-Max - Min-Max players to establish baseline

Figure 4.2: 10000 games between Random - Min-Max players to establish baseline

Figure 4.1 shows the instant stability of this. However, two random agents will never come to a stable result and will fluctuate in wins, draws, and losses. Figure 4.2 shows a Random agent vs. Min-Max agent. In this graph we can see how the Min-Max agent wins most of the time and draws many games. The goal here is to find NtrialsTTT that is the number of games played between opponents to find a stable result. To do this, we will run a set of experiments and find NtrialsTTT based on finding stable results under a number of conditions and take the maximum. We will start with a constant that is defined as N1 and we make this 100 because this number is computational feasible for our experiments. We define the number of trials required for a “stable” win/draw to loss ratio as the number of games an agent requires to initially reach a win/draw to loss ratio which remains constant for at least the next 20 trials with a fluctuation of one. As fully converged trained agents will result in a “stable” rate of one, we use weaker agents, such as the snapshot agents, spaced at regular intervals to find the total number of trials required. These findings have been summarised in table 4.2. Now the number of trials for the tournament of Tic-Tac-Toe can be determined by the following equation:

35 Player-1 Player-2 No. of Episodes for stability Q-20 rand NT1a = 35 Q-20 Nd-Min-Max NT1b = 20 Q-20 D-Min-Max NT1c= 7 Q-20 Q-50 NT1d = 27 Q-20 Q-100 NT1e = 23 Q-40 rand NT2a = 32 Q-40 Nd-Min-Max NT2b = 18 Q-40 D-Min-Max NT2c= 5 Q-40 Q-50 NT2d = 25 Q-40 Q-100 NT2e = 20 Q-60 rand NT3a = 30 Q-60 Nd-Min-Max NT3b = 15 Q-60 D-Min-Max NT3c= 3 Q-60 Q-50 NT3d = 23 Q-60 Q-100 NT3e = 17 Q-80 rand NT4a = 26 Q-80 Nd-Min-Max NT4b = 11 Q-80 D-Min-Max NT4c= 2 Q-80 Q-50 NT4d = 20 Q-80 Q-100 NT4e = 15 Table 4.2: Number of trials the Q-learning agent for Tic-Tac-Toe requires to reach a stable win/- draw to loss ratio

NtrialsTTT = max(N1,NT1a,NT1b,NT1c,NT1d,NT1e,NT2a,NT2b, NT2c,NT2d,NT2e,NT3a,NT3b,NT3c,NT3d,NT3e) (4.1)

The total number of trials required to set up a Tic-Tac-Toe tournament is thus, NtrialsTTT = 100, which though greater than all of the results in the table is computationally feasible to run the tournament on. We could save some time by using the largest experimentally identified number 32 as the other largest number in Table 4.2.

4.1.3 Round-Robin Match tournament Results for the game of Tic-Tac-Toe

Now that we have found out the total number of trials, NtrialsTTT = 100, required for stable results, we perform and present the tournament results for all of the agents. Table 4.3 contains the summary of the tournament with column 1 being the agents acting as player 1 and row 1 listing the agents as player 2, where the players are: Q-learning agent at 100 per cent (Q-100), Q-learning agent at 50 per cent (Q-50), Deterministic Min-Max Agent (Mdet), Non-deterministic Min-Max agent

36 (MNdet), and random (R). Each entry is in the form P1-wins:Draws:P1-losses where for example 2:94:4 means that player 1 won 2 games, drew 94 games, and lost 4 games. The last column shows the total points earned by a particular agent, where 3 points are given to each game won and 1 point is given to each game drawn.

P1/P2 Q-100 Q-50 Mdet MNdet R Points Q-100 0:100:0 5:95:0 0:100:0 0:100:0 95:5:0 700 Q-50 0:75:25 60:5:35 0:50:50 0:52:48 5:55:40 432 Mdet 0:100:0 10:90:0 0:100:0 0:100:0 100:0:0 720 MNdet 0:100:0 5:95:0 0:100:0 0:100:0 90:10:0 690 R 15:10:75 0:15:85 0:25:75 0:20:80 65:5:30 315

Table 4.3: Tic-Tac-Toe tournament to look at win/draw rate for each of the players

Table 4.3 shows that fully converged Q-Learning (Q-100) agent is a ”good” agent and performs almost on par with deterministic Min-Max agent and non deterministic Min-Max agent. It can also be seen that while Q-learning agent at 50 per cent (Q-50) performs much better than a random agent. By looking at the total points earned by each of the players we can see that deterministic min-max agent is the best agent while fully converged Q-Learning agent is a close second.

4.2 Evaluating Nine Men’s Morris

4.2.1 Defining “Good” game playing agents for Nine Men’s Morris For Nine Men’s Morris a “good” solution is a draw [33] assuming best play from both players, as discussed in section 3.4, So to quantitatively decide an agent is “good” we would expect it to draw against another “good” agent. However, against a weaker agent we would expect that a “good” agent would win. However, the interesting idea of training is that the agent may be motivated to draw, and therefore, this tournament structure will reveal if one RL agent is superior to another if it wins against weaker agents.

Player-1 Player-2 Player-1 wins Player-2 wins draws Min-Max agent Random agent 95.4 per cent 4.1 per cent 0.5 per cent Random agent Min-Max agent 0 per cent 74.8 per cent 25.2 per cent Min-Max agent Min-Max Agent 0 per cent 0 per cent 100 per cent

Table 4.4: The base line after 1000 games of Nine Men’s Morris with Min-Max and Random agent as players

Table 4.4 shows some simple baseline results of “good”. In this experiment we show that as player 1 a “good” agent should almost win all the time against a random agent. The Min-Max agent as player 2 versus a random agent will win less games, approximately 75%, and will draw the remainder of the games. In general, the minimum requirement of an agent is drawing each

37 game, and better agents will win more. Second, we can see that depending on the agents playing position the results will differ. Therefore, our full round-robin tournament will have agents both as player 1 and player 2.

4.2.2 Method to find the Number of Games to find a Stable Result for Nine Men’s Morris between Two Agents Our setup for evaluation is a round-robin tournament, and again, we determine how many matches need to be played between opponents to find a stable result. Figure 4.2 shows a Random agent vs. Min-Max agent. In this graph we can see how the Min- Max agent wins most of the time and draws many games. Our goal is to find NtrialsTTT that is the number of games played between opponents to find a stable result. To do this, we will run a set of experiments and find NtrialsTTT based on finding stable results under a number of conditions and take the maximum. We follow the same methodology as previously used for Tic-Tac-Toe.

Player-1 Player-2 No. of Episodes for stability Q-20 rand NT1a = 55 Q-20 Nd-Min-Max NT1b = 38 Q-20 D-Min-Max NT1c= 27 Q-20 Q-50 NT1d = 48 Q-20 Q-100 NT1e = 43 Q-40 rand NT2a = 52 Q-40 Nd-Min-Max NT2b = 35 Q-40 D-Min-Max NT2c= 25 Q-40 Q-50 NT2d = 45 Q-40 Q-100 NT2e = 40 Q-60 rand NT3a = 48 Q-60 Nd-Min-Max NT3b = 33 Q-60 D-Min-Max NT3c= 22 Q-60 Q-50 NT3d = 42 Q-60 Q-100 NT3e = 37 Q-80 rand NT4a = 46 Q-80 Nd-Min-Max NT4b = 30 Q-80 D-Min-Max NT4c= 18 Q-80 Q-50 NT4d = 38 Q-80 Q-100 NT4e = 35 Table 4.5: Number of trials the Q learning agent for Nine Men’s Morris at a particular stage in training requires to reach a stable win/draw to loss ratio

As fully converged trained agents will result in a “stable” rate of one, we use weaker agents,

38 such as the snapshot agents, spaced at regular intervals to find the total number of trials required. These findings have been summarised in table 4.5. Now the number of trials for the tournament of Nine Men’s Morris can be safely and accurately decided as:

NtrialsTTT = max(N1,NT1a,NT1b,NT1c,NT1d,NT1e,NT2a,NT2b, NT2c,NT2d,NT2e,NT3a,NT3b,NT3c,NT3d,NT3e) (4.2)

The total number of trials required to set up a Nine Men’s Morris tournament is thus, NtrialsTTT = 100, which though greater than all of the results in the table is computationally feasible to run the experiments. Also note, the max number has increased to 55 for this game. .

4.2.3 Round-Robin Match tournament Results for the game of Nine Men’s Morris

Now that we have found out the total number of trials, NtrialsTTT = 100, required for stable results, we perform and present the tournament results for all of the agents. Table 4.6 contains the summary of the tournament with column 1 being player 1 and row 1 being player 2, where the players are: Q-learning agent at 100 per cent (Q-100), Q-learning agent at 50 per cent (Q-50), Deterministic Min-Max Agent (Mdet), Non-deterministic Min-Max agent (MNdet), and random (R). Each entry is in the form P1-wins:Draws:P1-losses where for example 2:94:4 means that player 1 won 2 games, drew 94 games, and lost 4 games. The last column shows the total points earned by a particular agent, where 3 points are given to each game won and 1 point is given to each game drawn.

P1/P2 Q-100 Q-50 Mdet MNdet R Points Q-100 0:100:0 3:97:0 0:100:0 0:100:0 92:8:0 690 Q-50 3:90:7 5:60:35 0:45:55 0:50:50 20:35:45 364 Mdet 0:100:0 8:92:0 0:100:0 0:100:0 92:8:0 700 MNdet 0:100:0 5:95:0 0:100:0 0:100:0 94:6:0 698 R 10:15:75 0:17:83 0:25:75 0:22:78 10:55:35 194

Table 4.6: Nine Men’s Morris tournament to look at win rate for each of the players

Looking at the win rate of each of the players from table 4.6 we can infer that fully converged Q-Learning agent is a ”good” agent and performs on par with deterministic Min-Max agent and non deterministic Min-Max agent. It can also be seen that while Q-learning agent at 50 per cent (Q-50) performs much better than a random agent. By looking at the total points earned by each of the players, where 3 points are given to each game won and 1 point is given to each game drawn, we can see that deterministic min-max agent is the best agent while non deterministic Min-Max and fully converged Q-Learning agents are both following closely.

39 4.3 Evaluating Mancala

4.3.1 Defining “Good” game agent for Mancala For Mancala with a converged solved agent, the first-player wins if both players play perfect games , as discussed in section 3.4. So to quantitatively decide an agent is “good” we would expect the first player to win against another “good” agent. Also, against a weaker agent we would expect that a “good” agent would win.

Player-1 Player-2 Player-1 wins Player-2 wins draws Min-Max agent Random agent 95.6 per cent 2.9 per cent 1.5 per cent Random agent Min-Max agent 34.5 per cent 34 per cent 31.5 per cent Min-Max agent Min-Max Agent 100 per cent 0 per cent 0 per cent

Table 4.7: The base line after 1000 games of Mancala with Min-Max and Random agent as players

Table 4.7 shows some simple baseline results of “good”. In this experiment we show that as player 1 a “good” agent should almost win all the time against a random agent. In general, the minimum requirement of a ”good” agent is for it to win most games as a first player. Our full round-robin tournament will have agents both as player 1 and player 2.

4.3.2 Method to find the Number of Games to find a Stable Result for Man- cala between Two Agents Our setup for evaluation, as has been for the previous two games, is a round-robin tournament. To find out how many matches need to be played between opponents to find a stable result we will run a set of experiments and find NtrialsM, the number of games played between opponents to find a stable result, based on finding stable results under a number of conditions and take the maximum. We will start with a constant that is defined as N1 and we make this 100 because this number is computationally feasible for our experiments. We define the number of trials required for a “stable” win/draw to loss ratio as the number of games an agent requires to initially reach a win/draw to loss ratio which remains constant for at least the next 20 trials. As fully converged trained agents will result in a “stable” rate of one, we use weaker agents, such as the snapshot agents, spaced at regular intervals to find the total number of trials required. These findings have been summarised in table 4.8. Now the number of trials for the tournament of Mancala can be safely and accurately decided as:

NtrialsM = max(N1,NT1a,NT1b,NT1c,NT1d,NT1e,NT2a,NT2b, NT2c,NT2d,NT2e,NT3a,NT3b,NT3c,NT3d,NT3e) (4.3)

40 Player-1 Player-2 No. of Episodes for stability Q-20 rand NT1a = 115 Q-20 Nd-Min-Max NT1b = 97 Q-20 D-Min-Max NT1c= 65 Q-20 Q-50 NT1d = 84 Q-20 Q-100 NT1e = 78 Q-40 rand NT2a = 111 Q-40 Nd-Min-Max NT2b = 87 Q-40 D-Min-Max NT2c= 55 Q-40 Q-50 NT2d = 78 Q-40 Q-100 NT2e = 60 Q-60 rand NT3a = 108 Q-60 Nd-Min-Max NT3b = 78 Q-60 D-Min-Max NT3c= 43 Q-60 Q-50 NT3d = 63 Q-60 Q-100 NT3e = 53 Q-80 rand NT4a = 105 Q-80 Nd-Min-Max NT4b = 66 Q-80 D-Min-Max NT4c= 35 Q-80 Q-50 NT4d = 56 Q-80 Q-100 NT4e = 37 Table 4.8: Number trials the Q-learning agent for Mancala requires to reach a stable win/draw to loss ratio

The total number of trials required to set up a Mancala tournament is thus, NtrialsM = 115, which though greater than all of the results in the table is computationally feasible to run the experiments. Also, note that this results shows how our initial N1 choice of 100 has been surpassed as the stability point has increased for this games increases state-space complexity.

4.3.3 Round-Robin Match tournament Results for the game of Mancala

Now that we have found out the total number of trials, NtrialsM = 115, required for stable results, we perform and present the tournament results for all of the agents. Table 4.3 contains the summary of the tournament with column 1 being player 1 and row 1 being player 2, where the players are: Q-learning agent at 100 per cent (Q-100), Q-learning agent at 50 per cent (Q-50), Deterministic Min-Max Agent (Mdet), Non-deterministic Min-Max agent (MNdet), and random (R). Each entry is in the form P1-wins:Draws:P1-losses where for example 2:94:4 means that player 1 won 2 games, drew 94 games, and lost 4 games. The last column shows the total points earned by a particular agent, where 3 points are given to each game won and 1 point is given to each game drawn. Table 4.9 proves that fully converged Q-Learning agent is a ”good” agent and performs almost on par with deterministic Min-Max agent and non deterministic Min-Max agent. It can also be

41 P1/P2 Q-100 Q-50 Mdet MNdet R Points Q-100 115:0:0 115:0:0 115:0:0 115:0:0 110:5:0 1715 Q-50 58:5:52 60:5:50 46:31:38 49:21:30 55:35:10 901 Mdet 115:0:0 114:1:0 115:0:0 115:0:0 114:1:0 1721 MNdet 115:0:0 114:1:0 115:0:0 115:0:0 112:3:0 1717 R 35:0:80 54:15:46 37:29:49 39:50:26 45:15:55 739

Table 4.9: Mancala tournament to look at win/draw rate for each of the players seen that while Q-learning agent at 50 per cent (Q-50) performs much better than a random agent. By looking at the total points earned by each of the players we can see that deterministic min- max agent is the best agent while non deterministic min-max agent is second.

42 Chapter 5 : Discussion and Future Work

In the previous two chapters, we showed how to train Q-learning agents (including parameter selection and stopping criteria) for 3 abstract board games, provided a method to create weaker versions of the converged Q-learning agent, and provided a methodology to evaluate different game playing agents in a round-robin tournament. In this chapter, we provide insights into these methods and our analysis. For all the games used in this research the perfect play, i.e., the action or behavior of any game playing agent that leads to the best possible outcome for that player no matter the response by the opponent, is known as these games are solved. An interesting extension of our work can be to implement similar techniques for partially solved and unsolved games like Chess, Go, , etc. Another exciting application of this research can be in applying similar techniques to non-zero- sum games, i.e., to games where a win for one player is not a loss for the other. Some of the typical examples of non-zero-sum games are: • The Battle of The Sexes, where a couple has to choose between two activities, A and B, for a particular evening. While the man might prefer to engage in activity A, he would prefer to do B with his wife rather than pursuing A alone. Similarly, the wife would prefer activity B, but she too would rather pursue activity A with her husband rather than B alone. • The Prisoner’s Dilemma, where 2 criminals are brought in by the police who are allegedly responsible for a murder, but the police does not have enough evidence to prove their crime in court. The prisoners are put in separate cells with no way for the prisoners to communicate and each is offered to confess. If neither prisoner confesses, both will be free. If both of the prisoners confess to murder, both of them will be sentenced to 5 years. If, however, one prisoner confesses while the other does not, then the prisoner who confesses will be granted immunity while the prisoner who did not confess will go to jail for 20 years. Since non-zero-sum games generally have both competitive and cooperative elements, therefore, they better represent real-life conflicts. Thus, the observing behavior of our game-playing agents in such scenarios might be exciting. Additionally, since, we hypothesize that Q-learning will only be useful as an RL-based tech- nique for a limited state-space complexity, therefore, it might be fascinating to see if our observa- tions of the agents hold good for more computationally expensive games like Stratego (state-space complexity: 115), Twixt (state-space complexity: 140), Connect6 (state-space complexity: 170), etc. This approach might even be extended to observe the game-playing agents’ behavior when playing infinite games, i.e., games that are played to keep the gameplay going. Infinite games exist by playing with boundaries and rules of the game, for example, Infinite Chess, Magic: The Gathering, etc. These games can be treated as continuous tasks, i.e., RL tasks which are not made of episodes, but instead last forever. It can also be interesting to note the effect on training and performance if a Q-learning agent is trained against a fully converged Q-learning agent which has previously been trained against a

43 Min-Max agent as it might be riveting to observe how an agent performs when it has been taught by the ”best” teaching agent. It should be noted that we created lower-quality agents by taking snapshots while training. It might also be interesting to explore other methods to create these weaker agents. Such methods may include but not be limited to training against weaker opponents, tuning the parameters α, the learning factor, and γ, the discount factor, and ε, the exploration parameter, in such a manner that Q-Learning converges prematurely (before actually reaching the optimal policy). While evaluating our agents we devised a point system to decide which agent was the best, where 3 points were given to each game won and 1 point was given to each game drawn. However, since for Tic-Tac-Toe and Nine Men’s Morris, a perfect play from both players results in a draw, thus during training, a win and a draw were both given equal weightage. It would be fascinating to observe whether the training for these games would improve by pushing for wins as opposed to draws and how this would change the quality of Q-learning based agents for those games.

44 Chapter 6: Conclusions

In this thesis, our goal was to create and analyse Q-Learning agents and evaluate them for different abstract board games each of which have different state-space complexity. On analyzing the training of a Q-learning RL agent in terms of finding convergence for solved abstract board games we found that the training speed is largely affected by the state-space com- plexity of the game, as we initially expected. We also found that for state-space complexities within the scope of this research a fully converged Q-learning agent playing against a Min-Max player, which it has never played before, does not lose any games. This implies that we were successful in creating a player that can, just by playing against itself, learn how to play solved abstract board games, with state-space complexities ranging between 4 and 13, perfectly against a Min-Max opponent. Against a random player, a fully converged Q-learning agent reports no losses when playing games with small state space complexities. We were also successful in creating lower quality Q-learning based agents that allow human players to effectively find a competitive agent that can prove to be challenging opponents that are not too hard to beat. We used these lower quality Q-learning based agents to observe that better teaching agents accelerate training for abstract board games with state-space complexities between the range of 4-13. Finally, evaluation of the quality of agents was done using a round-robin tournament, which helped in determining a method to deciding how many games must be played between opponents to find stable results. To quantitatively assess which of the players is the best we use a point system and find that the deterministic Min-Max agent is the best agent for all three of the games, the fully converged Q-learning based agent is a close second for both Tic-Tac-Toe and Nine Men’s Morris, and the non deterministic Min-Max agent comes in second place for Mancala.

45 Bibliography

[1] G. Andrade, H Santana, A. Furtado, A.Leitao,˜ and G. Ramalho. Online adaptation of com- puter games agents: A reinforcement learning approach. Scientia, 15(2), 2004. [2] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Ried- miller. Playing atari with deep reinforcement learning. In arxiv:1312.5602, NIPS Deep Learning Workshop, 2013. [3] G. Tesauro . Temporal difference learning and td-gammon. In Communications of the ACM, volume 38, 1995. [4] D. A. Ferrucci. Introduction to “this is watson”. In IBM Journal of Research and Develop- ment, volume 56, pages 1:1–1:15, 2012. [5] M. Genesereth, N. Love, and B. Pell. General game playing: overview of the aaai competi- tion. In AI Magazine, volume 26, page 62–72, 2005. [6] N. Love, T. Hinrichs, and M. Genesereth. General game playing: game description language specification. In Stanford Logic Group LG-2006-01, Computer Science Department, Stanford University, Stanford, Calif, USA, 2006. [7]M. Swiechowski,´ H. Park, J. Mandziuk,´ and K. Kim. Recent advances in general game playing. In The Scientific World Journal, volume 2015, page 22, 2015. [8] C. Browne. Modern techniques for ancient games. In 2018 IEEE Conf. Comput. Intell. Games, page 490–497, 2018. [9] Stephan Schiffel and Michael Thielscher. Fluxplayer: A successful general game player. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence (AAAI-07), pages 1191– 1196. AAAI Press, 2007. [10] Jakub Kowalski, Jakub Sutowicz, and Marek Szykula. Regular boardgames. CoRR, abs/1706.02462, 2017.

[11] E. Piette, D. J. N. J. Soemers, M. Stephenson, C. F. Sironi, M. H. M. ´ Winands, and C. Browne. Ludii - the ludemic general game system. In CoRR, volume abs/1905.05013, 2019.

[12] E. Piette, D. J. N. J. Soemers, M. Stephenson, C. F. Sironi, M. H. M. ´ Winands, and C. Browne. An overview of the ludii general game system. In ” arXiv:1907.00240v1 [cs.AI], 2019. [13] Erik D. Demaine. Playing games with algorithms: Algorithmic combinatorial game theory. In Jirˇ´ı Sgall, Alesˇ Pultr, and Petr Kolman, editors, Mathematical Foundations of Computer Science 2001, pages 18–33, Berlin, Heidelberg, 2001. Springer Berlin Heidelberg.

46 [14] L.V. Allis. Searching for solutions in games and artificial intelligence. PhD thesis, Maastricht University, January 1994.

[15] E. Piette, D. J. N. J. Soemers, M. Stephenson, C. F. Sironi, M. H. M. Winands, and C. Browne. A practical introduction to the ludii general game system. In Proceedings of Advances in Computer Games (ACG 2019),Macao, Springer, 2019.

[16] Bikramjit Banerjee and Peter Stone. General game learning using knowledge transfer. In The 20th International Joint Conference on Artificial Intelligence, pages 672–677, January 2007.

[17] H.Jaap van den Herik, Jos W.H.M. Uiterwijk, and Jack van Rijswijck. Games solved: Now and in the future. Artificial Intelligence, 134(1):277–311, 2002.

[18]G abor´ E. Gevay´ and Gabor´ Danner. Calculating ultrastrong and extended solutions for nine men’s morris, morabaraba, and lasker morris. IEEE Transactions on Computational Intelli- gence and AI in Games, 8(3):256–267, 2016.

[19] B. Jones, L. Taalman, and A. Tongen. Solitaire mancala games and the chinese remainder theorem. In The American Mathematical Monthly, pages 706–724, 2013.

[20] Colin Divilly, Colm O’Riordan, and Seamus Hill. Exploration and analysis of the evolution of strategies for mancala variants. In 2013 IEEE Conference on Computational Inteligence in Games (CIG), pages 1–7, 2013.

[21] N. Ferns, P. Panangaden, and D. Precupand. Metrics for finite markov decision processes. In The Nineteenth National Conference on Artificial Intelligence, Association for the Advance- ment of Artificial Intelligence, pages 104–124, 2004.

[22] Stephan Ten Hagen and Ben Krose.¨ Q-learning for systems with continuous state and action spaces. In In BENELEARN 2000, 10th Belgian-Dutch Conference on Machine Learning, 2000.

[23] Yang Cai and Constantinos Daskalakis. On Minmax Theorems for Multiplayer Games, pages 217–234.

[24] Yang Cai and Constantinos Daskalakis. On minmax theorems for multiplayer games. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’11, page 217–234, USA, 2011. Society for Industrial and Applied Mathematics.

[25] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold, John Agapiou, Joel Leibo, and Audrunas Gruslys. Deep q-learning from demonstrations. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), Apr. 2018.

[26] R. Lorentz and IV TeofiloErinZosa. Machine learning in the game of breakthrough. In ACG, 2017.

47 [27] Bahare Kiumarsi, Frank L. Lewis, Hamidreza Modares, Ali Karimpour, and Mohammad- Bagher Naghibi-Sistani. Reinforcement q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica, 50(4):1167–1175, 2014.

[28] H. Wang, M. Emmerich, and A. Plaat. Monte carlo q-learning for general game playing. In arXiv:1802.05944 [cs.AI], 2018.

[29] Y. Mansour E. E.Dar. Learning rates for q-learning. In Journal of Machine Learning Re- search, volume 5, 2003.

[30] S. Mahadevan. To discount or not to discount in reinforcement learning: A case study com- paring r learning and q-learning. In Machine Learning Proceedings, Proceedings of the Eleventh International Conference, Rutgers University, New Brunswick, NJ, pages 164–172, 1994.

[31] Sridhar Mahadevan. To discount or not to discount in reinforcement learning: A case study comparing r learning and q learning. In William W. Cohen and Haym Hirsh, editors, Machine Learning Proceedings 1994, pages 164–172. Morgan Kaufmann, San Francisco (CA), 1994.

[32] S. Jain and N. Khera. An intelligent method forsolving tic-tac-toe problem. In International Conference on Computing, Communication Automation, pages 181–184, 2015.

[33] R. Gasser. Solving nine men’s morris. In Computational Intelligence, volume 12, pages 24–41, 1996.

[34] L. Manosh R. O. Abayomi, O. O. Olugbara. An overview of supervised machine learning techniques in evolving mancala game players. In Proceedings of the 2nd International Con- ference on Computer Science and Electronics Engineering (ICCSEE 2013), pages 978–983, 2013.

[35] V. Volz and B. Naujoks. Towards game-playing ai benchmarks via performance reporting standards. In arXiv:2007.02742v1 [cs.AI], 2020.

48