Solving General Game Playing with Incomplete Information Problem using Iterative Tree Search and Language Learning

Armin Chitizadeh

Supervisor: Prof. Michael Thielscher Co-Supervisor: Dr. Alan David Blair

A thesis in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Computer Science and Engineering Faculty of Engineering

December 2019

Thesis/Dissertation Sheet

Surname/Family Name : Chitizadeh

Given Name/s : Armin Abbreviation for degree as give in the University calendar : PhD Faculty : Engineering School : Computer Science and Engineering Solving General Game Playing with Incomplete Thesis Title : Information Problems using Iterative Tree Search and Language Learning

Abstract General Game Playing with Incomplete Information (GGP-II) is about developing a system capable of successfully playing incomplete information games without human intervention by just receiving their rules at runtime. Different algorithms (players) have been provided to play games in GGP-II. This research is concerned with three main limitations of algorithms in the literature: valuing-information, generating mixed strategy and cooperating in games which require implicit communication. In this thesis, I theoretically and experimentally show why past GGP-II players suffer from these problems and introduce four algorithms to overcome these problems and discuss the advantages and limitations of each algorithm. Firstly, I introduce the Iterative Tree Search (ITS) algorithm. ITS learns the best strategy by simulating different plays with itself. I show theoretically and experimentally how ITS correctly values information and models opponents by generating mixed strategies in different games. However, ITS fails to play large games and also the cooperative games which require implicit communication. Secondly, I present the Monte Carlo Iterative Tree Search (MCITS). This algorithm uses Monte Carlo Tree Search technique to focus the search on a more promising part of the game. I experimentally show the success of this algorithm on different games from the literature. MCITS fails to generate mixed strategies and to correctly play games which require implicit communication. Thirdly, I introduce a communication language learning technique called General Language (GL). GL is capable of generating an implicit communication language for cooperative players to share their information. The GL technique sees a communication language as an additional game rule. It can be used on top of any existing GGP-II player. This feature makes it a general algorithm. The main limitation of GL is its inability to solve large problems. Finally, I present the General Language Tree Search algorithm (GLTS). This algorithm extends the GL technique to be applicable to large games. It prioritises the communication languages according to their closeness to the most successful one. To validate my claim, I perform an experiment using GLTS by providing it with a Multi-Agent Path Finding with Destination Uncertainty problem. The GLTS algorithm successfully discovers the desired strategies by utilising the implicit communication among agents.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents a non-exclusive licence to archive and to make available (including to members of the public) my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known. I acknowledge that I retain all intellectual property rights which subsist in my thesis or dissertation, such as copyright and patent rights, subject to applicable law. I also retain the right to use all or part of my thesis or dissertation in future works (such as articles or books).

…………………………………………………………… ……….……………………...…….… Signature Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years can be made when submitting the final copies of your thesis to the UNSW Library. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

INCLUSION OF PUBLICATIONS STATEMENT

UNSW is supportive of candidates publishing their research results during their candidature as detailed in the UNSW Thesis Examination Procedure.

Publications can be used in their thesis in lieu of a Chapter if: • The candidate contributed greater than 50% of the content in the publication and is the “primary author”, ie. the candidate was responsible primarily for the planning, execution and preparation of the work for publication • The candidate has approval to include the publication in their thesis in lieu of a Chapter from their supervisor and Postgraduate Coordinator. • The publication is not subject to any obligations or contractual agreements with a third party that would constrain its inclusion in the thesis

Please indicate whether this thesis contains published material or not:

This thesis contains no publications, either published or submitted for publication ☐ (if this box is checked, you may delete all the material on page 2)

Some of the work described in this thesis has been published and it has been ☒ documented in the relevant Chapters with acknowledgement (if this box is checked, you may delete all the material on page 2)

This thesis has publications (either published or submitted for publication) ☐ incorporated into it in lieu of a chapter and the details are presented below

CANDIDATE’S DECLARATION I declare that: • I have complied with the UNSW Thesis Examination Procedure • where I have used a publication in lieu of a Chapter, the listed publication(s) below meet(s) the requirements to be included in the thesis. Candidate’s Name Signature Date (dd/mm/yy)

iv Originality Statement

I hereby declare that this submission is my own work and to the best of my knowl- edge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.

Signed: ......

Date: ......

v vi Copyright Statement

I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.

Signed: ......

Date: ......

vii viii Authenticity Statement

I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.

Signed: ......

Date: ......

ix x Acknowledgements

I wish to express my deepest gratitude to outstanding individuals who supported me to undertake this research, and it would not have been possible without their help. First and foremost, I wish to pay my special regards to my supervisor, Professor Michael Thielscher. It was his presentation on General Artificial Intelligence back in 2012 which made me interested in persuading the research. During my PhD, his guidance, support and ongoing encouragement were beyond the call of duty. Words are not enough to express to Michael how much it meant to me that he took the time to be such a caring mentor and improve my level of research, writing and critical thinking. Thank you. I would like to thank my parents, Jafar Chiti Zadeh, Nasrin Chitizadeh and my brother and sister, Amir Chiti Zadeh and Ayda Chitizadeh and my grandparents for their encouragement and support both mentally and financially throughout this research. I am so fortunate to have such a generous and understanding family who were always willing to talk and listen. Their continues encouragement and advice kept me moving at every step of this journey. I also wish to thank my co-supervisor, my panel, all the members of the GGP meeting group and my reviewers who helped me with their feedback and constructive criticism.

xi Acknowledgements

xii Contents

Acknowledgements xi

List of Figures xx

List of Tables xxi

Publications xxiii

1 Introduction1 1.1 Background ...... 1 1.2 Contributions ...... 6 1.3 Structure ...... 8

2 Multi-Agent Systems in Incomplete Information Environment9 2.1 Game Representations ...... 10 2.1.1 Game Theory ...... 10 2.1.2 General Game Playing with Incomplete Information ...... 15 2.2 Representations Comparison ...... 21 2.3 Multi-agent Algorithms in the Literature ...... 23 2.3.1 Two Common Problems with Incomplete Information Ap- proaches ...... 24 2.3.2 Competitive Games ...... 26 2.3.3 Cooperative Games ...... 31 2.3.4 GGP-II Past Players ...... 34 2.4 Summary ...... 38

3 Iterative Tree Search 39 3.1 Iterative Tree Search ...... 45 3.1.1 Initialising the Tree ...... 45 3.1.2 Iteration ...... 48

xiii Contents

3.2 Analysis ...... 56 3.2.1 Games with Dominant Pure Strategy and Single Player Games 56 3.2.2 Non-Locality Problem ...... 60 3.2.3 One-Step Joint-Move Two-Player Constant-Sum Games . . . . 67 3.2.4 Move Separable Games ...... 69 3.3 Summary ...... 72

4 MCITS: An Online Tree Search algorithm for Incomplete Infor- mation Games 75 4.1 Introduction ...... 75 4.1.1 Background: Monte Carlo Tree Search ...... 76 4.2 Monte Carlo Iterative Tree Search ...... 77 4.3 Analysis ...... 82 4.3.1 Valuing Information in the Game ...... 82 4.3.2 Non-Locality Problem ...... 88 4.3.3 Summary ...... 90

5 General Language Evolution in General Game Playing 93 5.1 Simplified Iterative Tree Search Algorithm ...... 95 5.2 General Language Algorithm ...... 98 5.3 Analysis ...... 103 5.3.1 Naming Game ...... 103 5.3.2 Air-Strike Alarm ...... 104 5.4 Experimental Analysis ...... 105 5.5 Summary ...... 109

6 GLTS and Its Application to Multi-Agent Path Finding with Des- tination Uncertainty 113 6.1 Background: Multi-Agent Path Finding with Destination Uncer- tainty Problem ...... 116 6.2 General Language Tree Search ...... 117 6.3 Analysis ...... 128 6.3.1 Example: Extended Cooperative Spies Game ...... 129 6.3.2 Example: Two Robots Game ...... 131 6.3.3 Applicability and the Bias of GLTS ...... 133 6.4 Experimental Results ...... 134 6.5 Summary ...... 137 xiv Contents

7 Conclusion 139 7.1 Contributions ...... 139 7.1.1 Competitive Players ...... 139 7.1.2 Communication Language Generators ...... 140 7.2 Future Work ...... 141

A Counterfactual Regret Minimization Performance on the PBECW Game 145 A.1 Background: Counterfactual Regret minimisation algorithm . . . . . 145 A.2 CounterFactual Regret minimization Failure ...... 147

B Battle Ships in the Fog: Comparing Different Strategies 155

C Number of Worlds in the Two Robots Example 159

D How a World Turns into a Terminal World 163

Bibliography 165

xv Contents

xvi List of Figures

2.2 A constant-sum game described in the world model ...... 16 2.3 GDL-II description of the Cutting Wire game...... 18 2.4 The extensive form representation of the Cutting Wire game . . . . . 20 2.5 General Game Playing Match Ecosystem [37] ...... 21 2.6 Generality of the discussed representations ...... 22 2.7 The Prisoner’s Dilemma in extensive form ...... 22 2.8 The extensive form representation of the Frank Basin Non-locality game...... 23 2.9 A game with non-locality problem described in world model. The thick lines represent the chosen strategy by an algorithm which has the non-locality problem...... 25 2.10 Applying the Vector Minimaxing to a game in the world model . . . . 30 2.11 The Cooperative Spies game in extensive form...... 32 2.12 GDL-II description of the Cooperative Spies game...... 33

3.1 The Partially Hidden Extended Cutting Wire game tree in the exten- sive from representation...... 40 3.2 Two Nash equilibrium solutions ...... 41 3.3 The mixed Nash equilibrium with average reward of 90 in the Ex- tended Cutting Wire game...... 44 3.4 The Extended Cutting Wire problem with variables’ values during the first iteration...... 49 3.5 The Extended Cutting Wire problem with variables’ values during the first iteration ...... 52 3.6 Two numerical solutions ...... 55 3.7 Values in Extended Cutting Wire example during the 6th iteration . 57 3.8 The game tree for Fully Hidden Extended Cutting Wire...... 58 3.9 Probability of telling at different states during the first 1,000 itera- tions in the Fully Hidden Extended Cutting Wire game...... 59

xvii List of Figures

3.10 The Frank Basin Non-Locality game to show non-locality problem represented using the world model...... 61 3.11 The Frank Basin Non-Locality game converted to the extensive form representation...... 62 3.12 Variables in the Frank Basin Non-Locality game during the first iter- ation of ITS ...... 63 3.13 Variables in the Frank Basin Non-Locality game during the second iteration of ITS ...... 64 3.14 Variables in the Frank Basin Non-Locality game during the third iteration of ITS ...... 65 3.15 The optimal strategy for Max player suggested by the designer of the game...... 66 3.16 Mixed strategy at state d in the Frank Basin Non-Locality game dur- ing the first 100 iterations...... 67 3.17 Mixed strategy of the Keeper for the biased penalty kick game for the first 10,000 iterations...... 69 3.18 The probability change toward equilibrium for four strategies in the banker and thief game when the faulty bank is the first one...... 72

4.1 Number of selection after 500 iterations for wait1 and tell1 moves in Fully Hidden Extended Cutting Wire...... 83 4.2 Number of selection after 500 iterations for wait2 and tell2 moves in Fully Hidden Extended Cutting Wire...... 84 4.3 Different directions that a ship can move on a 4 × 4 board...... 85 4.4 One of the three scenarios in which MCITS wins against HP-II . . . . 87 4.5 Number of selection after 1601 iterations for shoot, scan or move of the Battle Ships in the Fog game...... 88 4.6 The Frank Basin Non-Locality game described in worlds model. The suggested optimal strategy for the Max is shown with thick lines. . . 89 4.7 The Frank Basin Non-Locality game described in extensive normal form. Thickness of lines represents the frequency of move being cho- sen by MCITS...... 90 4.8 Number of selection after 400 iterations at information set d in the Frank Basin Non-Locality game ...... 91

5.1 General Language Algorithm placement in General Game Playing match ecosystem...... 100 5.2 Two numerical solutions ...... 102 xviii List of Figures

5.3 Experiment on language evolution with natural selection ...... 108 5.4 No language evolution can be seen on a society without natural se- lection...... 109 5.5 Game tree for the Four Cutting Wire game with three mustRules... 110 5.6 The evolution of optimal language with ITS search algorithm. . . . . 111

6.1 The Two Robots game which is a Multi Agent Path Finding with Destination Uncertainty. Each robot must go to its target cell marked by a solid square or solid circle. Each robot can only see its own target with solid mark but can not distinguish between other robot’s solid or shallow marks...... 117 6.2 General Language Tree Search at early iterations for the Cooperative Spies game. The composite world is not part of the world tree directly.118 6.3 Extended Cooperative Spies game...... 120 6.4 The four depths combinations of the children from a parent world

which has a mustRule with depth1 at 2 and depth2 at 7...... 125 6.5 A world tree example. Each node represents a world. Values for QS, Q and V variables are shown in a box next to each node. Colour of the edges show the path that has affected the Q values...... 127 6.6 An abstract generated tree world by GLTS in the game of Extended Cooperative Spies. Nodes here are the set of worlds with the given depths. The Shallow node represent a dummy node which has no world associated with them. Dummy nodes are just keeping the depths’ values to be later used on generating new nodes...... 130 6.7 The paths both robots take to reach their goal if they use either NEXUSBAUN or ITS aglorithms in the Two Robots game...... 131 6.8 Number of steps when Square goes clockwise versus when Square goes unti-clockwise in the Two Robots game...... 132 QS 6.9 Average self-value V for three singleton worlds ...... 135 QS 6.10 Average self-value V for three composite worlds ...... 135

A.1 The Partially Blind Extended Cutting Wire example with CFR vari- able represented during the first iteration...... 149 A.2 The Partially Blind Extended Cutting Wire example with CFR vari- able represented during the second iteration...... 150 A.3 The Partially Blind Extended Cutting Wire example with CFR vari- able represented during the third iteration...... 151

xix List of Figures

A.4 The Partially Blind Extended Cutting Wire example with CFR vari- able represented during the fourth iteration...... 152 A.5 The final pure strategy profil that CFR finds in the Partially Blind Extneded Cutting Wire example...... 153

B.1 HP-II algorithm vs sampling technique in Battle Ships in the Fog. . . 156 B.2 MCITS algorithm vs sampling technique in Battle Ships in the Fog. . 157

C.1 The Two Robots game in a loop board. Small solid square and circle shows the target cells for robots...... 160

D.1 A world tree in which a non-terminal world turned into a terminal world. Values inside the boxes show the mustRule’s depths of each world...... 164

xx List of Tables

2.1 The matrix for the prisoner’s dilemma ...... 12

3.1 The matrix for the Battle of Sexes problem...... 43 3.2 Probability of choosing a money distribution by ITS in the banker and thief game...... 72

xxi List of Tables

xxii Publications

Conference

• [15] Armin Chitizadeh and Michael Thielscher, "General language evolution in general game playing", In: AI 2018: Advances in Artificial Intelligence - 31st Australasian Joint Conference, Wellington, New Zealand, Dec 11-14 2018, pp. 51-64.

Some parts of this publication have been used within chapters 2 and 5.

Workshop

• [16] Armin Chitizadeh and Michael Thielscher, "Iterative tree search in general game playing with incomplete information", In: Tristan Cazenave, Abdallah Saffidine , Nathan Sturtevant (Editors), CGW 2018: Computer Games - 7th Workshop, Held in Conjunction with the 27th International Conference on Ar- tificial Intelligence, International Joint Conferences on Artificial Intelligence 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers, pp. 98–115.

Some parts of this publication have been used within chapters 2 and 3.

xxiii

Chapter 1

Introduction

This thesis contributes to the field of General Game Playing with Incomplete Infor- mation (GGP-II) [104]. General Game Playing is concerned with the development of a general Artificial Intelligence (AI) system that, in principle, can learn to play any game by only receiving its rules [37]. GGP-II is an extension of the original GGP which additionally models games with asymmetric information and chance events. Three of the existing gaps in this field are discussed in this research. Two gaps are about modelling the opponents. One is the inability to value information and other is not producing mixed strategy. These result in incorrect opponent modelling. In a world without access to all information, modelling others assists us to predict their past and possible future actions. As part of this thesis, I discovered these two gaps by analysing current GGP-II approaches. Full details are discussed in Chapter 3. The other gap is the inability of players to come up with a new way of communicat- ing without it explicitly be defined [87]. Communication among allies for sharing of information is extremely important in some games. Even though the techniques in this thesis are defined for GGP-II, they are not limited to this field and can be applied to other fields, such as Multi-Robot Systems [91] and economics [22].

1.1 Background

Computer systems are becoming an essential part of our lives. We interact with them every day and can not imagine our lives without them. Computers are designed to

1 Chapter 1 Introduction make our lives easier by performing our tedious tasks efficiently.

In the beginning, computers tended to perform only a sequence of instructions and arithmetic operations (algorithms) [79]. However, with the expansion of computer usage on more challenging tasks the expectations from computer systems have been increased. Computer systems are expected to make their own decisions on different tasks. In order to make the correct decision, they need to have rational thinking or intelligence. The common term used for such a behaviour of computers is Artificial Intelligence (AI) [82].

One of the earliest applications of AI was in automated theorem proving. In 1959 Gelernter developed Geometry Theorem Prover. It proved theorems which were considered to be nontrivial for mathematic students [35]. It was then followed by another automated theorem prover in 1960 by Wang. This automated theo- rem prover was able to prove all logic theorems in Principia Mathematica [111]. Automated theorem proving systems use a process known as high-level symbolic reasoning.

Meanwhile, there was considerable work on low-level learning inspired by the neu- rons on our brain. The early sparks of neural networks were started by a book called The Organization of Behaviour in 1949 [43]. This neuro-psychological book proposes there exists a relation between the behaviour of neurons and the synchronous ac- tivation of neighbour neurons. It claims that neural pathways are strengthened by simultaneous firing of two nearby neurons. In 1959 ADALINE and MADALINE were the first neural network systems to be applied to real-life problems [114]. They were used for noise cancellation on the telephone. A technique which is still in use today. The neural network hype cooled down in the 1970’s and the early 1980’s. It regained popularity in the late 1980’s by the invention of parallel processing [64]. The research on the neural network has reached extreme popularity in recent years due to the introduction of deep learning [18].

Another area of AI which massively shaped peoples’ opinions towards artificial intelligence is AI in games. Since ancient times, games have been used as a tool to compare intelligence among people [63]. Scientists used games to measure the

2 1.1 Background intelligence of their system and also to show the advancement of AI to the public. AI received general public interest through publicity of the human vs AI games. Alan Turing developed the first algorithm for a player, called TurboChamp, in 1951 [38]. However, it was never executed on an actual computer. The earliest AI game player executed on a computer was developed for the game of checkers [84] in 1958. The extension of this system beat the Connecticut state checkers champion in 1961. This event resulted in attracting public interest to AI [14]. AI then was used to create smart opponents in arcade games such as Space Invaders (1978), Pac-Man (1980) and several other games, including the recent ones [9]. In 1997 Deep Blue, a computer system designed by IBM, beat the World Chess Champion of the time, Garry Kasparov [10]. This symbolic event was praised by many as the machines overcoming humankind. Nearly two decades later in 2016, mankind has lost against AI again, this time in the game of Go. , one of the strongest players in the history of Go, lost 4 games out of 5 in a match against AlphaGo [10]. This means it took AI nearly two decades from beating mankind in the game of chess to beat mankind again in another game.

During the past decades apart from theorem proving, neural networks and games, artificial intelligence systems have been used in a variety of different tasks. However, there are still several tasks which might seem trivial to a human but that are still impossible for the AI. Some examples are playing football and writing a high-school essay [41]. This shows the fact that intelligence in computer systems is not transfer- able among different tasks. The majority of AIs are designed for a single problem to solve and are only applicable for the problem, such as only trading stock, only filter- ing Emails or only playing chess. To design such an AI system a group of specialists and computer scientists need to spend time and work together [10]. This means for each new task we need human intervention to design an AI system for it. However, this stands against the main idea of having computer systems, which is to reduce human involvement and increase efficiency. To create a true AI system we need to create a general AI system. A system which after being made can still be applicable to a variety of problems. In the literature, this general AI system is referred to as

3 Chapter 1 Introduction

Strong AI or General AI. General AI should be able to solve any problem by just knowing the problem. Introducing a strong AI is a big task. For this reason, researchers have separated it into smaller subtasks. One subtask is to come up with a system which can solve any game1 by just receiving its rules. However, the issue is how to give the system the game rules. Despite the variation in games, they all share a similar common graph structure. The nodes in the graph are the states of the game and the edges are the actions of the players. However, it is not practical to represent a game as a graph. The graph can be nearly impossible to be drawn for some games, such as chess. In games, states and actions have composite structures. In chess, for example, each state of the game can be represented by the location of pieces on the board and whose turn it is. Using such a compositional structure, we should be able to represent the game in a compact form [37]. General Game Playing framework was introduced to address such an issue.

General Game Playing

General Game Playing (GGP) is concerned with the development of a system that, in principle, can learn to play any game by only receiving its rules [37]. The rules are given to each player in Game Description Langauge (GDL) which is a Prolog-like logical language [36]. In games, states and moves have a composite structure. For example in chess, each state of the game can be described with a set of entities, like the position of each piece and whose round it is. GDL defines each state as a database of entities and defines legal moves with the help of logic. As an example, the legality of a move in chess can be described according to the type of the piece and locations of all pieces. This way the GDL can be extremely compact compared with other representation forms. Players then can convert the GDL format to any applicable format they prefer. Every year, A GGP annual competition is held. The last winner of GGP competition was: "Woodstock" [53].

1Multi-agent problems, like trading on a stock market, sometimes are referred to as a game.

4 1.1 Background

The original GGP has two limitations: lack of support for asymmetric information and no random player [104]. Asymmetry allows games with hidden information, like poker, to be modelled. A random player allows games with chance events to be modelled, such as backgammon. In 2010 Thielscher [104] extended GDL to support incomplete information (GDL-II). To this end, a random player and perceptions were added to the original GDL. The new extension of GGP which takes advantage of GDL-II is called General Game Playing with Incomplete Information (GGP-II).

In GGP-II players will no longer see all moves by other players but only some perceptions. Players might not always infer a single taken move but a set of possible moves, given the game description rules and the sequence of received perceptions. This uncertainty adds to the complexity of the problem. In such scenarios, players are required to guess which moves are more likely to be taken and play according to it. In other words, when the game is competitive, players need to model the opponent and avoid being predicted by the opponent. On the other hand, when a game is cooperative, players need to share information to assist each other in guessing the possible played moves. Both competitive and cooperative games can be modelled in GDL-II. Predicting what a player might have played or is planning to play is easier in cooperative games. However, in some cooperative games, players need to share their knowledge about the world without it being explicitly described in the rules of the game. For example, consider a simple game, taken from [14], with a random player, the Nature, and two agents, respectively called the Cutter and the Viewer. The game is named “cooperative spies game”. It proceeds as follows: First, the random player arms a bomb by randomly choosing one of two wires. The Viewer only sees which wire is chosen while the Cutter has to decide which wire to cut. After seeing which wire is chosen to arm the bomb, the Viewer can send either of two possible messages to the Cutter. The crux is that there is no rule which helps the Viewer choose one message over the other after seeing which wire is chosen. This inability of GGP-II to implicitly communicate is mentioned as a limitation of all current GGP-II player algorithms [14].

In 2016, Thielscher presented an extension of GDL-II to include epistemic games

5 Chapter 1 Introduction

[105]. The epistemic games are characterised by rules that depend on the knowledge of players. The new language is called Game Description Language with Incomplete Information and Introspection (GDL-III). The epistemic games and GDL-III are be- yond the scope of this thesis. The main focus of this thesis is games with incomplete information. In this thesis, I illustrate that this limitation is due to the inability of players to automatically generate a common implicit language among themselves when it is required. Also, I show that implicit communication can increase the efficiency of solving Multi-Agent Path Finding with Destination Uncertainty (MAPF/DU) prob- lems. MAPF/DU can model several real-world multi-agent applications in which agents need to move to different destinations without any collision and central plan- ning. Examples are automated cars at intersections [23] and office or warehouse robots [117, 109].

1.2 Contributions

The first generation of GGP-II players mainly used the determinisation technique to play the games [86, 25, 70, 12]. The determinisation technique picks some com- plete information samples of an incomplete information state of a game, then it plays each sample as a complete information game. It then chooses the best average move. One criticism of this approach is its lack of information valuation [87]. Since at each sample, the player plays the game like a complete information game, it avoids pos- sible advantageous moves which decrease the uncertainty by increasing information for the player but does not directly increase goal value. This is mainly due to the wrong assumption that the player has all information, due to playing a complete information game model, and requires no extra information to play strongly. This limitation has been shown at the incomplete information track of the GGP com- petition at the 2012 Australasian Joint Conference on Artificial Intelligence: Three games were designed to test the ability of players on valuing information. All the players of that time failed to play those games successfully as none of them values the information [87].

6 1.2 Contributions

HyperPlayer with Incomplete Information (HP-II) [86] was introduced as the an- swer to this problem. However, HP-II itself has two other limitations. The first limitation, which is noted in the HP-II paper itself, is its incapability to solve co- operative games with implicit communication. In cooperation games with implicit communications, players need to share information without being exactly directed on how. The second problem is its inability to return an optimal policy in games which require long-distance information valuation. In this thesis, I reveal this prob- lem by showing that HP-II returns a good but not optimal strategy in the Battle Ship in the Fog board game, a game introduced in the HP-II paper itself [86]. As mentioned earlier, another gap of current GGP-II players is the inability to implicitly communicate. In this thesis, I illustrate that this limitation is due to the inability of players to automatically generate a common implicit language among themselves when it is required. Also, I show that implicit communication can in- crease the efficiency of solving Multi-Agent Path Finding with Destination Uncer- tainty (MAPF/DU) problems. MAPF/DU can model several real-world multi-agent applications in which agents need to move to different destinations without any col- lision and central planning. Examples are automated cars at intersections [23] and office or warehouse robots [117, 109]. The summary of the core contributions of this thesis are:

• Discovering the vulnerability of state of the art GGP-II approach, HP-II.

• Introducing Iterative Tree Search (ITS) algorithm as a significant enhancement over state-of-the-art algorithms for general game playing with incomplete in- formation.

• Introducing a new, more efficient version of ITS, called Monte Carlo Iterative Tree Search (MCITS). MCITS can correctly solve games in extensive form, including GGP-II games, by valuing information with efficient memory usage.

• Introducing the novel General Language evolution techniques (GL). GL gen- erates an agreement, also known as common “language”, between cooperative

7 Chapter 1 Introduction

players to assist them in sharing their knowledge; hence, it correctly solves games with implicit communication requirement.

• Increasing the efficiency of the GL technique by introducing General Language Tree Search in order to solve complex problems such as Multi-Agent Path Finding Problem with Destination Uncertainty.

1.3 Structure

The rest of this thesis is organised as follows: Chapter 2 provides an introduction to multi-agent systems. It starts by introducing different fields in the literature which discuss incomplete information games, such as game theory and general game playing. It then explores the literature in these areas and describes limitations in their approaches. Chapter 3 describes the ITS algorithm. ITS extends the fictitious play [6] to extensive form games. The original fictitious play is a learning rule in game theory which only applies to games in normal forms. Monte Carlo Iterative Tree Search is presented in Chapter 4. Then, Chapter 5 introduces the innovative General Language evolution technique (GL) to solve cooperative games with implicit communication. Chapter 6 introduces the General Language Tree Search (GLTS). To evaluate GLTS I apply it to the well-known Multi-Agent Path Finding with Destination Uncertainty (MAPF/DU) problem. I explain how MAPF/DU problems can be solved more easily with the help of implicit communication. Chapter 6 concludes the thesis with a discussion on the results and suggestion for future work.

8 Chapter 2

Multi-Agent Systems in Incomplete Information Environment

Many real-world problems can be seen as a game in which multiple agents (aka. players) interact with each other. Some examples are the stock market and city traffic. Games can be divided into two categories: games with complete information and games with incomplete information. In games with complete information, agents fully know the rules of the game and past moves. However, in incomplete information games, agents have asymmetric information. In other words, agents might not see all moves or fully know the rules of the game. The term incomplete information is common in the field of AI but in the field of game theory, this category is called imperfect information. This thesis uses the incomplete information term. The focus of this thesis is on incomplete information games. It is centred around the two well-known fields of game theory and General Game Playing. This chapter starts by introducing two well-known game representations in game theory: the normal form and the extensive form; and the game representation in Gen- eral Game Playing: the Game Description Language. Later, it compares the generality of these representations and how to convert one to another. Finally, It reports on the literature in game theory and General Game Playing.

9 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

Publications This chapter recapitulates and expands on previously published works.

• [16] Armin Chitizadeh and Michael Thielscher, "Iterative tree search in general game playing with incomplete information", In: Tristan Cazenave, Abdallah Saffidine , Nathan Sturtevant (Editors), CGW 2018: Computer Games - 7th Workshop, Held in Conjunction with the 27th International Conference on Ar- tificial Intelligence, International Joint Conferences on Artificial Intelligence 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers, pp. 98–115.

This publication have been used within sections 2.1, 2.2, 2.3.2 and 2.3.4.

• [15] Armin Chitizadeh and Michael Thielscher, "General language evolution in general game playing", In: AI 2018: Advances in Artificial Intelligence - 31st Australasian Joint Conference, Wellington, New Zealand, Dec 11-14 2018, pp. 51-64.

This publication have been used within section 2.3.3.

2.1 Game Representations

The first step of solving a problem is to represent it. Following are representations from both game theory and General Game Playing.

2.1.1 Game Theory

Game theory is a general framework for analysing a situation in which several agents’ decisions impact others [24]. Game theory was introduced as a new field when On the Theory of Games of Strategy paper [110] was published by John Von Neumann. The paper only considered two-player games in which one’s gain equals the other’s loss. This category of games is referred to as zero-sum games1. In collaboration

1Zero-sum is identical to constant-sum. Constant-sum means the sum of all rewards is always constant. Any constant-sum game can be turned into a zero-sum game and vice versa.

10 2.1 Game Representations with Morgenstern, Von Neumann extended the game theory to multi-agent games in non-zero-sum games by publishing The Theory of Game & Economic Behavior book [73]. In game theory, there are two main representations to model a game: the normal form and the extensive form.

The Normal Form

A game in normal form (aka. strategic form) consists of three elements [69]:

• r ∈ R finite set of players

• Sr the pure-strategy space for each player r

• ur(s) pay-off (or utility) functions for each player r at each profile s =

(s1, ..., sr). The aim of each player is to maximise its own utility.

In normal form games, players choose their preferred strategies simultaneously. A game in normal form can be represented by a matrix. Each matrix shows the utility of players for different profiles of strategies. To illustrate this, the well-known problem of the Prisoner’s Dilemma [80] is used as an example here. Table 2.1 shows the matrix for this game. Each row represents the prisoner B’s strategies and each column represents the prisoner A’s strategies. The values inside each cell are representing the rewards, also knows as the utilities, for players. The left value in each cell is the reward for the prisoner B and the right reward is for the prisoner A. The rewards can be either negative or positive. The Prisoner’s Dilemma problem can be described as follows: Two members of a gang are arrested and placed in separate rooms with no means of communication. The prosecutor lacks enough evidence to fully charge the two for their main crimes. However, the prosecutor has enough evidence to convict them with lesser crimes. So he gives them an offer. The offer is to betray the other by testifying against him. The price for betraying is the reduction of prison duration by 2 years. The prison duration for the lesser crime is 4 years and the duration for the bigger crime is 10 years. Each year in prison is represented by a reward of −1.

11 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

Prisoner A Betray StaySilent Betray (−8, −8) (−2, −10) Prisoner B StaySilent (−10, −2) (−4, −4)

Table 2.1: The matrix for the prisoner’s dilemma

Three scenarios can occur. The first is when neither of them betrays the other. This way both will be charged with the lesser crime and are sent to prison for 4 years. The second is when both betray by testifying. They will then be charged for the larger crime with two years being forgiven due to their cooperation with law enforcement. In this scenario, both are sent to prison for 8 years. The final scenario is when one betrays the other while the other remains silent. In this scenario, the betrayer is only sentenced to 2 years of prison while the one who has been betrayed is sentenced to 10 years.

Depending on different scenarios such as the level of trust or pre-game arrange- ment, different strategies can be considered optimal. If there is no trust among the players or pre-game arrangement, then the best strategy for a player is the one which provides a higher self-reward, regardless of the other players’ chosen strategies. Such a strategy is called dominant strategy [80].

In the prisoner’s dilemma, betraying is the dominant strategy for a prisoner. To understand the reason, let us consider two possible situations based on the strategies of the opponent. If the opponent chooses to stay silent, then betraying is the best strategy for the other prisoner due to providing 2 over 4 years. if the opponent chooses to betray, again it is better to betray because it results in 8 years of prison which is better than staying silent and receiving 10 years of prison. If in a game, there exists at least one dominant strategy for each player then the combination of these strategies is called the dominant strategy equilibrium [82]. A strategy profile is an equilibrium if each player sees no benefit in changing his strategy. In other words, an equilibrium strategy profile is a local optimum.

So far we have only dealt with pure strategies. Strategies can take the form of a mixed strategy. A mixed strategy is a probability distribution on the set of a

12 2.1 Game Representations

player’s pure strategies [76]. The mixed strategy for player r is represented by ~πr. P A mixed strategy can be written as ~πr = sr∈Sr Csr sr.The Csr is the probability of P the strategy sr with sr Csr = 1 and Csr ≥ 0. It was proven by John Nash that “for every game, there exists at least one equilibrium” [71]. In honour of his work, the dominant strategy equilibrium is now called the Nash equilibrium in game theory [82]. The Nash equilibrium can be made from either mixed or pure strategies. The main limitation of the normal form game is its inability to correctly represent sequential form games. In a sequential game, players choose different moves after each other. This allows the player to see what other players have previously played and play accordingly. The extensive form is designed to model sequential form games.

The Extensive Form

The extensive form represents the game as a tree. Nodes are the states of the game with the starting state to be the root of the tree. Branches are the moves, each connecting a node, parent, to one of its successor nodes, children. Leaf nodes in the game tree are called terminal states. At each level, a player is in charge of choosing a move. The game continues until it reaches a terminal state in which each player will be provided with a reward. In order for the extensive form to be capable of modelling incomplete information games, chance and information set need to be added. The random player, also known as Nature, is a special player which plays at random with a pre-set probability. The information sets are the sets of states which are indistinguishable for some players. In the extensive form, the information set is only represented for the player who is playing the round. To present a game tree in the extensive form, I use the Kodak vs. Polaroid [27] example. The Polaroid company is famous for its instant cameras. It had a monopoly on the market due to its patent rights. In 1975, the main patent rights on instant cameras had expired. Right after the expiration date, the Kodak company announced its plan to challenge Polaroid in the instant camera industry. There had been rumours that Polaroid had spent a tremendous amount of money

13 Chapter 2 Multi-Agent Systems in Incomplete Information Environment on research. However, no result had yet been published. There are two possible scenarios. First, Polaroid might have invented a technology which can revolutionise the industry but kept it secret to use it only if the company was challenged by opponents. Another scenario is that Polaroid was not able to invent anything and was bluffing. This Kodak vs Polaroid case can be modelled using Nature and in- formation sets. The extensive form graph of this problem is shown on Figure 2.1. First, Nature decides if Polaroid has invented or not. Only Polaroid knows about the success of research but Kodak does not. For this reason, Kodak finds itself in the information set with two possible states. Kodak then either chooses to enter or not enter the market. Not entering simply ends the game with Kodak receiving 0 as its utility and Polaroid receiving 3. However, if Kodak chooses to enter then Polaroid needs to decide either to accommodate or to fight. In any case regarding the invention, accommodating leads to terminal states with the utility of 1 for both companies. If Polaroid decides to fight and it has actually invented then it receives a reward of 2 and Kodak is rewarded -1. The utilities for players are swapped if Polaroid fights and has just bluffed. Figure 2.1 shows the Kodak vs. Polaroid game tree represented in extensive form. The dashed line connecting the two nodes shows the information set for Kodak. This means Kodak only knows it is in one of the two states but is not sure which one. The black circles at the bottom of the figure are the terminal states. When the game reaches a terminal state, the players receive their rewards. Rewards for each player is shown at bottom of the terminal states. There are other forms of extensive form representation. They are used to represent some specific group of games compactly. In this thesis, I also use a specific form of extensive form representation called the world model.

The World Model

The world model is a kind of extensive form which is used to represent incomplete information games with Nature choosing the starting position of the game, like shuffling in poker. Each starting position is called a world. In poker, a world is

14 2.1 Game Representations

random polaroid polaroid invented bluffed

Kodak

enter not_enter enter not_enter

Polaroid

accom fight accom fight

Kodak: 1 -1 0 1 2 0 Polarod: 1 2 3 1 -1 3

Figure 2.1: Kodak vs. Polaroid Extensive Form Game Tree a unique combination of hands and the order of cards in the deck. Some players might know which world is the chosen world while others might have limited or no information. Figure 2.2 shows a constant-sum game described in the world model. This game consists of three players: Nature, Max and Min. Unlike the usual extensive form, the world model does not have any node or edges related to Nature. They are all summarised in rewards for each world. At first, Nature chooses a world. In this example, there are three worlds for Nature to choose from: w1, w2 and w3. Since this is a constant-sum game, it is sufficient to only shows rewards for only one player. The convention is to choose the rewards for the first player and call it the Max player. The other player is called the Min player. For clarity, delta ∆ is used for Max’s nodes and nabla ∇ is used for Min’s nodes. The information of each player is explicitly expressed in the description of the game separately from the diagram. For some games, it is more efficient to use move names like Figure 2.1, while in this case, Figure 2.2, it is easier to differentiate the nodes by giving each an exact name.

2.1.2 General Game Playing with Incomplete Information

General Game Playing (GGP) was motivated by the idea that true intelligence is general intelligence. Unlike specialised AI systems, like Deep Blue, a GGP system receives the game rules at runtime in a logical description language. The standard language for this purpose is the Game Description Language (GDL). GDL was

15 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

MAX a

MIN b c

MAX d e f

w1 1 0 0 1 0

w2 0 1 0 0 0

w3 0 1 0 0 0

Figure 2.2: A constant-sum game described in the world model recently extended to support incomplete information games (GDL-II).

Game Description Language

Game Description Language is a Prolog like language with compact representation but using prefix notation [36]. There are 101 game-independent vocabulary in GDL. These 101 are based ten integers from 0 to 100, inclusive. There are also game- independent relation constant as shown as follows [37]:

role(a) meaning a is a role in the game. input(r, a) meaning action a is feasible for role r. base(p) meaning p is a base proposition in the game. init(p) meaning proposition p is true in the start of the game true(p) meaning proposition p is true in the current state of the game does(r, a) meaning role r performs action a in the current state of the game next(p) meaning propositional p is true in the next state of the game legal(r, a) meaning action a is available for role r in the current state of the game goal(r, n) meaning player r receives the utility of n in the current state of the game terminal meaning current state of the game is terminal state.

States in GGP are represented by a set of game features that are true. The initial state and terminal states are distinguished using init(p) and terminal. After reaching terminal states, players receive their rewards according to the goal(r, n)

16 2.1 Game Representations relation constant. By convention, the minimum reward is 0 and the maximum reward is 100. The intention of each player is to maximise his or her rewards. Legal moves and next states are defined by a set of logical rules. The original GDL was designed to model games with complete information. It assumes players can see others’ moves. It also lacks a way to model random events such as shuffling the cards or rolling a dice. GDL was recently extended to support incomplete information games (GDL-II) [104]. To this end, a random player (aka. Nature) and perceptions were added to the original GDL using the following relation constants and keyword:

percept(r, p) meaning player r has perception p in the game. sees(r,p) meaning player r receives perception p in the next state. random is a predefined role that moves in random

The random player chooses actions randomly with equal probability. In GGP-II games, players can not see other players’ moves. Instead, after each round of the game, they receive perception tokens [104]. Similar to legal moves, perceptions are defined by a set of logical rules. Different moves send different perceptions to each player. If one player receives the same perception for two different moves of another player, then the player cannot distinguish between the two moves. This plausible asymmetric information among different players increases the complexity of the problem of playing optimally. As an example, Figure 2.3 shows the GDL-II rules for a game called Cutting Wire. This game was suggested to test the ability of players to correctly value information in games [87]: There are two wires, blue and red, one of which is used to arm a bomb by the random player (lines 5 to 7). Two players (line 1 to 2) need to cooperate in order to cut the right wire to disarm the bomb. One player, Teller, only sees which wire is armed (lines 22 to 23) while the other player, Cutter has to decide which wire to cut (lines 18 to 20). After seeing which wire is armed, Teller needs to decide to either tell the colour or wait (lines 14 to 17). By telling, Cutter will see the colour of the wire which the bomb is armed (lines 24 to 26).

17 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

24 (<= ( sees cutter ?c) 25 ( does teller tell) 1 ( role cutter ) 26 ( true armed ?c)) 2 ( role teller ) 27 (<= ( next ( round 1)) 3 ( role random ) 28 ( true ( round 0))) 4 ( init ( round 0)) 29 ... 5 (colour red) (colour blue) 30 (<= ( next ( armed ?c)) 6 (<= ( legal random ( arm ?c)) 31 ( does random ( arm ?c ))) 7 (colour ?c) ( true round 0)) 32 (<= ( next ( armed ?c)) 8 (<= ( legal random noop ) 33 ( true ( armed ?c)) 9 ( not ( true round 0))) 34 10 (<= ( legal teller noop) 35 (<= disarmed 11 ( true round 0)) 36 ( does cutter (cut c?)) 12 (<= ( legal cutter noop) 37 ( true (armed ?c))) 13 ( true round 0)) 38 (<= exploded 14 (<= ( legal teller tell) 39 ( does cutter (cut c?)) 15 ( true round 1)) 40 ( not true (armed ?c))) 16 (<= ( legal teller wait) 41 (<= terminal 17 ( true round 1)) 42 ( true round 3)) 18 (<= ( legal cutter (cut ?c)) 43 (<= ( goal ? role 100) 19 ( true round 2) 44 ( disarmed ) 20 ( true colour ?c)) 45 ( distinct ? role random )) 21 ... 46 (<= ( goal ? role 0) 22 (<= ( sees teller ?c) 47 ( exploded ) 23 ( does random ( arm ?c)) 48 ( distinct ? role random )) 49 (<= ( goal random 0))

Figure 2.3: GDL-II description of the Cutting Wire game.

Formalisation In chapters 3 and 4, I will not be concerned with a set of GDL rules themselves but rather consider the induced game tree, including players’ per- ceptions [104]. GDL and GDL-II allow us to describe games with simultaneous moves. For simplicity of explanation, I use the standard transformation by which joint-move incomplete-information games in GGP-II can be converted into sequen- tial incomplete-information games.

Definition 1. Let G = hS,R,M, Σ, s0, Z, u, doi be a game with incomplete informa- tion, where:

• S is a set of states;

• R is a set of players, and R(s) is a function which given state s provides the player whose turn it is;

• M is a set of moves, and M(s) is the list of legal moves at state s by the player2;

2Each move is unique. Having similar names for moves at different states does not mean the moves are the same.

18 2.1 Game Representations

• Σ is a set of perceptions, and Σ(s) is the list of perceptions for R(s) from initial state to s. For convention, I use Σ0(σ) function which returns the set of state given a list of perceptions σ.

• s0 ∈ S is the initial state of the game;

• Z ⊂ S is the set of terminal states, for which we have M(z) = ∅ for any z ∈ Z;

• u : Z → <|R| is the terminal utility function;

• do : S × M → S is the successor function.

The information sets can be determined from the observation tokens a player has received [104] and his past moves.

Definition 2. Let G be a GDL-II game as in Def. 1, then:

• I(s): S → S∗ is the information set function which takes a state s and returns a set of states. The given state and all the states in the returned set are indistinguishable by the player for whom s is a decision node.

Since we are converting games into sequential form, the player for which states are indistinguishable is always the player whose turn it is. The following equation describes formally how an information set is generated:

I(s) = {x ∈ S : Σ(s) = Σ(x) ∧ ξr(s) = ξr(x) ∧ r = R(s)} (2.1)

∗ Here, ξr : S → M is the history function which provides the sequence of moves of player r from the start to the provided state3. In the Cutting Wire example, the only non-singular information set is I(srw) = {srw, sbw}. To illustrate these, I look at the Cutting Wire game which is represented in extensive form in Figure 2.4. In this thesis, I use a sequence of moves as a subscript to denote a state in the game tree, for example, srtb means a state after performing

3Unlike GDL, which represents each state by a set of features that are true, I represent each state by the sequence of moves that are played to reach the state.

19 Chapter 2 Multi-Agent Systems in Incomplete Information Environment arm red, tell and cut blue. The roles are R = {random, Cutter, T eller}. Examples for the set of legal moves and the list of percepts for a state are: M(sr) = {wait, tell} and Σ(srt) = [(), red] respectively. As the Cutter reaches its decision state srt, it has received two perceptions along the path. The first perception is empty because the first action by the random player does not result in any perception for Cutter. The second is red because, in this particular state, Teller has chosen tell. The final utility, in this case, is u(srtr) = 90. The do(S,M) function returns the next state, e.g. do(sr, wait) = srw.

random arm_red arm_blue

Teller

wait tell wait tell

Cutter

cut_r cut_b cut_r cut_b cut_b cut_r cut_b cut_r

Teller: 100 0 90 0 0 100 0 90 Cutter: 100 0 90 0 0 100 0 90

Figure 2.4: The extensive form representation of the Cutting Wire game

General Game Playing Match

General Game Playing is more than just a game representation. It is a platform to test the intelligence of several systems. To understand this platform, it is essential to understand its ecosystem.

Figure 2.5 shows a typical ecosystem for General Game Playing [37]. At the centre is the match manager4. The match manager maintains a database of game descriptions, match records and temporary state data for the running match. It also provides graphics for spectators. The match manager communicates with players through HTTP connections.

4The common term for this entity in the literature is the game manager. However, the term match is more proper because the issue is to manage individual matches and not the actual game [37].

20 2.2 Representations Comparison

Graphics for Spectators

Game Description Match Manager Temporary Match State Data Records

Tcp/ip Tcp/ip Tcp/ip

Player 1 Player 2 Player n

Figure 2.5: General Game Playing Match Ecosystem [37]

At the start of the match, the match manager sends the game description to the players. The players have time to prepare for the match. After the preparation time is over, all players have to submit a move to the match manager through an HTTP connection. After checking the legality of the move, the match manager updates the state data. Then, it notifies the players by sending them perceptions according to the game description. Players are given a short time to decide on the move to send. The given time during each move is generally shorter than the first preparation time. Before the thinking time is over, they need to submit a new move. This process continues until the game reaches a terminal state. In the end, the match manager notifies the players regarding the end and their received rewards.

2.2 Representations Comparison

The Game Description Language is the most general representation and Normal form is the least general representation. The generality of a representation means any problem described in the representation can be described in a more general representation but the reverse is not always true. Any game in the normal form can be described in extensive form by making players’ moves hidden from others [108]. As an example, consider the Prisoner’s

21 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

The most general Game Description Language

Extensive form

World model

Normal form

The least general

Figure 2.6: Generality of the discussed representations

Dilemma example from Table 2.1. Figure 2.7 illustrates this example in extensive form game.

Prisoner A betray stay_silent

Prisoner B

betray stay_silent betray stay_silent

Prisoner A: -8 -2 -10 -4 Prisoner B: -8 -10 -2 -4

Figure 2.7: The Prisoner’s Dilemma in extensive form

Similarly, any game described in the world model can be described in extensive form. As an example consider the game in Figure 2.2. We can convert it to extensive form by adding an extra node to the top of the graph. The node belongs to Nature and there is same number of edges from it as the number of worlds. Figure 2.8 shows the extensive form representation of the game in Figure 2.2. Since the game is a constant-sum game, the delta and nabla are used for nodes. For clarity, only the Max’s rewards are shown and dotted squares are used to demonstrate the states in an information set instead of dotted lines connecting states. Any game described in GDL-II format can be converted to extensive format and vice versa. For the proof please refer to [106]. However, some extra information,

22 2.3 Multi-agent Algorithms in the Literature

random a a a

Max w1a w2a w3a

b b b c c c

Min w1b w2b w3b w1c w2c w3c

d d d e e e f f f

Max w1d w2d w3d w1e w2e w3e w1f w2f w3f

left right left right left right left right left right left right single single single

Max: 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 Min: 0 1 1 0 1 0 1 0 1 1 1 1 1 1 1

Figure 2.8: The extensive form representation of the Frank Basin Non-locality game. such as perceptions names, are lost in translating GDL-II to extensive form. In Chapters 5 and 6 this extra information will be used for implicit communication and more description will be discussed there. Figure 2.6 compares the generality of discussed representations. In Chapters 3 and 4 I am not concerned with GDL rules themselves but rather induced game tree. But, in Chapters 5 and 6, I consider the GDL rules as well. The logic based structure of GDL has an advantage over extensive form on solving implicit communication problem in Chapter 4 and 5.

2.3 Multi-agent Algorithms in the Literature

Games can be categories into two groups: competitive, like Zodak vs. Polaroid, or cooperative, like the Cutting Wire example. There are different approaches for these two different categories. In this thesis, the focus is on incomplete information games of both categories. First, I illustrate the two common problems with current approaches for incomplete information games in both game theory and GGP-II. Then, I describe approaches in the game theory for both categories of games and their limitations. Next, I describe the current approaches in GGP-II. GGP-II ap-

23 Chapter 2 Multi-Agent Systems in Incomplete Information Environment proaches are applicable to both competitive and some cooperative games. I Finish this chapter by describing the state-of-art approach in GGP-II, Hyper Player with Incomplete Information (HP-II) [87]. HP-II is developed to solve the two common problems, strategy-fusion and non-locality problems.

2.3.1 Two Common Problems with Incomplete Information Approaches

Following are the two most common problems with several current approaches in incomplete information games.

Non-Locality

Frank and Basin formalised the non-locality problem in 1998 [31]. They stated that a game tree search algorithm which evaluates the value of a state solely based on its subtree in an incomplete information game will suffer from the non-locality problem. During the game, the opponent who possesses more information can lead the game toward a node located completely in another branch of the game. This creates a dependency on the value of a node on other non-local nodes. In other words, in an incomplete information game, the true value of a node might depend on other nodes in the tree. To illustrate this, I use the game in Figure 2.2, which is borrowed from [32]. This game was created by Frank et al. to illustrate the non-locality problem. They used the world model to show the incomplete information game tree instead of the extensive form. Search tree algorithms with the non-locality problem always choose right at both e and d nodes even though choosing right at node d is non-optimal. The reason for such a wrong choice is that the Max player thinks its equally likely to be in any of three states. This means choosing right at node d results in a win 2/3 times. However, the Min has no reason to choose d in worlds w2 and w3 over a guaranteed win of node e. For this reason, the optimal move is to choose left at node d and choose right at node e. Figure 2.9 demonstrates the strategy that an algorithm with non-locality problem chooses. The thick lines represent the chosen move at a given

24 2.3 Multi-agent Algorithms in the Literature

MAX a

MIN b c

MAX d e f

w1 1 0 0 1 0

w2 0 1 0 0 0

w3 0 1 0 0 0

Figure 2.9: A game with non-locality problem described in world model. The thick lines represent the chosen strategy by an algorithm which has the non-locality problem. node by an algorithm that suffers from non-locality problem.

Strategy-Fusion

Strategy-fusion occurs when an algorithm tries to find the best move in an infor- mation set by averaging the best moves for each state in the information set with the wrong assumption of each state to be in a complete information game. This is called the determinisation technique. While this technique might work in some games, it fails in games where information is valuable. This technique ignores the value of moves that provide more information to the player. Solving the game as a combination of several complete information games gives an illusion to the player that it knows and sees everything in the game. For this reason, it avoids seeking further information in the game. To demonstrate strategy-fusion, I use the Cutting Wire game in Figure 2.4. Al- gorithms with the strategy-fusion problem always choose the non-optimal strategy of not telling for the following reason. In one possible scenario, the red wire is armed and in another the blue wire is armed. If we see each scenario as a complete information game then in both of them the optimal strategy is to wait and not to tell. If we only consider the scenario in which the red wire is armed then it is better to wait and cut the red. In another scenario, with the blue being armed, it is also

25 Chapter 2 Multi-Agent Systems in Incomplete Information Environment better to wait and then cut the blue. As a result, it is better to wait in both cases of complete information games. This way the algorithm mistakenly decides to wait rather than to tell. In the actual game, which is the incomplete information, it is better to tell and take the 10 points penalty in order to collect more information.

2.3.2 Competitive Games

Designing a player for competitive games has been studied in different areas. Two memorable milestones in history when humankind was beaten by computers. In 1997 the Deep Blue, designed by IBM, was able to beat the World Chess Champion Garry Kasparov in a six-game match [10]. More recently in 2016, a computer program from Deep Mind company, called AlphaGo, was able to beat Lee Sedol, the World Go Champion, in a five-game match[74]. These programs were a breakthrough in the field of AI, but it is believed that their success relies heavily on the game-specific expertise of their developers and tailored algorithms. More recently, AlphaZero [92] was able to learn the games of chess and Go from scratch, through self-play, to a point where it was able to beat Stockfish [81], the best current chess engine, and AlphaGo. AlphaZero uses a relatively general approach compared to its predecessor AlphaGo [92]. However, AlphaZero is only applicable to complete-information two- player zero-sum competitive board games [92]. Following four subsections detail relatively general techniques for competitive games.

Fictitious Play

In game theory, the fictitious play has been proposed as a learning technique [5]. It was originally designed for normal form games. It is known to find a Nash equilib- rium in the time-average sense for two-player zero-sum games, for games solvable with iterated strict dominance, and for so-called identical interest games as well as potential games [66]. Recently, the technique has been extended to some extensive form games, for example, full-width extensive-from fictitious play (XFP) [44] and neural fictitious self-play (NFSP) [45]. The XFP technique learns a strategy which

26 2.3 Multi-agent Algorithms in the Literature is realization-equivalent to the normal form fictitious play, meaning it considers a strategy as a whole. For this reason, this technique suffers from the curse of di- mensionality. The NFSP technique uses sampling and neural networks to discover a Nash equilibrium strategy for a game. It was able to play limited Texas Hold’em successfully, but this game does not require information valuation. The standard fictitious play updates a mixed policy after each iteration by averag- ing the previously played moves [5]. The updating algorithm can be mathematically described as follows:

t t ~π ~π(b( ~π−r )) ~πt+1 = r + (2.2) r t t + 1

t Here, r is the player and t is the iteration index. ~πr is the mixed policy for player r at iteration t and ~π−r is the mixed policy for all players except r. Mixed policy ~πr defines the probabilities for player r to choose a move at each state where it is the player’s turn.

Given the mixed policy ~π−r of other players, player r finds the best response as follows: t t b(~π−r) = argmax u(m, ~π−r) (2.3) m∈M

Here, b(~π−r) is the best-response function which returns the best move mr for player r given the mixed policy for the other players; and u(mr, ~π−r) is the reward function for player r if the player plays move m given the strategy ~π−r for the opponents. In this chapter, I also refer to π as a function which takes a move and returns a policy with the probability of the given move being 1 and all others being 0.

Information Set Monte Carlo Tree Search

Three techniques lie in this category. The Single Observer Information Set Monte Carlo Tree Search (SO-ISMCTS), the Single Observer Information Set Monte Carlo Tree Search With Partiality Observable Moves (SO-ISMCTS + POM) and the Multiple-Observer Information Set Monte Carlo Tree Search (MO-ISMCTS) [21]. All of these algorithms are searching the information set tree rather than game

27 Chapter 2 Multi-Agent Systems in Incomplete Information Environment state tree. However, each holds assumptions which limits them to some specific games. The SO-ISMCTS assumes all moves by players are observable. This assumption makes it inapplicable to those games with partial observable moves, like poker. The SO- ISMCTS + POM considers partial observable moves but it assumes opponents are choosing their moves at random. In other words, it does not consider opponents to be rational. This assumption leads to no opponent modelling and weak performance against a non-random player. The MO-ISMCTS tries to solve the problem with the second approach by considering other opponents to be rational. It switches between players when it needs to find the other players’ moves. Its main weakness is its inapplicability in games with simultaneous moves. MO-ISMCTS only considers moves by one player at a time and avoids considering joint moves. The main problem with all the ISMCTS techniques is they use MCTS state sim- ulation as a heuristic. MCTS has the strategy-fusion problem [112]. MCTS starts from a single state in the current information set and chooses moves at random until it reaches a terminal state. This simulation misguides the algorithm on choosing moves without valuing information. In the Cutting Wire example, all the ISMCTS techniques choose the wait rather than the tell move.

Vector Minimaxing

Minimax is a common solution to games with perfect information zero-sum games. It consists of two players, Max and Min. Max tries to maximise its utility while Min tries to minimise Max’s utility to increase its own. The algorithm starts at the terminal nodes. If It is Max’s node to choose its move, the value of the node will be the value of a terminal node with the highest reward value. If it is Min’s node then the value will be the least. This process will be done recursively until it reaches the initial node. Minimax can be implemented for incomplete information games by sampling different states of the current information set and play each as a complete information game. However, it causes the problem to have the strategy- fusion problem.

28 2.3 Multi-agent Algorithms in the Literature

Vector Minimaxing is an extension to the original Minimax algorithm and it solves the games with incomplete information without the strategy-fusion error [32]. Vec- tor Minimaxing assumes incomplete information is created by the random player’s hidden moves at the start of the game. Each unique combination of random player’s moves creates a world. Each world has a different reward at a terminal node. The Max player does not know which world it is placed in but it can observe the other player’s moves. On the other hand, the Min knows which world it is placed in and sees Max’s moves so it chooses the move to minimise Max’s reward. At the end of the game, players will be notified about the world they are in and their rewards. If the original Minimax uses the determinisation technique, then players would assume both players know the world they are placed in and accordingly choose the moves. However, this incorrect assumption can misguide the player.

Vector Minimaxing tries to overcome the problem by considering a vector of re- wards instead of a single value reward at each terminal. Each value of the vector reward in a terminal relates to the reward at the terminal in a world. The algorithm works recursively from leaves to the initial node. At each step for a Max node, the algorithm selects the reward with the highest average reward over all worlds. Other- wise, for Min’s nodes, it merges the rewards vector by applying the minimum reward value of each world. For example, in Figure 2.10 at node b, which is a Min’s node, the value will be: [min(0, 0), min(0, 0), min(1, 1), min(1, 1), min(1, 1)] and node a, which is a Max’s node, the value will be: max(0 + 0 + 1 + 1 + 1, 1 + 1 + 1 + 1 + 0).

Vector Minimaxing can solve the games with strategy-fusion, but it assumes the other player has full information of the game, which leads the Max player to play paranoiacally. Furthermore, Vector Minimaxing can only map the game in which players’ moves are not hidden and there are no simultaneous moves. It also suf- fers from non-locality problem. Vector Minimaxing is a compositional algorithm, meaning it only considers the lower part of a node to evaluate the value of the node even though in games of imperfect information the decision also relies on the other part of the tree. It only lessens the effect of non-locality [32]. The original Vector Minimaxing fails to solve the simple game in Figure 2.2. More analyses is provided

29 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

MAX 1 1 1 1 0 a

MIN 0 0 1 1 1 b 1 1 1 1 0 c

MAX 0 0 1 1 1 d 0 0 1 1 1 e 1 1 1 1 0 f

w1 1 0 1 0 1

w2 1 0 1 0 1

w3 0 1 0 1 1

w4 0 1 0 1 1

w5 0 1 0 1 0

Figure 2.10: Applying the Vector Minimaxing to a game in the world model in Chapter 3.

Paranoid and Over Confidence

These two algorithms mainly concern modelling the opponent. The over confi- dence modelling algorithm assumes the opponent plays at random [78]. Assum- ing the opponent plays in random can result in a weak play. On the other hand, the paranoid modelling algorithm believes the opponent knows everything and will always play the best move. Assuming playing against a strong player may be correct in games with complete information, but it is a strong assumption for games of imperfect information. In games of imperfect information, players usually do not have enough information about the world, so they can not always choose the best action [78]. The two models were tested on some games including the game of kriegspiel [17, 13]5 and it was shown that in the majority of them, the over confidence performs

5Kriegspiel game is an incomplete information variant of chess. In this game, players can only see their own pieces and not the opponents. For this reason, there is a need for a third person, or a computer, to manage the game.

30 2.3 Multi-agent Algorithms in the Literature better. The main two reasons are the high degree of uncertainty and the processing speed. A player can not always choose the best move in games with imperfect information because they do not hold enough information to make a sane decision. Also, calculating the best move for the opponent can be time-consuming comparing with choosing moves in random. This efficiency helps the player to search the tree deeper which leads to a better play. The over confidence opponent modelling shows promising results in games with incomplete information, but the test was only performed on a few games and was not statically proven to be right. The strong assuming of a random opponent results in a poor performance against a strong opponent [13].

Counterfactual Regret Minimization

CounterFactual Regret minimization (CFR) [118] was introduced to solve the imperfect-information game of poker. This technique performs well thanks to poker specific optimisations [50]. However, its general implementation failed to perform equally well in other games with incomplete information. One of the reasons is the high computational complexity of the algorithm, O(I2N), where I is the number of information sets and N the total number of states in the game [59]. A well-known extension of the original CFR is the Monte Carlo sampling Counter- Factual Regret minimization [54]. The original CFR needs to update all information sets during each iteration. On the other hand, The Monte Carlo CFR only needs to update a sampled set of information sets. However, the Monte Carlo CFR still needs to generate the whole information sets and keeps them in the memory. All forms of successful CFR algorithms uses abstraction. Generating an abstraction of a game requires excessive knowledge of the domain [7]. Making it impractical in GGP framework.

2.3.3 Cooperative Games

In cooperative games agents will gain the most when they fully cooperate [76]. Some of the approaches designed for competitive games will return competent results when

31 Chapter 2 Multi-Agent Systems in Incomplete Information Environment experimented on some categories of cooperative games. However, there are two categories in which I found a gap in the literature. The two are: cooperative games with implicit communication requirement and Multi-Agent Path Finding Problem. Chapters 5 and 6 discuss these two categories of cooperative games.

Cooperative Games with Implicit Communication Requirement

In some cooperative games, agents need to share their knowledge about the world without this being explicitly described in the rules of the game. As an example, consider a simple game, taken from [87], called the Cooperative Spies game. This is almost the same game as the Cutting Wire game in Figure 2.4 but with a slight twist. Similarly, there are three players: Nature and two agents, respectively called Cutter and Teller. After Teller sees which wire is used to arm it needs to choose either tell b or tell r moves. Figure 2.11 shows the extensive representation and Figure 2.12 shows the GDL representation of this game.

random arm_red arm_blue

Teller

tell_b tell_r tell_b tell_r

Cutter

cut_r cut_b cut_r cut_b cut_b cut_r cut_b cut_r

Teller: 100 0 90 0 0 100 0 90 Cutter: 100 0 90 0 0 100 0 90

Figure 2.11: The Cooperative Spies game in extensive form.

The twist in this game is the lack of any connection between the perception of Teller (which wire has been used) and the message it can send to Cutter. This is shown in the GDL representation in Figure 2.12 on lines 23 to 326. This problem is mentioned as a limitation of all current approaches in GGP-II [87]. This limitation is due to the inability of agents to automatically generate a common language among themselves. The study of common language generation in computer science can roughly be

6This is an example in which GDL is more general than extensive form

32 2.3 Multi-agent Algorithms in the Literature

27 (<= ( sees cutter a) 1 ( role Cutter ) 28 ( does viewer tellA)) 2 ( role teller ) 29 (<= ( sees cutter b) 3 ( role random ) 30 ( does teller tellB)) 4 ( init ( round 0)) 31 (<= ( next ( round 1)) 5 (colour red) (colour blue) 32 ( true ( round 0))) 6 (<= ( legal random ( arm ?c)) 33 ... 7 (colour ?c) ( true round 0)) 34 (<= ( next ( armed ?c)) 8 (<= ( legal random noop ) 35 ( does random ( arm ?c ))) 9 ( not ( true round 0))) 36 (<= ( next ( armed ?c)) 10 (<= ( legal teller noop) 37 ( true ( armed ?c)) 11 ( true round 0)) 38 12 (<= ( legal cutter noop) 39 (<= disarmed 13 ( true round 0)) 40 ( does cutter (cut c?)) 14 (<= ( legal teller tellA) 41 ( true (armed ?c))) 15 ( true round 1)) 42 (<= exploded 16 (<= ( legal teller tellB) 43 ( does cutter (cut c?)) 17 ( true round 1)) 44 ( not ( true (armed ?c)))) 18 (<= ( legal cutter (cut blue)) 45 (<= terminal ( true 19 ( true round 2)) 46 ( round 3))) 20 (<= ( legal cutter (cut red)) 47 (<= ( goal ? role 100) 21 ( true round 2)) 48 ( disarmed ) 22 ... 49 ( distinct ? role random )) 23 (<= ( sees teller red ) 50 (<= ( goal ? role 0) 24 ( does random ( arm red )) 51 ( exploded ) 25 (<= ( sees teller blue) 52 ( distinct ? role random )) 26 ( does random ( arm blue )) 53 (<= ( goal random 0)

Figure 2.12: GDL-II description of the Cooperative Spies game. divided into two categories based on the environments that are considered: embod- ied or simulated. The embodied systems mainly focus on language games. There have been three main variants of the language games: object naming game, colour categorising and naming game, and lexicon spatial language game [96, 95, 97, 3, 100, 93]. In simulated environments, agents do not need any interaction with the real world or image recognition, so they can focus more on extending the commu- nication to a population of agents [40]. This extension allows the simulation to test how a common language equilibrium will be affected when a new agent enters the environment [94].

Whether they used simulated or embodied environments, prior research methods were all limited to the design and evaluation of one specific problem. Recently, Re- inforcement Learning (RL) has been suggested as a relatively general approach to generate a common language [28, 60]. RL techniques consist of centralised learn- ing and decentralised execution. They also assume there exists a communication channel for sending messages with no effect on the world. These assumptions limit the generality of the algorithms. Firstly, centralised learning means agents who will

33 Chapter 2 Multi-Agent Systems in Incomplete Information Environment cooperate need to come together and train with each other. This reduces the appli- cability of these algorithms to games in which the allies are always allies and enemies are always enemies. Secondly, the more complicated scenarios in which signalling might come at a cost for agents cannot be solved by these techniques. In these scenarios, agents need to weigh the benefit against the cost of signalling. Moreover, the RL techniques always need centralised learning even though some problems can be solved without requiring this.

Multi-Agent Path Finding Problem

Many problems require multiple agents to relocate to different destinations without any collision. Real-world examples include automated vehicles at intersections [23], office and warehouse robots [117, 109] or video games [55] in which agents must move collision-free to different destinations. This is known as Multi-Agent Path Finding (MAPF). Many previous works [55, 42, 47, 90] assumed agents plan centrally and each agent knows the destinations of the other agents. These two assumptions, however, might not always be possible, e.g. the lack of centralised planning with cars at an intersection or the interaction between robots and humans. This is then referred to as MAPF under Destination Uncertainty (MAPF/DU) [26]. The only known general solution to MAPF/DU problems has PSPACE-complete complexity [4].

2.3.4 GGP-II Past Players

Since the introduction of General Game Playing with Incomplete Information differ- ent techniques were introduced. Unlike the previous works, approaches in GGP-II are meant to be applicable on a wide variety of both cooperative and competitive games. These approaches and their limitations are described below.

Determinisation

The determinisation technique is one of the earliest approaches in GGP-II. As ex- plained earlier, in determinisation technique the algorithm chooses all or some states

34 2.3 Multi-agent Algorithms in the Literature from its information set. Then it plays each state as it is a complete information game. In the end, it plays the best average moves among all states. Some exam- ples of successful determinisation algorithms in the field of GGP-II are the Hyper player [86], NEXUSBAUM [25], the TIIGER player [70] and the Shodan player [12]. Each of these algorithms has modified and extended the determinisation approach in order to make it more general and efficient. The main advantages of a determinisation technique are its fast speed and sim- plicity of implementation. However, this technique has two problems: non-locality and strategy-fusion [31, 87] as previously described in sections 2.3.1 and 2.3.1.

Norns Algorithm

Norns is a GGP-II single-player algorithm which values information [34]. It uses Action-Observation Trees to simulate a game and to determine the value of infor- mation. Each simulation starts from the starting state of the game. During this process, it generates states redundantly. The high resource consumptions and appli- cability to only single-player games are the two main limitations of this algorithm.

Shodan Player

Shodan player uses Game-Theoretic algorithms to play GGP-II [12]. It consists of two reasoners: the state machine and the propositional network. the state machine reasoner is the common reasoner used by almost all the current GGP-II player, but it is slower than the propositional network; however, the propositional network reasoner only works on a limited category of games. The Shodan player uses MCTS and EXP3 bandit selection techniques. EXP3 bandit selection on average gives a better result for simultaneous moves. The Game-Theoretic algorithm is similar to determinisation algorithms but out- performs it in games with simultaneous moves. However, it can not model the opponent and value information in a game. As a result, it has both non-locality and strategy-fusion problems.

35 Chapter 2 Multi-Agent Systems in Incomplete Information Environment

Hyper Player with Incomplete Information

The original Hyper Player uses determinisation technique to choose the best moves, so it does not value information on games [86]. Hyper Player with Incomplete Information (HP-II) is the extended version of the original HP which uses nested players to choose the best move. It wraps a Hyper Player inside another Hyper Player [87]. The algorithm works as follow: firstly, it randomly chooses different states in the current information set. Afterwards, it generates some children of the chosen states. Then, it runs Hyper Player from the generated children. The Hyper Player uses determinisation and determinisation uses MCTS. I discovered HP-II values information in the game, but only one step ahead. In other words, if there exist two moves which provide the same beneficiary information but located at two different parts of the game, HP-II algorithm will always choose the move which is closest, no matter the possible cost of each move. This will be discussed further in Chapter 3. Following is the move selection policy of HP-II7:

  X 0 argmaxm∈M(s)  eval(replay(s0,Ir∈R(do(s , m)), ~πhp), ~πhp, n) (2.4) s0∈I(s) where eval(s, ~π, r, n) is defined as:

n 1 X eval(s, ~π, n) = µ(s) × v(play(s, ~π)) (2.5) n 1 where µ(s) is the probability of state to be a true state and v(play(s, pi~ )) is the returned value of playing policy ~π from state s. It was stated that the HP-II is able to correctly value information in games and avoid the strategy-fusion problem [87]. The Cutting Wire example, Figure 2.4, was used to demonstrate this technique. I will use the same example to demonstrate HP-II technique. Imagine Nature chooses the red wire to arm the bomb. Then, Teller has two

7I modified the algorithm to suit sequential games.

36 2.3 Multi-agent Algorithms in the Literature

0 moves to choose from: tell and wait, M(sr) = {tell, wait}. At this stage s is only sr, I(sr) = {sr}. The replay(s0, s, search_policy) starts searching from the starting state, s0, to find all the states that are in the same information set as the 0 s. In theory for this example, it means I(s). For s = sr and m = wait we gets

I(do(sr), wait) = {srw, sbw}. It then uses the HP policy, ~πhp to evaluates move wait. There are four states each with equal probabilities. As a result, the returned value is 0.25×(100+0+100+0) = 50. But if we evaluate for the move tell, then the retuned value is 0.25×(90+90+90+90) = 90. So, unlike determinisation techniques, HP-II always chooses to tell in the Cutting Wire example.

To this date, HP-II is the most successful GGP-II algorithm. But it has three major gaps. The first gap is its inability to come up with a new way of communicat- ing without it explicitly be defined. This gap was introduced and discussed at the HP-II paper [87] itself. In the same paper, it is also claimed that all the previous algorithms suffer from this gap. The second gap is its inability to produce mixed strategies. A mixed strategy plays an important role in modelling opponents and even fooling them. This gap can easily be seen from HP-II move selection policy on equation 2.4. The returned value by argmaxm∈M(s) is m, which is a single move and not a mixed strategy. Here is an example to clarify it. If the expected optimal solution in a game is to play move a 60% and move b 40% of times, HP-II will always choose to play a in every single match. Playing a mixed strategy is essential in some games such as the well-known “Biased Penalty Kick” which I will discuss in details on section 3.2.3 of this thesis. The third gap is being short-sighted in valuing information. The main upgrade for HP-II compared to its predecessors is its power of valuing information in the game. However, it mistakenly collects some information as soon as it can, even if collecting the same information later in the game will cost less. Because of this gap, HP-II plays none-optimal strategies in the two games of “Banker and Thief” and “Battleships in the Fog”. These two games are mentioned in the HP-II paper itself [87] but their described “optimal” strategies are, in fact, not optimal. In section 3.2.4 of this thesis, I describe the “Banker and Thief” game and theoretically show what is its optimal strategy. I do the same for

37 Chapter 2 Multi-Agent Systems in Incomplete Information Environment the “Battleships in the Fog” in section 4.3.1 of this thesis. I use HP-II performance as a benchmark in my thesis because, to this date, HP-II is the most successful GGP-II algorithm.

2.4 Summary

In this chapter I have introduced four ways to represent games: normal form, ex- tensive form, world model and Game Description Language (GDL). Then, I intro- duced two main issues with current approaches for incomplete information games, strategy-fusion and non-locality. Later I introduced current approaches for games in extensive form and Game Description Language with Incomplete Information.

38 Chapter 3

Iterative Tree Search

In this chapter, I introduce the Iterative Tree Search (ITS) algorithm to overcome the limitations of previous approaches in GGP-II by adapting the classic idea of the fictitious play. The two addressed limitations in this chapter are not valuing infor- mation throughout the match and not generating mixed strategies. Both limitations stop the previous approaches from correctly modelling the opponents. First, I introduce the new Partially Hidden Extended Cutting Wire example (PHECW). The game was inspired by the Cutting Wire game. The original Cut- ting Wire game was originally published to motivate the HP-II technique [87] and I described it in section 2.3.1 of this thesis. Then I introduce the novel Iterative Tree Search algorithm and use the PHECW example to demonstrate it. Later, I show both theoretically and experimentally that the ITS provides an improvement over existing solutions on several classes of games that have been discussed in the literature. I finish this chapter by describing the main limitation of ITS. Publications This chapter recapitulates and expands on the following previously published .

• [16] Armin Chitizadeh and Michael Thielscher, "Iterative tree search in general game playing with incomplete information", In: Tristan Cazenave, Abdallah Saffidine , Nathan Sturtevant (Editors), CGW 2018: Computer Games - 7th Workshop, Held in Conjunction with the 27th International Conference on Ar- tificial Intelligence, International Joint Conferences on Artificial Intelligence

39 Chapter 3 Iterative Tree Search

2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers, pp. 98–115.

Partially Hidden Extended Cutting Wire Example

The main difference of the Cutting Wire game and the Partially Hidden Extended Cutting Wire game is in the latter Teller can share the colour in two steps of the game. Figure 3.1 shows the Extended Cutting Wire game tree. The roles are R = {random, Cutter, T eller}. At first, random arms a bomb, and only Teller sees which of two wires is used for this purpose. For the next two moves, then, Teller can either decide to tell which wire was used or to wait. Telling first costs 20 points and telling later costs 10 points for both players. In the end, Cutter has to decide which wire to cut. Cutting the correct wire gives both players 100 points (minus the aforementioned costs). Otherwise, they get 0 points. I introduced this game for the purpose of testing GGP-II approaches in valuing information at different steps of a game and choosing the best Nash equilibrium in the game. random μ : 0.5 μ : 0.5 arm_red arm_blue Teller wait1 tell1 wait1 tell1 Teller

wait2 tell2 wait2 tell2 wait2 tell2 wait2 tell2 Cutter cut_b cut_b cut_b cut_b cut_b cut_b cut_b cut_b

cut_r cut_r cut_r cut_r cut_r cut_r cut_r cut_r Teller: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 Cutter: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 Figure 3.1: The Partially Hidden Extended Cutting Wire game tree in the extensive from representation.

This cooperative variant of the Cutting Wire example has an interesting feature. There are three unique Nash equilibrium strategy profiles in this game. Two return the value of 95 while one returns value of 80. The two Nash equilibrium strategy profiles which return the maximum rewards for both players are not the optimal strategies in a one-off match. A strategy profile is a Nash equilibrium if players will gain less by changing their strategy with the assumption that others keep their strategies [82]. The crux is the assumption of knowing what strategies others are holding.

40 random arm_red arm_blue

Teller wait1 tell1 wait1 tell1

Teller

wait2 tell2 wait2 tell2 wait2 tell2 wait2 tell2 Cutter

cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 0 100 Cutter: 100 90 0 80 0 70 0 0 0 90 0 80 0 70 (a) The Nash equilibrium of waiting when the blue wire arms the bomb. random arm_red arm_blue

Teller wait1 tell1 wait1 tell1

Teller

wait2 tell2 wait2 tell2 wait2 tell2 wait2 tell2 Cutter

cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 0 100 0 80 0 70 Cutter: 100 90 0 80 0 70 0 0 0 90 (b) The Nash equilibrium of waiting when the red wire arms the bomb.

Figure 3.2: The game tree of the Partially Hidden Extending Cutting Wire game with two Nash equilibriums marked. Both of these equilibria return the average reward of 95 to players.

The two Nash equilibria which return the highest rewards are shown in Figure 3.2. The pure strategies are shown with coloured thick lines. Different colours are chosen for different players. The idea is to only tell the colour of the wire in one case and wait the other case. This way, Cutter knows which wire to cut when he is not told the colour. Here is a quick explanation on why these strategy profiles are Nash equilibria. To check whether a strategy profile is Nash equilibrium or not, we need to check for each player to see if the player benefits from changing strategy while others hold their strategies. If no player benefits from changing strategy while others hold their strategies then the strategy profile is Nash equilibrium. Consider the strategy profile in Figure 3.2a. On the left section of the graph, for the Teller the other strategy options are to choose tell1 → wait2 1, tell1 → tell2 or wait1 → wait2. If Teller chooses tell1 → wait2, he gets 80 and if he chooses tell1 →

1I use ”→” sign to demonstrate a strategy of a player. If the player can only choose a move in one step of the game, I simply use the move as the strategy

41 Chapter 3 Iterative Tree Search tell2, he gets 70 instead of 90. If he chooses wait1 → wait2 then he gets 0, because Cutter decided to choose cut_b. Remember, to check if a strategy profile is a Nash equilibrium we need to change the strategy of only one agent at a time.

On the right side of the graph, the other strategies for Teller are tell1 → tell2, tell1 → wait2 or wait1 → tell2. All of these strategies return a reward lower from what Teller already receives, which is 100.

Now consider Cutter’s strategy options. For most states, the optimal strategy is apparent, cut the correct wire if you are certain of your current state, except for srww and sbww. These two states are in the same information set. This means Cutter has to choose the same move in them since Cutter can not distinguish which state he is placed in. Considering the chosen strategies of Teller in Figure 3.2a, the game never reaches srww so Cutter should choose cut_b in both states.

These two Nash equilibria return 95, the highest reward among all Nash equilib- riums in this game, but they require a strong assumption. An assumption which is not possible in all games within GGP framework which deals with one-off games, this is also true for several real-world problems.

As already mentioned, a strategy profile is a Nash equilibrium if no player can benefit by switching strategies, given that every other player sticks with the same strategy. The assumption is players must know the chosen strategy of each other. In this example, Cutter must know the strategy of Teller. Cutter needs to know when Teller decided not to tell was either when the random player used the blue or the red wire to arm the bomb.

The crux is similar to the one in the Battle of Sexes problem [19] as shown in Table 3.1. There are two players in this problem, Man and Woman. Man enjoys watching football while Woman enjoys going to opera more but this couple hates being separated. This means Man still prefers watching opera with Woman over being alone and Woman still prefers watching football with Man over being alone. In Table 3.1 rows represent strategies by Man and columns represent strategies by Woman. Values in each cell are the utilities of these two players if each chooses the column’s and the row’s strategies of the cell. The left number in each cell is the

42 Woman football opera football (2, 1) (0, 0) Man opera (0, 0) (1, 2)

Table 3.1: The matrix for the Battle of Sexes problem. utility of Man and the right number is the utility of Woman. This game has two pure Nash equilibria and one mixed Nash equilibrium. One pure Nash equilibrium is when both go to football and the other is when both go to opera. This returns the utility of 2 to the person who enjoys the event more and 1 to the other. The mixed Nash equilibrium is when players go to their preferred 2 event with the probability of 3 . This mixed Nash equilibrium returns the average 2 reward of 3 which is even less for a player when he/she even sacrifices to choose a pure Nash equilibrium with undesirable activity [19]. The Partially Hidden Extended Cutting Wire example has a similar issue. With- out any pre-game agreement or any form of communication, agents fail to correctly play a Nash equilibrium which returns the highest utility for players in a cooperative game.

Optimal Strategy in Partially Hidden Extended Cutting Wire The Par- tially Hidden Extended Cutting Wire game has a mixed Nash equilibrium which, unlike the battle of sexes problems, returns higher than the mixed Nash equilibrium which consisted of the two strategies in two pure Nash equilibria. To compare the average returned value, let me show the average returned value for this mixed Nash equilibrium. This mixed Nash equilibrium randomly chooses one of the pure equi- librium and plays its strategy in the one-off game of PHECW. On one hand, with the probability 0.5, Cutter will find himself in a situation without knowing which wire is used in arming the bomb. This means with the probability of 0.5 he knows which wire is used for arming and cuts the correct wire. As a result, they receive 90 points. On the other hand, with the probability 0.5 Cutter will be in a situation in which he does not know which wire is used to arm the bomb. In such a scenario, he has to choose the strategy from one of the optimal pure Nash equilibria. With

43 Chapter 3 Iterative Tree Search

random arm_red arm_blue

Teller wait1 tell1 wait1 tell1

Teller

wait2 tell2 wait2 tell2 wait2 tell2 wait2 tell2 Cutter

cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 0 100 Cutter: 100 90 0 80 0 70 0 0 0 90 0 80 0 70 (a) Figure 3.3: The mixed Nash equilibrium with average reward of 90 in the Ex- tended Cutting Wire game. the probability of 0.5, he chooses the correct Nash equilibrium and they receive 100 but other times they receive 0. This means on average the reward for choosing one optimal pure Nash equilibrium is:

1 1 1 1 1 ∗ 90 + ∗ ∗ 100 + ∗ ∗ 0 = 70 2 2 2 2 2

Unlike Battle of Sexes, Partially Hidden Extended Cutting Wire game has a better mixed Nash equilibrium with higher average reward in a one-off game. The mixed Nash equilibrium is shown in Figure 3.3. In this Nash equilibrium Teller always chooses the wait1 → tell1 strategy while

Cutter chooses the cut_r move at srwt and cut_b at sbwt. Technically for this to be a Nash equilibrium, Cutter needs to choose cut_b and cut_r with the probability of 0.5 at both sbww and srww. Any probability higher than 0.9 for any move at those two states will make the strategy profile unstable, considering the current utilities of terminal states in this game. To understand the reason, consider a strategy profile similar to what described but with a difference. Cutter chooses cut_r with probability 0.91 and chooses cut_b with probability 0.09. Here is the reason why this strategy profile is not a Nash equilibrium.

This strategy profile for Cutter means if Teller guides the game to state srww, they both receive 91 as the reward which is higher than what they receive if they were at state sbwt. In this game with the current strategy profile of Cutter, it is more beneficial for T eller to change his strategy from choosing the tell2 move at sbwt to choosing the wait2 move.

44 3.1 Iterative Tree Search

This game shows that choosing any Nash equilibrium strategy is not always enough for playing optimally even if the Nash equilibrium returns the highest reward and it is a cooperative game. It is actually showing an advantage of ITS algorithm over CounterFactual Regret minimization algorithm. CFR is proven to play Nash Equilibrium in some games [7]. But in this game, even-though it plays a Nash equi- librium but it is one of the none-optimal Nash equilibria. More detailed explanation is provided in Appendix A. In the next section, I will use the PHECW game to explain the ITS algorithm.

3.1 Iterative Tree Search

In this section, I introduce the novel Iterative Tree Search algorithm (ITS) for GGP- II. ITS is an offline search, meaning it finds the best strategy before the game begins, and then during the game, it plays based on the pre-calculated mixed move policy. We first need to discover the game’s information sets and initialize variables for all states and move probabilities; then, we need to update these variables iteratively, all within the given time limit before the start of the game. Algorithm 1 is the pseudocode of ITS. I will use the Partially Hidden Extended Cutting Wire as an example to demonstrate the ITS algorithm by showing how different variables change during each iteration. I describe the algorithm in two sections: initialising the tree and updating the tree. Initialising the tree refers to discovering the information sets and assigning the initial probabilities to each move in the game. Updating the tree starts by setting variables for each state according to the move probabilities and choosing the optimal move in each state. The final step in updating the tree is to update the move probabilities according to the chosen moves. Updating the tree loops until the allowed pre-game time is over.

3.1.1 Initialising the Tree

Initialising the tree consists of discovering the information sets and initialising move probabilities. Unlike in the extensive form game representation, in the GGP frame-

45 Chapter 3 Iterative Tree Search

Algorithm 1 Iterative Tree Search 1: procedure ITS 2: . All the sets including S, R, M, etc. are assumed to have been computed from the gdl description as defined earlier. 3: InitialiseValues 4: while within_time_constraint_pre-game do 5: UpdateStateProbabilities 6: UpdateMovesProbabilites 7: end while game_ends_in_a_terminal_state 8: return m ← PlayOptimalStrategy(selfM, receivedP ) 9: end procedure 10: 11: procedure InitialiseValues 12: for all s ∈ S do . discovering the information sets using eq.(2.1) 13: I(s) ← {x ∈ S : Σ(s) = Σ(x) ∧ ξr(s) = ξr(x) ∧ r = R(s)} 14: end for 15: t ← 1 16: D ← S \ Z. D is the set of non-terminal states. 17: for all s ∈ D do 18: for all m ∈ M(s) do 1 19: µ(m) ← |M(s)| 20: end for 21: end for 22: end procedure 23: 24: procedure UpdateStateProbabilities 25: for all s ∈ D do 26: ρF actor(s) ← null 27: ρ(s) ← null 28: u(s) ← null 29: chosenMove(s) ← null 30: end for 31: ρF actor(s0) ← 1 . s0 is the initial state 32: for all s ∈ S & m ∈ M(s) do 33: ρF actor(do(s, m)) ← ρF actor(s) ∗ µ(m) 34: end for 35: for all s ∈ D do 36: ρF actor(s) ρ(s) ← P ρF actor(s0) s0∈I(s) 37: end for 38: for all s ∈ D & m ∈ M do . For terminal states, it is already set P 39: u(s) ← m∈M(s) u(do(s, m)) ∗ µ(m) 40: end for 41: for all s ∈ D do P 0 42: chosenMove(s) ← argmax[ s0∈I(s) ρ(s) ∗ uR(s0)(do(s , m))]] m∈M(s) 43: end for 44: end procedure 46 3.1 Iterative Tree Search

45: 46: procedure UpdateMovesProbabilites 47: for all s ∈ D do 48: for all m ∈ chosenMove(s) do (µ(m)×t) + 1 49: µ(m) ← t + |chosenMove(s)| 50: end for 51: for all m ∈ M(s) & m =6 chosenMove(s) do µ(m)×t 52: µ(m) ← t+1 53: end for 54: end for 55: end procedure 56: 57: procedure PlayOptimalStrategy(selfM, receivedP) 58: for all s ∈ D do 59: if selfM = ξR(s)(s)& receivedP = Σ(s) then 60: currentState ← s 61: break 62: end if 63: end for 64: return rand(M(currentState), according_to(µ)) 65: end procedure

work information sets are not explicitly described in the provided game rules. Figure 3.4 shows the information set and initial move probabilities in the PHECW game. In Algorithm 1, the InitialiseValues procedure (line 3) discovers information sets and assigns the initial values for each move probabilities. This procedure is only needed to be called once during each match. It discovers all the information sets in the game (lines 12-14). According to the equation 2.1 from Chapter 2, states are in a common information set if they have the same perception sequence, Σ(s) = Σ(s0), and same self-played move sequence, ξr(s) = ξr(s). There is only one non-singleton information set in the PHECW game. This information set contains two states: srww and sbww. The past self played move sequence in these states is empty set,

ξCutter(srww) = ξCutter(srww) = {} and their sequence of perceptions are identical to each other, Σ(srww) = Σ(sbww) = {none, wait1, wait2}. All the other states in this game have different past self played moves or/and perception sequence to each other which is the reason why they are all in singleton information sets. As an example, srwt is in a singleton information set because its sequence of perceptions

47 Chapter 3 Iterative Tree Search is {none, wait1, red}, which is unique to this state. In Figure 3.4, the dashed line connecting states is representing the non-singleton information set in the game. After discovering the information sets, ITS initialises the counter t (line 15) and gets the set of non-terminal states D (line 16). The final step of initialisation is to provide equal probability to all the moves in each state (lines 17-21). The move probability function, µ(m), can be defined as follow:

Definition 3. Let G be a GDL-II game, then:

• µ : M → [0, 1] is the probability function which, given a move, returns the probability of the move to be chosen by the player.

In Chapter 2 equation 2.2, I described the mixed policy notation. A mixed policy can also be described using move probabilities. For player r a mixed policy πr can be defined with move probability, µ, as follow:

πr = {µ(m)|r = R(s) ∧ m ∈ M(s)} (3.1)

Initially ITS provides equal probabilities to all the moves in a state (line 19). Figure 3.4 shows all the move probability next to its move.

3.1.2 Iteration

After the InitialiseValues procedure, the ITS algorithm starts the iterations (lines 4-7) until the pre-game calculation time ends. The first procedure in the loop is Up- dateStateProbabilities (line 5). The UpdateStatesProbabilties procedure changes state probability ρ(s), state probability factor ρF actor(s), state utility u(s) and the chosen move chosenMove(s) for all the states in the iteration.

Definition 4. Let G be a GDL-II game as defined, then:

• ρ : S → [0, 1] is the probability function which for a state returns the probability of the state to be the true state in its information set.

ρ can be calculated with the help of the probability factor, ρF actor (lines 31-34).

48 3.1 Iterative Tree Search

70 70 : 0.5

μ cut_b

: 0.5

μ tell2

0 0 : 0.5

μ cut_r

80 80 : 0.5

μ cut_b

: 0.5

: 0.5

μ tell1 μ wait2

0 0 : 0.5

μ cut_r

90 90 : 0.5

μ cut_b

: 0.5

: 0.5 μ tell2

μ wait1

: 0.5

0 0 : 0.5

μ arm_blue μ cut_r

: 0.5

100 100 μ cut_b

: 0.5

μ wait2

0 0 : 0.5

μ cut_r

0 0 : 0.5

μ cut_b

: 0.5

μ tell2

: 0.5

70 70 μ cut_r

0 0 : 0.5

μ cut_b

: 0.5

μ arm_red : 0.5

: 0.5

μ tell1 μ wait2

: 0.5

80 80 μ cut_r

0 0 : 0.5

μ cut_b

: 0.5

μ tell2 : 0.5

μ wait1

: 0.5 90 90

μ cut_r

0 0 : 0.5

μ cut_b

: 0.5

μ wait2 Figure 3.4: The Extended Cutting Wire problem with variables’ values during the first iteration.

: 0.5

100 100 μ cut_r

: :

Teller Cutter Teller random Teller Cutter

49 Chapter 3 Iterative Tree Search

ρF actor(s) ρ(s) = P (3.2) sn∈I(s) ρF actor(sn) And probability factor of each state can be calculated recursively as below:

ρF actor(do(s, m)) = ρF actor(s) ∗ µ(m) (3.3)

Where m ∈ M(s) and ρF actor(s0) = 1 (lines 31-34). The order that ITS assigns

ρF actor values starts from the initial state, s0, toward the terminal states. In Figure 3.5 numbers on top of squares next to each state show the order of states which their ρF actors are assigned. According to equation 3.3, the probability factory of a state can be calculated recursively by multiplying the probability factor of the parent with the move prob- ability from the parent to the state. Consider state srww as an example. The probability factor for state srww is ρF (s0) ∗ µ(arm_red) ∗ µ(wait1) ∗ µ(wait2), which given ρF (s0) = 1 and µ(arm_red) = (wait1) = (wait2) = 0.5 , will be 1 ∗ 0.5 ∗ 0.5 ∗ 0.5 = 0.125. Other probability factors can be calculated similarly. After all probability factors are being set, the state probabilities are set according to the equation 3.2. For the next step, the utilities of non-terminal states are set (lines 38-40). Unlike the direction of setting up ρF actor, utilities are set starting from the parents of the terminal states toward the starting state s0. It is in the reversed order from what is shown in Figure 3.5 for probability factors, the assignment starts from state 15 toward state 1. The utility of each state u(s) can be calculated based on the utilities of the successor states and the probabilities of the moves (lines 38-40). The recursion is as follow:

X u(s) = [u(do(s, m)) ∗ µ(m)] (3.4) m∈M(s) Terminal utilities are the base case of this recursion. In Figure 3.5 utility of each state is shown in the third row of each square next to the states. First, the state 15, sbtt, will has it its utility calculated. For move cut_b is has a probability of 0.5 and

50 3.1 Iterative Tree Search it leads to a state with a utility of 70. For move cut_r it has probability of 0.5 and it leads to a state with a utility of 0. So the utility of state sbtt can be calculated as: 0.5 ∗ 70 + 0.5 ∗ 0 = 35. The final step in the UpdateStatesProbabilties procedure is to calculate the value of each move in its information set and then set the move or moves with the highest utility as the chosen one (lines 41-43). To calculate the utility of a move we consider all the states in the information set.

Definition 5. Let G be a GDL-II game as defined, then:

• chosenMove : S → 2M is the move selection function which chooses moves with the highest reward for the player in an information set as:

  X 0 0 chosenMove(s) = argmaxm ρ(s ) ∗ ur(do(s , m)) (3.5) s0∈I(s)

Where r is the role who is about to play at the given state s, r = R(s). Consider the fact that the chosenMove(s) returns a set of moves if there is more than one move with the highest value. Figure 3.5 shows all the variables for states and moves up to this step in the first iteration. The move probabilities, µ, are shown next to each move. Probability factors, probability and utility of each state is shown in a square next to each state. The order in which probability factors are set is shown on top of each square next to states. The chosen move for each state is shown by a thick coloured lines, different colours represent different player.

For state srw the chosen move is the wait1 move and for s0 the chosen moves are both arm_red and arm_blue since the random player is indifferent of choosing any move due to the utility of the random player being always 0 at all terminal states by default. Few interesting points should be mentioned. First, as mentioned earlier, a state can have more than one chosen move. In Figure 3.5 states s0, srww and sbww all have two moves as their chosenMove. The random player always has 0 utility in GGP-II. This means, it is always indifferent of which move to choose so the chosenMove

51 Chapter 3 Iterative Tree Search Cutter Teller random Teller Cutter Teller

: :

u ⍴ ⍴

F cut_r μ 100 100 : 50

: 0.5 : .09 : 0.5 8

iue35 h xeddCtigWr rbe ihvrals ausdrn h rtiteration first the during values variables’ with problem Wire Cutting Extended The 3.5: Figure wait2 μ

: 0.25 u ⍴ ⍴ F : 63 4 : 1 : .37 cut_b μ

: 0.5

0 0

cut_r μ

90 90 : 0.75

wait1 μ tell2 μ

: 0.75 : 0.75

u ⍴ ⍴ u ⍴ ⍴ F

F cut_b μ : 62 2 : 1 : .5 : 67

9 : 1 : 0.25 : .28

0 0

u ⍴ ⍴ F cut_r μ 10 : 60 80 80

: 0.75 : 1 : .09

tell1 μ wait2 μ

: 0.25 : 0.75

arm_red μ

cut_b μ

: 0.5 : 0.25

0 0

cut_r μ 70 70 u ⍴ ⍴

F : 0.75 5

: 58 : 1 : .12

tell2 μ

: 0.25

u ⍴ ⍴ F : 62

1 : 1 : 1

u ⍴ ⍴

F cut_b μ 11 : 52

: 1 : .03 : 0.25 0 0

u ⍴ ⍴ cut_r μ F 12

: 50 : 0.75 0 0 : 0.5 : .09

wait2 μ

: 0.25 u ⍴ ⍴ F : 63 6 : 1 : .37

cut_b μ

100 100

: 0.25

arm_blue μ

cut_r μ

: 0.5 : 0.25 0 0

tell2 μ wait1 μ

: 0.75 : 0.75

u ⍴ ⍴

F cut_b μ 13 : 67

: 1 : 0.75 : .28 90 90

u ⍴ ⍴ u ⍴ ⍴ cut_r μ F

F 14 : 0.25 0 0 : 60 : 62 : 1 : .09 3 : 1 : .5

tell1 μ wait2 μ

: 0.25

: 0.75

cut_b μ

: 0.75 80 80

cut_r μ

: 0.25 0 0 u ⍴ ⍴

F : 58 7 : 1 : .25

tell2 μ

: 0.25

u ⍴ ⍴ F 15 cut_b μ : 52

: 1 : .25 : 0.75 70 70

52 3.1 Iterative Tree Search

contains all the available moves. States srww and sbww are in the same information set which means both have identical chosenMove set. To see why they have two moves as their chosenMove in this iteration, let us evaluate the value of each move.

For state srww we have two moves: cut_r and cut_b. According to definition 5 the value of cut_r can be calculated as follows:

ρ(srww) ∗ uCutter(srwwr) + ρ(sbww) ∗ uCutter(srwwr)

where according to the values in Figure 3.5 we have: ρ(srww) = 0.5, uCutter(srwwr =

100, ρ(srwwr) = 0.5, uCutter(srwwr) = 0. This means the value of move cut_r at state srww is 50. Similarly, we can calculate the value of move cut_b at state srww as follows:

ρ(srww) ∗ uCutter(srwwb) + ρ(sbww) ∗ uCutter(srwwb)

Where according to the values in Figure 3.5 we have: ρ(srww) = 0.5, uCutter(srwwr = 0, ρ(srwwr) = 0.5, uCutter(srwwr) = 100. This means the value of move cut_r at state srww is 50. Since both of these values are equal then both are chosen as the chosenMove. The calculation is identical for the sbww because to calculate the value of a move in a state we consider all the states in its information set. The last step of each iteration is to update move probabilities accordingly. After UpdateStateProbabilities, the ITS algorithm calls the Update- MoveProbabilities procedure (line 6). The UpdateMoveProbabilities pro- cedure (lines 46-55) updates move probabilities µ(m). The move probabilities are updated according to the chosen moves. Let m be a chosen move in state s, then the move probability function µ updates as follows:

(µt(m) ∗ t) + 1 µt+1(m) = (3.6) t + |chosenMove(s)|

µt(m0) ∗ t µt+1(m0) = (3.7) t + |chosenMove(s)| Where t is the iteration round, |chosenMove(s)| is the number of chosen moves,

53 Chapter 3 Iterative Tree Search m ∈ chosenMove(s) and m0 ∈/ chosenMove(s). These updates are similar to the move updates in the fictitious play technique, section 2.3.2. More comparison will be provided later in this chapter at section 3.2.3. Now that the moves’ probabilities are updated, the second iteration starts by calling the two UpdateStateProbabilities and UpdateMOvesProbabilities procedures again. In the ITS algorithm, the only variable which keeps incrementing or decrementing is the move probabilities, µ. All other variables, ρ, ρF actor, u and chosenMove, are set according to the µ variables of the previous iteration. Figure 3.5 illustrates the values for the second iteration before the final step which is updating move probabilties for the third iteration. A few things can be seen from this figure. First, the move probability for moves cut_r and cut_b at states srww and sbww are stable. The reason is both of them were the chosen moves so their probabilities do not change. Following is the updating process for these two moves based on equation 3.6:

(0.5 ∗ 1) + 1 1.5 = = 0.5 1 + 2 3

And since both sides of the tree are changing symmetrically, probability factors of these two states stay the same. As a result, the probabilities, ρ, stay at 0.5. Another interesting aspect of the second iteration is the utility of the initial state increased. The reason for such a behaviour is the fact that after each iteration, ITS chooses to cut the correct wire more often. This increases the probability of the moves. As a result, the utilities of all states above keep increasing. This is indicating that in this cooperative game, players start to shift their strategies toward a better strategy after each iteration. Figure 3.6 shows the game with its values up to 5th iteration. I decided to only show the left part of the tree as it is identical to the right side of the tree in this game. Figure 3.7 illustrates the moves and states values during the 6th iteration for the

Extended Cutting Wire example. At this point, the value of states srw and sbw are both 80 and the chosen move for these two states is tell2. This means during the

54 3.1 Iterative Tree Search

1

μ : 0.5 u : 64 random 2 arm_red u : 64 Teller μ : 0.83 μ : 0.17 4 wait1 tell1 5

u : 63 u : 65

Teller μ : 0.17 μ : 0.83 μ : 0.83 μ : 0.17 wait2 9 10 11 8 tell2 wait2 tell2 u : 50 u : 75 u : 66 u : 58 Cutter μ : 0.50 μ : 0.50 μ : 0.83 μ : 0.17 μ : 0.83 μ : 0.17 μ : 0.83 μ : 0.17 cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller : 100 0 90 0 80 0 70 0 Cutter: 100 0 90 0 80 0 70 0

(a) Third iteration 1

μ : 0.5 u : 72 random 2 arm_red u : 72 Teller μ : 0.62 μ : 0.38 4 wait1 tell1 5

u : 74 u : 69

Teller μ : 0.13 μ : 0.87 μ : 0.87 μ : 0.13 wait2 tell2 9 10 11 8 wait2 tell2 u : 50 u : 78 u : 70 u : 61 Cutter μ : 0.50 μ : 0.50 μ : 0.87 μ : 0.13 μ : 0.87 μ : 0.13 μ : 0.87 μ : 0.13 cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller : 100 0 90 0 80 0 70 0 Cutter: 100 0 90 0 80 0 70 0 (b) Fourth iteration 1

μ : 0.5 u : 76 random 2 arm_red u : 76 Teller μ : 0.72 μ : 0.28 4 wait1 tell1 5

u : 78 u : 71

Teller μ : 0.10 μ : 0.90 μ : 0.90 μ : 0.10 wait2 tell2 9 10 11 8 wait2 tell2 u : 50 u : 81 u : 72 u : 63 Cutter μ : 0.50 μ : 0.50 μ : 0.90 μ : 0.10 μ : 0.90 μ : 0.10 μ : 0.90 μ : 0.10 cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller : 100 0 90 0 80 0 70 0 Cutter: 100 0 90 0 80 0 70 0 (c) Fifth iteration

Figure 3.6: The Extended Cutting Wire problem with moves’ probabilities and chosen moves during the third, fourth and fifth iterations.

55 Chapter 3 Iterative Tree Search

next iteration, the probability of tell2 increases so the value of srw and sbw will be over 80 accordingly. Afterwards, during the future iterations the values for these two state will keep increasing until they reach 90. On the other hand, the value of states srt and sbt will never reach above 80 because their maximum terminal value is 80. For this reason during all the future iterations, the wait1 move will be kept getting chosen and no change in the structure of overall strategy will happen afterwards. As can be seen the ITS successfully discovered the optimal strategy for the Partially Hidden Extended Cutting Wire Game.

3.2 Analysis

In the following, I characterize the classes of GDL-II games that the ITS can solve. I resort to the theory of fictitious play and game theory to show theoretically how ITS find the optimal strategy in these classes of games. I also demonstrate how HP- II fails in these games and show experimental results to confirm our observations. All mentioned games have either been previously introduced in the literature or are extensions of games from the literature.

3.2.1 Games with Dominant Pure Strategy and Single Player Games

ITS can correctly solve games in which there exists a dominant pure strategy. As described earlier, having a dominant strategy means playing one specific move at each information set guarantees the player the highest reward. Also, it means that the actions of the opponents do not affect the decision of the player. If a single- player game with incomplete information has an optimal strategy, then this will be a pure dominant strategy, as the random player does not change its strategy. The ITS algorithm at the first iteration assigns equal probabilities to all moves from the same information set. This guarantees no µ(m) will ever have zero prob- ability. As the calculation progresses, the player with a dominant strategy tends to play more of it because states on the path of a dominant strategy have the highest

56 3.2 Analysis

70 70 : 0.5 : .01 : 1

: 35 μ cut_b 15 F ⍴ ⍴ u

: 0.08

μ tell2

: .17 : 1 7 : 37 F

⍴ ⍴ u 0 0 : 0.5

μ cut_r

80 80 : 0.5

μ cut_b

: 0.92

: 0.5

μ wait2 μ tell1

: 0.5 : 1 3 : .16 : 1 : 42 : 40 0 0 : 0.5 14 F

F μ cut_r ⍴ ⍴ u ⍴ ⍴ u

90 90 : .36 : 0.5 : 1

: 45 13 μ cut_b F ⍴ ⍴ u

: 0.5 : 0.92

μ wait1 μ tell2

0 0 : 0.5 : 0.5

μ cut_r μ arm_blue

: 0.5

100 100

μ cut_b

: .39 : 1 6 : 47 F ⍴ ⍴ u : 0.08

μ wait2

: .03 : 0.5 0 0 : 0.5 : 50

12 F μ cut_r ⍴ ⍴ u

0 0 : .08 : .01 : 1

: 64 11 μ cut_b F

⍴ ⍴ u

: 1 : 1 1

: 78 F ⍴ ⍴ u

: 0.08

μ tell2

: .17 : 1 : 73

5 F : .92

⍴ ⍴ u

70 70 μ cut_r

0 0

: .08 : 0.5

μ cut_b

μ arm_red

: 0.92 : 0.22

μ wait2 μ tell1

: .16 : 1 : 0.92

80 80 : 74 10 μ cut_r F ⍴ ⍴ u

0 0

: .36 : .08 : 1 9

: 83 : 0.5 : 1 2 : 78 μ cut_b F

F ⍴ ⍴ u ⍴ ⍴ u

: 0.92 : 0.78

μ tell2 μ wait1

: 0.92 90 90

μ cut_r Figure 3.7: Values in Extended Cutting Wire example during the 6th iteration

0 0

: 0.5

μ cut_b : .39 : 1 4 : 80 F ⍴ ⍴ u : 0.08

μ wait2

8 : 0.5 : .03 : 0.5

: 50 100 100 μ cut_r F ⍴ ⍴ u

: :

Teller Cutter Teller random Teller Cutter

57 Chapter 3 Iterative Tree Search

random μ : 0.5 μ : 0.5 arm_red arm_blue

Teller wait1 tell1 wait1 tell1

Teller

wait2 tell2 wait2 tell2 wait2 tell2 wait2 tell2 Cutter

cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b cut_r cut_b

Teller: 100 0 90 0 80 0 70 0 0 100 0 90 0 80 0 70 0 100 Cutter: 100 90 0 80 0 70 0 0 0 90 0 80 0 70 (a) Figure 3.8: The game tree for Fully Hidden Extended Cutting Wire. rewards and the probability of the parent state never changes. As a result, the prob- 0 ability of playing a dominant strategy increases with each iteration. So ρ(s) ∗ ur(s ) always increases and the dominant move will always be the chosen move. We can use the Fully Hidden Extended Cutting Wire game to illustrate how ITS can correctly play this type of games and also why HP-II fails.

Example: Fully Hidden Extended Cutting Wire

The main difference of the Fully Hidden Extended Cutting Wire (FHECW) game versus the Partially Hidden Extended Cutting Wire is in the former the Teller also can not see the colour of the wire. Telling here for Teller is more of deciding to let Cutter see the colour of wire or not. Figure 3.8 shows the game tree for Fully Hidden Extended Cutting Wire game. HP-II uses nested players to overcome the strategy-fusion problem. However, at each step, it sees information valued moves only one step ahead. This causes the algorithm to choose tell1 → wait2 rather than wait1 → tell2. Consider the move selection policy for HP-II, ~πhpii, as represented in section 2.3.4:

  X 0 ~πhpii = argmaxm∈M(s)  eval(replay(s0,Ir∈R(do(s , m)), ~πhp), ~πhp, n) (3.8) s0∈I(s)

At state sr, ~πhpii chooses the best move based on which of ~πhp(srw) or ~πhp(srt) gives the higher expected reward. The HP-II policy, ~πhp, uses a Monte-Carlo search. Monte-Carlo takes a weighted average of the reachable terminal nodes so the move 100+90 70+80 wait1 returns 2 and move tell1 returns 2 . As a result, HP-II considers tell1

58 3.2 Analysis a better move than wait1. We refer to this problem as short-sighted information valuation.

Unlike HP-II, the ITS algorithm can correctly value the information anywhere in the game tree. Similar to what is described in section 3, the Teller will tend to always choose strategy wait1 → tell2 for the left part of the game tree, which is the optimal strategy.

To experimentally validate our claims, I ran ITS with the FHECW game. The graph in Figure 3.9 shows the probability of choosing the tell1 and tell2 moves at three different states in the game during the first 1,000 iterations. If the probability of the telling action is high in any state, then the probability of waiting is low and vice versa. As can be seen from the graph, the probability of choosing both tell1 and tell2 moves quickly converges to zero at early iterations. The probability of choosing tell1, more costly telling move, also drops to almost zero in less than 1,000 iterations. In fact, after less than 200 iterations ITS will very likely choose wait1 first and then tell2 moves.

Figure 3.9: Probability of telling at different states during the first 1,000 iter- ations in the Fully Hidden Extended Cutting Wire game.

59 Chapter 3 Iterative Tree Search

3.2.2 Non-Locality Problem

Frank and Basin [31] formalized and analyzed the problem of non-locality. As de- scribed in section 2.3.1, non-locality occurs when an algorithm only considers chil- dren of a state to find the best move for that state. The ITS algorithm models the opponent, which can be shown to lessen the impact of this problem.2

Example: Frank Basin Non-Locality Game

To show the ability of ITS algorithm to solve games that exhibit the non-locality problem, the example from [31], as we called it the Frank Basin Non-Locality game, is considered. In this game, as demonstrated in section 2.3.1, the first move by random is only visible to Player2 while both other players’ moves are visible to each other. The random player’s choice places the game in a particular world, and the utilities for the players depend on their moves and the world they are in3, as described in section 2.3.1. This game was originally demonstrated using the world model [31] as shown in Figure 3.10. In the paper, they introduced an algorithm which correctly solves this problem. The moves that it choose for Max are shown by thick lines in Figure 3.15. For the ITS algorithm to process this problem it needs to be transferred into the extensive form. Figure 3.11 shows this game in the extensive form. The trans- formation is as described in Chapter 2 section 2.2 of this thesis. The worlds are represented by parallel sequences of the game starting from the initial moves of the random player. The reason behind the success of ITS on this game is the state probability variable ρ. Figure 3.12 shows the non-locality game in the extensive form with the variables’ values during the first iteration of the ITS algorithm. The information set is rep- resented using the dashed rounded rectangle around states in the same information set. States’ variables are represented in a box next to each state. Due to the limit on the size of the graph, the variables for the final Max’s states are drawn beneath

2Frank and Basin[32] introduced the “Vector Minimaxin” technique, which also just lessens, rather than completely avoids, the impact of non-locality. 3This game can, of course, be straightforwardly axiomatized in GDL-II as a GGP-II game [107].

60 3.2 Analysis

Max a

Min b c

Max d e f

w1 1 0 0 1 0

w2 0 1 0 0 0

w3 0 1 0 0 0

Figure 3.10: The Frank Basin Non-Locality game to show non-locality problem represented using the world model. the terminal utilities. Each square has the name of the state which it belongs to at the first row. In Figure 3.12 the probability factor is shown in the second row, the state probability is shown in the third row and the utility is shown at the final row in the squares next to each state. The first step in the ITS algorithm is to initialise the move probabilities by pro- viding equal value to the moves in a state. Next is to find the probability factor and probabilities of each state accordingly. At the first iteration in this game all the states in an information set have equal probabilities. Next step is to find the utility of each state in the tree. The utilities are chosen solely by move prob- abilities and the utility of the children. As an example, utility of state sw1d is µ(left) ∗ 1 + µ(right) ∗ 0 = 0.5. The utilities of the rest of the states are calculated in a similar way. The final step before updating the move probabilities is to decide the optimal move for each state. Moves for each state are chosen according to the probability of states in their information sets and the utility of their children. As an example, the value for move right at states sw1d, sw2d and sw3d is ρ(sw1d) ∗ u(do(right, sw1d) +

ρ(sw2d) ∗ u(do(right, sw2d)) + ρ(sw3d) ∗ u(do(right, sw3d) = 0.66. Whereas, the value

61 Chapter 3 Iterative Tree Search

random a a a

Max w1a w2a w3a

b b b c c c

Min w1b w2b w3b w1c w2c w3c

d d d e e e f f f

Max w1d w2d w3d w1e w2e w3e w1f w2f w3f

left right left right left right left right left right left right single single single

1 0 0 1 0 1 0 1 0 0 0 0 0 0 0

Figure 3.11: The Frank Basin Non-Locality game converted to the extensive form representation.

for move left is .33. So for all the states in that information set, including sw1d, the right move will be chosen. It is important to note that when choosing a move, we only consider the utility of a state, not the current chosen move. Also, a state can have more than one chosen move. As an example, for the state sw1b both moves are added to the chosen move set because both sw1d and sw1e have the utility of 0.5 at this iteration.

Figure 3.13 shows the game tree during the second iteration of the ITS algorithm.

For the second iteration the probabilities of sw2d, sw3d decreased while the probability of sw1d increased. However, it is not enough for the Max player to prefer move left over move right in those states.

Figure 3.14 shows the game tree during the third iteration of the ITS algorithm. It is an important iteration for this game. During this iteration, if we calculate the value of choosing move left we get: 0.64 ∗ 1 + 0.18 ∗ 0 + 0.18 ∗ 0 = 0.64. And if we calculate the value of choosing move right we get 0.64 ∗ 0 + 0.18 ∗ 1 + 0.18 ∗ 1 = 0.32.

This means the probability of the sw1d has increased enough in order for the Max player to choose move left over right.

For ITS as can be seen from Figure 3.14, after the third iteration, Min will never choose move d at states sw1b, sw2b and sw3b. This dramatically increases the proba-

62 3.2 Analysis Figure 3.12: Variables in the Frank Basin Non-Locality game during the first iteration of ITS

63 Chapter 3 Iterative Tree Search iue31:Vralsi h rn ai o-oaiygm uigtescn trto fITS of iteration second the during game Non-Locality Basin Frank the in Variables 3.13: Figure

64 3.2 Analysis Figure 3.14: Variables in the Frank Basin Non-Locality game during the third iteration of ITS

65 Chapter 3 Iterative Tree Search

Max a

Min b c

Max d e f

w1 1 0 0 1 0

w2 0 1 0 0 0

w3 0 1 0 0 0

Figure 3.15: The optimal strategy for Max player suggested by the designer of the game.

bility of sw1d to be the true states. For this reason, Max will never choose the right move at states sw1d, sw2d and sw3d. For states sw1e, sw2e and sw3e Max has no reason to choose the right, so he always chooses the left move. Regarding the top states for Max, according to the chosen strategies at bottom states for Max, he will always get a value more than 0. This means he will always choose move b over move c.

Figure 3.15 depicts the game tree for this game in a compact form and the thick lines show the suggested Max’s optimal strategy [31]. This strategy guarantees that

Max always receives 1 in w1. The strategy generated by ITS from the third iteration forward, reaches toward the suggested optimal strategy.

To experimentally show the correctness of ITS I ran it on the Frank Basin Non- Locality game. Figure 3.16 shows the mixed strategy of the player at state d after 100 iterations. After less than 100 iterations, the implemented ITS player was also able to play left at d with a probability of 99%.

66 3.2 Analysis

Figure 3.16: Mixed strategy at state d in the Frank Basin Non-Locality game during the first 100 iterations.

3.2.3 One-Step Joint-Move Two-Player Constant-Sum Games

For this category of games, it is shown that ITS algorithm reduces to fictitious play and since fictitious play is known to solve these games [5], ITS can solve them too. For ITS I convert this category of games into a two-step sequential game with incomplete information. All moves of Player1 lead to states that are all in the same information set. To show that the ITS algorithm works similar to fictitious play for this class of games, I show that the updating policy and the move selection of fictitious play are similar to ITS. I refer to the player who moves first as Player1 and call the opponent Player2. I will show for both players, updating the mixed policy π via equation (2.2) is identical to updating µ of moves in ITS, that is, equations (3.6) and (3.7). For the move selection policy of ITS we need to consider each player separately. With regard to choosing the best move for Player1 in ITS, if we combine equations (3.5) and (3.4), then for chosenMove() we get:

  X X 0 0 0 argmaxm ρ(s) ∗ [u(do(s , m )) ∗ µ(m )] (3.9) s∈i m0∈M(s0)

67 Chapter 3 Iterative Tree Search

0 Here, s is the initial state s0 and is in a singleton information set; s is the state after the initial state, referred to as sm in what follows. ρ(s0) is always equal to 1. I also 0 0 substitute do(s , m ) with smm0 , which is a terminal state with a fixed reward. Also, 0 to simplify notation, I replace M(s ) with M2. For chosenMove() we then obtain the following:

  X 0 argmaxm [u(smm0 ) ∗ µ(m )] (3.10) 0 m ∈M2 Considering the relation of mixed move policy and mixed policy described in equa- tion (3.1), this equation is indeed equal to the best response equation (2.3) for fictitious play. With regard to choosing the best move for Player2 in ITS, the action of Player1 changes the probability of state ρ(s). Since Player2 ’s decision states are all in the same information set and are all generated from the initial state, ρF actor(sm) = µ(m) µ(m). So ρ(sm) = P µ(m) . By definition the denominator is equal to 1. By m∈M2 substituting these in equation (3.5), for chosenMove() we obtain:

 X  0 0 argmaxm ∈M2 µ(m) ∗ u(smm ) (3.11) m∈M1 which is identical to equation 3.10 with m being replaced by m0. This completes the proof that ITS reduces to basic fictitious play for this category of the games and that, therefore, ITS is able to optimally play any one-step, joint-move constant-sum game with two players. As an example, I look at Biased Penalty Kick, which is a common game in game theory to show the ability of my algorithm to find a Nash equilibrium.

Example: Biased Penalty Kick

This is a well-known game to illustrate opponent modelling and the value of playing a mixed strategy. The game has two players: Kicker and Keeper. If the Kicker shoots_right and the Keeper jumps_right, the Keeper gets 60; otherwise, the Kicker gets 60. If the Kicker shoots_left the rewards will instead be 40. There is no pure

68 3.2 Analysis

Figure 3.17: Mixed strategy of the Keeper for the biased penalty kick game for the first 10,000 iterations.

Nash equilibrium, and the optimal mixed strategy in a Nash equilibrium for the Kicker is to shoot 40% right and 60% left. For the Keeper, it is jumping 40% to the left and 60% to the right. While both HP and HP-II only deliver pure strategies, ITS can solve this problem after just a few iterations as this game is a one-step joint-move game. To verify this claim I ran the implemented ITS algorithm on this problem. Figure 3.17 shows the mixed strategy of the Keeper for the first 10, 000 iterations. ITS quickly finds the correct probabilities for both players.

3.2.4 Move Separable Games

In this category of games, each player is only responsible for moves in one stage of the game. This means that when Player2 begins to move after a series of moves by Player1, then after Player2 has made their moves the game ends, and also Player1 cannot interrupt Player2 ’s course of actions. If the random player exists in the game, its actions are visible to the player of the corresponding stage of the game. In this class of games, Player2 may or may not be able to see some or all of the actions performed by Player1. I show that these games will be reduced to a game where Player1 chooses a joint-move game for both players to play. HP-II fails to

69 Chapter 3 Iterative Tree Search solve games in this class as it can only find a pure strategy, whereas again the ITS algorithm can solve this category of games. I theoretically prove the ability of ITS on playing this category of games correctly, while I only use an example, Banker and Thief, to show the failure of HP-II on this category of games. First, consider the simpler case of games where Player2 cannot see Player1 ’s moves. The probability of a state to be the true state in the information set of Player2 depends on the sequence of Player1 ’s moves. This sequence can be consid- ered as one single, combined move, whose probability is the same as the frequency with which the corresponding sequence is chosen. The game can therefore obviously be reduced to a one-step, joint-move game that is solvable by the ITS algorithm, as I have demonstrated in the previous section. Now consider the case when Player2 can see some of the actions of Player1. Player1 can lead Player2 into one of the possible information sets. The ρ(s) of the states in each information set can be changed by the unobserved moves of Player1. Thus a game of this type can be reduced to a game where Player1 chooses a sub- game of a one-step, joint-move game with the highest Nash equilibrium payout for himself among all sub-games and then play the sub-game corresponding to the Nash equilibrium strategy. From the result in the previous section, it follows that ITS can solve this sub-game and determine the payout for each Nash equilibrium. Choosing the sub-game with the highest Nash equilibrium payout then just requires a simple search. In this way, ITS can solve Move Separable Games. I end my analysis with an example of a game from the literature that falls into this category.

Example: Banker and Thief

This game was used in the HP-II paper [87] to show the ability of HP-II to value the withholding of information. I use the exact same game as a counterexample to disprove their statement. I will show how the “optimal” solution suggested by the HP-II algorithm reveals that this algorithm does not model its opponent properly in a game. There are two banks in this game, one of which has a faulty alarm system. The owner of the banks has to decide to distribute $100 between the two banks in

70 3.2 Analysis

$10 notes. The thief can see the distribution of the money between the banks but not which of the two is faulty. If the thief robs the faulty bank then he succeeds in getting the money, otherwise, the banker receives all the money left in the faulty bank at the end of the match.

Using HP-II, the banker places $40 in the faulty bank and $60 in the other, implicitly assuming that the thief is greedy and will choose to rob the bank with $60, which means the banker wins. This strategy was considered as the winning strategy in HP-II paper [87]. I claim that this is, in fact, a sub-optimal strategy as the banker wrongly assumes the thief to be greedy. Indeed, the thief might well assume the banker to assume that he is greedy, and hence he will decide to go after the $40. The best strategy for this game must, therefore, be a mixed strategy so that the thief becomes indifferent to choosing a bank. Only then is the (mixed) strategy a Nash equilibrium. Different distributions lead to different Nash equilibrium with different expected payout. The highest expected payout for the banker is a $50-$50 distribution with an expected payout of $25, while the expected payout for the $40- $60 distribution is just $24. This means HP-II finds a none-optimal strategy and fails to correctly solve this category of games. But, since this is a Move Separable Game, ITS can solve it in contrast to HP-II.

To experimentally confirm the correctness of this claim, I ran the ITS algorithm on this game, but to make it more challenging I added two extra safe banks. The banker can then choose a distribution of his money in $10-chunks among four banks. There is a total of 287 ways to do so. Table 3.2 shows the probabilities for some mixed strategies for the banker in the case when the first bank has been selected as faulty by the random player. The theoretical analysis for this variant of the game shows that the optimal strategy is to put $50 in a faulty and $50 in any of the safe banks. As can be seen from the table, this is what the ITS algorithm will do in 94% of the times after one million iterations. Figure 3.18 shows how the probabilities for some of the 287 strategies evolve.

71 Chapter 3 Iterative Tree Search

Figure 3.18: The probability change toward equilibrium for four strategies in the banker and thief game when the faulty bank is the first one.

Money distribution 50 50 0 0 50 0 50 0 50 0 0 50 0 50 50 0 all others

Probability of choosing 35.67% 27.5% 30.88% 0% 5.95%

Table 3.2: Probability of choosing a money distribution by ITS in the banker and thief game.

3.3 Summary

In this chapter, I have introduced the novel Iterative Tree Search (ITS) algorithm as a significant improvement over the state-of-the-art algorithms, in particular HP- II, for general game playing with incomplete information. I have discovered two limitations of HP-II in this chapter. One was being short-sighted in valuing infor- mation and the other one was its inability to generate mixed strategies. These two limitations stop HP-II from modelling its opponents properly. I have shown HP-II being short-sighted by introducing the Fully Hidden Extended Cutting Wire exam- ple and I have proved its discovered strategy, which HP-II claimed to be optimal in its Banker and Thief example [87], is in fact not optimal. This shows its inability

72 3.3 Summary on opponent modelling and generating mixed strategy to fool the opponent. While HP-II is short-sighted on valuing information, our ITS algorithm has been shown to correctly value information in a game by gathering information at the lowest possible cost that promises the highest benefit. An ITS-based general game player is also able to withhold information from opponents and to play a Nash equi- librium on several classes of games. Moreover, HP-II is not able to compute mixed strategies, so it fails to find the best strategy in games that require opponent mod- elling. With ITS we can overcome these limitations by iteratively self-playing the game using an incomplete-information tree and thus learn the expected behaviour of a rational opponent. In regards to the CFR minimisation technique, by introducing Partially Hidden Extended Cutting Wire example, I have shown that even though CFR techniques always find a Nash equilibrium technique, it might find a none-optimal one. On the other hand, for the PHECW game, the ITS algorithm finds the optimal Nash equilibrium. The only limitation of the ITS algorithm is its poor memory efficiency and failure in solving large games. ITS requires to store the whole game tree in the memory before it can start making a decision. The only game discussed in HP-II paper that ITS can not solve in practice is the "Battleships in the Fog" game. ITS requires to store 1.4 terabytes of data in its memory just after 3 rounds in the "Battleships in the Fog" game. This game will be discussed in more details in the next chapter. I will then introduce an extended version of ITS called, Monte Carlo Iterative Tree Search to overcome this limitation of the ITS algorithm.

73 Chapter 3 Iterative Tree Search

74 Chapter 4

MCITS: An Online Tree Search algorithm for Incomplete Information Games

4.1 Introduction

As discussed in Chapter 2, several approaches have been suggested to solve the games described in GDL-II. Two of the most recent are the Lifted Hyper-Player (HP-II) [87] and the Iterative Tree Search (ITS) algorithm from Chapter 3. HP-II has two limitations, it: only generates pure strategies and is short-sighted. These two limitations stop the HP-II from correctly modelling its opponents. While ITS has none of HP-II’s limitations; however, as demonstrated in Chapter 3, it can only solve small games in practice. In this chapter, I develop a novel algorithm by embedding Monte Carlo Tree Search (MCTS) [52, 20] into ITS and I call it the Monte Carlo Iterative Tree Search (MCITS) algorithm. Experimental results and analysis show the success of this algorithm. MCITS has the advantages of ITS on both small and large games. This chapter is organised as follows: I first introduce the Monte Carlo Tree Search. The new technique is then formally presented. I theoretically analyse the algorithm and perform experiments with current games that have been discussed in the liter-

75 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games ature. I conclude this chapter with a short summary.

4.1.1 Background: Monte Carlo Tree Search

The Monte Carlo Tree Search (MCTS) is an online tree search algorithm [8]. It received attention when the program developed using MCTS showed strong per- formance in the game of Go [56]. The MCTS guides the search toward the more promising segment of the search tree. This gives an advantage in games with a large game tree with no human knowledge. The main idea of the MCTS is to use multiple random simulations to predict the value of states. It is also an anytime algorithm, meaning we can cut the search and it still provides reasonable play. The game tree is initially empty is being constructed through this search. The MCTS consists of four steps [8]:

1. Selection: The selection process starts from the initial state in the game tree. The next state is chosen by a tree policy with a balance between exploitation and exploration. Exploitation tends to choose more promising states, and exploration helps to expand the search and find some possible hidden gems in the game tree. One of the most common strategies to keep the balance between exploitation and exploration is the Upper Confidence Bound (UCB) [1]. The selection process continues until we reach a terminal state or a state with unexpanded next states, its children.

2. Expansion: When the selection reaches a node with an unexpanded child, an unexpanded child is added to the game tree.

3. Simulation: After the game tree is expanded, a simulation is played according to some default policy until it reaches a terminal node. The default policy is usually a random play. This means from the newly added state the algorithm runs a series of random play until it reaches a terminal state in the game.

4. Back-propagation: When simulation or selection reaches a terminal state the result is back-propagated from the leaf of the explored tree toward the ini-

76 4.2 Monte Carlo Iterative Tree Search

tial node. The process involves updating the statistics in the nodes. Statistics include visited counts and the win/loss ratio.

After the searching time finishes, the player uses a decision policy to choose the best move according to the generated tree. There are two common decision policies. One is to choose the most selected child, and another is to choose the child with the highest value [8].

4.2 Monte Carlo Iterative Tree Search

I now present the new Monte Carlo Iterative Tree Search (MCITS). It is an online MCTS inspired by ITS algorithm. By merging the MCTS with the ITS, the new algorithm values information in games while keeping low memory consumption. Like the original MCTS, it searches a section of the game tree which it finds promising. The advantage of the MCITS over the ITS is having higher speed and memory efficiency, and its advantage over MCTS is valuing information and modelling the opponents in incomplete information games. The original MCTS consists of four main steps: selection, expansion, simulation and back-propagation with three policies: the tree, the default and the decision policy. For MCITS steps are the same but Selection and tree policy are modified. Algorithm 21 describes MCITS. The algorithm starts by getting the sequence of players’s own past moves, pastM, and the sequence of perceptions received by the player, pastP. The algorithm consists of two phases. The first phase is to discover the current information set by finding as many as possible states from it and the second phase is to find the optimal move. In theory, we can only use the second phase and use the states in the information set that are already generated; however, I discovered having the first phase increases the performance of the algorithm. As just said, the first phase is to find the current information set which is defined at FindCurrentIS procedure (lines 11-24). The procedure starts from the starting state, s0, and randomly simulates until it reaches to a state with the same number of 1Similar to Algorithm 1, all the sets including S, R, M, etc. are assumed to have been computed from the GDL description as defined earlier.

77 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games

Algorithm 2 Monte Carlo Iterative Tree Search 1: procedure MCITS(pastM, pastP ) 2: Γ ← Γ ∪ {s0} 3: currentInfoSet ← FindCurrentIs(pastM, pastP ) . Phase 1 4: while within_time_constraint_phase2 do . Phase 2 5: s ← pickState(currentInfoSet) . Picks according to the visited count 6: Selection(s) 7: end while 8: return argmax visited(do(rand(currentInfoSet), m)) m∈M(currentInfoSet) 9: end procedure 10: 11: procedure FindCurrentIS(pastMoves, pastPerceptions ) 12: currentIS ← ∅ 13: while within_time_constraint_phase1 do 14: s ← s0 15: while |ξR(s)(s)| ≤ |pastMoves| do 16: if Σ(s) = pastP erceptions & ξR(s)(s) = pastMoves then 17: currentIS ← currentIS ∪ s 18: else 19: s ← do(s, rand(M(s)) 20: end if 21: end while 22: end while 23: return currentIS 24: end procedure 25: 26: 27: procedure Selection(s) 28: if s ∈ Z then 29: BackPropagation(s, u(s)) 30: else if !isF ullyExpanded(s) then 31: newState ←Expansion(s) −−→ 32: util ←Simulation(newState) −−→ 33: BackPropagation(s, util) 34: else 35: m ←ChooseBestMove(I(s)) 36: for all s0 ∈ I(s) do 37: s00 ← do(s0, m) 38: if s00 ∈ Γ then visited(s00)++ 39: end if 40: end for 41: Selection(do(s, m)) 42: end if 43: end procedure

78 4.2 Monte Carlo Iterative Tree Search

44: 45: procedure BackPropagation(s, ~u) 46: if s =6 s0 then 47: for all s0 ∈ I(s) do 0 0 0 Q(s )×(visited(s )−1) + ur 48: Q(s ) ← visited(s0) . The ur means the reward of the player who is about to play at the state s0. 49: end for 50: BackPropagation(parent(s), ~u) 51: end if 52: end procedure 53: 54: procedure Simulation(s) 55: if isT erminal(s) then 56: BackPropagation(s, u(s)) 57: else 58: Simulation(do(s, rand(M(s))) 59: end if 60: end procedure 61: 62: procedure Expansion(s) 63: newState ← rand({unexploredChild(s)}) 64: visited(newState) ← 1 65: Γ ← Γ ∪ newState 66: return newState 67: end procedure 68: 69: procedure ChooseBestMove(infoSet) 70: totalV ← P visited(s) s∈infoSet 71: for all m ∈ M(infoSet) do P visited(s) 72: Ext(m) ← totalV × Q(do(s, m)) s∈infoSet P visited(s) q totalV 73: Exr(m) ← totalV × visted(s) s∈infoSet 74: Q(m) ← Ext(m) + Exr(m) 75: end for 76: return argmax Q(m) m∈M(infoSet) 77: end procedure

79 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games past moves as the provided sequences of the player’s own past moves. It then checks to see if the reached state is in the information set that the player is currently in (line 16). An information set is a set of states which are indistinguishable by the player. In the General Game Playing platform, states are distinguished by the sequence of the player’s own past moves and the sequence of received perceptions, as described in Definition 2. The first phase finishes when the time_constraint_phase1 is reached (line 13). It then returns the possible current information set (line 23).

The second phase begins at line 4. A starting state is chosen from the possible current information set. The states are chosen not at random but based on the probability distribution assigned to the states in the information set during the last time the MCITS was called. It then calls the Selection procedure (lines 26-42). The algorithm keeps calling the Selection procedure on different starting states until the allowed time in time_constraint_phase2 is over. In the end, it needs to return the best move according to its decision policy. I have tested both decision policies in my experiments and both returned the same best move after enough iterations. For the pseudocode and experimental results, I use the most selected move as its decision policy. MCITS does not count the number of times a move is selected but counts how many times a state was visited, the visited variable. So for its decision policy, it returns a move from a state in the current information set which the next state of it after applying the move has the highest visited count. Line 8 represents the decision policy of MCITS. The algorithm can randomly pick a state to check the number of times its moves were selected because moves from all states of an information set are picked equally. Below are the descriptions of procedures in the Monte Carlo Iterative Tree Search algorithm.

The Selection procedure is similar to the selection in the original MCTS algo- rithm. It calls the BackPropagation procedure when it reaches a terminal state (lines 27-28). If it reaches an unexpanded state, it calls on Expansion, then calls the Simulation on the newly generated state followed by the BackPropagation procedure (lines 29-32). Otherwise, if it reaches a state which is neither terminal nor unexpanded, it chooses the optimal move by calling the ChooseBestMove

80 4.2 Monte Carlo Iterative Tree Search procedure. A difference of the MCITS algorithm from the original algorithm is that the MCITS increments the visited count in the Selection procedure. The states with incremented visited count are the children of all the states in the information set after applying the chosen move (lines 36-38). This process could be implemented in the BackPropation procedure like the original MCTS but it then requires to store the chosen moves during each iteration.

The BackPropagation procedure takes a state s and the utility vector ~u to update the state’s value Q(s) (lines 44-51). It continues until it reaches the starting state s0. The ur in line 47 means the reward of the player who is about to play at the state s0.

The Expansion procedure (lines 61-66) is called in the Simulation procedure when an unexpanded state is reached (line 30). After the state is expanded the Simulation procedure is called (line31). The Simulation procedure (lines 53-59) is similar to the one in the original MCTS algorithm. It randomly runs until it reaches a terminal state, then it returns the utility of the terminal state ~u.

The ChooseBestMove procedure (lines 68-76) is called both in the Selection procedure (line 34) and in the main procedure to choose the optimal move when the iterations are finished (line 8). This procedure takes into account both the balance between exploitation and exploration and evaluation of a move based on the whole state in the information set rather than only one state. Aside from the ChooseBestMove procedure, the MCITS is different from the original MCTS in incrementing the visited count. The visited count is incremented on the future states of the whole states in the information set. This allows a better modelling of the knowledge of the opponent.

Consider the fact that ChooseBestMove procedure returns only a single move. As a result, MCITS does not generate a mixed policy. So, it can not successfully play games which require a mixed policy such as Biased Penalty Kick from Section 3.2.3 of this thesis. I have experienced with ChooseBestMove returning a mixed policy instead of the single best move. However, due to the pruning feature of the MCITS, usually, the mixed policy was different from the expected optimal mixed

81 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games policy.

4.3 Analysis

In this section to demonstrate the success of the MCITS algorithm, I ran it on games which have been previously discussed in the literature. I discuss the reason for the success of the MCITS in these games’ categories. In HP-II paper and Chapter 3, few games were used to test the ability of these algorithms. It was claimed that each game is designed in a way to test an aspect of incomplete information games. The Games are categorised by the feature they are trying to test. After experimenting with different time ratios for the two phases of MCITS, I discovered the 1 : 9 ratio is the most successful ratio for all the games discussed in this chapter. This means of the provided time, 10% of it will be used to find as many as states in the current information set, phase 1, and then 90% of the provided time will be used for finding the best move from the current information set, phase 2.

4.3.1 Valuing Information in the Game

Earlier solutions to GGP-II mainly used the sampling technique [86, 25]. Sample techniques do not value information in the game. In fact, the main reason for the development of both the ITS and the HP-II was to correctly value the information in the game, unlike previous solutions. The problem of valuing information is also referred to as the strategy-fusion problem, section 2.3.1. We chose two games in the literature for this category: the Fully Hidden Extended Cutting Wire game and the Battle Ships in the Fog game.

Example: Fully Hidden Extended Cutting Wire The Fully Hidden Ex- tended Cutting Wire game was introduced in section 3.2.1 to show the limitation of HP-II in valuing information throughout the game. Similar to ITS, MCITS runs on information sets, so it avoids strategy-fusion. I ran the MCITS on this game and recorded the number of times each move was chosen in the information set. The counting starts when the first state from the information set is added to the

82 4.3 Analysis

MCITS’s tree. Figure 4.1 shows the number of times the algorithm selected wait1 over tell1 and Figure 4.1 shows the number of times the algorithm selected wait2 over tell2. Number of selection is important since at the end, the algorithm uses the same procedure to select the optimal move.

200 180 160 140 120 100 80 60

Number of Selection 40 20 0 1 21 41 61 81 101 121 141 161 181 201 221 241 261 281 301 321 341 361 381 401 421 441 461 481 Iteration

wait1Wait Telltell1

Figure 4.1: Number of selection after 500 iterations for wait1 and tell1 moves in Fully Hidden Extended Cutting Wire.

As can be seen in Fig 4.1, after less than 150 iterations, the optimal move, wait1, was selected twice as much as the tell1 selection. Similarly, Fig 4.2 illustrates, after approximately 500 iterations, the selections of the optimal move, tell2, was selected approximately eight times more often than move wait2. One reason for having a different gradient for these two graphs is due to the difference in utilities. If Teller tells at the first step and then plays optimally, both players will receive 10 points less than what they receive if he plays wait1 → tell2 strategy, which is the optimal strategy. However, if Teller chooses wait1 → wait2, then they will get 40 points less than what they would receive if Teller plays optimally. If the utility difference of two strategies is high, the difference in their selection slopes is high as well.

Example: Battle Ships in the Fog This is a turn-taking constant-sum game designed to test the ability of players to value information. In this game, players

83 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games

180 160 140 120 100 80 60

Number of Selection 40 20 0 1 20 39 58 77 96 115 134 153 172 191 210 229 248 267 286 305 324 343 362 381 400 419 438 457 476 495 Iteration

wait2Wait Telltell2

Figure 4.2: Number of selection after 500 iterations for wait2 and tell2 moves in Fully Hidden Extended Cutting Wire. need to hide their information from the opponent and seek the opponents’ valuable information. Both ships are placed randomly on two separate n × n boards. Similar to HP-II paper, I chose the 4 × 4 board size. Each player only sees its own location on the board. At each turn, a ship decides either to move in four directions, to shoot at a cell on the opponent’s board, or to scan. Hitting the opponent when shooting gives the maximum reward of 100, and the opponent loses, but shooting also reveals one’s own location. Scanning tells the exact location of the opponent, but they will be notified that they have been scanned. Figure 4.3 shows the directions that a ship can move on different cells of the board. This game is introduced in HP-II paper [87]. All the previous GGP-II algorithms, including HP-II which claimed it plays this game optimally [88], fail to play the optimal strategy. All the GGP algorithms which suffer from the strategy-fusions choose to shoot at random. As a result, they miss the opponent n2 − 1 times, where n is the dimension of the board. The ITS algorithm, which I introduced in Chapter 3, fails to play the game in practice due to the high branching factor. At the start of the game, the random player places each battleship into their cells. There are 16x16 possibilities for the combination of

84 4.3 Analysis

1 2 3 4 1 2 3 4 1 2 3 4 1 1 1 2 2 2 3 3 3 4 4 4

Corner Edge Inner

Figure 4.3: Different directions that a ship can move on a 4 × 4 board. both placements. Then, in each round, each battleship can decide to shoot one of 16 cells on the opponent’s board or move to 4 neighbour cells or scan. Only one shot hits the opponent ship and the game ends. So the total branching factor is (16-1) + 4 + 1. In each round, both battleships make decisions, so the total branching factor is duplicated for each round of the game. As a result, for r number of rounds, the total number of states in the Battle Ships in the Fog’s game tree is 162 ∗ 202n. By the assumption that each state requires only 84 bytes in the memory, after only three rounds in the game, ITS requires 1.4 terabytes of memory. This makes ITS practically incapable of playing this game or other large games.

All the known variants of the CFR algorithm also fail to play this game without any abstraction. There are approximately 3.2 × 1014 information sets [49] and in a 4 × 4 board game of the Battle Ships in the Fog with the length of 10 steps there are over 4.2 × 1015 information sets [85]. All variants of CFR fail to play Texas hold’em without any abstraction because of the large number of information sets [49]. Since the Battle Ships in the Fog game has higher information sets, all the variants of CFR will fail to find the optimal strategy in this game as well without any abstraction.

The HP-II algorithm finds a better strategy than its previous approaches but not an optimal one: It first scans then shoots, scan → shoot. A better strategy is not to shoot but to scan or move; never shooting actually wins most of the times over

85 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games scan → shoot strategy.

The reason is as follows: if the first ship scans then the second ship knows it has been scanned. So it moves in order to add uncertainty to the first ship’s knowledge. It can either move up, right, left or down. If the first ship misses its shot, the second ship knows the exact location of the first ship. As a result, any rational opponent as the second ship will shoot back and always hit the first ship. This means if the first ship misses its shot after the scan, it will certainly lose. The first ship misses 75% of the times since the second ship can move to four directions after being scanned and the first ship can not see which direction the second ship was taken. We can then conclude that scanning → shooting strategy has a 25% chance of success and 75% chance of failure against a rational opponent. The MCITS algorithm discovers the better strategy, namely to move. Figure 4.4 shows one of the three scenarios in which MCITS wins against HP-II.

There will be a total of 4 scenarios that in three of them MCITS wins. The left side in Fig 4.4 shows the match from MCITS point of view and the right side shows the match from HP-II point of view. The bright green circle on the left shows the actual position of the MCITS ship and the bright blue triangle shows the actual position of the HP-II ship. The pale shapes are the possible positions in which a ship is guessing the placement of the opponent. At first, none of the two ships knows the exact position of the other. Scanning reveals the position of the opponent but the opponent can possibly move to four directions next. When a ship wins, both players are notified about everything and the game ends. In this game with the HP- II and MCITS strategies, only if the HP-II ship shoots the right cell on the second step it wins. If it shoots any of the other three cells, it loses. More comparison between MCITS, HP-II and sampling techniques are provided in Appendix B.

86 4.3 Analysis

MCITS HP-II

scan

move

shoot x 4

shoot x1

Figure 4.4: One of the three scenarios in which MCITS wins against HP-II

87 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games

7000 6000

5000 4000 3000 2000

Number of Selection1000 0 0 673 1345 2017 2689 3361 4033 4705 5377 6049 6721 7393 8065 8737 9409 10081 10753 11425 12097 12769 13441 14113 14785 15457 16129 16801 Iteration

shootShoot scanScan Movemove

Figure 4.5: Number of selection after 1601 iterations for shoot, scan or move of the Battle Ships in the Fog game.

As mentioned the MCITS algorithm discovers a better strategy, namely to move. To prove this claim, I ran MCITS for a total of one minute in order to find the optimal move in this game. During the 6 seconds, MCITS ran 7,200 iterations to find the current information set and the probabilities of its states. Then for 54 seconds, it ran for 16,800 iterations from the current information set. Figure 4.5 shows that after some iterations MCITS was able to discover moving is better than both firing or scanning. The reason for the closeness of scan and move is due to the low advantage of moving or scanning in a 4 × 4 map. If the second ship is located in the centre of the board, it can move in four directions. This means 75% chance of missing the shot. If the second ship is located on the edge, it can only move in 3 direction. This means 66.6% chance of missing. If the ship is located on the corner, it has 50% chance of missing. For this reason in a 4 × 4, the optimal strategy, move, against the scan → strategy is 24% more likely to succeed.

4.3.2 Non-Locality Problem

I introduced the non-locality problem in section 2.3.1. When an algorithm evaluates the value of a state solely based on its subtree, non-locality problem arises. In

88 4.3 Analysis incomplete information games, a node might depend on other nodes in the tree. The opponent can possess more information and lead the game toward a node in another branch of the game. This creates a dependency on the value of a node on other non-local nodes.

Example: Frank Basin Non-Locality Game Consider the game in Figure 4.6 which was described in section 2.3.1 and I called it Frank Basin Non-Locality game. This game was developed by Frank et al. [30] to illustrate the non-locality problem.

Max a

Min b c

Max d e f

w1 1 0 0 1 0

w2 0 1 0 0 0

w3 0 1 0 0 0

Figure 4.6: The Frank Basin Non-Locality game described in worlds model. The suggested optimal strategy for the Max is shown with thick lines.

It has been shown in section 2.3.1 that several incomplete information search algorithms suffer from this problem. Some examples are Perfect Information Monte Carlo (PIMC) [115], Smooth UCT [59] and Information Set Monte Carlo Tree Search (ISMCT) [21]. To analyse how the MCITS algorithm performs in the mentioned game, I used the version of this game in an extensive normal form representation with incomplete information. Figure 4.7 represents it in the extensive normal form. MCITS keeps different probabilities for different states in an information set. Probabilities depend on the number of selections of the previous moves. At early iterations, Max decides to choose right on information set d. However, as the game

89 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games

random a a a

Max w1a w2a w3a

b b b c c c

Min w1b w2b w3b w1c w2c w3c

d d d e e e f f f

Max w1d w2d w3d w1e w2e w3e w1f w2f w3f

left right left right left right left right left right left right single single single

1 0 0 1 0 1 0 1 0 0 0 0 0 0 0

Figure 4.7: The Frank Basin Non-Locality game described in extensive normal form. Thickness of lines represents the frequency of move being chosen by MCITS. progresses and the Min chooses right at states w2b and w3b, the probabilities of states w2d and w3d to be the true state decrease. This means after some iterations, Max reaches this conclusion that w1d has a higher chance of being the true state. It then places a higher probability on w1d when deciding to choose the move at information set d. As can be seen from Figure 4.8 at the first 100 iterations, the Max player tends to choose right because it considers all three states to be almost equally likely. But with more iterations and Min mainly selecting right at w2b and w3b, Max understands if it sees itself in information set d it is probability at state w1d. It then chooses left at information set d which is the optimal move to play.

4.3.3 Summary

This chapter has introduced the Monte Carlo Iterative Tree Search (MCITS) which is a combination of the Monte Carlo Tree Search (MCTS) and the Iterative Tree Search (ITS) algorithms. I evaluated the performance of MCITS on non-locality and strategy-fusion problems. The non-locality and the strategy-fusion (informa- tion valuation) problems are considered as the two main limitations of incomplete information search algorithms.

90 4.3 Analysis

70 60 50 40 30 20

Number of Selection10 0 1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256 271 286 301 316 331 346 361 376 391 Iteration

left right

Figure 4.8: Number of selection after 400 iterations at information set d in the Frank Basin Non-Locality game

MCITS evaluates a move according to all the states in an information set. This way it correctly values information in games; therefore, it can successfully play games with strategy-fusion. MCITS correctly play the Battle Ships in the Fog game which has strategy-fusion. All the previous GGP-II approaches, including ITS, fails to play the optimal strategy in this game [87]. I have proved the state-of-the-art, HP-II, fails to generate the optimal strategy in the Battle Ships in the Fog game, which itself introduced. But MCITS can successfully generate a strategy to win against HP-II. MCITS is also able to successfully play games with non-locality problem by placing different probabilities on different states of an information set. It also evaluates a move according to all the states in an information set. These two features allow MCITS to successfully play non-locality while other algorithms such as Smooth UCT, Perfect Information Monte Carlo and Information Set Monte Carlo Tree search fail.

91 Chapter 4 MCITS: An Online Tree Search algorithm for Incomplete Information Games

92 Chapter 5

General Language Evolution in General Game Playing

MCITS is able to solve a wide variety of games. However, none of the previous approaches, including MCITS, were able to solve cooperative games with explicit communications. As discussed in Chapter 2 section 2.3.3 in cooperative games with implicit communication, agents need to share their knowledge about the match without this being explicitly described in the rules of the game. As an example, recall the simple Cooperative Spies game, from section 2.3.3. First, a bomb is armed by randomly choosing one of two wires. Only the Teller can see which wire the bomb is armed with. This agent can then send one of two possible messages to the Cutter. Finally, the Cutter needs to cut the right wire for both to win. The crux in this game is the lack of any connection between the perception of the Teller, which wire has been used, and the message it can send to the Cutter. This problem is mentioned as a limitation of all current methods for GGP-II [87]. One reason for this shortcoming is their inability to generate the necessary coordination language. The study of common language evolution in computer science can roughly be divided into two categories based on the environments that are considered: embodied or simulated. The focus of the embodied environment is mainly on language games. There are three main variants of the language games: object naming game, colour categorising and naming game, and lexicon spatial language game [96, 95, 97, 3,

93 Chapter 5 General Language Evolution in General Game Playing

100, 93].

One of the earliest research on object naming game was an experiment with Sony AIBO robotic dog [96]. Reinforcement learning was used to teach the robotic dog to learn the name of some objects. Then, it was implemented between two robots [95], rather than between one robot and one human. In the two robots experiment [95], two Robotic arms with attached cameras had been used. They were given a set of words and a wall with some colour stickers attached to it. With the help of try and error, robots came to an agreement on what to call each colour using the given set of words.

One of the most recent improvements in the object naming game is when it was introduced to humanoid robots [3]. In the experiment, two mobile humanoid robots are placed in a room. When one of the robots sees an object, it points at it. The object can be either an abstract object, like a sphere, or a toy. They then agree to give the object and its colour some names which they both agree on.

While the naming game can be used to model some real-world problems, many real problems require lexicon spatial language. Humans regularly use the spatial language every day by referring objects’ locations in space relative to other objects. Some examples are "the cat behind that tree" or "the person inside the car". One of the earliest research on spatial language was performed with two robotic arms with attached cameras [93]. The scale of the experiment was then expanded by introducing mobile humanoid robots [100]. The robots use some mirror at first to learn about their body movement and its relation with space. Then, they interact with each other in the room to achieve a requested goal. They can only achieve the goal by successfully communicating with the help of spatial language.

All the mentioned examples are in embodied environments located in the real world. On the other hand, in simulated environments, the need to interact with the real world is removed but the agents’ population is increased [40, 94]. Most exist- ing language evolution techniques focus on generating a common language without considering its generality or its use for problem-solving [28, 60].

In this chapter, I extend GGP with language evolution to develop a general lan-

94 5.1 Simplified Iterative Tree Search Algorithm guage generation technique. My main contributions are as follows: I extend GGP-II so that it can be applied to the field of language learning in AI with the aim to study general language learning algorithms that can be applied across a wide variety of problems. I also introduce a general language learning algorithm in this framework which allows agents to reach a common language for sharing information and cor- rectly playing coordination games. Agents learning new common “languages” for problems solely by being given a formal problem description without a dedicated communication channel is new to both GGP as well as the field of language learning. In the following, I briefly present the Simplified ITS algorithm. Then, I intro- duce the new General Language learning algorithm (GL) for GGP-II, followed by an analysis of the algorithm in a variety of different games, including the Cooper- ative Spies game. I also report on an experiment performed with the help of the Genetic algorithm and GL. The Genetic algorithm is largely used as a way to create agreement between agents in an environment [58, 89, 77]. The chapter ends with a short summary. Publications This chapter recapitulates and expands on the following previously published work.

• [15] Armin Chitizadeh and Michael Thielscher, "General language evolution in general game playing", In: AI 2018: Advances in Artificial Intelligence - 31st Australasian Joint Conference, Wellington, New Zealand, Dec 11-14 2018, pp. 51-64.

5.1 Simplified Iterative Tree Search Algorithm

In Chapter 3, I developed Iterative Tree Search (ITS) as a new algorithm that can successfully play a wider variety of GGP-II games than previous techniques. ITS searches on the incomplete information game tree to value information in games. It also iteratively plays against itself before the game starts in order to learn the optimal strategy against a rational opponent. However, it is incapable of playing coordination games such as the above-mentioned cooperative spies game. For a

95 Chapter 5 General Language Evolution in General Game Playing detailed description of the ITS algorithm I refer to Chapter 3; now, I introduce a simplified version, called sITS. In this version, the algorithm searches the game tree only once and avoids further iterations. The iteration was required to model the opponent in a game. Since in this chapter we are mainly interested in coop- erative games, we can avoid further iterations. This dramatically reduces the time complexity of the algorithm. The simplified version is described in Algorithm 3.

The sITS algorithm is described in a way that instead of returning a chosen move, it returns the value of the game. Value of the game for a player is the utility of the starting state, s0 for the player. This simplifies the description of the General Language approach in this chapter. In the original ITS’s pseudocode, Algorithm 1, the game description is not mentioned as the input but in sITS, I explicitly mention it as the input. The General Language approach, which will be described next, modifies the game description, so the game description should be mentioned as the input in here.

The first step in the sITS algorithm is to initialise the game tree (line 2). The InitialiseGameTree procedure (lines 7-24) initialises the game tree in a similar way to the ITS algorithm. First, it generates the game tree according to [106], then it generates the information sets according to the eq. 2.1. Next, it initialises the move probabilities µ (lines 10-16) and probability factors (lines 17-20) ρF actor, in order to set state probabilities ρ (lines 21-23). One main difference of sITS with ITS is in sITS states which belong to players except random have all move probabilities set to 1 (line 14). The reason is sITS will be used for cooperative games and so there is no requirement to model the opponent.

The next step is to calculate the utility of the provided game (line 3). The CalculateUtil procedure uses recursion to set up utilities starting from final states toward the initial state. If it is a state which belongs to random, it takes the average utility of children (line 37). Otherwise, it sets the utility of the state equal to the maximum utility among the children (line 41).

96 5.1 Simplified Iterative Tree Search Algorithm

Algorithm 3 Simplified ITS algorithm 1: procedure sITS(GDL) 2: s0 ← InitisaliseGameTree(GDL) −−−−−−−−→ 3: gameV alue ← CalculateUtil(s0) −−−−−−−−→ 4: return gameV aluetheP layer . theP layer is the role of who ran sITS 5: end procedure 6: 7: procedure InitialiseGameTree(GDL) 8: generateGameT ree(GDL) . According to [106] 9: generateInformationSet(GDL) . According to eq. 2.1 10: for all s ∈ S \ Z & m ∈ M(s) do 11: if R(s) = random then 1 12: µ(m) ← |M(s)| 13: else 14: µ(m) ← 1 15: end if 16: end for 17: ρF actor(s0) ← 1 18: for all s ∈ S \ Z & m ∈ M(s) do 19: ρF actor(do(s, m)) ← ρF actor(s) ∗ µ(m) 20: end for 21: for all s ∈ S \ Z do 22: ρF actor(s) ρ(s) ← P ρF actor(s0) s0∈I(s) 23: end for 24: return s0 25: end procedure 26: 27: procedure CalculateUtil(state) 28: if state ∈ Z then 29: return u(state) 30: else if R(s) = random then 31: return CalculateRandomUtil(state) 32: else . If the player is an agent 33: return CalculatePlayerUtil(state) 34: end if 35: end procedure 36: 37: procedure CalculateRandomUtil(state) P CalculateUtil(do(state,m)) 38: return m∈M(state) |M(state)| 39: end procedure 40: 41: procedure CalculatePlayerUtil(state) P ρ(s)×CalculateUtil(do(s,m))R(s)  42: return max s∈I(state) m∈M(state) |I(state)| 43: end procedure

97 Chapter 5 General Language Evolution in General Game Playing

5.2 General Language Algorithm

Traditionally, the focus of GGP players has been on one-shot games, even though GGP framework always allowed players to memorise games from the past. In this chapter, I utilise this unpopular feature of GGP framework and go beyond one-shot games, in order to use a policy learning technique. In other words, I allow players to keep information about their past matches. With these constraints removed, I can now introduce the General Language (GL) algorithm. Only for the sake of clarity in the description of the algorithm, I introduce a new syntactical element to GDL-II, a pre-defined keyword called must(). An instance (must (does ?agent ?action1)) forces an agent to choose the given specific action. Effectively, the algorithm is just removing all the legal actions except for one for a player in a state. I then refer to this remaining rule as the mustRule. My General Language technique for GGP-II learning is as follows: A common “communication language” can be described as a set of mustRules added to the original GDL-II of the game. These rules connect perception tokens received by a player to the actions of that player. In other words, each mustRule forces an agent to play a particular move that triggers a specific percept if, and only if, the agent made a specific observation beforehand. A move that releases a percept must happen after the triggering percept has been received by the player. This is a one to one relation. Formally, all the added mustRules have the following structure: (⇐ (must(does ?agent1 ?action1)) (sees ?agent1 ?perception1)) where ?action1 has the following consequence: (⇐ (sees ?agent2 ?perception2) (does ?agent1 ?action1)) It is worth knowing that ?perception1 is always received before ?perception2 in a game. In this thesis to represent a mustRule within the text, I use the - structure. Also, due to the limited space in figures the - structure will be used. For ex- ample, (⇐ (must(does teller tellA)) (sees teller arm_red)) will be rep- resented as arm_red - tellA within the text or Red-A on the figures.

98 5.2 General Language Algorithm

The first step of the GL algorithm is to generate a bag of different GDL games, each of which we refer to as a world. The original GDL of the game is a world, and so are the original GDL with one or more mustRules added. We begin by adding all the possible worlds. Next, we run sITS search on each world and set the returned value for each game as the value of the world. We then choose the world with the highest value and play the policy generated by running sITS on this world. If there is more than one world with maximum value, then we have a coordination problem. To solve this coordination problem, I use the learning policy algorithm. Each agent chooses a world with maximum value randomly, then plays an optimal policy for it. If the coordination succeeds, they stick with the chosen world for all future rounds. If, however, it fails, then the agents repeat the process by randomly choosing a world with maximum value again. Algorithm 41 describes this GL algorithm more formally.

Algorithm 4 General Language Algorithm 1: procedure GL(GDL) 2: worldList ← GDL ∪ generateBageOfW orlds(GDL) . As described in section 5.2 3: maxReward ← −1 4: for all world ∈ worldList do 5: if maxReward < sIT S(world) then 6: maxW orlds ← {world} 7: maxReward ← sIT S(world) 8: else if maxReward = sIT S(world) then 9: maxW orlds ← {world} ∪ maxW orlds 10: end if 11: end for 12: chosenW orld ← rand(maxW orlds) 13: while time_allows do 14: play according to sIT S(chosenW orld) policy 15: if matchReward < maxReward then 16: chosenW orld ← rand(maxW orlds) 17: end if 18: end while 19: end procedure

1Similar to Algorithm 1 and Algorithm 2, all the sets including S, R, M, etc. are assumed to have been computed from the GDL description as defined earlier.

99 Chapter 5 General Language Evolution in General Game Playing

Match Manager

Tcp/ip

General Language Algorithm

GGP-II Player

Figure 5.1: General Language Algorithm placement in General Game Playing match ecosystem.

The General Language algorithm communicates directly with the match manager in a GGP-II match. It receives the game rules from the match manager then gen- erates different altered version of the game rules, also known as a world. Then, it uses any GGP-II player to evaluates the value of each of them. It then plays ac- cordingly to the optimal world. Figure 5.1 illustrates General Language algorithm in a GGP-II match ecosystem.

Example: Cooperative Spies

To illustrate the General Language technique, the Cooperative Spies game is used as an example. it was previously presented in section 2.3.3 and originally introduced in HP-II paper [87] to illustrate the limitation of their and all other existing approaches to general game playing with incomplete information. As described in section 2.3.3, the crux is that in the description of the game (cf. Figure 2.12) there is no logical dependency between the colour of the wire and the signal that the Teller can send. For this reason, none of the previous GGP-II approaches can solve this problem. Recall from Figure 2.12 the following sees rules for the Teller: (⇐ (sees teller redWire) (does random (arm red)))

100 5.2 General Language Algorithm

(⇐ (sees teller blueWire) (does random (arm blue))) along with the following rules for the Cutter: (⇐ (sees cutter a) (does teller tellA)) (⇐ (sees cutter b) (does teller tellB)) For this game, there are four possible mustRules that can be added:

1. (⇐ (must(does teller tellA)) (sees teller redWire))

2. (⇐ (must(does teller tellA)) (sees teller blueWire))

3. (⇐ (must(does teller tellB)) (sees teller blueWire))

4. (⇐ (must(does teller tellB)) (sees teller redWire))

There are 16 combinations of the above mustRules. However, any combination must keep the one-to-many rule. The one-to-many rule only applies to the mustRules that their ?agent1 performs its ?action1 at the same information set. As an example, from the above example the two mustRule 1 and mustRule 4 if combined together will contradict the rule. By combining the two, Teller must perform both moves tellA and tellB at the same time if he sees redWire, which is impossible. Considering the one-to-many rule, we only have six legal combinations that can be added to the original GDL and generates seven worlds. We then run the sITS algorithm on all seven worlds and calculate their values. Consider, for example, the world with only mustRule 1. This world is similar to the original GDL but limits Teller to only choose tellA after the random player chose to arm the red wire. In other words, the branch where Teller chooses tellB after the random player chose to arm with the red wire is removed from the game tree. The value of the original GDL determined by sITS is 50. The value of worlds with only one mustRule is computed as 75. The value of the two worlds with two mustRules is 100. This shows that in this game, we have a coordination prob- lem. Spies must choose between combinations of the first mustRule with the third mustRule or the second mustRule with the fourth mustRule. Let us assume that one spy chooses the first combination and the other choose the second one. They fail and need to choose again. As soon as the spies make the same choice, they stick to it

101 Chapter 5 General Language Evolution in General Game Playing

random u : 50 arm_red arm_blue

u : 50 u : 50 Teller

tellA tellB tellA tellB

u : 50 u : 50 Cutter u : 50 u : 50

cut_r cut_b cut_r cut_b cut_r cut b cut_r cut_b

Teller: 100 0 100 0 0 100 0 100 Cutter: 100 0 100 0 0 100 0 100

(a) Original world random u : 100 arm_red arm_blue

u : 100 u : 100 Teller

tellA tellB

Cutter u : 100 u : 100

cut_r cut_b cut_r cut_b

Teller: 100 0 0 100 Cutter: 100 0 0 100

(b) World with two mustRules

Figure 5.2: Comparison of a world with two mustRules vs the original game in Cooperative Spies. The utilities next to each state are calculated using the sITS algorithm.

102 5.3 Analysis for all future matches, thus having learned to cooperate through the development of a common language. Figure 5.2 illustrates the difference between the original game tree and the tree for an enhanced world. The utilities are calculated using the sITS algorithm. The left game tree belongs to a world with mustRule 1 and mustRule 2. Thick lines are showing the frequency of choosing a move. Values inside the nodes are the value of a node after running sITS algorithm. The value of the initial node, s0 is the value of the world.

5.3 Analysis

In the following, I describe different games and show how with the GL algorithm agents are always able to generate a common language and play the optimal move.

5.3.1 Naming Game

One of the earliest and simplest examples which its different variants are still heavily used in the literature is the Naming game [101, 3, 98, 48, 99]. It is a game in which players need to find a common language that connects an object to a name. In this game, any world with a number of mustRules equal to the number of objects is an optimal world. This class of games can be described in GDL-II as follows. The game has three players, one random player and two agents. The random player chooses an object at random. Both agents can see the chosen object. Then they should choose one among different actions. Each action sends a specific percept to the other agent. If both choose the same action they win and receive a score of 100, otherwise they n! lose and score 0. There are obj! optimal legal worlds, where obj is the number of objects and n the number of names. All of these optimal worlds contain exactly obj many mustRules. The value of each optimal world is 100 after running the sITS algorithm on them. Now agents can solve the coordination problem with the policy learning. As can be seen, the GL algorithm can successfully play the Naming Game.

103 Chapter 5 General Language Evolution in General Game Playing

5.3.2 Air-Strike Alarm

Many made up examples, like the well-know Naming Game, have more than one optimal language. This forces the agents to use a technique, such as policy learning, to agree on using a common language. In contrast, real world examples, such as efficient message encoding [116], are unsymmetrical which makes them to have only one optimal language. An agent needs to use both language learning with planning with no policy learning to find the most optimal language. The use of planning with language learning is a unique feature of GL which is its advantage over other existing language learning algorithms. To demonstrate it, I created a closer to the real world example and I called it the Air-Strike Alarm game. This game is an unsymmetrical version of the Cooperative Spies.

The Air-Strike Alarm game has three players: Enemy, Signalman and Citizen. The enemy is played by random who attacks 10% of the time. Signalman sees aeroplanes coming toward the city. He can then sound the alarm or not. The Citizen then needs to decide whether to take shelter or not. The game is similar to the Cooperative Spies game. The Teller is replaced by a Signalman; the Cutter is the Citizen; perceptions ‘a’ and ‘b’ are replaced, respectively, by seeing or not seeing the aeroplanes; cutting wires are replaced by taking shelter or not taking shelter by the Citizen; and the two messages the Signalman can send are sounding the alarm or not. The main differences to the original game are, first, that raising the alarm causes auditory discomfort which comes at a cost of 10 points for everyone and second, the Enemy attacks only 10% of the time.

Clearly, the optimal strategy is for the Signalman to sound the alarm and for the Citizens is to take shelter when they hear the alarm. However, all the existing approaches to GGP-II would fail to optimally play this game, because the best strategy they find is always not to sound the alarm. Citizens then know it is better to go about their normal life as only in 10% of the time an air-strike will occur. As a result, they get an average of 90 points. This mistake in choosing a non- optimal strategy occurs because there is no learning involved. On the other hand, the reinforcement learning techniques cannot solve any games with a penalty for

104 5.4 Experimental Analysis sending messages. Moreover, they always require prearranged centralised learning of agents. Our GL algorithm finds seven worlds, similar to the Cooperative Spies game. sITS determines 90 as the value of the original GDL and 82 if the alarm is off during an air-strike and on during a safe situation. The optimal world is when the alarm is on during an air strike and off when it is safe. sITS returns 98 points for this world. If we assume that all the citizens are rational, then all of them will find the optimal world with the right language in it and play the optimal strategy and receive on average a value of 99 in the long run.

5.4 Experimental Analysis

In the following, I report on the experiments with sITS and its complete version, ITS. The main focus here is sITS, but we have also found that ITS can offer some additional advantages over sITS.

Experiment with sITS

We have performed a language evolution experiment with sITS and the Genetic Algorithm (GA) [46]. The Genetic algorithm is largely used as a way to create agreement between agents in different environments [58, 89, 77]. This makes it an ideal algorithm to be combined with sITS and GL in order to find the optimal language. For this experience, I let the population fluctuate to reach an equilibrium in which the majority has developed an optimal common language. This can also be used as a way to find an optimal world, provided the size of the optimal language is too large. A language, i.e. a set of mustRules, is hardcoded in each agent’s DNA. Agents reproduce and die similarly to single-celled organisms. Agents with less compatible language with others in the society will be eliminated sooner. The more compatible agents survive longer and multiply more often. In other words, we use natural selection to show the evolution of a society without any language to an optimal

105 Chapter 5 General Language Evolution in General Game Playing society with an advanced common language.

For the purpose of the experiment, I introduced a new matchmaker which is in charge of creating, penalising, death and organising games among the agents. The society starts with a group of agents with no mustRule. The matchmaker randomly chooses an agent from the society. Then if any agent with a similar language exists, the matchmaker randomly picks one. Otherwise, it randomly picks an agent with a different language. It lets them play a game. If they succeed, they both age by a normal ageing value. However, if they fail, both will be penalised by ageing faster (the penalised ageing value). All Agents have a lifespan. Any agent with an age equal or higher than the lifespan will die and be removed from the society. After playing some specific number of matches, a single agent gives birth to a new agent. This parameter is referred to as the reproduction rate. The new agent can have the same exact world (chromosome) of the mother or some mutation of it. A mutation is the addition or the removal of one mustRule (gene). The probability of a mutation is given by the mutation rate. To make it similar to a real society, a limit is set on reproduction. Reproduction in the society stops when the population reaches the reproduction limit value. The reproduction restarts when the population is reduced to the restart reproduction value.

All of these parameters can be varied but need to satisfy the following constraints. The penalised ageing value needs to be larger than the normal ageing value. Agents should be able to live long enough to reproduce. Some legal values might push the population to extinction with high probability. For example, high ageing values or low restart reproduction values. Some combinations of legal values for the param- eters in the experiment can slow down reaching an equilibrium. As an example, having penalised ageing close to normal ageing or a low mutation rate will stop the population from changing quickly.

GA is a technique with uncertainty. There is always a chance that all members of the population die and the experiment ends without any suggested solution or equilibrium might be reached far in the future. I ran this experiment with different combinations of parameters within the suggested ranges. For each combination, I

106 5.4 Experimental Analysis ran it for 200 times. The combination which for all the 200 experiments reaches equilibrium and on average is the fastest is described below. The discovered optimal combination of the parameters is also relatively close to the suggested parameters in [113]. The values used for the parameters in the experiment are as follows: I set lifespan to 100 years, penalised ageing to 70 years reduction, normal ageing af- ter each game to 1 year reduction, the reproduction limit to 60 agents, the value for restart reproduction to 55 agents, reproduction rate to 3 rounds for each agent and the mutation rate to 1%. As Figure 5.3 indicates, the society started with some agents lacking any language. Then the common language slowly shifted towards a more complicated version with more mustRules. After 700 matches, an equilibrium was reached. In a society with natural selection, the species which can successfully cooperate most of the times will remain in the society. Other species reduce dramatically in population or even become extinct. To show the effect of natural selection on language evolution, I have run a second experiment with identical parameters as in the first experiment except that penalised ageing was set equal to normal ageing. Specifically, I ran experiments with an Extended Cooperative Spies game where I doubled the number of wires. This extension has the effect that success by chance is unlikely and living without adaptation is much more difficult. As discussed, in the Cooperative Spies game, an agent with the maximum number of mustRules is optimal. Optimal agents can fully cooperate with their own kind and partially cooperate with their primitive kind2. To make it similar to a real society, a limit is set on reproduction. I was able to test the effect of natural selection by simply changing the penalised ageing to only −1 in the second experiment. This way agents will face the same penalty if they succeed or fail. This simple change stops the society from evolving to a more cooperative society: As can be seen from Figure 5.4, changes hardly happen in a society without natural selection. However, as in such societies,

2An agent is referred to as a primitive version of another agent if the language of the former is a subset of the language of the latter.

107 Chapter 5 General Language Evolution in General Game Playing

Language Evolution with Natural Selection 70 60

50

40 30 20

10

0 1 31 61 91 121 151 181 211 241 271 301 331 361 391 421 451 481 511 541 571 601 631 661 691 721 751 781 811 841 871 901 931 961 991 emptyempty languagelanguage A-RedRed-A & & C-Blue Blue-C & & B-White White-B & & D-Yellow Yellow-D A-RedRed-A & & B-WhiteWhite-B && D-YellowYellow-D A-armRedRed-ARed & Blue-C & C-Blue & White-B & B-White A-RedRed-A & & B-WhiteWhite-B A-RedRed-A thethe rest

Figure 5.3: Experiment on language evolution with natural selection all agents benefit equally, in an extremely long run the society might turn into a scattered population of different kinds of agents. The experiment without natural selection shows the society hardly changes even after 10,000 total games were played.

Experiment with full version of ITS

In the following, I describe another experiment with a similar four-wire Cooperative Spies game with ITS. The main difference between ITS and its simplified version is the addition of opponent learning. With the help of several self-plays, ITS can find Nash equilibria in a variety of classes of games. The ITS algorithm helps the players with an incomplete language to fully cooperate. An incomplete language misses one or more mustRule compared to an optimal language. The experiments show that smarter algorithms, such as ITS, can guess the missing mustRule in the games. For the mentioned game, only three mustRules are enough for a smart player to guess the missing mustRule. Figure 5.5 shows the game tree with three mustRules

108 5.5 Summary

Figure 5.4: No language evolution can be seen on a society without natural selection. after few iterations. Each box next to a state shows the utility and probability of the state. The thick coloured lines show the preferable moves by ITS after the iterations. For the full details of the ITS algorithm, I refer to Chapter 3. Since ITS can guess the final mustRule, in the society the majority can have either four or three mustRules. As can be seen in Figure 5.6, the majority swings between two languages: one of size three and another one of size four.

5.5 Summary

In this chapter, I have introduced GGP-II, a well-known framework for general arti- ficial intelligence, into the field of language learning. With just a few modifications to GGP-II, I was able to introduce a general language learning algorithm. The GL algorithm allows agents, at least in principle, in any game to generate a common language if there is a need for it. I have also shown that it can solve common prob-

109 Chapter 5 General Language Evolution in General Game Playing Cutter Teller random Cutter: Teller:

100 0 100 0

⍴ u : 100 : 0.99 r

b

w tellA

y

iue55 aete o h orCtigWr aewt three with game Wire Cutting Four the for tree Game 5.5: Figure ⍴ u : 100 0 10 0 10 : 1.00 ⍴ u : 100 : 0.99 r

b

w tellB

arm_red

y

⍴ u : 100 : 1.00 ⍴ u : 100 : 1.00

⍴ u : 100 0 100 0 100 : 0.99

r

arm_blue

b

tellC w

y arm_white

⍴ u : 0 : 0.01 0 100 0 100

r

b

⍴ u : 100 : 1.00 w tellA arm_yello

y

⍴ u : 0 : 0.01 0 100 0 100 r

b

tellB ⍴ u : 99.9 : 1.00

w

y ⍴ u : 0

: 0.01 tellC

mustRules

0 100 0 100 r

b

w . tellD

0 100 0 100 r

b

⍴ u : 100 : 1.00 w

y

110 5.5 Summary

Evolution of language with complete ITS 60

50

40

30

20

10

0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 209 217 225 233 241 249 emptyempty language C-RedRed-C & & D-Blue Blue-D & & B-White White-B C-RedRed-C & & D-Blue Blue-D &B-White & White-B & & A-Yellow Yellow-A the rest

Figure 5.6: The evolution of optimal language with ITS search algorithm. lems in language learning such as the Naming Game. Moreover, with the help of planning in GL without repeating the match, agents can reach an optimal common language if there exists only one. The GL technique is not restricted to ITS as a solution; other algorithms can be used. This makes it a general algorithm. General Language technique is successful in small games but as the complexity of the problem increases this technique fails. Consider a simple game of the Multi- Agent Path Finding with Destination Uncertainty problem with only two robots and 12 cells. More than 1.9 ∗ 1058 worlds can be created for this game. In theory, GL can solve this game but in practice it can not generate 1.9 ∗ 1058 worlds. In the next chapter, I will describe the Multi-Agent Path Finding with Destination Uncertainty problem in more details. I will also introduce General Language Tree Search, which is applicable to the larger games.

111 Chapter 5 General Language Evolution in General Game Playing

112 Chapter 6

GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

As discussed in Chapter 5, all the GGP-II algorithms, including MCITS, are inca- pable of successfully playing games which require implicit communication. Then, I introduced the General Language (GL) learning technique to overcome this problem. The GL technique starts with the rules of a game and then generates a bag of different worlds for the game, whereby each world has a unique common “language” among the players. This “language” can simply be described as a collection of conventions of the form: if a player receives some specific percept, then it must perform an action that sends another specific percept to another player. These conventions are referred to as mustRules. We can run a normal GGP-II player on each world with its specific language, and then choose the optimal common language from the world in which the player achieves the highest reward. The main problem with GL is that it needs to generate all the possible worlds in a game. This makes it impractical to run on large games. As human beings, we do not suffer from such a limitation. We are able to prioritise on messages that we can send to each other. One way of prioritising is based on similarity. An example can be if we want to convey the information water but we are limited to only uttering

113 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty a colour, then we would naturally prefer to say blue instead of, say, red due to the close association of blue with water. Another way to prioritise is to look for the optimal perception token by modifying a good percept that we have already found with a hope to reach to the optimal one. An example can be how infants learn a language.

There are three different stages of language learning among humans: discovering units1, packaging words into meaningful units and creating art using language [83]. Each happens in a period of life, discovering units is for infants, packaging words into meaningful units is for toddlers and creating art using language is for preschoolers and onward. Sounds are materials and each set of these materials can form into a word unit in a language. This creates tremendous possibilities of words which only a small portion of them are meaningful in a language. With this complexity, infants are still able to learn these units easily using the mentioned prioritisation technique, which in the literature is called imitating and babbling [119]. An infant listens to surrounding sounds and learns some ambiguous pronunciations, babbling. Infants usually have a need like hunger, thirst or attention. When an infant needs attention, he/she starts to pronounce some of its ambiguous learnings and at the same time look at the parents’ respond. Infants recognise the degree of emotions within parents’ faces [67]. They then discover which pronunciation makes their parents happier and they keep using that word more often. The pronunciation,

babbling, might not be exactly a world but something similar to an actual word like

/kj t/.c The infant then tries new pronunciations similar to what made the parents

happy the most, like /kj s/.c Then the infant sees parents are less excited so he/she returns back to /kj t/c and try to change something else within the pronunciation like /kju:t/. As he/she sees parents are now even more excited, the infant keeps using this word. In the future, the infant might modify the word again, but as he/she sees less of excitement from parents, the infant goes back to /kju:t/. With using such a prioritisation technique, infants can easily learn words of a language

1Languages are made of smaller units [51]. A single unit can be simple hand sign by Homo heidelbergensis [57] or a hieroglyph, character, of the ancient Egyptian writing system [39].

114 among all the possible combinations of sounds 2.

In this chapter, I consider the second prioritisation technique and introduce a tree search technique which prioritises communication languages based on the proximity of percepts to address the inapplicability of the GL technique on large games. I also use the Upper Confidence Bound (UCB) method [52] to keep a balance be- tween exploration and exploitation. I call this technique General Language Tree Search (GLTS). It is an any-time version of the GL algorithm. I then show how GLTS can be applied to the well-known Multi-Agent Path Finding with Destination Uncertainty (MAPF/DU) problem. I show that with the help of implicit communi- cation a MAPF/DU problem can be solved more efficiently. MAPF/DU can model several real-world multi-agent applications in which agents need to move to differ- ent destinations without any collision and without central planning. Examples are automated cars at intersections [23] and office or warehouse robots [117, 109].

The General Language learning algorithm, from Chapter 5, was originally for- malised for General Game Playing with Incomplete Information (GGP-II) [104]. I continue to use GGP-II as the framework to extend GL to General Language Tree Search. Being developed in a general framework allows the algorithm also to be tested on other problems including different variants of MAPF such as MAPF with deadlines (MAPF-DL) [62], combined Target Assignment Path Finding (TAPF) [75] and Package Exchange Robot Routing problem (PERR) [61].

The rest of the chapter is organised as follows: I first recall from [4] the Multi- Agent Path Finding with Destination Uncertainty problem. Then, I describe in detail the new General Language Tree Search followed by an analysis and report on an experiment performed on a MAPF/DU example with GLTS. The chapter ends with a summary.

2Birds like parrots and mynas learn to imitate human language in a similar way [29, 68].

115 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

6.1 Background: Multi-Agent Path Finding with Destination Uncertainty Problem

Many problems require multiple agents to relocate to different destinations without any collision. Real-world examples include automated vehicles at intersections [23], office and warehouse robots [117, 109] or video games [55] in which agents must move collision-free to different destinations. This is known as Multi-Agent Path Finding (MAPF).

Many previous works [55, 42, 47, 90] assumed agents plan centrally and each agent knows the destinations of the other agents. These two assumptions, however, might not always be possible, e.g. the lack of centralised planning with cars at an intersection or the interaction between robots and humans. This is then referred to as MAPF under Destination Uncertainty (MAPF/DU) [26]. The only known general solution to MAPF/DU problems has PSPACE-complete complexity [4].

Figure 6.1 shows an example of a MAPF/DU problem as a game, adapted from [4]. I call it the Two Robots game. The game consists of two robots and the random player. The robots are placed on a loop-like board. In this sequential game, robots need to move to their prescribed destination cells. At the start of the game, both agents can fully see the map. In the depicted example, both players see that the Circle robot is required to go to either cell (2,0) or (2,2) while the Square robot needs to move to either cell (3,0) or (3,2). Then at the first step, the random player chooses which of the two possible cells for each player becomes their actual destination. Players are only informed about their own cell and not the other player’s. The crux of the game is when both chosen cells are on the same side of the board, then one agent needs to take the longer route if both want to reach their cell without collision. In order to replicate the real world and to value efficiency, I add penalties for idling and moving. Each move costs both players 2 points and staying costs them 1 point. Staying at the goal location, however, results in no penalty. I illustrate how our new technique, General Language Tree Search, can be applied to these type of problems to find acceptable solutions for MAPF/DU problems more

116 6.2 General Language Tree Search efficiently.

0 1 2

0

1

2

3

4

Figure 6.1: The Two Robots game which is a Multi Agent Path Finding with Destination Uncertainty. Each robot must go to its target cell marked by a solid square or solid circle. Each robot can only see its own target with solid mark but can not distinguish between other robot’s solid or shallow marks.

6.2 General Language Tree Search

As described in chapter 5, the original General Language learning technique at first generates all the possible worlds. It then uses the brute-force search to find the world with the highest reward value. Unfortunately, it is practically impossible to generate all the possible worlds for large games. Even for the simple game of Figure 6.1, there are more than 1.9 ∗ 1058 possible worlds3. For this reason, we need a better search technique to make General Language Evolution applicable to large games. In this section, I introduce a novel technique called General Language Tree Search (GLTS). It allows the search to focus on more promising worlds by looking for the worlds which are more similar to already successful worlds. It also keeps a balance between exploration and exploitation using UCB [52]. The main idea of GLTS is to find the optimal mustRules by generating a tree of worlds. Nodes in the tree are the worlds in the game. Each node is assigned two

3Please check Appendix C for the calculations.

117 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

The original game tree ( the initial world)

World tree (all singleton worlds)

A composite world

Composite worlds set (all composite worlds)

Figure 6.2: General Language Tree Search at early iterations for the Cooper- ative Spies game. The composite world is not part of the world tree directly.

values, depth1 and depth2, or d1 and d2 as their abbreviations. The depth1 value holds the depth in which the first percept is receives by the first player and depth2 holds the depth in which the second percept from the first player is received by the second player in the game.

Both depths for the starting node have the value of 1. This means the starting node of the tree is the game with full game rules, meaning no mustRule. Children are created by increasing or decreasing depths of the parents and creating new worlds out of the new depths. The tree is expanded through the most promising world. At the end of each expansion or reaching terminal, the two most promising worlds are combined to generate a new composite world. Figure 6.2 shows the first part of a GLTS search tree on the Cooperative Spies game. As can be seen, the composite world has all the mustRules of its parents. For the clarity of explanation, I introduce a new variation of the Cooperative Spies game. I call it, the Extended Cooperative Spies game and Figure 6.3 shows this game in extensive form. In this variation, random sends different percepts to Teller at two different steps of the game. Similarly, Teller can send different percepts to Cutter at two different steps of the game. We refer to each step as a

118 6.2 General Language Tree Search depth. The initial node of the game tree is at depth 0. Depth number is shown on the right side of the figure. For the ease of explanation, I provided a different numbers to each world in Figure 6.3. Also, percept names are identical to the move which triggers them. This means if a player receives a percept, it will be identical to the name of the move shown in the graph. For example, when the game is at world 2 and random chooses move arm_b, Teller receives arm_b as percept. Like the original Cooperative Spies, in this variation the random player’s moves only triggers percepts for Teller and Teller’s moves trigger percepts for Cutter.

Algorithm 5 describes this technique. It starts by generating the initial world (line 2). The procedure ApplyMustRule takes the game description and a mus- tRule and generates a world according to section 5.2. The ApplyMustRules procedure (lines 18-25) initially copy the complete GDL of the game (line 21) and then call InitialiseWorld procedure to initialise the world (line 22). The Apply- MustRules procedure limits the player to play one set of moves at a set of states4 (line 23). Finally, this procedure returns the modified new world.

The InitialiseWorld procedure (lines 27-28) takes a world as input. It then runs it on the given GGP-II algorithm, in my case UCT, and assigns the value to two variables of the world: Q(w) and QS(w) (lines 28-30). The QS variable holds the total sum of the returned result of running the world using the provided GGP-II algorithm. The V variable holds the number of times we ran the world. The Q QS variable holds the maximum V among the descendants of a world node.

Next step is to set the initial world depth1 and depth2 (lines 3-4). These variables store the depth at which a percept has been received and the depth that a player chooses a specific move (or moves) to send his own percept. These variables are initialised to 1 since by convention in GGP-II percepts are received starting from depth 1 [86] and also there is no previous knowledge to share at depth 0. In Figure 6.3 states 1 and 2 are at depth 1 of the game. As an example, for the game at Figure

6.3 for a possible mustRule of arm_blue – tellF we have depth1 = 1 and depth2 = 3. Next is to add the initial world to the set of all worlds (line 5).

4It will be a set of states if the player is in an information set.

119 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty Cutter: Teller: Cutter Teller Teller random random

100 100

cut_r 11 tellC

100 100 cut_b

0 100 0 100 7

tellA cut_r cut_r 0 0

tellD

100 100

cut_b

12 cut_b

0 100 0 100

3

cut_r cut_r 0 0 13

arm_up tellE

0 0 cut_b

cut b tellB 0 100 0 100

cut_r cut_r

100 100 8

tellF

0 0

cut_b 14 0 100 0 100 cut_b

1

cut_r cut_r 15 100 100 tellC

cut_b

9 cut_b iue63 xeddCoeaieSisgame. Spies Cooperative Extended 6.3: Figure

0 100 0 100

tellA

arm_down arm_red cut_r

tellD

cut_b 0 0 cut_b 16

4

100 100 cut_r cut_r 17 tellE

cut_b

0 100 0 100 cut b tellB

cut_r cut_r

10

tellF

cut_b

18 cut_b

0 100 0 100

0

cut_r cut_r 19 tellC

cut_b 7

cut_b

tellA

cut_r cut_r

tellD

cut_b arm_blue 20 100 100 cut_b

5

0 100 0 100 cut_r cut_r 21 tellE

arm_up

cut_b

cut b tellB

cut_r cut_r

8

tellF

cut_b 22 cut_b

2

cut_r cut_r 23 tellC

cut_b 9

cut_b

tellA

cut_r cut_r arm_down

tellD

cut_b 24 cut_b

6

cut_r cut_r 25 tellE

cut_b

cut b tellB

cut_r cut_r

10

tellF cut_b 26 cut_b

Depth 4 Depth 3 Depth 2 Depth 1 Depth 0

120 6.2 General Language Tree Search

Algorithm 5 General Language Tree Search 1: procedure GeneralLanguageTreeSearch (GDL, UCT ) 2: initialW orld ← ApplyMustRule(GDL, {}) . initial world has no mustRule 3: depth1(initialW orld) ← 1 . In GGP the initial percepts are received at depth 1 4: depth2(initialW orld) ← 1 5: allW orlds ← {initialW orld} 6: while within_computational_ time do 7: w ←Select(initialW orld) 8: BackPropagate(−∞, w) QS(w) QS(w0) 9: max1 ← argmax V (w) and max2 ← argmax V (w0) 0 w∈allW orlds w ∈allW orlds\max1 10: newCompositeW orld ←GenerateCombinedWorlds(max1, max2) 11: compositeW orlds ← newCompositeW orlds ∪ compositeW orlds 12: allW orlds ← newCompositeW orld ∪ allW orlds 13: RunCompositeWorlds(compositeW orlds) 14: end while QS(w) 15: return argmax V (w) w∈allW orlds 16: end procedure 17: 18: procedure ApplyMustRule(GDL, mustRules) 19: . This is similar to adding must() to the GDL as described in section 5.2 20: .M(s) is derived from the GDL according to section 2.1.2 21: theNewW orld ← GDL 22: theNewW orld ← InitialiseWorld(theNewW orld) 23: theNewW orld.M(s) ← {m|(s, m) ∈ mustRules} . It can be a set of moves 24: return theNewW orld 25: end procedure 26: 27: procedure InitialiseWorld(w) 28: value ← run(w, UCT ) 29: Q(w) ← value . The highest exploitation term in the path 30: QS(w) ← value . Exploitation term of the single node 31: V (w) ← 1 . Number of visits 32: return w 33: end procedure

121 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

34: procedure Select (world) 35: if world is_terminal then 36: return world 37: else if world is_not_fully_expanded then 38: return BestChild(GenerateChildren(world)) 39: else 40: return Select(BestChild(childrenOf(world))) 41: end if 42: end procedure 43: 44: procedure BestChild(wSet) s 45: 2lnN(w) chosenW orld ← argmax (Q(w) + c P N(w0) ) w∈wSet w0∈dSet 46: QS(chosenW orld) ← QS(chosenW orld) + run(world, UCT ) 47: V (chosenW orld) ← V (chosenW orld) + 1 48: return chosenW orld 49: end procedure 50: 51: procedure GenerateChildren(parentW orld) 52: d1 ← depth1(parentW orld) 53: d2 ← depth2(parentW orld) 54: childrentSet ← ∅ 55: possibleDepths ← {(d1 + 1, d2), (d1 − 1, d2), (d1, d2 + 1), (d1, d2 − 1)} 56: for relD ∈ P ossibleDepths do 57: newW orlds ← GenerateNewWorlds(relD) 58: if newW orlds 6⊂ generatedW orldSet & AreDepthsLegal(relD) then 59: childrenSet ← childrenSet ∪ newW orlds 60: end if 61: end for 62: childrenOf(parentW orld) ← childrenSet 63: return childrenSet 64: end procedure 65: 66: procedure AreDepthsLegal (d1, d2) 67: if d1 < 1 || d1 > d2 then 68: return F alse 69: else 70: return T rue 71: end if 72: end procedure

122 6.2 General Language Tree Search

73: procedure GenerateNewWorlds(d1, d2) 74: allStartingStates ← getAllStatesAt(d1) . According to the GDL 75: for all pl1 ∈ AllP layers\random do 76: if |(Σ(allStartingStates, pl1))| > 1 then 77: for all per1 ∈ Σ(allStartingStates, pl1) do 78: startingStates ← {s|Σ(s, pl1) = per1 & s ∈ allStartingStates} 79: perceptionStates ← {s|round(s) = (d2)} 80: for all pl2 ∈ {p|p ∈ R & p =6 random & p =6 pl1} do 81: if |(Σ(perceptionStates, pl2))| > 1 then 82: for all per2 ∈ Σ(perceptionsState, pl2) do 83: mustRule = ∅ 0 0 84: for all s ∈ Σ (per2) do 85: mustRules ← {(s, m) | do(s, m) = s0)} 86: end for 87: w ← ApplyMustRule(GDL, mustRules) 88: newW orldSets ← w ∪ newW orldSets 89: end for 90: end if 91: end for 92: end for 93: end if 94: end for 95: return newW orldSets 96: end procedure 97: 98: procedure BackPropagate(maxV alue, world) QS(world) 99: maxV alue ← max(maxV alue, V (world) ) 100: Q(world) ← maxV alue 101: if world =6 initialW orlds then 102: BackPropagate (maxV alue, parentOf(world)) 103: end if 104: end procedure 105: 106: procedure GenerateCombinedWorlds (world1, world2) 107: combinedMustRules ← getMustRule(world1) ∪ getMustRule(world2) 108: return ApplyMustRule(GDL, combinedMustRules) 109: end procedure 110: 111: procedure RunCompositeWorlds (compositeW orlds) 112: for all w ∈ compositeW orlds do 113: V (w) += 1 114: QS(w) += run(w, UCT ) 115: end for 116: end procedure 117:

123 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

At this stage, we start populating the tree to find the optimal world. The tree is populated toward more promising world nodes. This process continues until com- putational_time is over (line 6), it then returns the optimal world. The first step of the loop is to call the Select procedure (line 7).

The Select procedure (lines 34-42) considers three different kinds of worlds. The first kind is a terminal world. A world is terminal if it can not generate a new world, meaning all of its children have been generated and assigned to other worlds.5 If the given world is terminal it simply returns the world. The second kind is an unexpanded world. Unexpanded world means its children are not generated yet. In this case, it calls the GenerateChildren and the BestChild procedures on it (line 38). Finally, if the children are already generated, then it simply calls the BestChild procedure to choose the best child and then call the Select procedure on it (line 40).

As mentioned earlier, in this algorithm, there are two variables which hold the value of each world: Q and QS. Each time a child is selected as in BestChild procedure, it will be run (line 46) and the returned value of the world is added to the QS. At the same time, the visited counter V for the world is incremented by one (line 47). The Q variable is similar to QS. Unlike QS which holds the QS value of the world itself, Q holds the maximum V among itself and its descendants. QS is incremented inside the BestChild procedure while Q is updated in the BackPropagate procedure. The BestChild procedure (line 45) chooses the UCB1 policy [2] (line 45) to discover the best child for expanding next.

As described earlier, in the Selection procedure if a world is unexpanded then the children need to be generated (line 38). The GenerateChildren procedure generates the children of the provided world (lines 51-64). First, the depths of the parent world are stored. There are four possibilities for the depths of children. Each child will have a depth which differs by one from a depth of the parent (line 55). Figure 6.4 shows the four depths combinations of the children from a parent world which has a mustRule with d1 at 2 and d2 at 7.

5For an example on how a world becomes terminal please consider Appendix D.

124 6.2 General Language Tree Search

d1 d2

Depth 0 2 4 6 8

d d 1 2 d1 d2

Depth 0 2 4 6 8 Depth 0 2 4 6 8

d1 d2 d1 d2

Depth 0 2 4 6 8 Depth 0 2 4 6 8

Figure 6.4: The four depths combinations of the children from a parent world which has a mustRule with depth1 at 2 and depth2 at 7.

The GenerateNewWorlds generates new worlds with given depths (line 57). Any new worlds will be added as a child only if it has not been generated in a different part of the world tree and its depths are legal (line 58). The legality of depths is checked by the AreDepthsLegal procedure (lines 66-72). This procedure rejects any combination of depths if its depth1 is less than 1 or if its depth1 is larger than depth2 (line 67). The GenerateNewWorlds procedure (lines 73-96) takes two depths and it generates all the possible singular worlds in the given depths, d1 and d2. Recall that a singular world has a set of mustRules which all the elements in it have only one received and one sent percept.

The getAllStatesAt function returns all the states at the given depth, d1 (line 74). As an example, in the Extended Cooperative Spies game in Figure 6.3, states 1 and

2 are at depth 1. The d1 represents the depths that the first percept is received by a player. This is the percept that one player wants to share with the other one. Unlike moves in extensive form, all players receive a percept at all depths, even though it might all be a noop. So the algorithm looks for all the percepts of all players in all the

125 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

states at level d1 (lines 75-79). The algorithm filters out any percept which does not worth sharing. If the player only receives one percept for allStartingStates then it means the player receives no information. As an example consider the Extended Cooperative Spies game in Figure 6.3. For states 2 and 1, the only percept that Cutter is received is noop. This means he has no information of the past moves to share to Teller.

For the next step, the algorithm picks another player and a percept of the player.

The depth of the state that this percept is received is at d2 + 1 (line 78). The depth d2 is the depth of the move which is about to be played and triggers sending the percept.

As mentioned earlier, mustRules can be described as: If seen percept ‘a’, then send percept ‘x’. It can also be represented as: At states S, play moves M. Such that states S are the one which percept a is received at and moves M send percept x. At the end of the GenerateNewWorlds procedure, worlds are created by applying mustRules and each new world is added to the newWorldSets (lines 87-88 ).

After the Select procedure ends and returns a terminal world or the best newly generated child (line 7), the GLTS algorithm calls the BackPropagate procedure on the returned world (line 8). The BackPropagate procedure (lines 98-104) updates the Q value for each world recursively, starting from the leaf world returned by the Select procedure. Starting from the leaf world, the procedure sets the Q QS value for each world in the path as the V value of itself or the maximum among its descendant whichever is higher (line 99). Figure 6.5 shows an example world tree with its values for variables QS, Q and V are shown in a box next to each world node. Each world is given a number next to its box. Different colours are used to show the path a Q value is passed from a child node to its parent. It is important to note that the Q is actually the average value of QS. The colours between nodes 12, 10 is different from their ancestors. The reason is the Q value of the node 8 is better than the Q value of node 12 so its Q value is passed down the tree.

Up to this stage, the GLTS algorithm looks for the optimal singleton world. The next step is to look into generating a composite world (lines 9-13). The composite

126 6.2 General Language Tree Search

QS : 60 V : 1 1 Q : 80

QS :585 2 V : 9 Q : 70

QS :195 QS :390 3 V : 3 V : 6 4 Q : 70 Q : 80

QS :140 QS : 375 5 V : 2 V : 5 6 Q : 70 Q : 80 QS : 70 7 V : 1 QS :320 Q : 70 V : 4 8 Q : 80

QS : 65 QS : 70 9 V : 1 V : 1 11 Q : 65 QS :225 Q : 70 V : 3 10 Q : 75

QS :140 V : 2 12 Q : 75

Figure 6.5: A world tree example. Each node represents a world. Values for QS, Q and V variables are shown in a box next to each node. Colour of the edges show the path that has affected the Q values.

127 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty world is generated by merging the mustRules of two worlds, either composite or QS singleton. The algorithm first finds the top two best worlds based on their V . It then combines them using the GenerateCombinedWorlds procedure (line 10) and adds it to the compositeWorlds set. The GenerateCombinedWorlds procedure (lines 106-109) takes all the mus- tRules from two worlds and combined them (line 107). Then, it uses the Apply- MustRule procedure to apply them to the game, GDL, in order to generate a combined world. When the algorithm generates a new composite world out of the optimal composite worlds, it adds them to the set of compositeWorlds. It then calls the RunCompos- iteWorlds procedure (line 13) to update the QS and V for all the composite worlds. The reason for updating all the composite worlds is because composite worlds are not part of the world tree and will never be updated through selection and back-propagation. When the provided computational time is over, the algorithm returns a world QS from the set of allWorlds which has the highest value of V . The allWorlds variable holds all the composite worlds and singular worlds.

6.3 Analysis

Previous GGP-II algorithms can be categorised into two groups: determinisa- tion techniques and incomplete information reasoners, as described in Chapter 2. The determinisation techniques pick different complete information samples of an incomplete-information state and play each sample as if it were a complete- information game. They then choose the best average move. Some examples are HyperPlayer [86], NEXUSBAUM [25], TIIGER player [70], Shodan player [12] and Norns algorithms [34]. On the other hand, incomplete information reasoners treat a game as a whole incomplete-information problem. They are resource-heavy but can return better results compared with determinisation techniques when given enough time. Some examples are HyperPlayer-II [87] and ITS, from Chapter 3. In the follow- ing, I consider one GGP-II algorithm from each group: NEXUSBAUM, specifically

128 6.3 Analysis the UCT version with no game-specific pruning from the first group; and the ITS from the group of incomplete information reasoners techniques, to compare them with my new algorithm. To show the success of GLTS algorithm, I use the Extended Cooperative Spies game, introduced on section 6.2 and shown in Figure 6.3, and the Multi-Agent Path Finding with Uncertain Destination problem illustrated in Figure 6.1, which is borrowed from [4] and modified to have one single optimal world.

6.3.1 Example: Extended Cooperative Spies Game

I introduced the Extended Cooperative Spies game by extending the original Co- operative Spies game [87] in order to show the ability of the GLTS algorithm in generating communication language for large problems. There are eight unique per- ception tokens in this game: arm_red, arm_blue, arm_up, arm_down, tellA, tellB, tellC, tellD, tellE and tellF with similar names for moves which trigger them. So, there are 35 different singleton mustRules but only 16 legal ones according to Algo- rithm 5. The optimal language which will result in a maximum reward of 100 is a composite world consisting of 4 singular mustRules in this game. There are four different information sets with each having four different states in the Extended Cooperative Spies game. Teller decides which information set Cutter finds himself in and random’s actions place him in one of the four states in the information set. If the game is played by either ITS or NEXUSBAUM without GL or GLTS, on average both players will receive 50 as their rewards. With the help of GL or GLTS this value will be 100. There are different optimal combinations of mustRules. To see which will be cho- sen by the GLTS, we first need to see which singular worlds are generated. And for that we need to follow the Algorithm 5 step by step. The initial world, the un- touched game, has depths (1,1) (lines 3-4). No mustRule can be generated at these depths. To expand the tree, the algorithm then generates its children according to first node’s depths. The possible children depths are (0,1), (1,2), (1,0) and (2,1). Only (1,2) is the legal depth set among them. However, this depth set generates

129 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

1

d1: 1 d2: 1 2

d1: 1 d2: 2 3 4

d1: 1 d1: 2 d2: 3 d2: 2

Figure 6.6: An abstract generated tree world by GLTS in the game of Extended Cooperative Spies. Nodes here are the set of worlds with the given depths. The Shallow node represent a dummy node which has no world associated with them. Dummy nodes are just keeping the depths’ values to be later used on generating new nodes.

no mustRule in this game and so no world is associated with it. As a result, the algorithm generates a dummy node to keep the depths. Next, children are gener- ated according to the dummy’s depths. Only (1,3) and (2,2) are the legal depths. Figure 6.6 shows the generated world tree for this game. Worlds are generated with mustRules at depths (1,3) and (2,2). So the first singular worlds are the worlds with mustRules like arm_red - tellC and arm_up - tellB and not like arm_blue - tellB.

It then combines one of the two singular worlds to generate a new composite world. After a few iterations, an optimal composite world with four mustRules is generated.

This example shows how GLTS searches a game tree to generate a world tree by searching different depths for the received and sent percepts. This illustrates the applicability of GLTS on searching different depths to discover the optimal mustRules in a game.

130 6.3 Analysis

0 1 2

0

1

2

3

4

Figure 6.7: The paths both robots take to reach their goal if they use either NEXUSBAUN or ITS aglorithms in the Two Robots game.

6.3.2 Example: Two Robots Game

For the Two Robots Game in Figure 6.1, both GGP-II algorithms, ITS and NEXUS- BAUN, fail to correctly navigate both robots toward their target cells when both have to go to the same side. Players in both groups consider the best average move when dealing with the uncertainty of the location of their ally’s target cell. On aver- age, if a robot does not know the ally’s target cell then it is better to move toward its own target. This, however, is not an optimal move in this game. To also experimen- tally show this I ran both NEXUSBAUM and ITS algorithms on the game without language learning. They both failed to successfully navigate the robots when their targets are on the same side. Both robots rushed toward their targets and at the end, they receive a reward of 0 for not reaching their targets. Figure 6.7 shows the path that both of these two algorithms take the robots and still they never reach their targets. When they are at these positions, they keep moving one cell back and forth trying to give way to each other and never succeed6. The optimal strategy in this game is for one player to change its direction ac-

6This behaviour can also be seen among humans and is called Taarof. Taarof is a Persian be- haviour which is a form of civility or art of etiquette [65]. An example of such behaviour happens when two people need to enter a room at the same time. In this case both asking the other to enter the room first. This process can last minutes.

131 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

0 1 2 4 steps 0 1 2

0 0 8 steps 1 1

2 2

3 3 9 steps

4 4

3 steps (a) When Circle goes toward its (b) When Square goes toward its target and Square goes around, target and Circle goes around, they need to take 11 steps to they need to take 15 steps to reach to their targets. reach to their targets.

Figure 6.8: Number of steps when Square goes clockwise versus when Square goes unti-clockwise in the Two Robots game. cording to the target of its ally. In this example, the player which needs to change direction based on the target of its ally for the optimal joint strategy is the Square robot. If their targets are on the same side then Square needs to change its direction and move around to reach its target cell. But if targets are in different columns then both can simply move toward their goals. To show the reason why the Square robot needs to change its direction, not the Circle robot, let us calculate the total penalty for when Circle changes direction and when Square changes direction. Suppose that in the game depicted in Figure 6.1, the dark circle and the dark square are the ac- tual target cells. Square moves first. If Circle changes direction and instead moves around, then a total of 9 steps is required, and Square moving toward its target re- quires an additional 4 steps. This results in a total of 13 steps. Following the same reasoning for Square changing direction, we only need a total of 11 steps. Therefore, it is a better strategy for Square to change direction upon learning about its ally’s target. These calculation are shown in Figure 6.8.

Learning can be done through the use of implicit communication, or as I called it

132 6.3 Analysis

“communication language learning”. Knowing the Circle’s target cell dramatically reduces the complexity of the problem for both players. As both can see each other’s moves, each move can be considered as a message. Since staying and moving costs them both, the optimal mustRule must be the one where an informative percept is triggered as early as possible. At the start of the game, players have three moves to choose from, so there are three percepts to be mapped to two targets. In other words, the two targets are equivalent to the first two percepts sent to the Circle player by the random player and the three percepts sent by either of the robots to each other. In this game, the optimal world is the one with two mustRules, one for each target. For each mustRule we can choose among three moves: stay, move toward the target and move away from the target. Staying costs them 1 extra point, moving away costs them 4 extra points compared to what is calculated above. The optimal mustRules are the ones that have the Circle move towards its target. Using the original General Language learning technique from Chapter 5, which uses brute force, may work in theory but will fail in practice as there are percepts sent between players on each move, hence an exponential number of worlds to consider. With the help of General Language Tree Search technique, the new algorithm should be able to find the optimal strategy in a reasonable time. If both players are rational and try to communicate they both reach the same common world. In the following section, I provide the results of my experiments.

6.3.3 Applicability and the Bias of GLTS

Each of the mentioned examples demonstrates a unique feature. The extended Cooperative Spies have both kinds of percepts, received and sent, happening at different depths. This illustrates the ability of GLTS on generating communication language within different depths. The Two Robots Game has its first perception token in depth one and its second perception token up to 50 depths deep in the game. Discovering a perfect perception token within such a large game is the practical advantage of GLTS over the original GL. In theory, both GL and GLTS can benefit in a game with a set cooperative

133 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty agents when agents have asymmetrical knowledge at some states of the game and the agent with more knowledge can trigger different percepts to others later in the game. Obviously, mustRules limit an agent on choosing a specific move at some parts of the game but this might benefit them more by reducing the information set size of others in later parts of the game. Unlike GL, the GLTS algorithm is an anytime algorithm. Meaning in principle, it can be applied to all other problems of any size, just the quality of language decreases with size and increases with time given. The main bias of the GLTS algorithm is it might focus on successful mustRules within early depths. It mainly tries to keep one perception token fix of a relatively successful mustRules and change the other perception token to find a better mus- tRule. The balance between exploration and exploitation reduces this impact in theory [52] but it might take some when some relatively successful mustRule can be generated early in the game while the optimal token happens later in the game.

6.4 Experimental Results

The main advantage of GLTS over the original GL is in large games like MAPF/DU. Here I demonstrate an experiment to show the success of GLTS. I ran the algorithm on the described MAPF/DU example. The previous technique, General Language Evolution, clearly failed to solve the problem due to its complexity. GLTS algorithm avoids generating all the possible worlds. It also avoids considering mustRules if it is certain of their failure (lines 66-72, 76 and 81 in Alg. 5). Some other MustRules that are ignored too are the ones trying to share information with the random player (lines 75 and 79) or when both sender and receiver are the same player (line 79). This way it prunes the tree search early and saves on computational time and resources. Below I show the results of running the algorithm for the first 200 iterations. During these iterations, almost all early worlds are generated due to the exploration QS and exploitation balance. The average value of each world node, V , at each iteration is stored. Figure 6.9 shows the values for three world nodes with a single mustRule, and Figure 6.10 shows the values for three composite world nodes.

134 6.4 Experimental Results

80

70

60

50

40 node reward value - 30

Average self-value (QS) 20 Average self Average

10

0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 Iteration

world without CircleCircle | (2,0)(2,0) ->- stay stay Em pty Dialect CircleCircle | (2,1) (2,0) --> left left mustRule

QS Figure 6.9: Average self-value V for three singleton worlds

120

100

80 value (QS) - 60

40 Average self 20

0 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 129 137 145 153 161 169 177 185 193 201 Iteration

Circle (2,0) - left & Circle (2,2) - right [Ful-optimal]

Circle (2,0) - right & Circle (2,2) - left [Semi-optimal]

Square (3,0) - right & Square (3,2) - left [Least-optimal]

QS Figure 6.10: Average self-value V for three composite worlds

135 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

The three different singular worlds for which I present the results are: First, the optimal world in which the Circle must go to left if the target is located at cell (2,0). Second, the world similar to the first but where Circle has to move to the right if its target is located at cell (2,2). The third world is the actual game, that is, without any mustRules. The UCT player was able to solve a world with no mustRule if and only if the targets were on different sides. Running UCT on two worlds with mustRules returned a reward of approximately 65, which is 25 points more than in the world without a mustRule. The difference between the top two worlds is due to a move in the wrong direction. When Circle moves to the right to send the signal, it needs to come back left again. This will cost an extra 4 points on average.

QS Figure 6.10 shows the evolution of the returned values, V , for three composite worlds. The three chosen worlds shown are called full optimal, semi-optimal and the least optimal worlds. The full optimal world has the two optimal mustRules as mentioned in Section 6.3. The semi-optimal world has similar mustRules but with the wrong directions. The least optimal world has mustRules for the wrong player with wrong moves. As can be seen, the optimal composite world has the highest reward value of 90. The GLTS algorithm will return the optimal combined world QS since it has the highest average self-value, V . It will then play the actual game according to its mustRules and expecting the ally to play accordingly.

As mentioned earlier, in principal GLTS can be applied to all other problems of any size because it is an any-time algorithm. Meaning the quality of the solution decreases with size and increase with time given. The GLTS algorithm is a great algorithm for MAPF problems. Since usually the uncertainty is created early in the game by random and players can trigger different percepts later in the game to help each other reduce the uncertainty. The MAPF are asymmetrical so GLTS does not require to use any language learning technique for agents to reach a com- mon communication language. The MAPF problems are asymmetrical because each move which triggers a percept can costs or benefits players differently in different circumstances. So there is always a mustRule which benefits the players the most and costs them the least and GLTS can find this optimal mustRule given enough

136 6.5 Summary time.

6.5 Summary

In this chapter, I have introduced an efficient searching algorithm for the Gen- eral Language Evolution technique in General Game Playing, called General Lan- guage Tree Search (GLTS). The new technique is able to search complex games such as Multi-Agent Path Finding with Destination Uncertainty (MAPF/DU). In MAPF/DU games, GLTS allows agents to use their moves to share knowledge with others. Moves are visible to players and can thus be used as an implicit way of communication. The main challenge is to find a correct convention between moves and the knowledge, which I call and encode as mustRule. The GLTS algorithm is able to find the optimal communication convention, mustRules, by combining the original GLE technique with UCB Tree Search and early pruning. The analysis and experimental results show the success of my algorithm in large games in which agents need to share knowledge while there is no explicit rule for how to do it. The MAPF/DU example was borrowed from [4]. Implicit communication can also happen among rational human beings and/or robots. When humans know the rules of the game they can reason. Assuming, for example, a human to be the Square agent and an actual robot the Circle agent, then the human can calculate that the optimal policy is for the Circle to always move toward its goal. So, as an example, if the human sees the Circle robot move left then he can conclude the location of the Circle’s target is on the left side and can move accordingly.

137 Chapter 6 GLTS and Its Application to Multi-Agent Path Finding with Destination Uncertainty

138 Chapter 7

Conclusion

This thesis contributes to the field of General Game Playing with Incomplete Infor- mation (GGP-II). GGP-II is concerned with developing a general intelligence that, in principal, learns to solve any multi-agent incomplete information problem without being specifically developed for the problem [37]. In other words, a GGP-II player system should be developed only once and be applicable to multiple problems. One of the most recent works in the same area is the thesis by Michael Schofield "Playing Imperfect-Information Games in General Game Playing by Maintaining Information Set Samples (HyperPlay)" [85]. In my thesis, I considered the HyperPlay technique as the benchmark to evaluate the success of my techniques. Following are the con- tributions and the future work for this thesis.

7.1 Contributions

The contributions of this thesis can be categorised into two groups of competi- tive players, Chapters 3 and 4, and communication language generators, Chapters 5 and 6.

7.1.1 Competitive Players

Chapter 3 introduced the novel Iterative Tree Search (ITS) algorithm. ITS was theoretically and experimentally analysed on different categories of games. These

139 Chapter 7 Conclusion categories were introduced at the Australasian Joint Conference on Artificial Intelli- gence 2012 [87]. All of the GGP-II approaches of the time failed to find the optimal strategy in those games [85]. The HyperPlayer with Incomplete Information [87] was later introduced which claimed it can correctly find the optimal strategies in these games. In Chapter 3, I have shown that HP-II actually fails to find the optimal strategies for three of these games: Banker and Thief, a slightly harder version of the Cutting Wire Game and BattleShip in the Fog. I have also shown both theoretically and experimentally that ITS is capable of finding the optimal strategies in almost all these games. The only game that HP-II claimed to play optimally while ITS failed was the BattleShip in the Fog game. As previously explained in section 4.1, ITS fails to play this game due to the high branching factor in this game. I have shown that all the variants of CounterFactorial Regret minimization algorithms will fail to optimally play this game. In Chapter 4, I introduced Monte Carlo Iterative Tree Search (MCITS), an online extension of the ITS algorithm combined with Monte Carlo Tree Search. MCITS focuses on a section of the game tree wherein it founds to be more promising. The main goal of Chapter 4 was to show MCITS can successfully handle large games such as the BattleShip in the Fog. I have proved the "optimal strategy" discovered by the HP-II is in fact not the optimal strategy for this game. MCITS discovered a better strategy which can defeat HP-II. To my knowledge, no other algorithms in GGP-II, even CRN algorithms, are able to find the optimal strategy for this game.

7.1.2 Communication Language Generators

One limitation of all current GGP-II approaches is their inability to share informa- tion through implicit communication [87]. In Chapter 5, I introduced the General Language (GL) learning technique, a general approach for agents to agree on a common language in order to share their information. I introduced the concept of mustRule as the building blocks of a simple communication language. The com- munication language is defined to be a set of mustRules for games described in Game Description Language. I have shown that in some examples which require

140 7.2 Future Work implicit communication, cooperative agents can achieve higher reward by using the GL technique. The GL technique employs other GGP-II approaches to solve the implicit communication problem. I have also shown in an experiment that using a strong GGP-II approach helps agents to fully communicate with a simpler common language as they can infer the missing part of the language. In Chapter 6, I extended this idea to be applicable to large games. The extended version, which I called General Language Tree Search (GLTS), can solve this problem without the need to generate the whole bag of worlds. I used a game of Multi Agent Path Finding with Destination Uncertainty (MAPF/DU) as an example to demonstrate this technique. MAPF/DU games are relatively complex. Even for a simple two-player problem on a board with 12 cells, we have 1058 possible worlds to search. This makes it impractical for the original GL technique to solve such a problem. I also showed that with the help of implicit communication MAPF/DU problems can be solved more efficiently. Similar to the original GL, the GLTS technique employs an underlying GGP-II approach. GLTS can successfully play cooperative games which require implicit communication as well as any other problem solvable by the underlying GGP-II approach. This makes it possible for a competitive problem to also be solved by a GLTS technique. As a result, GLTS is a truly general approach. However, there is an overhead in playing a competitive game with GLTS. The overhead is caused by GLTS trying to solve different worlds of a game in order to find an implicit communication language which does not exist.

7.2 Future Work

The following is a summary of the limitations and potential future research directions to extend this work.

• More formal investigation on ITS, MCITS and past players. In this thesis, ITS had some theoretical assessment and MCITS only had an exper- imental assessment. It would be beneficial to focus on the theoretical aspect

141 Chapter 7 Conclusion

of GGP-II players to define the expected requirements of an incomplete infor- mation general player. This makes GGP-II more accessible to different fields of AI or even the economy.

• Extending MCITS algorithm to generate mixed strategies. A main limitation of MCITS, similar to the original MCTS algorithm, is its inability to generate a mixed strategy. For this reason it fails to discover the optimal strategy for the Banker and Thief game1. Additional work can be done to extend the MCITS in order to generate mixed strategies to solve a wider variety of games.

• Applying MCITS algorithm to real problems. The algorithm in Chapter 4 was only tested on some simple games with no abstraction or game- specific heuristic. General algorithms like different variants of CFR are used in real applications of Texas hold’em game [103] or stock trading [102] with the help of abstraction. Similarly, MCITS may be improved using abstraction for real-world applications.

• Performing further experiment on a larger variety of games using both GL and GLTS. My result is just the first step towards a more general implicit coordination approach. Further experiments are required to demon- strate the reader about the relevancy and scalability of GL and GLTS tech- niques. GLTS paves the way for more general frameworks for implicit coordi- nation: numerous agents, uncertain goals and uncertainty about the behaviour of other agents.

• Implementing General Language Tree Search on embedded systems. The GLTS showed promising results in simulated problems of MAPF/DU. One interesting extension on GLTS might be to be implemented on embed- ded robotic systems [33, 11]. This makes it possible to test possible implicit communication between robots as well as human and robots.

1The Banker and Thief game is described on section 3.2.4.

142 7.2 Future Work

• Prioritising the worlds based on the similarity. As described in Chapter 6, humans placed priority according to the similarity of different words when communicating. For example, if we want to convey the infor- mation water but we are limited to only uttering a colour, then we would naturally prefer to say blue instead of, say, red due to the close association of blue with water. A possible future extension for GLTS can be to apply such priority to its tree search. This way an embedded robot system can learn to easily communicate with humans with no prior arrangement.

143 Chapter 7 Conclusion

144 Appendix A

Counterfactual Regret Minimization Performance on the PBECW Game

This appendix recapitulates the Counterfactual Regret minimization technique. For the detailed explanation please refer to [118], [54] and [72]. In this appendix, I also show how this algorithm finds a Nash equilibrium which is not considered as the optimal one in a one-off game of Partially Blind Extended Cutting Wire (PBECW).

A.1 Background: Counterfactual Regret minimisation algorithm

The idea of regret in games is about how much a player would have gained more if it played a specific move instead of the mixed strategy that is played. Regret matching refers to playing future moves at random with a distribution proportional to positive regrets [72]. Positive regret means ignoring any move which would have given the player less reward. Regret minimization was originally designed for normal form games but it was later extended to extensive form games and is called CounterFactual Regret (CFR) minimization [118]. Before explaining the algorithm, let me restate some notation

145 Appendix A Counterfactual Regret Minimization Performance on the PBECW Game form Chapter 2 and 3. The strategy profile is represented by π, move probability function is represented by µ, states are represented by s and moves are represented by m. Let us say π(s) means the probability of reaching state s. This can be calculated recursively by multiplying move probabilities from the initial state s0 till the given state s. Accordingly, the probability of reaching an information set I can be calculated by π(I) = Σs∈I π(s). As described earlier, Z represents the set of all terminal states. The s @ z shows all the state in the path from the starting state s0 to the given terminal state z. The π(s, z) gives the probability of reaching the terminal state of z from the state s. This can also be calculated recursively by multiplying the move probabilities between state s and terminal state z.

Now let us define the counterfactual value for player r at state s as:

X vr(π, s) = π−r(s)π(s, z)ur(z) z∈Z,s@z where π−r means assuming the probabilities of the moves by player r to be one.

So accordingly, the counterfactual regret of not taking move m at state s can be defined as:

r(s, m) = vr(πI→m, s) − vr(π, s) where I is the information set of state s and πI→m means to assign probability 1 to move m and 0 for all other moves at information set I.

We can now define the counterfactual regret of not taking move m at informa- tion set I as: X r(I, m) = r(s, m) h∈I

Accordingly, the cumulative counterfactual regret of not taking move m at information set I at time T is defined as:

T X RT (I, a) = rt(I, m) t=1

146 A.2 CounterFactual Regret minimization Failure

Positive cumulative counterfactual regret is defined as:

T,+ T Rr (I, m) = max(Rr (I, a), 0)

Now we have defined enough notations to describe how to obtain new strategy us- ing cumulative counterfactual regret. The strategy update for CFR can be described as:

  RT,+(I,m) P T,+ P RT,+(I,m) if m∈M(I) R (I, m) > 0 πT +1(I, m) = m∈M(I) (A.1)  1  |M(I)| otherwise.

It is proven that CFR is capable of finding a Nash equilibrium for constant-sum two-player games [118] however not all Nash equilibrium strategies are optimal in all cases. Below is an example of where it fails to do so. I created this game by extending the game of Cutting Wire game [86] in order to show the failure of both HP-II and CFR algorithms.

A.2 CounterFactual Regret minimization Failure

Now I show how the CFR algorithm fails in finding the optimal strategy in the Par- tially Blind Extended Cutting Wire example. I follow the cumulative counterfactual regret minimization algorithm described in [72]. The algorithm is counterfactual re- gret minimization with chance sampling, meaning it each round it picks a move by random to update values on its branches. For details of the algorithm please refer to [72]. I have presented four different rounds of CFR on the Partially Blind Extended Cutting Wire example in Figures A.1, A.2, A.3 and A.4. On each figure, I am showing four variables for each move. I presented them inside a box next to the move in the figure. For the final moves, boxes are placed beneath the moves. The first row is the name of the move, the second row is the probability of the move µ at time t. The π can be calculated with µ values. The third row shows the counterfactual regret r and fourth row show the cumulative counterfactual regret R.

147 Appendix A Counterfactual Regret Minimization Performance on the PBECW Game

The final row shows the updated probability value for the next round. States are numbered for ease of explanation. The dashed line is used to show the information set. There is only one information set, which contains states 7 and 11. 1 In the first round, all the move probabilities are initialised to |M(s)| which in this example is 0.5. For the first and third rounds, I chose arm_red move by the random player and for the second and fourth rounds, I chose arm_blue move. There will be no positive regret for state 7 since the regret is calculated according to all the states 1 in the information set. The regrets for move wait2 from state 3 is 2.5. The v(s3, π ) is 0.5 ∗ 0.5 ∗ 50 + 0.5 ∗ 0.5 ∗ 50 + 0.5 ∗ 0.5 ∗ 90 = 47.5 and if the player always chooses 1 wait2 at state 3 then the v(s3, πs3→wait2) is 0.5 ∗ 50 + 0.5 ∗ 50 which is 50. So the regret of not choosing move wait2 at state 3 is 2.5. Since this is the first round, the cumulative counterfactual regret will also be 2.5. The regret for move tell2 from state 3 is negative so it will be 0. Similarly, all the other variables for all the states on the left side of the tree can be calculated. For the second round, as I mentioned earlier, I pick the arm_blue move from the initial state s0. This round will be similar to the first round with a difference at moves from state s11. Since the π−r(s7) is higher than π−r(s11) in this round, then the regret for not choosing cutr at state s11 will be positive. Following the calculations, the regret for not choosing cut_r at s11 and s7 is 11. This updates the

µ to be 0.67 for cutr and 0.33 for cutb. This process continues until it reaches the pure strategy profile on Figure A.5. Thicken coloured lines are the chosen moves. As I described in Chapter 3 of this thesis, this strategy profile is Nash equilibrium but is not the optimal one.

148 A.2 CounterFactual Regret minimization Failure

70 70

: .5

0 0 : .5

t t+1 14

cut_b μ r : R : μ

: .5 0 0 : .5 t t+1 tell2 μ r : R : μ

: .5 0

0 : .5 6 t t+1 0 0 cut_r μ r : R : μ

: .5 0 0

80 80 : .5 t t+1

tell1 μ r : R : μ

: .5 0 0 : .5 t t+1 cut_b μ r : R : μ

: .5 0

0 : .5 t t+1

wait2 μ r : R : μ

13 : .5 0

0 : .5 t t+1 0 0 cut_r μ r : R : μ

2

90 90

: .5 0 0 : .5

t t+1 cut_b μ r : R : μ

12

: .5 0 0 : .5 t t+1

tell2 μ r : R : μ

: .5

62

0 : .5 t t+1

: .5

0 cut_r μ r : R : μ 0 0 0 : .5 t t+1

arm_blue μ r : R : μ

: .5 0

0 : .5 5 t t+1

100 100 wait1 μ r : R : μ

: .5 0 0 : .5 t t+1 cut_b μ r : R : μ

: .5 0 0

: .5

t t+1

wait2 μ r : R : μ : .5 11 : 0 0 : .5

t t+1 0 0 cut_r μ r : R μ

0

0 0

: .25

0 : .5 : 0 10 t t+1

cut_b μ r R : μ

: .25 0 : .5 t t+1

tell2 μ r : 0 R : μ

: .75 35

4

: .5 : 35 70 70 t t+1

cut_r μ r R : μ

: .25

0 : .5 0 0 0 : .5 t t+1 : .5

t t+1

tell1 μ r : R : μ arm_red μ r : R : μ 0 0

: .25 0

: .5 : 0 t t+1

cut_b μ r R : μ

: .75 2.5

2.5 : .5 t t+1 9 wait2 μ r : R : μ

: .75 40 40 : .5 80 80 t t+1 cut_r μ r : R : μ

1

0 0

: .25

0

: .5 : 0 t t+1 8

cut_b μ r R : μ

: .25 0 0 : .5 t t+1

tell2 μ r : R : μ

: .75

: 45 : .5 : 45 t t+1 : .75 90 90 10 cut_r μ r R μ 10 : .5 t t+1 wait1 μ r : R : μ

0 0 3

: .5 0

: .5 : 0 t t+1 cut_b μ r R : μ

: .75 2.5

: .5 : 2.5

t t+1 7 wait2 μ r R : μ

: .5 0 : .5 : 0 t t+1 cut_r μ r R : μ 100 100

: :

Teller Cutter Teller random Teller Cutter Figure A.1: The Partially Blind Extended Cutting Wire example with CFR variable represented during the first iteration.

149 Appendix A Counterfactual Regret Minimization Performance on the PBECW Game iueA2 h atal ln xeddCtigWr xml ihCRvral ersne uigtesecond the during represented variable CFR with example Wire Cutting Extended Blind Partially The A.2: Figure Cutter Teller random Teller Cutter Teller

: :

100 100 μ R : r μ cut_r t+1 t : 11 : .5 11 : .67

μ R : r μ wait2 7 t+1 t : 2.5 : .5

iteration. 2.5 : .75

μ R : r μ cut_b t+1 t : 0 : .5

0 : .33

3 0 0

μ R : r : μ wait1 t+1 t : .5 10 μ R r μ cut_r 10 90 90 : .75 t+1 t : 45 : .5 : 45

: .75

μ R : r : μ tell2

t+1 t : .5 0 0 : .25

μ R : r μ cut_b

8 t+1 t : 0 : .5

0

: .25

0 0

1

μ R : r : μ cut_r t+1 t 80 80 : .5 40 40 : .75

μ R : r : μ wait2 9 t+1 t : .5 2.5

2.5 : .75

μ R : r μ cut_b

t+1 t : 0 : .5

0 : .25

0 0 μ R : r : μ arm_red μ R : r : μ tell1

t+1 t

: .5 t+1 t : .5 0 0 0 : .5 0

: .25

μ R : r μ cut_r

t+1 t 70 70 : 35 : .5

4

35 : .75

μ R : r : 0 μ tell2

t+1 t : .5 0 : .25

μ R : r μ cut_b

t+1 t 10 : 0 : .5 0

: .25

0 0

0

μ R r : μ cut_r 0 0 t+1 t

: .5 11 : 11 11 : .67 μ R : r : μ wait2 t+1 t

: .5

0

0 : .25

μ R : r : μ cut_b t+1 t : .5 0 0 : .33

μ R : r : μ wait1 100 100

t+1 t 5 : .5 9 9 : .75

μ R : r : μ arm_blue t+1 t

: .5 0 0 0 μ R : r : μ cut_r 0

: .5

t+1 t : .5 0

0

: .25

μ R : r : μ tell2

t+1 t : .5 13.5 13.5 : .75

12

μ R : r : μ cut_b t+1 t

: .5 45

45 : .75

90 90

2

μ R : r : μ cut_r 0 0 t+1 t : .5 0

0 : .25 13

μ R : r : μ wait2

t+1 t : .5 2.5 2.5 : .75

μ R : r : μ cut_b

t+1 t : .5 40 40 : .75 μ R : r : μ tell1

t+1 t

: .5 80 80 0 0 : .25

μ R : r : μ cut_r 0 0 t+1 t 6 : .5 0

0 : .25

μ R : r : μ tell2 t+1 t : .5 0 0 : .25

μ R : r : μ cut_b

14 t+1 t

: .5 35 35 : .75

70 70

150 A.2 CounterFactual Regret minimization Failure

70 70

: .75 35 35 : .5

t t+1 14

cut_b μ r : R : μ

: .25 0 0 : .5 t t+1 tell2 μ r : R : μ

: .25 0

0 : .5 6 t t+1 0 0 cut_r μ r : R : μ

: .25 0 0 80 80 : .5

t t+1

tell1 μ r : R : μ : .75 40 40 : .5 t t+1

cut_b μ r : R : μ

: .75 2.5 2.5 : .5 t t+1

wait2 μ r : R : μ

13 : .25 0

0 : .5 t t+1 0 0 cut_r μ r : R : μ

2

90 90

: .75 45

45 : .5

t t+1 cut_b μ r : R : μ

12

: .75 13.5 13.5 : .5 t t+1

tell2 μ r : R : μ

: .25

0

0 : .5 t t+1

: .5

0 cut_r μ r : R : μ 0 0 0 : .5

t t+1 arm_blue μ r : R : μ

: .75 9 9 : .5 5 t t+1

100 100 wait1 μ r : R : μ

: .25 0 0 : .5 t t+1 cut_b μ r : R : μ

: .25 0

0

: .5

t t+1 wait2 μ r : R : μ : .75 11 : 36 25 : .5

t t+1 0 0 cut_r μ r : R μ

0

0 0

: .17

0 : .25

: 0 10 t t+1 cut_b μ r R : μ

: .17 0 : .25 t t+1

tell2 μ r : 0 R : μ

: .83 35

4

: .75 : 35 70 70 t t+1

cut_r μ r R : μ

: .17

0 : .5 0 0 0 : .25 t t+1 : .5

t t+1

tell1 μ r : R : μ arm_red μ r : R : μ

0 0

: .17 0

: .25 : 0

t t+1

cut_b μ r R : μ

: .83 16.1

13.6 : .75 t t+1

9 wait2 μ r : R : μ

: .83 40 40 : .75 80 80 t t+1 cut_r μ r : R : μ

1

0 0

: .17

0

: .25

: 0 t t+1 8 cut_b μ r R : μ

: .17 0 0 : .25 t t+1

tell2 μ r : R : μ

: .83

: 67.5 : .75 : 22.5 t t+1 : .83 90 90 13.4 cut_r μ r R μ 3.4 : .75 t t+1 wait1 μ r : R : μ

0 0

3

: .25 0

: .33 : 0 t t+1 cut_b μ r R : μ

: .83 4.2 tion.

: .75 : 1.7 t t+1

7

wait2 μ r R : μ

: .75 36 : .67 : 25 t t+1 cut_r μ r R : μ 100 100

: :

Teller Cutter Teller random Teller Cutter Figure A.3: The Partially Blind Extended Cutting Wire example with CFR variable represented during the third itera-

151 Appendix A Counterfactual Regret Minimization Performance on the PBECW Game iueA4 h atal ln xeddCtigWr xml ihCRvral ersne uigtefut iter- fourth the during represented variable CFR with example Wire Cutting Extended Blind Partially The A.4: Figure Cutter Teller random Teller Cutter Teller

: :

100 100 μ R : r μ cut_r t+1 t : 25 : .67 36 : .83

μ R : r μ wait2

7

t+1 t : 1.7 : .75

ation. 4.2 : .83

μ R : r μ cut_b t+1 t : 0 : .33

0 : .17

3

0 0

μ R : r : μ wait1 t+1 t : .75 3.4 μ R r μ cut_r 13.4 90 90 : .83 t+1 t : 22.5 : .75 : 67.5

: .83

μ R : r : μ tell2

t+1 t : .25 0 0 : .17

μ R : r μ cut_b 8 t+1 t : 0

: .25

0

: .17

0 0

1

μ R : r : μ cut_r t+1 t 80 80 : .75 40 40 : .83

μ R : r : μ wait2 9

t+1 t : .75 13.6

16.1 : .83

μ R : r μ cut_b

t+1 t

: 0 : .25

0 : .17

0 0

μ R : r : μ arm_red μ R : r : μ tell1

t+1 t

: .5 t+1 t : .25 0 0 0 : .5 0

: .17

μ R : r μ cut_r

t+1 t 70 70 : 35 : .75

4

35 : .83

μ R : r : 0 μ tell2

t+1 t : .25 0 : .17

μ R : r μ cut_b t+1 t 10 : 0

: .25 0

: .17

0 0

0

μ R r : μ cut_r 0 0 t+1 t

: .5 25 : 36 11 : .83 μ R : r : μ wait2 t+1 t

: .25

0

0 : .17

μ R : r : μ cut_b t+1 t : .5 0 0 : .17

μ R : r : μ wait1 100 100

t+1 t 5 : .75 3.3 12.3 : .83

μ R : r : μ arm_blue

t+1 t

: .5 0 0 0 μ R : r : μ cut_r 0

: .5

t+1 t : .5 0

0

: .17

μ R : r : μ tell2

t+1 t : .75 16.2 31.2 : .83

12 μ R : r : μ cut_b

t+1 t

: .5 45

45 : .83

90 90

2

μ R : r : μ cut_r 0 0 t+1 t : .5 0

0 : .17 13

μ R : r : μ wait2

t+1 t : .25 2.5 5 : .83

μ R : r : μ cut_b

t+1 t : .5 40 40 : .83 μ R : r : μ tell1

t+1 t

: .25 80 80 0 0 : .17

μ R : r : μ cut_r 0 0 t+1 t 6 : .5 0

0 : .17

μ R : r : μ tell2 t+1 t : .75 0 0 : .17

μ R : r : μ cut_b

14 t+1 t : .5 35

35 : .83

70 70

152 A.2 CounterFactual Regret minimization Failure

70 70

14

6 0 0

80 80

13

0 0

2

90 90

12

0 0

5 100 100

11

0 0

0

0 0

10

4 70 70

0 0

9

80 80

1

0 0

8

90 90

0 0 3

7

100 100

: :

Figure A.5: The final pure strategy profil that CFR finds in the Partially Blind Extneded Cutting Wire example. Teller Cutter Teller random Teller Cutter

153 Appendix A Counterfactual Regret Minimization Performance on the PBECW Game

154 Appendix B

Battle Ships in the Fog: Comparing Different Strategies

As shown in Chapter 4, MCITS’s generated strategy is better than HP-II’s strat- egy. In that chapter, I demonstrated how HP-II’s strategy loses against MCITS’s strategy. Here, I demonstrate how HP-II’s and MCITS’s Strategies win against the strategy of the sampling approaches, like HP. GGP-II approaches which use sam- pling techniques pick a possible position of the opponent and play the game like a complete information game. At the end, they average the reward for each move in all samples and play the move which returned the highest reward on average. This technique might work well for games like Poker but for Battle Ship in the Fog, they miserably fail. Each time when they assume the opponent is located in a cell they shoot to that cell. So they think shooting is the best move to play. It is completely random which position they finally pick to shoot at. With a high chance, they fail to shoot the correct cell. Fail shooting reveals their location. So the opponent can precisely shoot them back. Sampling techniques, like HP, lose against both HP-II and MCITS algorithms. Figure B.1 shows the gameplay between HP-II algorithm and a player which uses sampling technique, like HP and Figure B.2 shows the gameplay between MCITS vs an algorithm which uses sampling technique.

155 Appendix B Battle Ships in the Fog: Comparing Different Strategies

SAMPLING HP-II TECHNIQUES

scan

shoot x 16

shoot x 1

Figure B.1: HP-II algorithm vs sampling technique in Battle Ships in the Fog.

156 MCITS SAMPLING TECHNIQUES

move

shoot x 16

shoot x 1

Figure B.2: MCITS algorithm vs sampling technique in Battle Ships in the Fog.

157 Appendix B Battle Ships in the Fog: Comparing Different Strategies

158 Appendix C

Number of Worlds in the Two Robots Example

Here I explain how many worlds are there in the game of Two Robots from Chapter 6. The game is presented in Figure C.1. It is a turn-based game. There are the random player and two robots, Circle and Square. Before the game starts, robots can see two marked cells for each robot. At the first round, the random player picks one target cell for each robot from the two options. The target cells are marked with solid black shape inside each. The robots then need to go to their target cells. The crux of the game is that each robot can only see its own picked target cell. So the idea of creating a world is to connect the perception sent to a robot from the random player to a perception that the robot can send to the other robot. In this game, each robot can see the other’s moves. This means, moving is a way of sending a message to the other robot. Assume the Circle (2,0) is marked solid as in Figure C.1 and Circle wants to share this with Square. At each cell, Circle can move to two directions or stay. Moving costs 2 points while staying still costs 1 point. Because this is a turn-taking game and there is a total of 100 points in a game, each player has 50 points to spend before the game ends. It is important to remember that even though some perceptions have similar names, like left, at different states of the game, but they are considered unique and different because games in GGP-II have perfect recall [104].

159 Appendix C Number of Worlds in the Two Robots Example

0 1 2

0

1

2

3

4

Figure C.1: The Two Robots game in a loop board. Small solid square and circle shows the target cells for robots.

So we need to pick a move of the Circle player to attach to the Circle (2,0) perception. Each Circle’s move creates a connection and each connection creates a world. So we need to measure of number possible moves for the Circle to know number of possible worlds created due to this player. Assume only Circle moves in this game. Then, each branch has three edges. To calculate these edges, we can use dynamic programming. Let edgeCount(p) be the number of edges when there are p points left. the based case is when there are 0 point or 1 point left. For any point larger than one we can either pick to move or stay. This can be summarised as:

   0 if p = 0   edgeCount(p) = 1 if p = 1 (C.1)    2 × edgeCount(p − 2) + edgeCount(p − 1) otherwise.

Since Circle has only 50 points to use before the game ends, I set p = 50 and get 375, 299, 968, 947, 541. That is for only the two perceptions of Circle (2,0) and the Circle player’s moves. There are two targets for Circle and there is also another player, Square. This means there are 4×3.7×1014 singleton worlds. In regards with

160 composite worlds, there are (3.7 × 1014)2 worlds with two mustRules, (3.7 × 1014)3 worlds with three mustRules and (3.7 × 1014)3 worlds with four mustRules. By adding singular and composite worlds we get a total of 1.9 × 1058 worlds.

161 Appendix C Number of Worlds in the Two Robots Example

162 Appendix D

How a World Turns into a Terminal World

It is plausible for a world to turn into a terminal one in a world tree. Turing terminal means the world can no longer generate new worlds. As discussed earlier, one case is when both depths reaches the end of the game and all the worlds with depth before them are generated. The other case is when a possible child of a world is already generated in another part of the tree. Consider Figure D.1 as an example. This figure shows a world tree in which a world turned into a terminal world. Boxes next to each world show the perceptions’ depths of the mustRule inside the world. For ease of explanation each world is given a number. First world 1 is expanded, then world 2 and followed by world 3. Now if the Select procedure reaches world 4 then it finds the world to be a terminal world. To show the reason lies within the legality of world 4 children.

There are four possible children with depths: (d1 : 1, d2 : 2), (d1 : 2, d2 : 1),

(d1 : 3, d2 : 2) and (d1 : 2, d2 : 3). The world with depths (d1 : 1, d2 : 2) is already in the world tree, world 2. No worlds can be generated with depths (d1 : 2, d2 : 1),

(d1 : 3, d2 : 2) because d1 > d2. The only possible depths for a new child is now

(d1 : 2, d2 : 3). This also can not be added as it is already inside the game tree, world 6. As a result, world 4 is now considered a terminal world. This shows an

163 Appendix D How a World Turns into a Terminal World

1

d1: 1 d2: 1 2

d1: 1 d2: 2 3 4

d1: 1 d1: 2 d2: 3 d2: 2 5 6

d : 1 d1: 2 1 d : 3 d2: 4 2

Figure D.1: A world tree in which a non-terminal world turned into a terminal world. Values inside the boxes show the mustRule’s depths of each world. example when a non-terminal world turns into a terminal world.

164 Bibliography

[1] Rajeev Agrawal. “Sample mean based index policies by o (log n) regret for the multi-armed bandit problem”. In: Advances in Applied Probability 27.4 (1995), pp. 1054–1078. [2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. “Finite-time analysis of the multiarmed bandit problem”. In: Machine learning 47.2-3 (2002), pp. 235–256. [3] Joris Bleys, Martin Loetzsch, Michael Spranger, and Luc Steels. “The grounded colour naming game”. In: 2009. [4] Thomas Bolander, Thorsten Engesser, Robert Mattmüller, and Bernhard Nebel. “Better eager than lazy? How agent types impact the successfulness of implicit coordination”. In: International Conference on Principles of Knowl- edge Representation and Reasoning. 2018, pp. 445–453. [5] George W Brown. “Iterative solution of games by fictitious play”. In: Activity analysis of production and allocation 13.1 (1951), pp. 374–376. [6] George W Brown. Some notes on computation of games solutions. Tech. rep. RAND Corp Santa Monica, 1949. [7] Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. “Deep coun- terfactual regret minimization”. In: arXiv preprint arXiv:1811.00164 (2018). [8] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyri- don Samothrakis, and Simon Colton. “A survey of monte carlo tree search methods”. In: IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012), pp. 1–43. [9] Van Burnham. Supercade: A Visual History of the Videogame Age 1971-1984. Cambridge, MA, USA: MIT Press, 2003. [10] Murray Campbell, A Joseph Hoane, and Feng-hsiung Hsu. “Deep Blue”. In: Artificial intelligence 134.1 (2002), pp. 57–83. [11] Giorgio Cannata and Marco Maggiali. “An embedded tactile and force sensor for robotic manipulation and grasping”. In: IEEE Robotics and Automation Society International Conference on Humanoid Robots. IEEE. 2005, pp. 80– 85. [12] Jakub Čern`y.“Playing General Imperfect-Information Games Using Game- Theoretic Algorithms”. Bachelor’s Thesis. Czech Republic: Czech Technical University in Prague, 2014.

165 Bibliography

[13] Jakub Čern`y.“The dark side of the board: advances in chess Kriegspiel”. PhD’s Thesis. Bologna: Alma Mater Studiorum - Università di Bologna, 2010. [14] Legacy Chinook World Man-Machine Checkers Champion. Arthur Samuel’s Legacy. 2016. url: https : / / webdocs . cs . ualberta . ca / ~chinook / project/legacy.html (visited on May 16, 2013). [15] Armin Chitizadeh and Michael Thielscher. “General Language Evolution in General Game Playing”. In: AI 2018: Advances in Artificial Intelligence - Australasian Joint Conference, Wellington. 2018, pp. 51–64. [16] Armin Chitizadeh and Michael Thielscher. “Iterative tree search in general game playing with incomplete information”. In: Tristan Cazenave, Abdallah Saffidine , Nathan Sturtevant (Editors), Computer Games Workshop, CGW, Held in Conjunction with International Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence, Revised Selected Papers. 2018, pp. 98–115. [17] P Ciancarini, F DallaLibera, and F Maran. “Decision making under uncer- tainty: a rational approach to Kriegspiel”. In: Advances in Computer Chess 8 (1997), pp. 277–298. [18] Claudiu Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber. “Flexible, high performance convolutional neural networks for image classification”. In: International Joint Conference on Ar- tificial Intelligence. 2011, pp. 1237–1242. [19] Russell Cooper, Douglas V DeJong, Robert Forsythe, and Thomas W Ross. “Communication in the battle of the sexes game: some experimental results”. In: The RAND Journal of Economics (1989), pp. 568–587. [20] Rémi Coulom. “Efficient selectivity and backup operators in Monte-Carlo tree search”. In: International Conference on Computers and Games. Springer. 2006, pp. 72–83. [21] Peter I Cowling, Edward J Powley, and Daniel Whitehouse. “Information set monte carlo tree search”. In: IEEE Transactions on Computational Intelli- gence and AI in Games 4.2 (2012), pp. 120–143. [22] Hiroshi Deguchi. “Multi agent economics and its gaming simulation”. In: IFAC Proceedings Volumes 28.7 (1995), pp. 269–274. [23] Kurt Dresner and Peter Stone. “A multiagent approach to autonomous in- tersection management”. In: Journal of Artificial Intelligence Research 31 (2008), pp. 591–656. [24] Martin Dufwenberg. “Game Theory”. In: Wiley Interdisciplinary Reviews: Cognitive Science 2.2 (2011), pp. 167–173. [25] Stefan Edelkamp, Tim Federholzner, and Peter Kissmann. “Searching with partial belief states in general games with incomplete information”. In: An- nual Conference on Artificial Intelligence. Springer. 2012, pp. 25–36.

166 Bibliography

[26] Thorsten Engesser, Thomas Bolander, Robert Mattmüller, and Bernhard Nebel. “Cooperative epistemic multi-agent planning for implicit coordina- tion”. In: arXiv preprint arXiv:1703.02196 (2017). [27] Andrés Faíña, Jesús López-Rodríguez, and Laura Varela-Candamio. “Using game theory in computer engineering education though case stady method- ology: Kodak vs Polariod in the market for instant cameras”. In: Journal of Mobile Multimedia 10.3-4 (2014), pp. 252–262. [28] Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. “Learning to communicate with deep multi-agent reinforcement learning”. In: Advances in Neural Information Processing Systems. 2016, pp. 2137–2145. [29] Brian M Foss. “Mimicry in mynas (Gracula Religiosa): A test of Mowrer’s theory”. In: British Journal of Psychology 55.1 (1964), pp. 85–88. [30] Ian Frank and David Basin. “A theoretical and empirical investigation of search in imperfect information games”. In: Theoretical Computer Science 252.1-2 (2001), pp. 217–256. [31] Ian Frank and David Basin. “Search in games with incomplete information: A case study using Bridge Card play”. In: Artificial Intelligence 100.1-2 (1998), pp. 87–123. [32] Ian Frank, David A Basin, and Hitoshi Matsubara. “Finding optimal strate- gies for imperfect information games”. In: National Conference on Artificial Intelligence and Innovative Applications of Artificial Intelligence Conference. 1998, pp. 500–507. [33] Jun Fu, Yi Jiang, Wei Ren, and Dingxin He. “A hardware and software programmable platform for industrial embedded application”. In: Chinese Automation Congress. IEEE. 2015, pp. 360–365. [34] Florian Geißer, Thomas Keller, and Robert Mattmüller. “Past, Present, and Future: An Optimal Online Algorithm for Single-Player GDL-II Games.” In: European Conference on Artificial Intelligence. Vol. 14. 2014, pp. 357–362. [35] Herbert Gelernter, James R Hansen, and Donald W Loveland. “Empirical explorations of the geometry theorem machine”. In: Western joint Investiga- tive Report Institute of Radio Engineers and American Institute of Electrical Engineers and Association for Computing Machinery computer conference. 1960, pp. 143–149. [36] Michael Genesereth and Nathaniel Love. General Game Playing: Game De- scription Language specification. Tech. rep. Computer Science Department, Stanford University, Stanford, 2005. [37] Michael Genesereth and Michael Thielscher. General Game Playing. Morgan & Claypool Publishers, 2014.

167 Bibliography

[38] W Simon Godalming. A history of computer Chess from the—Mechanical Turk to—Deep Blue. 2019. url: https://www.chess.com/blog/Ginger_ GM/the- history- of- computer- chess- part- 1- the- mechanical- turk (visited on Aug. 12, 2019). [39] Orly Goldwasser. “Where is metaphor?: Conceptual metaphor and alterna- tive classification in the hieroglyphic script”. In: Metaphor and Symbol 20.2 (2005), pp. 95–113. [40] Miguel Gonzalez, Richard Watson, and Seth Bullock. “Minimally sufficient conditions for the evolution of social learning and the emergence of non- genetic evolutionary systems”. In: Artificial Life 23.4 (2017), pp. 493–517. [41] Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. “When will AI exceed human performance? Evidence from AI experts”. In: Journal of Artificial Intelligence Research 62 (2018), pp. 729–754. [42] Wolfgang Hatzack and Bernhard Nebel. “The operational traffic control prob- lem: Computational complexity and solutions”. In: European Conference on Planning. 2014, pp. 49–60. [43] Donald Olding Hebb. The Organization of Behavior: A Neuropsychological Theory. Psychology Press, 2005. [44] Johannes Heinrich, Marc Lanctot, and David Silver. “Fictitious self-play in extensive-form games”. In: International Conference on Machine Learning. 2015, pp. 805–813. [45] Johannes Heinrich and David Silver. “Deep reinforcement learning from self- play in imperfect-information games”. In: arXiv preprint arXiv:1603.01121 (2016). [46] John Henry Holland. Adaptation in Natural and Artificial Systems: An In- troductory Analysis with Applications to Biology, Control, and Artificial In- telligence. MIT press, 1992. [47] Wolfgang Hönig, TK Satish Kumar, Liron Cohen, Hang Ma, Hong Xu, Nora Ayanian, and Sven Koenig. “Multi-agent path finding with kinemat- iccConstraints.” In: The International Conference on Automated Planning and Scheduling. 2016, pp. 477–485. [48] Herbert Jäger, Luc Steels, Andrea Baronchelli, E Briscoe, Morten H Chris- tiansen, Thomas Griffiths, G Jager, Simon Kirby, N Komarova, and Peter J Richerson. “What can mathematical, computational and robotic models tell us about the origins of syntax?” In: Biological Foundations and Origin of Syntax (2009), pp. 385–410. [49] Michael Johanson, Nolan Bard, Marc Lanctot, Richard Gibson, and Michael Bowling. “Efficient Nash equilibrium approximation through monte carlo counterfactual regret minimization”. In: Proceedings of International Con- ference on Autonomous Agents and Multiagent Systems-Volume 2. 2012, pp. 837–846.

168 Bibliography

[50] Michael Bradley Johanson. “Robust strategies and counter-strategies: Build- ing a champion level computer poker player”. MA thesis. Edmonton: Univer- sity of Alberta, 2007. [51] Hartmut Jurgens, Heinz-Otto Pcitgen, and Dletmar Saupe. “The language of fractals”. In: SCIence (1978). [52] Levente Kocsis and Csaba Szepesvári. “Bandit based monte-carlo planning”. In: European Conference on Machine Learning. Springer. 2006, pp. 282–293. [53] Frédéric Koriche, Sylvain Lagrue, Éric Piette, and Sébastien Tabary. “Wood- Stook: A stochastic constraint-based general game players WoodStook: A stochastic constraint-based general game players”. In: (). [54] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. “Monte Carlo sampling for regret minimization in extensive games”. In: Ad- vances in Neural Information Processing Systems. 2009, pp. 1078–1086. [55] Ramon Lawrence and Vadim Bulitko. “Database-driven real-time heuristic search in video-game pathfinding”. In: IEEE Transactions on Computational Intelligence and AI in Games 5.3 (2013), pp. 227–241. [56] Chang-Shing Lee, Mei-Hui Wang, Guillaume Chaslot, Jean-Baptiste Hoock, Arpad Rimmel, Olivier Teytaud, Shang-Rong Tsai, Shun-Chin Hsu, and Tzung-Pei Hong. “The computational intelligence of MoGo revealed in Tai- wan’s tournaments”. In: IEEE Transactions on Computational Intelligence and AI in games 1.1 (2009), pp. 73–89. [57] Stephen C Levinson and Judith Holler. “The origin of human multi-modal communication”. In: Philosophical Transactions of the Royal Society B: Bi- ological Sciences 369.1651 (2014), p. 20130302. [58] Hong Liu and John Hamilton Frazer. “Supporting evolution in a multi-agent cooperative design environment”. In: Advances in Engineering Software 33.6 (2002), pp. 319–328. [59] Jeffrey Richard Long, Nathan R Sturtevant, Michael Buro, and Timothy Furtak. “Understanding the success of perfect information monte carlo sam- pling in game tree search”. In: Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence. 2010. [60] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. “Multi-agent actor-critic for mixed cooperative-competitive environments”. In: Advances in Neural Information Processing Systems. 2017, pp. 6382–6393. [61] Hang Ma, Craig A Tovey, Guni Sharon, TK Satish Kumar, and Sven Koenig. “Multi-agent path finding with payload transfers and the package-exchange robot-routing problem.” In: Association for the Advancement of Artificial Intelligence. 2016, pp. 3166–3173. [62] Hang Ma, Glenn Wagner, Ariel Felner, Jiaoyang Li, TK Kumar, and Sven Koenig. “Multi-agent path finding with deadlines”. In: arXiv preprint arXiv:1806.04216 (2018).

169 Bibliography

[63] Yousef Majidzadeh. “Jiroft: the earliest priental civilization”. In: Organiza- tion of the Ministry of Culture and Islamic Guidance, Tehran (2003). [64] James L McClelland, David E Rumelhart, and PDP Research Group. “Par- allel distributed processing”. In: Explorations in the Microstructure of Cog- nition 2 (1986), pp. 216–271. [65] Corey Miller, Rachel Strong, Mark Vinson, and Claudia M Brugman. “Rit- ualized Indirectness in Persian: taarof and related strategies of interpersonal management”. In: University of Maryland Center for Advanced Study of Lan- guage (2014). [66] Dov Monderer and Lloyd S Shapley. “Potential games”. In: Games and Eco- nomic Behavior 14.1 (1996), pp. 124–143. [67] Catherine J Mondloch, Terri L Lewis, D Robert Budreau, Daphne Mau- rer, James L Dannemiller, Benjamin R Stephens, and Kathleen A Kleiner- Gathercoal. “Face perception during early infancy”. In: Psychological Science 10.5 (1999), pp. 419–422. [68] Virginia Morell. “Why Do Parrots Talk? Venezuelan Site Offers Clues”. In: Science 333.6041 (2011). American Association for the Advancement of Sci- ence, pp. 398–400. [69] Oskar Morgenstern and John Von Neumann. Theory of Games and Economic Behavior. Princeton university press, 1953. [70] Tomas Motal. “General Game Playing in Imperfect Information Games”. MA thesis. Prague: Czech technical university, 2011. [71] John Nash. “Non-cooperative games”. In: Annals of Mathematics (1951), pp. 286–295. [72] Todd W Neller and Marc Lanctot. “An introduction to counterfactual re- gret minimization”. In: Proceedings of Model AI Assignments, Symposium on Educational Advances in Artificial Intelligence. 2013. [73] John von Neumann and Oskar Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1947. [74] BBC News. Artificial intelligence: Go master Lee Se-dol wins against Al- phaGo program. 2016. url: https://www.bbc.com/news/technology- 35797102 (visited on May 30, 2019). [75] Van Nguyen, Philipp Obermeier, Tran Cao Son, Torsten Schaub, and William Yeoh. “Generalized target assignment and path finding using answer set pro- gramming”. In: Proceedings of International Joint Conference on Artificial Intelligence, Melbourne. 2017, pp. 1216–1223. [76] Guillermo Owen. Game Theory. Emerald Group Publishing Limited, 2013. [77] Liviu Panait and Sean Luke. “Cooperative multi-agent learning: The state of the art”. In: Autonomous Agents and Multi-Agent Systems 11.3 (2005), pp. 387–434.

170 Bibliography

[78] Austin Parker, Dana Nau, and VS Subrahmanian. “Overconfidence or para- noia? search in imperfect-information games”. In: National Conference on Artificial Intelligence. Vol. 21. 2. 2006, pp. 1045–1050. [79] David L Poole and Alan K Mackworth. Artificial Intelligence: Foundations of Computational Agents. Cambridge University Press, 2010. [80] William Poundstone. Prisoner’s Dilemma/John von Neumann, Game Theory and the Puzzle of the Bomb. Anchor, 1993. [81] Tord Romstad, Marco Costalba, and Joona Kiiski. Stockfish: A strong open source Chess engine. 2018. url: https : / / stockfishchess . org / about (visited on Sept. 12, 2018). [82] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 3rd. Upper Saddle River: Prentice Hall Press, 2009. [83] Jenny R Saffran, Ann Senghas, and John C Trueswell. “The acquisition of language by children”. In: Proceedings of the National Academy of Sciences 98.23 (2001), pp. 12874–12875. [84] Arthur L Samuel. “Some studies in machine learning using the game of check- ers. II—recent progress”. In: Computer Games I. Springer, 1988, pp. 366–400. [85] Michael Schofield. Playing Imperfect-Information Games in General Game Playing by Maintaining Information Set Samples (HyperPlay). Doctoral dis- sertation. The University of New South Wales, Sydney, 2017. [86] Michael John Schofield, Timothy Joseph Cerexhe, and Michael Thielscher. “HyperPlay: A solution to general game playing with imperfect information”. In: Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence. 2012. [87] Michael John Schofield and Michael Thielscher. “Lifting model sampling for general game playing to incomplete-information models.” In: Association for the Advancement of Artificial Intelligence. Vol. 15. 2015, pp. 3585–3591. [88] Michael John Schofield and Michael Thielscher. “The Scalability of the Hy- perPlay Technique for Imperfect-Information Games.” In: AAAI Workshop: Computer Poker and Imperfect Information Games. 2016. [89] Jeff S Shamma. Cooperative Control of Distributed Multi-Agent Systems. Wi- ley Online Library, 2007. [90] Guni Sharon, Roni Stern, Ariel Felner, and Nathan R Sturtevant. “Conflict- based search for optimal multi-agent pathfinding”. In: Artificial Intelligence 219 (2015), pp. 40–66. [91] Peng Shi and Qikun Shen. “Cooperative control of multi-agent systems with unknown state-dependent controlling effects”. In: IEEE Transactions on Au- tomation Science and Engineering 12.3 (2015), pp. 827–834.

171 Bibliography

[92] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Ku- maran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. “Mastering Chess and Shogi by Self-Play with a General Reinforce- ment Learning Algorithm”. In: arXiv preprint arXiv:1712.01815 (2017). [93] Marjorie Skubic, Dennis Perzanowski, Samuel Blisard, Alan Schultz, William Adams, Magda Bugajska, and Derek Brock. “Spatial language for human- robot dialogs”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 34.2 (2004), pp. 154–167. [94] Luc Steels. “A self-organizing spatial vocabulary”. In: Artificial Life 2.3 (1995), pp. 319–332. [95] Luc Steels. “Language games for autonomous robots”. In: IEEE Intelligent Systems 16.5 (2001), pp. 16–22. [96] Luc Steels and Frederic Kaplan. “AIBO’s first words: The social learning of language and meaning”. In: Evolution of Communication 4.1 (2000), pp. 3– 32. [97] Luc Steels, Frédéric Kaplan, Angus McIntyre, and Joris Van Looveren. “Cru- cial factors in the origins of word-meaning”. In: The Transition to Language 2.1 (2002), pp. 4–2. [98] Luc Steels and Martin Loetzsch. “The grounded naming game”. In: Experi- ments in Cultural Language Evolution 3 (2012), pp. 41–59. [99] Luc Steels and Angus McIntyre. “Spatially distributed naming games”. In: Advances in Complex Systems 1.04 (1998), pp. 301–323. [100] Luc Steels and Michael Spranger. “Can body language shape body image?” In: Artificial Life. 2008, pp. 577–584. [101] Luc Steels and Paul Vogt. “Grounding adaptive language games in robotic agents”. In: Proceedings of European Conference on Artificial Life. Vol. 97. 1997. [102] Michal Ann Strahilevitz, Terrance Odean, and Brad M Barber. “Once burned, twice shy: How naive learning, counterfactuals, and regret affect the repur- chase of stocks previously sold”. In: Journal of Marketing Research 48 (2011), pp. 102–120. [103] Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. “Solving heads-up limit Texas Hold’em”. In: International Joint Conference on Artificial Intelligence. 2015, pp. 645–652. [104] Michael Thielscher. “A general game description language for incomplete information games.” In: Association for the Advancement of Artificial Intel- ligence. Vol. 10. 2010, pp. 994–999. [105] Michael Thielscher. “GDL-III: A description language for epistemic general game playing”. In: The IJCAI-16 Workshop on General Game Playing. 2017, p. 31.

172 Bibliography

[106] Michael Thielscher. “The general game playing description language is uni- versal”. In: International Joint Conference on Artificial Intelligence. 2011. [107] Michael Thielscher. “The General Game Playing Description Language is Universal”. In: Barcelona, July 2011, pp. 1107–1112. [108] Eric Van Damme. “A relation between perfect equilibria in extensive form games and proper equilibria in normal form games”. In: International Journal of Game Theory 13.1 (1984), pp. 1–13. [109] Manuela M Veloso, Joydeep Biswas, Brian Coltin, and Stephanie Rosenthal. “CoBots: Robust symbiotic autonomous mobile service robots.” In: Interna- tional Joint Conferences on Artificial Intelligence. Citeseer. 2015, p. 4423. [110] John Von Neumann. “On the theory of games of strategy”. In: Contributions to the Theory of Games 4 (1959), pp. 13–42. [111] Hao Wang. “Toward mechanical mathematics”. In: IBM Journal of research and development 4.1 (1960), pp. 2–22. [112] Daniel Whitehouse, Peter I Cowling, Edward J Powley, and Jeff Rollason. “Integrating monte carlo tree search with knowledge-based methods to create engaging play in a commercial mobile game”. In: Artificial Intelligence and Interactive Digital Entertainment Conference. 2013. [113] Darrell Whitley. “A genetic algorithm tutorial”. In: Statistics and Computing 4.2 (1994), pp. 65–85. [114] Bernard Widrow and Marcian E Hoff. Adaptive switching circuits. Tech. rep. Stanford University Stanford Electronics Labs, 1960. [115] Florian Wisser. “An expert-level card playing agent based on a variant of per- fect information Monte Carlo sampling”. In: International Joint Conference on Artificial Intelligence. 2015. [116] Ian H Witten, Radford M Neal, and John G Cleary. “Arithmetic coding for data compression”. In: Communications of the Association for Computing Machinery 30.6 (1987), pp. 520–540. [117] Peter R Wurman, Raffaello D’Andrea, and Mick Mountz. “Coordinating hun- dreds of cooperative, autonomous vehicles in warehouses”. In: AI magazine 29.1 (2008), p. 9. [118] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Pic- cione. “Regret minimization in games with incomplete information”. In: Ad- vances in Neural Information Processing Systems. 2008, pp. 1729–1736. [119] Philip Zurbuchen. “A model of infant speech perception and learning”. In: arXiv preprint arXiv:1610.06214 (2016).

173