Playing Imperfect-Information Games in General Game Playing by Maintaining Information Set Samples (Hyperplay) Michael Schofield

Playing Imperfect-Information Games in General Game Playing by Maintaining Information Set Samples (HyperPlay)

Michael Schoﬁeld

A thesis in fulﬁlment of the requirements for the degree of Doctor of Philosophy

School of Computer Sciences and Engineering Faculty of Engineering

March 2017

Declaration relating to disposition of project thesis/ dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation In whole or in part in the University libraries in all forms of media, now or here aher known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this hesls or dissertation. I also authorise University Microfilms to us word abstract of my thesis in Dissertation Abstracts International (this is applica oral theses only).

Signature M Schofield

Witness Signature A London

Date 19 January 2018

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Request.s for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered In exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

iii

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………...... i Contents

1 Introduction 2 1.1 Acknowledgements ...... 2 1.2 Motivation ...... 3 1.3 Artiﬁcially Intelligent Agents ...... 5 1.4 Artiﬁcial Cognitive Systems ...... 7 1.5 General Game Playing ...... 8 1.6 Research Objectives ...... 10 1.7 Publications ...... 12

2 General Game Playing with Imperfect Information 16 2.1 Introduction ...... 17 2.2 The Game Description Language ...... 19 2.3 Game Formalization ...... 22 2.4 Information Set ...... 27

3 Sampling an Information Set (the HyperPlay Technique) 34 3.1 Introduction ...... 34 3.2 Playing Imperfect-Information Games ...... 38 3.3 Weighted Particle Filter ...... 41 3.4 The HyperPlay Technique ...... 43 3.5 Using HyperPlay ...... 53 3.6 Experimental Results ...... 56 3.7 Conclusions ...... 61

ii 4 The HyperPlay-II Technique 66 4.1 Introduction ...... 66 4.2 Lifting Sampling to Imperfect-Information Models ...... 69 4.3 Experimental Results ...... 76 4.4 Conclusions ...... 81

5 Scalability of HyperPlay 86 5.1 Introduction ...... 86 5.2 Scalability of HyperPlay ...... 88 5.3 Experimental Results ...... 92 5.4 Conclusions ...... 101

6 The Eﬃciency of the HyperPlay Technique Over Random Sampling 106 6.1 Introduction ...... 106 6.2 Sampling an Information Set ...... 108 6.3 Security Games ...... 112 6.4 Experimental Results ...... 114 6.5 Conclusion ...... 120

7 Conclusion 124 7.1 HyperPlay ...... 125 7.2 HyperPlay-II ...... 125 7.3 Scalability of HyperPlay ...... 126 7.4 Eﬃciency of HyperPlay ...... 128 7.5 Future Work ...... 129

A HyperPlayer Experimental Results 132 A.1 Design of Experiments ...... 133 A.2 Games ...... 135 A.3 Experimental Results ...... 136

B HyperPlayer-II Experimental Results 142 B.1 Design of Experiments ...... 142 B.2 Games ...... 145 B.3 Experimental Results ...... 146

iii C HyperPlayer Scalability Experimental Results 152 C.1 Design of Experiments ...... 152 C.2 Experimental Results ...... 155

D HyperPlay Eﬃciency Experimental Results 168 D.1 Design of Experiments ...... 169 D.2 Results ...... 171

E Imperfect Information Game Topologies 178 E.1 Imperfect-Information Games ...... 179

List of Figures 184

List of Tables 186

References 191

iv 1 Chapter 1

Introduction

The study of thought processes can be traced from the times of Aristotle with syllogisms like "Socrates is a man; all men are mortal; therefore, Socrates is mortal" (Russell et al., 2003) through to the modern era where the cognitive sciences spans both the human and artificial domains. Whereas, the study of artificial thought processes in the field of cognitive science can be traced from a seminal workshop held at MIT in September 1956 where it was shown that computers can mimic the human processes of memory, language and logical thinking (Russell et al., 2003). From this beginning, the research of Artificial Intelligence has blossomed to become as rich and complex as its human counterpart.

2 1.1 Acknowledgements

This research thesis represents many years of patient work by many dedicated people at the School of Computer Science and Engineering at UNSW Australia. Although it is my name that appears as the author, it is far from and individual effort. I began this research as a mature student finding his way back into the education system after a 25 year absence. The journey back has been very rewarding mostly because of the support I have had along the way. My first acknowledgement is to my supervisor Professor Maurice Pagnucco, Head of the School of Computer Science and Engineering, who supported me when I began a Masters by Research and coached me in a variety of research techniques. His patience with my progress back into the education system and his enthusiasm for my ideas were boundless. Looking back on those first few years I can only say what a ’newbie’ I was. As my research became more focused I was drawn to the work of Professor Michael Thielscher, and erudite academic making the transition from Europe to Australia and leading the field of imperfect-information game playing. Michael took over the practical aspects of my supervision and guided me towards the various facets of my research. It was Michael’s original challenge to our reading group that began my research journey along with a talented graduate, Timothy Cerexhe. Our early efforts were rewarded with the acceptance of our first conference paper. Along the way I have collaborated with many other researchers including Abdallah Saffidine who had just obtained his PhD from the Universite Paris-Dauphine. Abdallah and I collaborated many times on workshop and conference papers which gave me a clear picture of the standards I would need to reach in order to succeed in this field. Through Abdallah I came to work with Proferror Tristan Cazenave at the LAMSADE, Universite Paris-Dauphine who taught me the importance of experimental research, its rigour and its documentation. Lastly I would like to acknowledge all of the unknown academics that have reviewed my many submissions for conferences and workshops. Your anonymous feedback has greatly improved my writing and presentation style and enabled to make this final submission.

3 1.2 Motivation

The motivation for this research comes from a lifelong fascination with computers and their ability to mimic human intellectual processes using both software and hardware. This fascination began with the emergence of commercial computing in the 1960s when this researcher’s father became one of the first software engineers to work in the field, and so, there has been a family history spanning 60 years and two generations of computer programming in commercial applications. This work represents a step beyond commercial computing applications and towards a contribution to the body of knowledge of Artificial Intelligence in a practical way. While others make contributions that advance the theory of Artificial Intelligence, this work focuses on the practical implementation of those theoretical frameworks.

1.2.1 Strong and Weak Artificial Intelligence The dichotomy of Artificial Intelligence (AI) into strong and weak refers to the performance of the artificial system using human capabilities as a benchmark. Strong AI makes the claim that computers can be made to think on a level at least equal to humans; that they are capable of cognitive mental states (Russell et al., 2003). This is often portrayed in popular fiction where artificial life forms take on the thought processes, and often physiology, of humans with spectacular results. To a certain extent, this is being accomplished in the real world with some specialised systems outperforming the average person. This classification is exemplified by the Turing Test; an experiment suggested by Alan Turing in his 1950 paper Computing Machinery and Intelligence. Turing argued that if a machine could convince a knowledgeable observer that it was human, then it should be consider as intelligent (Russell et al., 2003). In contrast, weak AI makes no such assertion and acknowledges the limitations of machine based artificial intelligence. Irrespective of the labels that are being used, it is worth noting that human capabilities are being used to set the standards for measuring the performance of artificially intelligent systems, and that researchers often anthropomorphism their creations by using human physiology as a model for artificially intelligent systems architecture.

1.2.2 Strong and Weak Emergence A similar dichotomy of strong and weak emergence appears in the study of philosophy when referring to the characteristics of natural systems, speciﬁcally the synergy that sometimes

4 occurs in complex systems where the whole is greater than the sum of the parts. Aristotle, Metaphysics, Book 8.6.1045 a:8-10: "... the totality is not, as it were, a mere heap, but the whole is something besides the parts ..." (Bogaard, 1979). Contemporary authors refer to the structures and patterns that appear when complex systems are capable of organising and re-organising information, as well as re-organising themselves. This is referred to as strong emergence. In the study of AI, emergent systems take on a similar connotation. Such systems are often constructed from elements that, in themselves, have no speciﬁc capabilities; yet through a process of self-organisation they become viable within their environment (Vernon, Metta, & Sandini, 2007). This continuous self-reconstruction characterises an emergent AI system and would be classiﬁed philosophically as a strong emergence.

1.2.3 Artificial General Intelligence Often an artificially intelligent system will be designed to meet a particular need, and as such its capabilities may well exceed human capabilities in the narrow band it was designed for. A common example is the development of chess playing computers. In contrast to this, some researchers (Goertzel & Pennachin, 2007)(Hutter, 2007) prefer to take a broader view of AI and work on developing Artificial General Intelligence (AGI). A field that is, by their own admission, still in its infancy. This researcher seeks to add to the body of knowledge in AGI by considering the problems associated with the development of a cognitive system and is guilty of trying to replicate skills that human beings take for granted.

5 1.3 Artiﬁcially Intelligent Agents

An agent, natural or artiﬁcial, is seen as a rational and autonomous "being" occurring in an environment; capable of sensing that environment and interacting within that environment. A human agent senses with sight, sound, touch, smell, taste, and interacts through movement and communication.

1.3.1 Artiﬁcial Agents Within the domain of artiﬁcially intelligent systems, intelligent agents are distinguished by their abilities to sense their environment and to interact with that environment. The agent’s sensors provide percepts (perceptual inputs) as inputs to its program while its actuators and communication devices generate outputs into its environment. Considering an agent as a black box, and knowing the agent’s choice of actions for every combination of percepts, it is possible to mathematically describe the agent function that maps percepts into actions (Russell et al., 2003).

1.3.2 Modelling Intelligent Agents The traditional models of intelligent agents begin by examining the percepts and considering what actions the agent will take in response to these perceived inputs. Yet, most authors will represent this in a two-way relationship between the agent and its environment. Some authors (Goertzel & Pennachin, 2007) even challenge the traditional view by suggesting that an agent should be thought of as a set of percepts that occur in response to its actions. Irrespective of the view point, any model of an intelligent agent must involve a cyclic, temporal, relationship with its environment.

1.3.3 Evolving Agents Looking inside an agent and considering its inner workings reveals both software and hardware. Yet this distinction does not do justice to the complexity and variety found in the broad range of intelligent agents. To suggest that an agent is the sum of its architecture plus its program (Russell et al., 2003) is an over-simplification. In this quickly evolving field software that requires a high performance computing platform is then reduced to firmware on a plug-in card and becomes an innate property of the agent’s hardware. Similarly, in the human domain, tasks that take years to master may be performed at a subconscious, almost reflex, level. A good example is the use of GPUs on a graphics

6 card as a replacement for CPUs as a more efficient way of processing game states (Schiffel & Björnsson, 2014). Many authors find it useful to consider the behaviours of an agent that result from its innate characteristics and will not change with time, in contrast with the behaviours that are learned from its interaction with its environment. Here the distinction between phylogenic behaviour and ontogenic behaviour is often made (Vernon, 2008). In General Game Playing (GGP) the early agents were instantiated at the beginning of every game and thus were tabula rasa. With time agents began to evolve through the better understanding of their creators so as to include a greater intuition of the nature of a game1. Eventually agents were able to use learning from previous games to improve their performance. As a species the game-playing agents have evolved and as individuals they have developed.

1 There is an anecdote about an attempt to identify pieces in a game of drafts so as to improve the agent’s performance. On analysis it was found that the agent identiﬁed the empty squares as the pieces but still improved it performance.

7 1.4 Artiﬁcial Cognitive Systems

Cognition is the state of knowing that you know something, as opposed to a state in which you know something without knowing how or why you know it. One refers to a rational thought process while the other deals with a priori knowledge.

1.4.1 Cognitive Functions For artificially intelligent systems, cognition is used more simply to describe the state of knowing something, even though the system may not be able to reason about the source of that knowledge. For example; more traditional cognitive systems allow the operator to look inside the agent and see the symbolic representation of the knowledge, but an emergent system (say based on neural networks) achieves cognition through a re-organisation of itself in its dynamic interactions with its environment. Yet both are cognitive systems. Despite the internal mechanisms of cognitive systems, there are some important distinctions that set them apart from the general class of artificially intelligent systems. One of the distinguishing features of a cognitive system is its ability to perform certain cognitive functions, similar to the thinking processes that humans often take for granted. Whilst all cognitive systems are not able to perform every one of these functions it is commonly assumed that there is a core set of cognitive functions that sets a cognitive system apart from other artificially intelligent systems. A typical set of cognitive functions would include; reasoning, deliberating, recognising, adapting, learning, planning, anticipating and predicting (Auer et al., 2005)(Vernon, 2008).

1.4.2 Definition For the purposes of this thesis, a Cognitive System shall be defined as: An artificially intelligent system that has a continuously updated representation of its environment, reasons and learns through its interactions with that environment and takes goal-oriented actions while considering and verifying the impact of those actions. In this context, the "representation" does not need to be symbolic, it may exist in the internal structure of the agents cognitive processes. However, symbolic representations are often easier. Throughout this thesis the term "the agent" will be taken to mean an instantiation of such a cognitive system usually for experimental purposes.

8 Figure 1.1: Adapted from Maturana and Varela’s ideogram of systems that exhibit co-determination and self-development.

1.5 General Game Playing

In many human endeavours we must crawl before we can walk. And so it is with researching cognitive systems in the AGI. As parents we teach our children through games because we can focus on particular skills or knowledge within a simpliﬁed framework. So this researcher looks to the ﬁeld of General Game Playing with Imperfect Information as a test bench for ideas and experiments that might eventually be extended to AGI.

1.5.1 Game Environment The General Game Playing environment provides many of the characteristics of the real world. Games can be constructed with solutions that cannot be easily discovered and therefore require some type of search. Games can have many actors, with sophisticated reward structures. Games can require decision making with some level of uncertainty. Games can be constructed where actions fail, and percepts are faulty. When imperfect information and random moves2 are introduced, then games allow for the full gamut of problem solving and reasoning to be modelled.

1.5.2 Knowing Where to Start The best place to start is at the beginning. In the ﬁeld of General Game Playing with Imperfect Information this means building a player capable of all of the cognitive skills previ-

2 Random moves are made by the random player later deﬁned in Table 2.1

9 ously mentioned. The principle skill being the maintenance of an internal representation of the world (game) despite the fact that the agent has imperfect or missing sensory perception of that world. This will be the focus of the remaining chapters. That is, how to maintain an internal representation of a multi agent world when the agent has little or no information about the actions of the other agents and still be able to take intelligent goal-orient action after reasoning about the current state of the world and the impact that such action will have.

10 1.6 Research Objectives

1.6.1 Background This research began with a challenge from Prof. Michael Thielscher to the Logic Club Reading Group. How is it possible to sample an information set for a game of Kriegspiel Chess that is partially completed when the search space is too big to map using current technology? The context was General Game Playing where each player has a time limit of only a few minutes to evaluate the game play and select a move. The solution was published (Schofield, Cerexhe, & Thielscher, 2012) shortly afterwards. A player was created and entered into a General Game Playing with Imperfect Information competition, but was not very successful. It was good at sampling an information set, but lacked an effective search algorithm to evaluate move choices. The creation of the player and the subsequent attempts to use it to answer some open questions in the field of GGP shaped the objectives of this research.

1.6.2 Related Research At the time that this research began there were signiﬁcant gains being made in perfect- information players for GGP and then being extended into the General Video Game Playing arena (Levine et al., 2013). In particular GVG players were being developed for newly founded GVG competitions (Perez-Liebana et al., 2016). But little work was being done to extend the knowledge in GGP with imperfect-information. What started out as an interesting challenge soon blossomed into a relevant and necessary research topic building on the challenges set by Thielscher (2011) and extended by Edelkamp, Federholzner, and Kissmann (2012), as well as challenges set earlier by Frank and Basin (2001).

1.6.3 Objectives This research seeks to address the following questions:

• Could the information set sampling technique used for one game, Kriegspiel Chess, be developed into a general solution applicable to any and all GGP-II games? The challenge is to develop a general solution that provides a starting point for existing search techniques used by perfect-information game players.

• Could a player based on this sampling technique overcome the Strategy Fusion Error by providing enough information to calculate the probability of a sample being the true game, and hence facilitate the use of a weighted particle ﬁlter?

11 The challenge is to overcome the Strategy Fusion Error with very limited information as many search spaces prevent the calculation of an exact probability.

• How robust was this technique in the largest search spaces where the scale of the task may make it impossible to implement? The challenge is to make this technique applicable in large search spaces and still meet the limited time budget being used in GGP-II games.

• How efficient was this sampling technique over the use of random samples? The challenge is to develop a technique that is fast and efficient, but effective at sampling an information set.

• Could the technique be applied outside of GGP? This is an open question which speaks to the larger domain of AGI.

Each of these questions is addressed in the chapters of the thesis.

12 1.7 Publications

This thesis recapitulates and expands on previously published works.

1.7.1 Conference Papers Papers published at Artificial Intelligence conferences: 1. Schofield, M., Cerexhe, T., Thielscher, M. (2012). HyperPlay: A solution to general game playing with imperfect information. In Proceedings of Twenty Seventh AAAI Conference on Artificial Intelligence (p. 1606-1612). 2. Schofield, M., Thielscher, M. (2015). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of Twenty Ninth AAAI Conference on Artificial Intelligence (p. 3585-3591). 3. Cazenave, T., Saffidine, A., Schofield, M., Thielscher, M. (2016b). Nested monte carlo search for two-player games. In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence.

4. Schofield, M., Thielscher, M. (2017). The efficiency of the hyperplay technique over random sampling. In Proceedings of Thirty First AAAI Conference on Artificial Intelligence.

1.7.2 Conference Workshop Papers Papers published at Artificial Intelligence conference workshop specialising in game playing: 1. Schofield, M., Cerexhe, T., Thielscher, M. (2013). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of GIGA 2013 Workshop on General Game Playing (p. 39-45). 2. Schofield, M., Saffidine, A. (2013). High speed forward chaining for general game playing. In Proceedings of GIGA 2013 Workshop on General Game Playing (p. 31-38). 3. Cazenave, T., Saffidine, A., Schofield, M., Thielscher, M. (2016a). Discounting and pruning for nested playouts in general game playing. In Proceedings of The IJCAI-15 Workshop on General Game Playing. 4. Schofield, M., Thielscher, M. (2016b). The scalability of the hyperplay technique for imperfect-information games. In Proceedings of AAAI Workshop on Computer Poker and Imperfect Information Games.

13 14 15 Chapter 2

General Game Playing with Imperfect Information

This chapter formalises the basics of General Game Playing with Imperfect-Information with particular emphasis on nomenclature verging on pedantry. The famous Monty Hall game is used as a running worked example to illustrate each of the technical sections. Finally an imperfect-information game is deﬁned as the mathematical foundation for the proofs and experimental ﬁndings presented in later chapters.

16 2.1 Introduction

2.1.1 Publications This chapter recapitulates and expands on previous works.

• Schofield, M., Thielscher, M. (2016a). General game playing with incomplete information. Submitted to Journal of Artificial Intelligence Research. This article was submitted to the Journal of Artificial Intelligence Research in July 2016. It was peer reviewed but not accepted. The recommendation was to make minor changes and resubmit. Out of necessity this thesis has been completed before resubmitting the article.

• This researcher prepared the article, and

• Thielscher, M. supervised the process.

2.1.2 Terminology New terminology is introduced or more accurately deﬁned.

• imperfect-information is used in the General Game Playing community where general AI uses the term "incomplete-information" or "partially observable". The term "imperfect- information" will be used to mean that parts of the game play may be hidden from one or more players. The initial state of the game and reward structure is known to all roles as they are declared in the rules.

• Game Controller being a pieces of software that runs the true game and sends play messages to each of the agents playing the game.

• true game being the game played out by the Game Controller. The game playing agent’s only access to the true game are the play messages sent out by the Game Controller.

2.1.3 Motivation The standardisation of the Game Description Language (Love et al., 2006) and its widespread adoption has led to an increase in the research of General Game Playing. Often we see competitions being held at AI conferences1, beginning with AAAI GGP Competition with successful players employing a variety of techniques including: automatically generated evaluation functions (Clune, 2007; Schiﬀel & Thielscher, 2007), or some form of Monte Carlo technique such as the modern UCT (Björnsson & Finnsson, 2009; Méhat & Cazenave, 2011).

1 There have been fewer competitions in recent years.

17 However, these advancements have been in the field of perfect-information games, only a few advancement have been made in the field of imperfect-information games. Imperfect-information games were set as a challenge for existing AI systems (Thielscher, 2010; Schiffel & Thielscher, 2014) with a specification for the extension of the Game Description Language (Quenault & Cazenave, 2007; Thielscher, 2011) to encompass imperfect-information being accepted by the General Game Playing community. These games present several new challenges for the AI player. Firstly, the search space is often larger than similar perfect- information games as the player must search parts of the game tree that would otherwise be known to be inaccessible. Secondly, the player must reason across an information set2 and choose the move that is most likely to give a positive outcome. The motivation for this chapter is to introduce General Game Playing and to present a rigorous mathematical definition of an imperfect-information game that satisfies the constraints of the Game Description Language along with the induced game tree, play histories, and game play evaluations. The definitions have been constructed to facilitate the implementation of a game playing agent, and so, there will be subtle differences to some of the existing literature.

2.1.4 Related Research The declarative Game Description Language (GDL) has been defined as a formal language that allows an arbitrary game to be specified by a complete set of rules (Love, Hinrichs, Haley, Schkufza, & Genesereth, 2008)(Genesereth, Love, & Pell, 2005). Originally developed for Stanford’s General Game Playing (GGP) test bed, and the AAAI GGP Competitions it encompassed finite, discrete, deterministic, multi-player games of perfect information. It describes the general game playing language, the messages that constitute game play, along with a specification for the game communication language. Originally for perfect-information games, GDL has been extended to GDL-II (for: GDL with imperfect information) to allow for arbitrary, finite games with random3 moves and incomplete information (Thielscher, 2010). Thielscher deals with the limitation of GDL being restricted to deterministic games with perfect-information and extends the GDL in a simple but expressive way. This new formalisation of the rules allows for an arbitrary, discrete and finite n-player game with randomness and imperfect state information to be expressed. The extended framework gives players all the information they need to reason about their own

2 That is, the set of possible play histories that satisﬁes all the signals received by the game playing agent. 3 textcoloramendedThe word "random" is a key word in GDL-II. It is often refered to as "nature" in other disciplines.

18 knowledge as well as that of the other players. This work is extended by Schiffel and Thielscher by focusing on the reasoning challenge for general game playing agents using the new language. They present a full embedding of the extended GDL into the Situation Calculus and prove that this is a sound and complete reasoning method for agents’ knowledge. From the field of applied theoretical economics Hart (1992) presents definitions and examples for games in extensive form and games in strategic form, with pure and mixed strategies and equilibrium points. Rasmusen (2007) deals with non-cooperative game theory and asymmetric information and presents the latest ideas on game theory and information economics showing how to build simple models using a standard format. In a similar vein Leyton-Brown and Shoham (2008) present a short introduction to the field of game theory covering the basics that are common to the many disciplines using the theory. They aim to meet a perceived need by providing a summary of the main classes of games, their representations, and the main concepts used to analyse them.

19 2.2 The Game Description Language

The General Game Playing agent requires a formal language that allows an arbitrary game to be speciﬁed by a complete set of rules. The Game Description Language (GDL) is used to specify a set of declarative statements (rules) with a programming-like syntax. Originally designed for games of perfect-information, the GDL was formalized by Love et al. (2008) and later enhanced to include games of imperfect-information (Thielscher, 2010), as characterized by the special keywords listed in Table 2.1. This section presents an overview of the language and some intuition into its use.

(role ?r) ?r is a player (init ?f) ?f holds in the initial state (true ?f) ?f holds in the current state, (aka. ﬂuent) (legal ?r ?m) role ?r can do move ?m in the current state (does ?r ?m) role ?r does move ?m (next ?f) ?f holds in the next state (terminal) the current state is terminal (goal ?r ?v) role ?r gets reward ?v (sees ?r ?p) ?r perceives?p in the next state random the random player (aka. nature) (not ?f) the negation of ?f (distinct ?a ?b) ?a and ?b are syntactically diﬀerent

Table 2.1: GDL-II keywords

2.2.1 GDL-II Keyword Description GDL-II comes with some syntactic restrictions4 that ensure that every valid game description has a unique interpretation as a state transition system, as follows:

• The players in a game are determined by the derivable instances of (role ?r),

• The initial state is the set of derivable instances of (init ?f),

• For any decision state d, the legal moves of a player ?r are determined by the instances

4 For details, refer to the language speciﬁcations (Love et al., 2006; Thielscher, 2010).

20 of (legal ?r ?m) that follow from the game rules augmented by an encoding of the facts in d using the keyword (true ?f),

• Since game play is synchronous in the Game Description Language5, states are updated by joint moves with one move by each player ?r,

• The next position after a joint move is taken in state d is determined by the instances of (next ?f) that follow from the game rules using the keywords (does ?r ?m) and (true ?f), respectively,

• The percepts a player ?r receives as a result of a joint move being taken in decision state d is likewise determined by the derivable instances of (sees ?r ?p) using (does ?r ?m) and (true ?f), and

• Finally, the rules for (terminal) and (goal ?r ?v) determine whether a given state is terminal and what the players’ goal values are in this case.

1 (role random) 34 (<= (next (car ?d)) 2 (role contestant) 35 (does random (hide_car ?d)) 3 36 (<= (next (car ?d)) 4 (init (closed 1)) 37 (true (car ?d))) 5 (init (closed 2)) 38 (<= (next (closed ?d)) 6 (init (closed 3)) 39 (true (closed ?d)) 7 (init (round 1)) 40 (not (does random open_(door ?d)))) 8 41 (<= (next (chosen ?d)) 9 (<= (legal random (hide_car ?d)) 42 (does contestant (choose ?d))) 10 (true (round 1)) 43 (<= (next (chosen ?d)) 11 (true (closed ?d))) 44 (true (chosen ?d)) 12 (<= (legal random (open_door ?d)) 45 (not (does contestant switch))) 13 (true (round 2)) 46 (<= (next (chosen ?d)) 14 (true (closed ?d)) 47 (does contestant switch) 15 (not (true (car ?d))) 48 (true (closed ?d)) 16 (not (true (chosen ?d)))) 49 (not (true (chosen ?d)))) 17 (<= (legal random noop) 50 (<= (next (round 2)) 18 (true (round 3))) 51 (true (round 1))) 19 (<= (legal contestant (choose ?d)) 52 (<= (next (round 3)) 20 (true (round 1)) 53 (true (round 2))) 21 (true (closed ?d))) 54 (<= (next (round 4)) 22 (<= (legal contestant noop) 55 (true (round 3))) 23 (true (round 2))) 56 24 (<= (legal contestant noop) 57 (<= terminal 25 (true (round 3))) 58 (true (round 4))) 26 (<= (legal contestant switch) 59 27 (true (round 3))) 60 (<= (goal contestant 100) 28 61 (true (chosen ?d)) 29 (<= (sees contestant ?d) 62 (true (car ?d))) 30 (does random (open_door ?d))) 63 (<= (goal contestant 0) 31 (<= (sees contestant ?d) 64 (true (chosen ?d)) 32 (true (round 3)) 65 (not (true (car ?d)))) 33 (true (car ?d))) 66 (goal random 0)

Figure 2.1: A GDL-II description of the Monty Hall game.

5 Synchronous means that all players move simultaneously. Turn-taking games are modelled by allowing players only one legal move without eﬀect such as (noop) if it is not their turn.

21 2.2.2 The Monty Hall Game A running example based on the GDL-II rules in Figure 2.1 is used to illustrate the theory. It formalises a simple but famous game (Rosenhouse, 2009) adapted from (Thielscher, 2011) where a prize (a car) is hidden behind one of three doors, a goat behind the others, and where a contestant is given two chances to pick a door. This game is especially relevant to this research as it contains a simple but eﬀective trap. It requires the agent to consider the probability of the play histories when making a move selection.

Example 1. The intuition behind the rules is as follows. Lines 1 - 2 introduce the players’ names (the game host is played by random). Lines 4 - 7 deﬁne the four features that comprise the initial game state. Lines 9 - 27 specify the legal moves available to each role:

• In round 1, the random player decides where to place the car (lines 9 - 11),

• Simultaneously the contestant chooses a door (lines 19 - 21),

• In round 2 random opens one of the other doors (lines 12 - 16) to reveal a goat,

• Finally the contestant can stick noop or switch (lines 24 - 25, 26 - 27). The contestant’s only percepts are:

• The door opened by the host (lines 29 - 30), and

• The location of the car at the end of the game (line 31 - 33). The rules specifying the state update next (line 34 - 55). The conditions for the game to end terminal (line 57 - 58). The payoﬀ for the player goal (line 60 - 66).

22 2.3 Game Formalization

This section presents the basic mathematics of a game that satisfies the restrictions of the GDL-II. While there have been many variations presented in the literature the differences are mostly superficial in regard to the nomenclature. This definition build upon the semantics of GDL-II presented by Schiffel and Thielscher (2014) where they show that a set of clauses forming a valid GDL-II description6 is a representation of a state transition system as follows.

2.3.1 Nomenclature The most common conventions are followed with a standardisation of the nomenclature with respect to font, case, script, and argument. The rules used in this thesis are:

• Sets and machines are represented by upper-case Roman characters,

• Set elements are represented by the corresponding lower-case Roman character,

• Functions are represented by lower-case Greek characters7 or short words giving some intuition of the functions purpose,

• Vectors are variable size with homogeneous ordered elements, 8 eg. ~a = ha1, a2, a3i, elements can be addressed directly using ai = ~a[i] and size may be indicated as a superscript, eg. ~e n represents a path of n edges.

• Tuples9 are ﬁxed size with heterogeneous typed elements,

eg. a1 =w ¯ = (x, y, z), elements can be addressed directly using x =w ¯[ix].

• Vectors and tuples may be nested, and their elements addressed using a multidimensional

address, eg. xn = ~a[n, ix], as shorthand for xn =w ¯n[ix] :w ¯n = ~a[n]

• Subscripts are used to indicate vector or tuple or set elements, and to denote "belongs to a role" or "as seen by a role", not as pseudo-arguments to a function,

• Functions may be overloaded to return a set, a vector or an element of such, and

• Bijective functions have an inverse relation denoted f −1.

6 Refer to section 2.3 Valid Game Descriptions of (Schiﬀel & Thielscher, 2014). 7 The most common conventions are preserved. 8 This borrows from the nomenclature in the programming language C. 9 A tuple is shown with a bar above, and a vector is shown with an arrow above.

23 2.3.2 Imperfect-Information Game An imperfect-information game may be view as a state-transition system (Schiffel & Thielscher, 2014) with a description who’s syntax and semantics supports such a view. The definition of these games uses the machinery of the extensive-form game augmented to include imperfect- information signals from the game that may leave the game playing agent with a set of indistinguishable game play histories forming an information set. As with all extensive-form games there is an induced game tree where these indistinguishable histories form simple paths to a set of nodes representing states in the game. The set of states form the agent’s view of the game and provide a basis for reasoning. The following definitions build upon works by Love et al. (2008), Schiffel and Thielscher (2014), Rasmusen (2007) and Leyton-Brown and Shoham (2008) and facilitate the implementation of a game playing agent. The definitions require that any imperfect-information game is well-formed and meets the following restrictions10 (Love et al., 2008; Schiffel & Thielscher, 2014) in order to be expressible in the GDL-II:

• the game is decidable (guaranteed to terminate),

• the game is playable (there is always a legal move for every role in every state),

• the game is winnable (there is always a sequence of moves for every role leading to a terminal state),

• the game tree is stratiﬁed (there are no cycles involving negative edges), and

• the game rules are allowed (each variable in a clause occurs in at least one positive atom in the body and avoids cyclic dependency).

Deﬁnition 1. Let G = (S, s0, R, A, λ, P, ρ, υ, δ) be an imperfect-information game that satisﬁes the restrictions of the GDL-II, as follows.

1. S is a ﬁnite set of states:

• Disjoint decision and terminal states S = D ] T

• st ∈ S is the true state of the game being played by the Game Controller, and

• s0 ∈ S is the initial state of the game. 2. R is a ﬁnite set of roles in the game:

• Subscript r ∈ R is used to identify vector elements ’belonging to’ a role. 3. A is a ﬁnite set of all moves (actions) in the game:

• ar ∈ A is any legal move for role r, eg. (does r (move)),

• ~a is a move vector, where ~a = ha1...a|R|i with one move for each role,

10The terminology used is consitent with the original authors terminology.

24 • h~a−r, ari = ha1 . . . ar . . . a|R|i is a move vector with speciﬁc move ar, and • n ∈ N is the number of joint moves enacted in a game which is equivalent to the number of rounds played11. 4. λ : D × R → 2A is a function giving a set of legal moves for r ∈ R in d ∈ D:

• ar ∈ λ(d, r) ⊆ A is any legal move for role r in decision state d, and |R| • λ : D → 2A is an overload of the function λ to return the set of legal move vectors ~a in d. 5. P is a ﬁnite set of all percepts in the game: 12 • pr ∈ P is the percept received by role r after succession. 6. ρ : D × A|R| × R → P is a function giving the percept for role r ∈ R resulting from enacting a joint move ~a in a decision state d ∈ D:

• pr = ρ(d, ~a,r) is the percept received by role r after succession,

• ρ : D × A|R| → P |R| is an overload of the function ρ to return a vector of percepts, one for each role. 7. υ : T × R → R is the payoﬀ function on termination. 8. δ : D × A|R| → S is the successor function:

• s = δ(d, ~a) is the progression of the game in d enacting move vector ~a......

Example 1 continued. In the running worked example shown in the GDL of Figure 2.1 in the ﬁrst round each role must make a move. ~a = h(does random (hide_car 3)), (does contestant (choose 1))i ~a = h(hide_car 3), (choose 1)i is more compact. The succession of the game from the initial state using the move vector produces new ﬂuents and percepts.

s1 = λ(s0, h(hide_car 3), (choose 1)i (closed 1) (closed 2) (closed 3) (round 2) (car 3) (chosen 1)

p = ρ(s0, h(hide_car 3), (choose 1)i = hnull, nulli

A game is deﬁned as a multi-agent state transition system, described by a set of clauses in the GDL-II. The terminology of the GDL-II is used to describe the role of the game playing

11When the game is in the initial state and no moves have been made then the round number n = 0. 12For the sake of simplicity, multiple percepts ?p for a player ?r as per the GDL predicate sees are considered to be conjoined into a single element of p and null ∈ P is the empty percept.

25 agent, the actions of the agent as moves in the game, and signals received by the agent as percepts. The game description sets out the natural progression of the game from one state to its successor, beginning with the initial state and tracing out a simple path in the induced game tree. There is a bijective relationship between the simple paths on the game tree and the perfect- information play histories possible in the game. This is because there is a correspondence between the joint move vectors in the game and the edges on the game tree. When implementing a General Game Playing agent it is often easier to think of the agent as traversing the game tree rather than processing play histories.

Figure 2.2: Monty Hall game tree, as seen by the contestant. Moves are serialized and (does role noop) removed, for the sake of clarity. The moves in the game were (does random (hide_car 3)), (does contestant (choose 1)), (does random (open_door 2)).

Definition 2. A game G = (S, s0, R, A, λ, P, ρ, υ, δ) that satisfies the restrictions of the GDL-II induces a game tree, which is a connected, acyclic graph G = (V, E) with a single root node corresponding to the initial game state and where the edges are determined by the joint legal moves in non-leaf nodes while leaves correspond to terminal game states. The following definitions apply. 1. state : V → S is a function that maps from a node (vertex) v ∈ V in the game tree −1 to the corresponding game state s ∈ S, with state : S → V as the inverse.

26 |R| 2. move : E → A is a function that maps from an edge e ∈ E in the game tree to the corresponding joint move ~a that labels the edge.

n 3. ~e is a unique path of edges he1, e2, ..., eni beginning at the single root node v0, corresponding to the joint moves enacted in a game.

th n 4. node : EN × N → V is a function that returns the i node along a path ~e where th the origin is given by the 0 node, node(anypath, 0) = v0.

......

Deﬁnition 2 provides the structures and functions necessary to map from states in the game to nodes on the game tree and vice versa. However, it is the paths formed by the edges on the game tree that are of greatest interest and it is the indistinguishable paths that will form a game playing agent’s information set.

Example 1 continued. The diagram in Figure 2.2 shows the paths in the game tree after the second round. The left most path on the tree starting from the root node and extending to termination corresponds to the move sequence: ~e = hhA,Hi, hE,Li, hG,Lii The second node along that path is:

node(~e, 2) = v1

The ﬂuents in the state corresponding to the node v1:

state(v1) = (closed 1)(closed 3)(round 3)(car 3)(chosen 1)

27 2.4 Information Set

This section examines the use of the player’s information set as a basis for reasoning. It starts with a deﬁnition of an information set and the need to calculate the probability distribution for the player’s move selection policy by sampling the current information set, then shows that a weighted sample is required using the running worked example of the Monty Hall game, then introduces weighting calculations based on Ockham’s Razor.

2.4.1 Play Histories

13 As the game progresses, there is a state st of the game described by a set of ﬂuents as well as an ordered list of play messages forming a history of joint moves and percepts. This history can be traced out on the game tree as a path from the root node v0 to the current node vt, such that, st = state(vt).

Definition 3. Let G = (S, s0, R, A, λ, P, ρ, υ, δ) be an imperfect-information game that satisfies the restrictions of the GDL-II and G = (V, E) be the induced game tree, and a play message be a move and percept tuple associated with an edge on the induced game tree and a history be an ordered sequence of such tuples. The following definitions apply. 1. M is a finite set of all play messages:

• m¯ = (a, p) ∈ M, is a play message,

• m¯ r = har, pri is a play message for role r where ar is the move enacted in

decision state d and pr = ρ(d, ~a,r),

• ~m = hm1, ...m|R|i as a play message vector, one message for each role, and

• ~mr = hnull, ...mr, ..., nulli as an imperfect-information play message vector as seen by role r. 2. H is a ﬁnite set of all histories in the game:

• h ∈ H is a history of ~mn 14 and being a multidimensional array spanning n rounds and r roles it can be addressed using the notation in 2.3.1 such that

m¯ r,n = h[ir, in] and pr,n =m ¯ r,n[ip] shortened to pr,n = h[r, n, ip],

• n hr = ~mr = h ~mr1, ..., ~mr,ni is an imperfect-information history as seen by role r in round n,

13Fluents are things that are (true ?f) in the state. 14This departs from the convention of a history being only actions ~a n or edges ~e n as the imperfect-information histories contain percepts that provide additional information that is not in the state and aids in identifying information sets that are not consistent with the true game being played by the Game Controller.

28 • ht is the perfect-information history for the true game being played by the Game Controller, n • hv = ~m = h ~m1, ..., ~mni is a perfect-information history inducing a simple path to a node v on the game tree at round n,

• vh is the game tree node at the end of the simple path induced by a perfect- information history, and • node : H × N → V is an overload of the function node for histories. 3. ξ : H × R → H gives the imperfect-information history as seen by role r ∈ R.

......

Section 2.3.2 identiﬁes restrictions that the GDL-II places on traditional deﬁnitions of imperfect-information extensive-form games. Additionally, this chapter makes an important distinction in the way signals are used. Often games are described without the explicit use of signals15, as all signals can be derived from the game decision state and the joint actions in that state. However, all of the moves in the true game are not known to the agent and the signals from the Game Controller provide additional information. This information may allow the agent to ignore some of the play histories in an information set as they are inconsistent with the signals received from the Game Controller16. Such play histories can be removed from the agent’s deliberations. Therefore, for practical reasons, the play histories are constructed from messages that include both moves and percepts (signals).

Example 1 continued. The paths in the game tree in Figure 2.2 shows the paths in the game tree labeled with the play message vectors. The history in the true game is given by:

hvt = hhhA,0i, hH,0ii, hhE,0i, hL,2iii Abbreviated to:

hvt = hhA0,H0i, hE0,L2ii

15A traditional deﬁnition may express the game as a game tree and an information set. 16A play history is inconsistent when its derived signals disagree with the actual signals received from the Game Controller.

29 2.4.2 Defining an Information Set Schiffel and Thielscher (2014) show that a valid GDL-II game can be understood as a partially- observable, extensive-form game and that there will be sets of legal play sequences (histories) that a player cannot distinguish due to imperfect information about the true game being played by the Game Controller. It is easy to think of a board game where different moves lead to identical positions on the game board. The identical game boards are described by the same fluents even though they were the result of different play sequences. This has an impact on the probability of a game being in a particular state. Therefore, information sets for a player are defined as a set of play sequences (histories) and not a set of states.

Definition 4. Let G = (S, s0, R, A, λ, P, ρ, υ, δ) be an imperfect-information game that satisfies the restrictions of the GDL-II and G = (V, E) be the induced game tree, and the information sets for a role be identified with the sets of indistinguishable imperfect-information play histories. Then:

1. Ir,n ⊆ H is the general form for an information set for role r ∈ R in round n:

• hi ∈ Ir,n is a perfect-information history in an information set deﬁned by

ξ(hi, r) = ξ(ht, r) such that the imperfect-information play histories are identical to the imperfect-information play history of the true game being played by the Game Controller, as seen by role r.

......

Example 1 continued. The information set Icontestant,2 is made up of two indistinguishable imperfect-information play histories in the game tree in Figure 2.2 given by:

hv1 = ξ(v1, r) = hh,H0i, h,L2ii

hv2 = ξ(v2, r) = hh,H0i, h,L2ii

Now there is a framework to extract information from any node in the game tree about the moves and percepts that correspond to the edges in its path and construct a corresponding play history. It is also possible to construct an imperfect-information history that represents the view of one of the roles17 in the game and deﬁnes that role’s information set.

17This is used in later chapters to reason about the deliberations of other players in the game.

30 2.4.3 Move Selection Policy For a game to be played out to termination a move selection policy is required for each role. This policy represents the role’s strategy for choosing a legal move in any decision state, and may include a random element, for example the Monte Carlo technique or a mixed strategy based on a Nash equilibrium. There are no assumptions made about the nature of the policy or when is it formulated18.

Definition 5. Let G = (S, s0, R, A, λ, P, ρ, υ, δ) be an imperfect-information game that satisfies the restrictions of the GDL-II and G = (V, E) be the induced game tree, and each role have a move selection policy expressed as a probability distribution across all moves in all decision states. The following definitions apply. 1. Π is the set of all move selection policies π ∈ Π. 2. select :Π×D×R → φ(A) is a move selection function, as a probability distribution across all moves:

• ~π = hπ1, ..., π|R|i is a vector of move selection policies, one for each role. and

• φ(A) = select(πr, d, r) is the move selection probability for r in d. 3. play : S × Π|R| → φ(T ) is the playout of a game from any state to termination according to the given move selection policies, expressed as a probability distribution across all terminal states.

......

From Deﬁnition 5 a probability distribution can be calculated across all of the terminals:

X φ(T ) = P (hi) · play(state(vhi ), ~π) (2.1) hi∈Ir,n by performing a weighted sum of the probability distribution of all the playouts from all of the states corresponding to all of the histories in a player’s information set. Each play history corresponds to a node on the induced game tree which can be mapped to a state described by a set of ﬂuents. By application of the move selection policies ~π and the successor function the game can be played out to termination. The weighting used is the probability that the history hi corresponds to the history of the true game being played by the Game Controller.

18Traditionally this policy is expressed as a probability distribution across all moves in all states and is calculated oﬀ-line. In General Game Playing it is common to calculate the policy values on-line across the histories in an information set.

31 The play() function is deﬁned for any state including a terminal state. In such a case the distribution φ(T ) would have one certain value and all other terminals would be unreachable. It is also possible to calculate the expected outcome of the game:

X E(Ir) = P (ti) · υ(ti, r) (2.2) ti∈T by summing the reward at each terminal state. Note that the playout results in a probability distribution across all the terminal states in the game, those terminals that are unreachable from an information set would have zero probability. The calculation in (2.1) suggests it is possible to enumerate all of the elements in an information set. Sadly this may not be possible due to the size of the set and the time restrictions placed on the players, so a sample is required.

32 33 Chapter 3

Sampling an Information Set (the HyperPlay Technique)

This chapter introduces the HyperPlay technique for sampling an information set using a weighted particle filter. The formal definition is presented with a pseudo code implementation and a proof of soundness and completeness. Experiments were conducted to test an agent using this technique in a General Game Playing environment and the strengths and weaknesses of the technique are identified. Finally a serious limitation of the agent using this technique is presented.

34 3.1 Introduction

3.1.1 Publications This chapter recapitulates and expands on previously published works.

• Schoﬁeld, M., Cerexhe, T., Thielscher, M. (2012). HyperPlay: A solution to general game playing with imperfect information. In Proceedings of Twenty Seventh AAAI Conference on Artiﬁcial Intelligence (p. 1606-1612).

This paper was the origin of the HyperPlay technique which was invented during discussions between the authors and instantiated into a player for an international competition. This researcher developed the technique from the discussions, instantiated an agent for experiments using the technique1, identiﬁed the need for a weighted particle ﬁlter, and designed and ran the experiments.

• Schoﬁeld, M., Thielscher, M. (2016a). General game playing with incomplete information. Submitted to Journal of Artiﬁcial Intelligence Research.

This article was submitted to the Journal of Artiﬁcial Intelligence Research in July 2016. It was peer reviewed but not accepted. The recommendation was to make minor changes and resubmit.

3.1.2 Terminology New terminology is introduced or more accurately deﬁned. 1. ChoiceFactor being a representation of the likelihood that a particular play history would be the true play history in any particular game based on the legal choices available along the play history. 2. HyperPlayer being a perfect-information game playing agent adapted for imperfect- information games by "bolting-on" the HyperPlay sampling technique. 3. HyperGame being a data structure that contains a model that represents one possible state of the true game. When new information arrives the invalid model is updated and the likelihood of the new model is calculated.

1 The player used for competition was built by Tim Cerexhe.

35 3.1.3 Motivation

In the ﬁeld of research of General Game Playing there has been a signiﬁcant improvement in the search for optimal moves, both at a tactical and strategic level. For imperfect-information games such techniques require a deterministic sample of an information set as a starting point for their search for optimal moves. This thesis seeks to improve game playing agents for imperfect-information games by allowing them to employ the search techniques that are successful in perfect-information games. The HyperPlay technique provides existing General Game Playing agents with a bolt-on solution to convert from perfect to imperfect-information games.

3.1.4 Related Research

The General Game Playing community has been slow to pick up the challenge of imperfect- information games and only a few players have been implemented with even fewer competitions being conducted. The published approaches to designing general game playing agents for GDL- II show how existing perfect-information GDL players can be lifted to play general imperfect- information games by using models or sampling of an information set as the starting states for a perfect-information search by Edelkamp et al. (2012). Earlier research into set sampling by Richards and Amir (2009) and by particle system techniques by Silver and Veness (2010) were the sample of an information set is either generated, or maintained from one round to the next. The latter being able to search POMDPs that would otherwise be intractable, but relying on a black box to provide the deterministic samples of an information set. Outside of General Game Playing, the Monte Carlo tree search technique based on a sample of an information set has been applied to a variety of speciﬁc perfect-information and imperfect-information games alike (Browne et al., 2012). Frank and Basin (1998) investigated imperfect-information games with a focus on Bridge, presenting a ’game-general’ tree search algorithm that exploits a number of imperfect-information heuristics. Later they compare their heuristics to Monte Carlo sampling (Frank & Basin, 2001) and show they are superior at identifying play strategies. With all of their work the emphasis is on maximising the outcome through strategic play choices. For the same game, Ginsberg (2001) has applied perfect- information Monte Carlo sampling. The Alberta Computer Poker Research Group has developed systems at the forefront of computer Poker players (Billings et al., 2004), a challenging domain combining imperfect and

36 misleading information, opponent modelling, and a large state space. While not explicitly inter- ested in GGP, they do describe several techniques that could generalise to this field, including miximix, fast opponent modelling, and Nash equilibrium solutions over an abstracted state space. Again the emphasis is on choosing the right play rather than obtaining a deterministic sample of an information set. Recent work by Lanctot, Lis`y, and Bowling (2014) into online search in games has presented Online Outcome Sampling (OOS), and shown that it is guaranteed to converge to an equilibrium strategy in two-player zero-sum games. Of all the recent work, this has the closest parallels to the techniques in this thesis and it is this researchers hope that the differences between the two techniques can be overcome. Prima facie, these differences are superficial, but in their implementation there may be some hurdles to overcome. OOS is zero sum, two-player, turn taking, whereas this research focuses on all game types. OOS maintains a game tree which may be impossible in some games within the online framework2. OOS ignore percepts3, whereas this research uses percepts to further refine an information set. OOS is unclear on the set of equivalence classes applied in the choice of moves, whereas this research allows for different sets of equivalence classes in each sample of an information set. OOS does not use a weighted particle filter, but does calculate the probability of the reacha- bility of the terminal states and this will give a similar result. The OOS algorithm returns a full information set and stores it in memory so may struggle in large search spaces4. Another recent idea by Cowling, Powley, and Whitehouse (2012) is Information Set Monte Carlo Tree Search (ISMCTS) currently being applied to many deterministic games of imperfect- information. Instead of searching game trees constructed from game states and moves the ISMCTS algorithms search trees constructed of information sets providing an alternative analysis of the game. In doing so the technique takes into consideration aspects of game-play that might be outside the domain of other reasoners. ISMCTS relies on randomly guessing the missing information, but considers each sample as an independent game to be analysed and to guide a single iteration of MCTS. Instead of keeping the search trees independent, it creates a conglomerate tree. The paradox is that the search includes portions of the game tree known to be unreachable, yet delivers statistically significant results. There are strong parallels between this approach and the Imperfect Information Simulation in Chapter 4 of this research. Importantly ISMCTS still requires a deterministic sample of an information set.

2 Pruning the dead branches of the tree should overcome this problem. 3 This is correct as percepts can always be calculated. 4 Of all the diﬀerences this seems the most diﬃcult to overcome in an implementation

37 The idea of using deterministic samples in a single tree is reinforced by Nijssen and Winands (2012) with their treatment of MCTS being applied to the hide-and-seek game Scotland Yard with positive results. As with ISMCTS there are lessons to be learned from these approaches and hurdles to overcome in the form of resolving equivalence classes5, calculating weighting factors, traversing large search spaces on-line, and storing large information set in memory.

5 Not all moves are legal in all elements of an information set in GGP-II.

38 3.2 Playing Imperfect-Information Games

The motivation for this research is to develop a ’bolt-on solution’ to allow a perfect-information player to play imperfect-information games. This will allow the player to use all of the techniques that are so successful in General Game Playing, however it locks in the player to taking a deterministic sample of an information set. This section provides an online deterministic sampling technique that converts of imperfect-information play histories to samples.

3.2.1 Providing a Bolt-on Solution The intuition for a ’bolt-on’ solution to allow a perfect-information player to play imperfect- information games is as follows. 1. All imperfect-information must be removed consistent with the game description and the play messages from the Game Controller. 2. Any sample of an information set must be statistically valid as many perfect-information players uses statistical techniques for their forward search. 3. The elements of an information set are indistinguishable, imperfect-information play histories. Therefore, any sampling technique must provide a deterministic grounding of the missing information.

3.2.2 Imperfect-Information Play Histories In the previous chapter, Deﬁnition 4 gives an information set to be a set of imperfect- information play histories that are indistinguishable from the imperfect-information play history of the true game being played by the Game Controller. The histories are indistinguishable because they are imperfect-information play histories in the form of imperfect-information play message vectors. A play message is received each round from the Game Controller and contains only the move (does role move) and percepts (sees role percept) for the role receiving the message. An ordered sequence of such play messages form an imperfect-information play history for the true game being played by the Game Controller, as follows:

• ~mr = hnull, har, pri, nulli being an imperfect-information play message received from the Game Controller by role r which can be rewritten as

~mr = hnull, har, ρ(d, ~a,r)i, nulli, and n • ~mr = h ~mr1, ... ~mrni being an imperfect-information play history as seen by role r in round n.

39 Example 1 continued. Consider the situation shown in Figure 2.2. The contestant only knows part of the play history:

~mc = hhh i, hL0ii, hh i, hH0ii, hh i, hL2iii using the coding in the ﬁgure. Expressed in abbreviated form in the GDL:

~mc = h(does contestant noop), (does contestant choose 1), (does contestant noop)(sees contestant 2)i.

3.2.3 Sampling an Information Set The logic used in the HyperPlay technique for taking a deterministic sample of an information set is as follows. 1. All missing moves and percepts must be instantiated with valid choices consistent with the game description and the known sequence of play. 2. Percepts are determined from the move vector and the decision state ~p = ρ(d, ~a), therefore, only the missing moves need to be instantiated from the choice of legal moves available in the decision state. 3. Known percepts are used to validate move choices by advancing the state and checking the percepts. 4. Each new decision state is determined using the successor function d0 = δ(d, ~a), therefore, the grounding of missing moves must begin at the initial state of the game and proceed according to the play sequence. 5. At each round, an empty set of valid grounding choices or a prematurely terminal state invalidates the entire history6.

Example 1 continued. The possible grounding for the play histories are:

~m = hhhA0i, hL0ii, hhG0i, hH0ii, hhE2i, hL2iii = ξ(h1), and

~m = hhhC0i, hL0ii, hhG0i, hH0ii, hhE2i, hL2iii = ξ(h2).

All other groundings are inconsistent with the play messages received by the contestant, and so the information set has two elements.

6 It is possible to make a bad choice early in the grounding process that prevents the play history from being completed.

40 3.3 Weighted Particle Filter

The probability distribution across an information set, as calculated in (2.1) and our example, will depend on the move selection policy π of the other roles. Prima facie, the move selection policy for other roles is random, but that does not mean the probability distribution across an information set is uniform. It is possible to use Ockham’s Razor and the rules in the GDL to formulate the probability distribution across an information set. The case for a weighted particle ﬁlter is best made by looking at our running example. The rule for the Monty Hall game is always switch doors.

Example 1 continued. Consider the situation shown in Figure 2.2:

Ic = {h1, h2} is the contestant’s information set, φ(T ) = P P (h ) · play(state(v ), ~mc) is the result of random playouts hi∈Ic i hi assuming a uniform probability distribution across Ic, and φ(T ) = h..., 0.25, 0.25, ..., 0.25, 0.25, ...i.

The reader will note that φ(T ) is incorrect for the real Monty Hall game, therefore, the

assumed probability distribution across Ic must also be incorrect.

And so the challenge is to ﬁnd a way to calculate the probability that an element of an information set corresponded to the history of the true game being played by the Game Controller, given that the game tree is too large to compute such probabilities directly.

3.3.1 Choice Factor The original deﬁnition focused on the choices being made by a player as the game was being played, remembering that paths in the game tree are induced by play histories. It let the

ChoiceF actor of a node vi be determined by inspecting the choices at each node along the path in the game tree. This translated to a similar calculation using play histories:

n−1 Y i ChoiceF actor(hi) = |λ(state(node (hi), j))| (3.1) j=0 by enumerating the nodes in the path from the root node to the previous round n − 1 and taking the product of the size of the sets of legal moves (choices)7 in each of the decision states for each of the nodes.

7 The size is of the set of joint move vectors, not the size of the role’s move choices

41 With the application of Okham’s razor, in so much as the number of choices made along a path is a representation of the likelihood of any one branch of that path being taken, provides a basis for the calculation of a probability. The probability of the play history hi corresponding to the play history the true game being played by the Game Controller ht can be expressed in proportion to the likelihood of all of the other play histories sampled from an information set. This requires a normalising factor to be applied:

1 P (hi = ht) = · 1/ChoiceF actor(hi) k X (3.2) normalisation factor k = 1/ChoiceF actor(hj) hj ∈Ir,n the probability is inversely proportional to the number of choices, normalised for the sample being taken. This approach has appeal as an information set may be so large8 as to be intractable and the player may only be able to take a sample. Therefore, the only indication of the likelihood of a sampled history corresponding to the true game must be expressed in terms of the likelihood of the other samples. Where there is information about the move selection policy of other roles it is possible to redeﬁne the probability of a move vector more accurately:

Y P (~a | dj) = P (ar | select(πr, dj, r)) r∈R (3.3) i dj = state(node (hi), j))) by including the move selection policy π for the choices being made by all roles it is possible to calculate a probability distribution across all move vectors in a decisions state. By compounding the probabilities along the path:

n−1 Y P (hi = ht) = P (~a | dj) (3.4) j=0 a probability can be calculated for any history in an information set based on the move selection policies of each role.

8 In the Blind Breakthrough case study the upper bound for an information set in the mid game was 6 × 1023.

42 Unfortunately, the probability expressed in equation (3.4) is only correct if the whole information set can be sampled, and the game tree has a constant branching factor. Therefore, as in the previous case (3.2), this must be treated as a partial probability and a normalizing factor used to get an estimate of the true probabilities:

n−1 1 Y P (hi = ht) = P (~a | dj) k j=0 (3.5) n−1 X Y normalisation factor k = P (~a | dj) hi∈Ir,n j=0

This ﬁnal expression represents the probability that an element of the sample does in fact corresponds to the state of true game, relative to the other samples in the bag9.

Example 1 continued. Reworking the example of the situation shown in Figure 2.2:

Ic = {h1, h2} is the contestant’s information set φ(T ) = P P (h ) · play(state(v ), ~mc) is the result of random playouts hi∈Ic i hi

φ(Ic) = h0.33, 0.67i (from equations (3.3)(3.5)) φ(T ) = h..., 0.17, 0.17, ..., 0.33, 0.33, ...i Which is now consistent with the probable outcomes of the true game. The expected outcome for the two move choices are:

E(noop) = 0.33 · car + 0.67 · goat E(switch) = 0.67 · car + 0.33 · goat

Suggesting that the Contestant should always (switch), unless the goat has more utility, for the contestant, than the car. The distribution φ(T ) is consistent with the known outcomes of the real Monty Hall game.

9 A bag of samples is used instead of a set of samples, as there may be duplication.

43 3.4 The HyperPlay Technique

This section introduces the HyperPlay technique, both as a function that completes imperfect- information play messages and as an implementation in a GGP player. There is a proof sketch for soundness and completeness of the technique by examining properties of the game tree. There is suﬃcient detail for the reader to implement the technique as a bolt-on to an existing perfect-information player.

3.4.1 Description The technique maintains a bag of models of the true game, each model being a perfect- information game. The models are given the name HyperGames10 as a reminder that they are more than just a grounding of the state variables, but a full instantiation of the game currently being played with agents in every role. The HyperGame also contains the ability to re-instantiate the model if new information makes the model invalid, referred to as "backtracking". So a HyperGame is more than a model. It is a data structure that contains a model that represents one possible state of the true game. When new information arrives the current model of the HyperGame becomes invalid11. The HyperGame data structure contains additional information that allows a new model to be selected that is consistent with the newly received information and information that allows the calculation of the likelihood that the new model is the true game. The term HyperGame will be used to mean a self-correcting data structure that contains a model of the current game instantiated in code for experimental purposes. So the HyperGame is one of many self-correcting instantiations of the game containing a model that is utilised by a HyperPlayer to make move selection in the game. The "bag of models" refers to the collection of models contained within the HyperGames12 The logic for the technique is as follows. 1. Models are updated from one move to the next using legal moves to replace the missing moves for all of the other roles. 2. Replacement moves are made randomly using a random seed. The same random seed will produce the same move choices.

10In the original analysis the term HypoGame was used to mean a hypothetical game, but the term was not popular and it morphed into HyperGame, and hence HyperPlay. 11If nothing else, the model is invalid by virtue of the fact that the game has progressed to the next round. 12A bag is used as some models may be indentical.

44 3. Bad13 move choices are recorded in memory by each HyperGame and not repeated. Subtrees with no good move choices are discarded. 4. Invalid models14 are replaced with new models by backtracking along the play history until a valid model can be instantiated in the current round. 5. Choice factors are calculated by each HyperGame along the models play history, and hence a weighting factor is determined for calculation of expected outcomes. 6. The models are used as a weighted particle ﬁlter on an information set and provide a basis for perfect-information evaluation.

Additionally it is possible to place a counter into the update process so that any HyperGame taking too long to update its model can be taken oﬀ line. This prevents the player from stalling when one HyperGame becomes hopelessly lost. This merely postpones the update process to a later round, it does not permanently invalidate the model. Time permitting, the model can be brought back on line15. A function is deﬁned, below, that maintains each model by completing a history of imperfect-information play messages using the legal move choices available in each decision state.16 The function must choose randomly from the set of legal moves and so a random seed x is used, where the same random seed will always return the same legal move choices.

Deﬁnition 6. Let G = (S, s0, R, A, λ, P, ρ, υ, δ) be an imperfect-information game given by a valid GDL-II description and G = (V, E) be the induced game tree. The HyperPlay function is deﬁned as follows. n n 1. hp : M × R → M is the function HyperPlay that completes an imperfect-

information play history hr by grounding the missing elements in the play messages i consistent with the legal moves ari ∈ λ(state(node (hr, i), r)) ∀ i : 0 ≤ i < n known to the role r. The function takes a random seed x to randomise the grounding choices, the same seed will produce the same grounding choices.

2. Bad Move: ~a is labelled as "bad" in decision state dk if the percepts do not match for

the round k + 1, ie. ρ(dk, ~a,r) 6= ht[k + 1, r, ip].

......

13The term "bad" was coined in the original paper to mean inconsistent with later information. The moves are legal but not consistent with percepts. 14There may not be any move choices consistent with the play messages received. 15In turn taking games the agent often has time to resolve oﬀ line models. 16If the full history of the game is known then all of the models converge to the true game.

45 A formulation can be expressed to sample an element in an information set by completing the imperfect-information play history known to the player:

hi = hp(ξ(ht, r), x) (3.6)

the player receives an imperfect-information play history ξ(ht, r) from the Game Controller and uses a random seed to construct a deterministic sample in the form of a perfect-information play history.

Example 1 continued. The HyperPlay function takes the imperfect-information play history for the Contestant and constructs a complete history to one of the nodes in the current information set.

ξ(ht, Contestant) = hhh i, hL0ii, hh i, hH0ii, hh i, hL2iii

hp(ξ(ht, Contestant), x) = hhhA0i, hL0ii, hhG0i, hH0ii, hhE2i, hL2iii

A diﬀerent random seed x would produce a diﬀerent sample.

The union of all such samples, in the limit, is an expression of an information set:

[ Ir,n = hp(ξ(ht, r), x) (3.7) lim as the HyperPlay function can return any and every element of an information set.

3.4.2 Pseudo Code Below is a presentation of the original process with some alterations to the nomenclature. Previously the procedure was described using mathematical notation, but here it is described using object oriented pseudo code. The procedures forward(), and backward() have been combined into Update(). The procedure forward() would replace missing information moving forward to the current round, while the procedure backward() would backtrack invalid paths until a new, untested branch could be found.

46 1 class Game

2 Node

3 Round

4 class Step

5 MyMove

6 MyPercept

7 Legal

8 Bad

9 class HyperGame

10 Node

11 Round

12 Path

13 RandomSeed

14 procedure Player(GameController)

15 Bag =

16 Round = 0

17 repeat

18 MyMove = SelectMove(Bag)

19 MyPercept = GameController.SubmitMove(MyMove)

20 Round = Round +1

21 for each HyperGame in Bag

22 HyperGame.Update(MyMove, MyPercept, Round)

23 next HyperGame

24 until IsTerminal(GameController)

25 end

Figure 3.1: An imperfect-information player using the HyperPlay technique to maintain a bag of models of the game.

The HyperPlay algorithm is summarized in Figure 3.2 as part of an imperfect-information game player in Figure 3.1. The code for the player:

• classes are declared for a Game, Step and HyperGame for the operation of the player,

• line 15 shows the initialization of the bag of models (HyperGames), each being equal to the initial node of the game,

• line 18 uses the bag of models as a weighted particle ﬁlter to calculate the move with the highest utility,

• line 19 submits the move to the game controller and receives a percept, and

• line 22 updates each model to agree with the most recent move and percept. Each HyperGame randomly completes the imperfect-information history to provide a statistically valid sample of an information set.

47 1 procedure HyperGame.Update(MyMove, MyPercept, Round)

2 NewStep = Step.New(MyMove, MyPercept)

3 HyperGame.Path.Add(NewStep)

4 // Advance the HyperGame to the current round

5 while HyperGame.Round < Round

6 CurrentStep = HyperGame.Step[HyperGame.Round]

7 // Find a move that is consistent with the play messages

8 for each Move in CurrentStep.Legal and not in CurrentStep.Bad

9 if IsCongruent(Move) then

10 HyperGame.DoMove(Move)

11 NextStep = HyperGame.Step[HyperGame.Round]

12 NextStep.ResetLegalAndBad(RandomSeed)

13 continue while

14 end if

15 CurrentStep.Bad.Add(Move)

16 next Move

17 // Backtrack the previous move as all of its children are bad

18 BadMove = HyperGame.UndoLastMove()

19 PreviousStep = HyperGame.Step[HyperGame.Round]

20 PreviousStep.Bad.Add(BadMove)

21 end while

22 end

Figure 3.2: The HyperPlay technique used to maintain a model of the game.

Looking at the Update() code in Figure 3.2 in more detail:

• lines 2 & 3 add a new step to the imperfect-information history,

• at ﬁrst call, the HyperGame will be one round behind the game,

• lines 12 clears the array of bad moves and randomizes the array of legal moves,

• line 8 enumerates the legal, as yet untested, moves for the current search round,

• lines 9 - 14 advance the HyperGame with a move that is consistent with the known moves and percepts,

• line 15 records moves inconsistent with the known moves and percepts as ’bad’, and

• if there are no ’good’ moves, then line 18 - 20 backtracks the HyperGame to the previous round, records the ’bad move, and continues the search17.

17The Legal and Bad moves are preserved so backtracking picks up where the previous search left oﬀ

48 3.4.3 Soundness & Completeness The HyperPlay technique "maintains" a collection of HyperGames which update their models from one round to the next. This proof sketch treats the Update() procedure shown in Figure 3.2 as a logical system and shows that it is both sound and complete, that is: • Everything that the Update() procedure says is a valid model, is in fact valid, and

• Everything valid model can be obtained by the Update() procedure. In this context a valid model corresponds to a play history from an information set for the current round. As there is a bijective relationship between the play histories and the simple paths in the induced game tree and it is easier to visualise a tree than a set of histories then the following discussion will make use of the game tree in Figure 3.3 to illustrate an analysis of the maintenance of a sample of an information set.

Figure 3.3: The Game Tree for a GDL-II game at rounds n and n + 1, showing the subtree deﬁned by an information set.

The HyperPlay software was originally developed using the context of the game tree for inspiration and so memory structures and functions were developed using nodes and paths. In this context the soundness and completeness of the Update() procedure is modiﬁed by considering the return value of the procedure as being the game tree node corresponding to the model in the HyperGame and so the term ’HyperGame.Node’ is used to mean the output of the logical system. So the test for soundness and completeness is modiﬁed to be: • Every HyperGame.Node returned by the Update() procedure corresponds to a play history in an information set of the current round; and

• Every node corresponding to a play history in an information set of the current round can be returned by the Update() procedure.

49 Consider the game tree for an imperfect-information game and identify the subtree GIr,n defined by the player’s information set, which is induced by the players imperfect-information play histories received from the Game Controller. In Figure 3.3 there is a stylized representation of a game in progress. In round n the aqua region defines a subtree GIr,n of all of the paths representing possible histories identified with an information set, and that no paths pass outside the subtree. The converse being that any path outside the subtree leads to a node that is not in an information set, remembering that the tree is acyclic in its undirected form. In round n + 1 the tan region defines a similar subtree. From Definition 3 it can be seen that the play messages for round n + 1 are built upon the play messages for round n and hence the tan subtree may not include any nodes outside the aqua subtree, other than those nodes in round n + 1.

50 Definition 7. Let G = (S, R, A, P, υ, δ) be an imperfect-information game that satisfies the restrictions of the GDL-II and G = (V, E) be a game tree induced by the game. The following definitions apply. ∼ ∼ 1. =: ξ(h1, r) = ξ(h2) ⇐⇒ ξ(h1, r) = ξ(h2, r) defines the congruence of a history of imperfect-information play messages with a history of complete play messages.

n 2. GIr,n is the subtree given by all of the paths ~e from the initial node v0 to a node

vhi : hi ∈ Ir,n induced by an information set.

......

Theorem 1. All nodes corresponding to an information set Ir,n+1 succeed nodes corresponding to an information set Ir,n.

Sketch. by construction, showing that the simple paths in the subtree GIr,n+1 can only be built upon simple paths in the subtree GIr,n . Let G = hS, R, A, P, υ, δi be an imperfect-information game that satisﬁes the restrictions of the GDL-II and G = (V, E) be a game tree induced by the game, then: Base case n = 0:

h0 = ∅, v0 = path(h0) the empty history inducing the root node of the game tree General case for round n + 1:

∀hi ∈ Ir,n+1 path(hp(ξ(hi, r), x)) ∈ GIr,n+1 from deﬁnitions 4, 6 and 7 an information set subtree is a set of paths corresponding to a set of histories ∼ ξ(hn+1, r) = ξ(ht,n+1) from equation 3.6 and deﬁnition 7 a history in an information set subtree is congruent with the true game history

ξ(hn+1) = hm0, ...mn, mn+1i = hξ(hn), mn+1i from deﬁnition 3 each history can be split into an existing history and an extension ∼ ξ(hn, r) = ξ(ht,n) from equation 3.6 and deﬁnition 7 as before this existing history is congruent with the true game history

∀hi ∈ Ir,n path(hp(ξ(hi, r), x)) ∈ GIr,n from deﬁnitions 4, 6 and 7 each existing history corresponds to the previous information set subtree

51 Corollary 2. All paths in an information set subtree for a role must pass through all previous information sets for that role.

Corollary 3. Not all paths in the previous information set subtree for a role are in the information set subtree for the current round.

These corollaries underpin the backtracking process in the HyperPlay algorithm as they facilitate the ﬁnding of a "good" path without the need to start from the initial state of the game, and the pruning of "bad" paths at the earliest opportunity. Otherwise, the bad paths could not be pruned until the subtree they initiate was completely checked.

Soundness The tools are in place to show the soundness of the HyperPlay Update() procedure. From Corollary 2 any move can be safely tested at any round from the initial state of the game to the current round. If the imperfect play messages are not congruent with the game18 then this move will never lead to a history in the current information set. In the context of the Update() procedure these moves are labelled as "bad".

Sketch. Let G = (S, R, A, P, υ, δ) be an imperfect-information game that satisﬁes the restrictions of the GDL-II and G = (V, E) be a game tree induced by the game, then:

Update() labels all moves outside the subtree GIr,n as bad line 15 labels incongruent moves as Bad line 20 labels moves as Bad if all subsequent moves are Bad Update() backtracks all Bad moves line 18 backtracks Bad moves Update() only returns nodes in the current round line 22 only exits when HyperGame.Round = CurrentRound

Completeness The HyperPlay Update() procedure randomly selects from all of the states in an information set of the current round when returning a value. The random choice does not have a uniform probability distribution. The selected node will be the one that required the least backtracking as all shallow options are exhausted before backtracking more deeply. This also speak to the eﬃciency of the process.

18Up to and including the round in question.

52 Sketch. Let G = (S, R, A, P, υ, δ) be an imperfect-information game that satisﬁes the restrictions of the GDL-II and G = (V, E) be a game tree induced by the game, then:

Update() can return any of vhi : hi ∈ Ir,n line 22 only exits with a valid sample in the current round every path in the game tree originates the initial node every path is accessible from every other path Update() randomizes the evaluation of moves line 12 resets the forward search using a RandomSeed

53 3.5 Using HyperPlay

This section presents a simple imperfect-information player incorporating two basic elements, the HyperPlay algorithm and Monte Carlo sampling as the perfect-information reasoner. A formalism for the move selection policy is presented which allows for the aggregation of move utilities across a player’s information set. Some experimental results are presented to validate the player and identifying its strengths, weaknesses and its limitations.

3.5.1 The HyperPlayer This technique can be bolted-on to any perfect-information game playing agent as well as any artificially intelligent agent performing a search with imperfect-information provided the search space can be represented as a connected, acyclic graph G = (V, E) with a single root node. A simple Monte Carlo player is used to run a number of random simulations for each joint move in the set of joint move choices, then the average of the terminal values is used as a measure of utility. The evaluation function definition is based on a number of ’playouts’ of the game after making a specific move. Since the original experiments were conducted (Schofield et al., 2012) there have been many advancements in search techniques utilised in General Game Playing. In particular Monte Carlo Tree Search (MCTS) has proved more efficient than the simple Monte Carlo searches originally conducted. It is worth noting the reasons for remaining with the original Monte Carlo search technique. 1. The primary focus is on sampling an information set using historical information, not the search for future terminal states of the game. 2. MCTS and Monte Carlo search converge to the same values for the terminal values for the large search spaces. 3. The time constraints would not allow the conduct of the many MCTS searches conducted for each round19.

A more recent technique, Information Set - Monte Carlo Tree Search (IS-MCTS) (Cowling et al., 2012), still requires a deterministic sampling of an information set. Which is the primary focus of this research. A more detailed discussion is presented in Chapter 4.

19Each HyperGame would need to conduct its own MCTS search.

54 Definition 8. Let G = (S, R, A, P, υ, δ) be an imperfect-information game that satisfies the restrictions of the GDL-II and G = (V, E) be a game tree induced by the game. The following definitions apply.

1. Br is a bag of models, each model being an element of information set Ir.

|R| 2. eval : S × Π × R × N → R is the evaluation function for giving an expected value for a state, and hence a model.

The evaluation can be expressed using the play function from Deﬁnition 5:

1 n eval(s, ~π, r, n) = X υ(play(s, ~π), r) (3.8) n 1 and performing some number of playouts according to the move selection policy of the roles. The variable n is not the round number, but a number of playouts used to average eval().

Example 1 continued. There are two elements in the information set for the fourth round in the Monty Hall game and two moves available to the Contestant in each state, (switch) and (noop). The HyperPlayer would:

• For each decision state di = state(vhi ); hi ∈ Ir,

• For each legal move ~aij = λ(di),

• Implement each ~aij in n playouts and average the results,

• Aggregate the weighted results using P (hi) for all ~aij based on syntactic equivalence (switch) or (noop), and

• Choose the move with the highest expected result.

Definition 8 provides for the formulation of an expression for the utility of a specific move in a specific decision state:

utility(d, arj) = E eval(δ(d, ~ak), ~mc), r, n) (3.9) ~ak∈λ(d): ~ak3arj using the eval() the E expected function aggregates all of the outcomes for move vectors that contain the speciﬁc move arj. Note the expression of the speciﬁc move arj being present in the move vector ~ak as the unusual form ~ak 3 arj. Also note that this move may be present in more than one move vector and so an expected value performs an aggregation.

Moves arj and ark from diﬀerent samples in the bag may be syntactically identical. These moves form an equivalence class based upon their syntax. And so, the agent is looking for the equivalence class that has the highest expected value.

55 The problem is that not all equivalence classes may be present in each sample. This can be overcome by setting the weighting for that equivalence class in that sample equal to zero. By summing the utility of the move class across all of the samples in the bag:

  P (hi) · utility(state(vhi ), arj):  X  classUtility(arj) = arj ∈ λ(state(vh ), r) (3.10)  i hi∈Br   0 testing to make sure the move class is legal in the states and applying the probability derived from the ChoiceF actor. Finally, the move selection policy for the agent can be seen as the miximisation of the move class utility:

ar = argmaxarj classUtility(arj) (3.11)

∀ hi ∈ Br, ∀ arj ∈ λ(state(vhi ), r)

ﬁnds the move class with the highest expected utility across all the legal moves across all the models in the bag Br. Note that the samples are traversed twice, once in equation (3.11) and once in equation (3.10). Despite the thoroughness of the analysis it is still possible, though unlikely, to choose a move that is not legal in the true game being played by the Game Controller. This fact has generated much debate within the GGP research community with some researchers suggesting that a GDL-II game should contain enough informaton to make illegal moves impossible. This researcher believes that being able to make illegal moves is part of the challenge of imperfect- information game play.

56 3.6 Experimental Results

This section presents a summary and interpretation of the experiments shown in Appendix A.

3.6.1 Testing the HyperPlayer Three games are chosen to test the move selection policy outlined in Equation (3.11). The Monty Hall game is used to validate the use of a weighted particle ﬁlter, Krieg-TicTacToe to represent two player simultaneous move games, and Blind Breakthrough as an example of a turn taking game with a very large search space20 of more than 1020 for the 6x6 version. A summary of experimental set up presented in Appendix A is given below.

Design of Experiments A series of experiments was designed to test the capabilities of the new technique using the high-throughput computer facilities at the School of Computer Science and Engineering. Games played at a recent Australasian Joint Conference on Artiﬁcial Intelligence were used as inspiration as the conference organisers specially designed games that would challenge the state of the art of GDL-II players so as to encourage research and development in this ﬁeld.

Game Play For two player games an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive. Since a comparison is being made between diﬀerent instantiations of the player, the experiments will not be overly sensitive to the performance of the opponent.

Measuring Performance The number of states visited by the player in playing the game is used as the measure of computational resources, being careful to measure this across the multiple samples of an information set and to include the states visited in the backtracking of invalid samples21. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

Player Resources In each experiment the player resources were varied to demonstrate the performance compared to resources. Care was taken to ensure that each player had equal

20Some GGP-II games (Kriegspiel Chess) have more nodes in their game tree than atoms in the universe. Any game whose game tree is too large to map, even in the mid-game, inside the time budget is a very large search space. 21This may consume half of the time budget in some games.

57 resources when making a move selection. This was achieved by setting each player’s parameters such that they would visit a similar number of states each round.

Standardized Opponent In each of the two player games the HyperPlayer opposed a Cheat. A HyperPlayer with perfect-information through access to the true game, who maintains HyperGames that are the true game (instead of models of the true game). The Cheat was fully resourced to make the best move choices within the limits of the move selection process.

3.6.2 Monty Hall The Game In addition to the rules already presented, the number of initial doors is varies between three, four, and ﬁve doors. The host always opens all but two doors.

The Objective This game is used to represent information sets where the elements have diﬀerent probabilities of representing the true game. If the player does not recognise this and use a weighted particle ﬁlter then the result will be less than optimal.

Experimental Results The results in Figure 3.4 show the averages and conﬁdence intervals from a batch of 1000 games played. The agent’s score is either 0 or 100 depending on the correct guess of the door with a car behind it.

Figure 3.4: Monty Hall results, validating the weighting used in the particle ﬁlter.

Interpretation As expected, the adequately resourced HyperPlayer was able to achieve an average payoﬀ appropriate for the number of doors in the game. The poorly resourced agent does no better than 50% which is consistent with a random guess.

58 3.6.3 Particle Filter Weighting The use of a weighted particle ﬁlter can be validated through the weightings applied to the models when playing the Monty Hall game. If the formulation is correct then the maximum long term score should be equal to 100 · (1 − 1/No_Of_Doors), otherwise the maximum long term score should be 50. Clearly, the former is true in Figure 3.4.

3.6.4 Krieg-TicTacToe The Game A variant of traditional TicTacToe where the opponent’s pieces are hidden. Players move simultaneously and the winning length was ﬁxed at four in a row. The board size is varied between 4x4, 5x5, and 6x6 squares. Players are given feedback via percepts as to whether their move, and their opponent’s move, were successful.

The Objective This game is used to test simultaneous move games. In this game the percepts can cause extensive backtracking which the HyperPlayer would normally do during its opponents turn. However, there is no idle time in this game and backtracking comes at the expense of forward searches. The opponent (the Cheat) does not require any backtracking.

Experimental Results The results in Figure 3.5 show the averages and conﬁdence intervals from a batch of 1000 games played. The agent’s score is either 0, 50 or 100 depending on the outcome of the game. The experiments showed steady improvement in performance as HyperPlayer resources increased. Remember that the opponent (the Cheat) has perfect information.

Figure 3.5: Krieg-TicTacToe results

59 Interpretation The limiting values appear to be well short of the 50% mark, especially on the larger board. On inspection of the play messages, it could be seen that the reduced number of percepts relative to the game duration gave the HyperPlayer very little information to assist in narrowing its search. In fact, the Cheat often won the game before the HyperPlayer could establish an accurate model of the board. There did not appear to be an easy explanation for the unusual shape of the results in the 6x6 board. An examination of the game-play suggested the player gained more of an understanding of the missing information as its resources were increased, but that this did not always translate to a higher score. It is important to remember that the line on the chart is not a continuous function but merely a device to link results for the same game.

3.6.5 Blind Breakthrough The Game A variant of chess where each player starts with two rows of pawns on their side of the board. Players take turns, trying to ‘break through’ their opponent’s ranks to reach the other side ﬁrst. Opponent’s pieces are hidden, as with Krieg-TicTacToe. The board size was varied between 5x5, 6x6, and 7x7 squares and players were given equal opportunities to go ﬁrst.

The Objective This game is used to test player’s ability to handle very large search spaces. There is almost no possibility of mapping the search space in the time allowed for a move, even towards the end of the game. Players do have the ability to perform some analysis22 while their opponent is taking a turn.

Experimental Results The results in Figure 3.6 show the averages and conﬁdence intervals from a batch of 1000 games played. The agent’s score is either 0, 100 depending on the outcome of the game23. The experiments showed steady improvement in performance as HyperPlayer resources increased. Remember that the opponent (the Cheat) has perfect information.

22The HyperPlayer can complete any outstanding backtracking when it is idle. 23Blind Breakthrough must have a winner, no draws are allowed.

60 Figure 3.6: Blind Breakthrough results

Interpretation The results of the Blind Breakthrough experiments show clear improvement in performance as resources increase and approach a limiting value. As the number of models increases and the number of simulations per move increases the HyperPlayer is able to match the Cheat’s performance with neither player having the advantage. In the 5x5 results, the percepts were suﬃcient for the HyperPlayer to maintain models that were very close to the true game. This allowed the HyperPlayer to perform as if it had perfect-information.

61 3.7 Conclusions

The HyperPlay technique does provides existing General Game Playing agents with a bolt- on solution to convert from perfect to imperfect-information games. However, its success in playing those games in mixed depending on the game topology.

3.7.1 Strengths The maintenance of the imperfect-information path, the lists of legal moves and bad moves clearly works. It also facilitates the calculation of the probability P (vi = vt) of the sample being the true game. This was demonstrated in the experimental results. The player operated under a time budget with the ability for the Update() process for each HyperGame to be taken oﬀ line so that slow updates would not slow down the player’s move selection. The impact being that the large search space in Blind Breakthrough did not cause the player to stall due to excessive backtracking. The HyperGame in question was simply taken oﬀ line until it had completed its backtracking calculations and then returned on line. Whilst the instance used for testing was a simple Monte Carlo player, it was clear from the software development that any perfect-information player could have been used.

3.7.2 Weaknesses The primary weakness is that the search space may be extremely large and the enumeration of the possible imperfect-information histories given by ξ(vt, MyRole) may take so long as to make the Update() process appear to be never ending. In practical terms, there will always be a few models in the bag that have randomly used a path that is close to the true game. However, the size of the search space is a genuine concern for any implementation of this technique.

62 1 (succ 0 1) 28 (<=(sees agent ?c) 2 (succ 1 2) 29 (does agent ask) 3 (succ 2 3) 30 (true (armed ?c))) 4 (color red) 31 5 (color blue) 32 (<= (explodes) 6 33 (does random (cut ?c)) 7 (role agent) 34 (not (true (armed ?c)))) 8 (role random) 35 (<= (next (round ?n)) 9 36 (true (round ?m)) 10 (init (round 0)) 37 (succ ?m ?n)) 11 38 (<= (next (armed ?c)) 12 (<=(legal random (arm ?c)) 39 (does random (arm ?c))) 13 (color ?c) 40 (<= (next (armed ?c)) 14 (true (round 0))) 41 (true (armed ?c))) 15 (<=(legal random noop) 42 (<= (next (score 90)) 16 (not (true (round 0)))) 43 (does explodes ask)) 17 (<=(legal agent noop) 44 (<= (next (score 100)) 18 (true (round 0))) 45 (does explodes wait)) 19 (<=(legal agent ask) 46 (<= (next (score ?s)) 20 (true (round 1))) 47 (not (explodes))) 21 (<=(legal agent wait) 48 (<= (next (score 0)) 22 (true (round 1))) 49 (explodes)) 23 (<=(legal agent (cut ?c)) 50 24 (color ?c) 51 (<= (terminal) 25 (true (round 2))) 52 (true (round 3))) 26 53 (<=(goal agent ?s) 27 54 (true (score ?s)))

Figure 3.7: GDL-II description of the Exploding Bomb game.

3.7.3 Limitations of the Player The error that comes from elevating sample to fact (Frank & Basin, 1998) can be clearly demonstrated with this player, resulting in all information-gathering moves24 valued at zero utility; as all the information has already been gathered. The HyperPlay based player is unable to correctly value information-gathering moves. This is HyperPlay’s Achilles’ heel and it motivates the next section. This is illustrated in the Exploding Bomb game, in which this technique always fails. Given a choice between doing nothing and asking for information at a small cost the agent will do nothing, conﬁdent it has all of the information it needs to gain the maximum score. This failure come from the elevation of a sample of an information set to fact. In taking the sample the agent replaces all missing information with randomly selected legal alternatives, then deliberates is if this were the true game. When reasoning about information the agent believes it already has all of the facts. This failure will be the focus of the next chapter.

24Both information-gathering and information-hiding moves are tested in the next chapter.

63 64 65 Chapter 4

The HyperPlay-II Technique

This chapter introduces the HyperPlay-II technique which is a nested technique specifically designed to overcome the limitations of the original technique, able to play a much larger class of games by reasoning on imperfect information models. The formal definition is presented along with experements conducted to test an agent using this technique in a General Game Playing environment with the strengths and weaknesses of the technique identified. Finally a limitation of the agent using this technique is presented.

66 4.1 Introduction

4.1.1 Publications

This chapter recapitulates and expands on previously published works.

• Schoﬁeld, M., Cerexhe, T., Thielscher, M. (2013). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of GIGA 2013 Workshop on General Game Playing (p. 39-45).

This paper was a workshop paper ﬁrst deﬁning the HyperPlay-II technique:

• This researcher invented the technique, designed and conducted the experiments and prepared the formalism,

• Cerexhe, T. did the realated research and prepared the paper,

• Thielscher, M. supervised the process.

• Schoﬁeld, M., Thielscher, M. (2015). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of Twenty Ninth AAAI Conference on Artiﬁcial Intelligence (p. 3585-3591).

This paper was a conference paper built upon the original workshop paper:

• This researcher rewrote the workshop paper, tightened up the math and reworked the examples,

• Thielscher, M. supervised the process and kept this researcher mathematically honest (many times).

• Schoﬁeld, M., Thielscher, M. (2016a). General game playing with incomplete information. Submitted to Journal of Artiﬁcial Intelligence Research.

This article was submitted to the Journal of Artiﬁcial Intelligence Research in July 2016. It was peer reviewed but not accepted. The recommendation was to make minor changes and resubmit.

4.1.2 Terminology

We introduce some new, or augmented, terminology:

• Imperfect Information Simulation (IIS) is a nested playout that starts from the initial state of the game and passes through a state corresponding to a sample of an information set.

67 4.1.3 Motivation In researching games with imperfect-information little progress has been made beyond a specification of their rules (Quenault & Cazenave, 2007; Thielscher, 2011). The only published approaches to modelling and playing imperfect-information games (Edelkamp et al., 2012) both use standard model sampling as a partial solution to this challenge. The sampled states are then used to form the root of a separate perfect-information search. These deterministic sampling methods have an immediate and obvious flaw, in that they value information at zero because each individual sample assumes perfect information at all times (Frank & Basin, 2001). This was confirmed at the imperfect-information track of the GGP competition at the Australasian Joint Conference on Artificial Intelligence; where three games, NumberGuessing, BankerAndThief, BattleshipsInFog, were specifically designed to test the ability of players to value information. None of the competitors were able to do so.1 The motivation for this chapter is the development of HyperPlay-II (for: HyperPlay with imperfect-information models) as an extension of the original sampling technique HyperPlay. This extended technique should be able to play a much larger class of games by reasoning on imperfect information models. The analysis and experimental results evaluate how the new technique values information correctly according to the expected cost/benefit, how it performs information gathering moves when appropriate, how it is protective of information that should remain discreet, and how much additional resources are required.

4.1.4 Related Research The GGP-II international competitions have sparked the development of just a few players. Edelkamp et al. (2012) have built a GDL-II player Nexusbaum based on complete-information Monte Carlo sampling, which they compared against Fluxii, an imperfect-information extension of Fluxplayer (Schiffel & Thielscher, 2007) that maintains the complete information set throughout a game. An early development version of a HyperPlayer also competed in the first international competition, coming second out of only three competitors. At the Australasian Joint Conference on Artificial Intelligence there was GDL-II track encouraging entries by CadiaPlayer, LeJoueur and Nexusbaum. An obvious criticism of many of these players is that they do not correctly value information gathering moves. This Strategy Fusion Error (Frank & Basin, 2001) has been identified as an important issue with straightforward perfect-information model sampling (Frank & Basin,

1 see https://wiki.cse.unsw.edu.au/ai2012/GGP.

68 1998). Long, Sturtevant, Buro, and Furtak. (2010) analyse the impact of this error and present three conditions on game trees under which it does not have a strong adverse effect on the quality of play. For other games, the Strategy Fusion Error has led to variations of perfect-information sampling that have been successfully applied to other card games (Wisser, 2015). The root of the problem lies in the deterministic information set sampling (Richards & Amir, 2009) and particle system techniques (Silver & Veness, 2010) that inspired the original HyperPlay technique (Schofield et al., 2012). Recent work by Cowling et al. (2012) into Information Set Monte Carlo Tree Search (ISMCTS) is extremely interesting and presents the best alternative to work in this chapter. ISMCTS represents an important advancement in imperfect-information game play with the possibility of significantly improving the move selection process for online players. Its primary focus is the forward search of the game tree and the evaluation of the move options. While this thesis is primarily concerned with the deterministic sampling of play histories (a backward looking search) this chapter takes a holistic approach to the game and might benefit from lessons learned researching ISMCTS. Possible future directions are discussed in the conclusion to this chapter.

69 4.2 Lifting Sampling to Imperfect-Information Models

This section addresses the failure of the previous technique and oﬀers a new technique that reasons with imperfect-information. As before a formalism is presented for a simple player based on the new technique and experiments are designed then conducted to validate the technique and to identify its strengths, weaknesses and failures. The problem being addressed is best characterised by an example.

Example 2. In the Exploding Bomb game in Figure 3.7 we see the following sequence of events:

• A spy secretly arms a bomb using either the Red wire or the Blue wire,

• A second spy must disarm the bomb or both will die,

• The second spy may ask "Which colour?" for a small cost, and

• The second spy cuts one wire.

Using the HyperPlay algorithm and a game tree search, the second spy never asks the question "Which colour?" as all the samples of an information set have perfect information and all reasoning assumes perfect knowledge.

Figure 4.1: The game tree for the Exploding Bomb. At Round 1 the agent has an

information set of two nodes; that is, Iagent = {hv1 , hv2 }.

70 Example 2 continued. Evaluating the legal moves by reasoning on complete information with a Monte Carlo based playouts: the agent always (wait), we never (ask):

eval(do(si, ask), ~mc,agent, 4) = 0.25 · (90 + 0 + 90 + 0) = 45

eval(do(si, wait), ~mc,agent, 4) = 0.25 · (100 + 0 + 100 + 0) = 50 (selected)

4.2.1 Imperfect Information Simulations To address the limitations of the original technique an extended technique that includes an Imperfect Information Simulation (IIS)2 in the decision making process is presented. The IIS reasons directly with imperfect information, exploring the consequences of every action in the correct imperfect information context. The new technique reasons across larger subsets of the information partition, in fact, it encompasses the upper bound of the union of information sets of all roles in the game. The result is that it places the correct value on information and will choose information gathering moves and information protecting moves when it is cost eﬀective to do so.

Example 2 continued. In the context of the Exploding Bomb game the IIS might be characterised by the thought processes of an agent playing the role of the second spy. HyperPlay: "what do I know about the situation" HyperPlay-II: "what do I know about what the ﬁrst spy knows about the situation, and how will the ﬁrst spy behave based on what I know that he could know that I might know."

The term HyperPlay-II is used to indicate that this is the HyperPlay technique reasoning with imperfect information instead of reasoning with perfect-information models. Everything about the original technique remains in place, with an additional layer added in the form of the Imperfect Information Simulation (IIS). As before, the new technique requires a bag of models of an information set, representing a weighted sample. These models are updated, as before, based on moves and percepts from the game. But unlike before they are not evaluated directly using perfect-information

2 In the original paper this was called the Incomplete Information Simulation.

71 evaluations3, but the models are the mid-point of an imperfect-information playout that starts at the original starting point of the game and passes through the model as if the model were the true game. This concept is critical to the understanding of the new technique. Think of the HyperPlayer- II as sitting above a simulated game being played by HyperPlayers, watching their moves and learning from their mistakes. At every stage of the game the HyperPlayer-II simulates the lower level game according to percepts received and learns from it. Because the simulation is based on imperfect information then each role must be furnished with imperfect information at every round in the game, including the rounds that have already been played. The IIS is not like the common GGP approach that only considers future moves, it considers all possible past moves given all the imperfect-information available in the current round.

Example 2 continued. In the context of the Exploding Bomb game, the simulation invloves both a replay of previous rounds and a playout to termination. The replay is expanded to allow for the following query: HyperPay-II: "what are the moves that the ﬁrst spy might have made and what are the percepts that the ﬁrst spy might have received based on my model of the truth?"

The logic for conducting an Imperfect Information Simulation is as follows. 1. Select a perfect-information model from the bag of models. 2. Update the play histories for each role using the HyperPlay function. 3. Select a joint-move in the current round to be evaluated. 4. Instantiate a simulation of the selected model of the game at the initial state with HyperPlayers in each role. 5. Replay the game by passing the imperfect-information play messages from the model to each role, allowing the roles to complete the play messages using the HyperPlay function and choose a move consistent with their move selection policy. 6. Make the selected joint-move in the simulation at the appropriate round. 7. Playout out the remainder of the game consistent with the agent’s policy. 8. Use the terminal value as a measure of utility of the joint-move chosen for evaluation.

Note that, as each role in the simulation is instantiated as a HyperPlayer, then each role will be conducting a perfect-information evaluation4 before choosing a move. In this respect

3 Often employing perfect information playouts. 4 This may be any type of perfect-information search.

72 the HyperPlay-II technique might be characterised as a nested evaluation. Also note that some versions of the game will be played out where our role will have a play history set by using the HyperPlay function on their role, which was originally set by applying the HyperPlay function to our role. Such a playout may include nodes that are known to be outside our information set.

Figure 4.2: A visual representation of an Imperfect Information Simulation of a two player game, where each role’s beliefs are built on one half of the play history.

4.2.2 Formalism for HyperPlay-II This formalism is recapitulated adopting notation for finite games in extensive form. This definition refers to the original technique and to the definitions in the previous sections. For the correct valuation of information gathering moves, the agent must be able to evaluate a move using some type of imperfect-information reasoning. In this case, using a playout with imperfect-information. To do this a path for a sample of an information set is used for an IIS with a HyperPlayer in each role. This requires a playout starting from the initial state of the game, not just starting from the state in an information set. The IIS will pass through the state sampled from the current information set on its way to termination. The terminal value is then used as a measure of utility in an imperfect-information context.

Deﬁnition 9. Let G = hS, R, A, P, υ, δi be an imperfect-information game given by a valid

GDL-II description, and let G = (V, E) be a game tree induced by the game, then: 1. replay : S × H × Π → S|R| is the replay of a game consistent with the history of a state corresponding to a sample of an information set:

• replay(s0, hr, ~π) is the replay of a game, as if hr = hp(ξ(st, r), x) and

generating information sets for all roles, such that, hp(ξ(hr, i), x) ∈ Hi where r is our role, i is any other role and x is a random number.

73 2. IIS : S × Π × R × N → R is the imperfect information simulation:

• IIS(hr, ~πhp, r, n) is an evaluation using an imperfect information simulation,

and is deﬁned as eval(replay(s0, hr, ~πhp), ~πhp, r, n), and

• ~πhp is the move selection policy determined by a HyperPlayer.

......

Example 2 continued. The IIS generates multiple paths. Row one is what the agent knows; row two is a model from the agent’s information set; row three shows two imperfect paths created by the IIS (one for each role); and row four shows models from an information sets of each of the IIS roles.

1 ξ(ht, agent) hH0, i hE0, i

2 hp(ξ(ht, agent), x) → ha hH0,A 0i hE0,C 0i

3 ξ(ha, role) hH0, i h ,A 0i hE0, i h ,C 0i

4 hp(ξ(ha, role), x) hH0,B 0i hH0,A 0i hE0,C 0i hD1,C 0i

It is worth noting that one of the models in row four may represent a state that is inconsistent with the agent’s information set. This is to be expected when the agent is considering the basis for its opponent’s reasoning.

There is a clear likeness between the internal processes of the IIS and that of DNA repli- cation. Row one in the example is a partial strand of DNA which is completed by the hp() function to give complete DNA. The IIS then splits the DNA into two strands passing each new strand to diﬀerent internal agents. Each of these internal agents then completes the DNA with the hp() function.

4.2.3 Move Selection Policy Move selection policy for the HyperPlay-II technique follows the same analytical process as the original technique. Note the similarity with the move selection policy ~πhp given in Equation 3.9. In this respect, it is easy to see the characterisation of the new technique as a nested player.

74 Definition 9 provides for the formulation of an expression for the utility of a specific move in a specific decision state:

utility(d, arj) = E IIS(δ(d, ~ak), ~πhp), r, n) (4.1) ~ak∈λ(d): ~ak3arj using the IIS() from Definition 9 the E expected function aggregates all of the outcomes for move vectors that contain the specific move arj. Note the expression of the specific move arj as being present in the move vector ~ak as the unusual form ~ak 3 arj. Also note that this move may be present in more than one move vector and so an expected value is taken.

The problem is that not all equivalence classes may be present in each sample. This can be overcome by setting the weighting for that equivalence class in that sample equal to zero.

By summing the utility of the move class across all of the samples in the bag:

  P (hi) · utility(state(vhi ), arj):  X  classUtility(arj) = arj ∈ λ(state(vh ), r) (4.2)  i hi∈Br   0 testing to make sure the move class is legal each of the states and applying the probability derived from the ChoiceF actor.

Finally, the move selection policy for the simple player:

ar = argmaxarj classUtility(arj) (4.3)

∀ hi ∈ Br, ∀ arj ∈ λ(state(vhi ), r)

ﬁnds the move class with the highest expected utility across all the legal moves across all the models in the bag Br. Note that the nodes are treversed twice, once in equation (4.3) and once in equation (4.2).

As with the HyperPlay move selection policy; despite the thoroughness of the analysis it is still possible, though unlikely, to choose a move that is not legal in the true game being played by the Game Controller.

75 Example 2 continued. Reasoning on imperfect-information using the move selection

policy ~πhp gives the following,

IIS(do(hi, ask), ~πhp, agent, 4) = 0.25 · (90 + 90 + 90 + 90) = 90 (selected)

IIS(do(hi, wait), ~πhp, agent, 4) = 0.25 · (100 + 0 + 100 + 0) = 50

which now corrects the error experienced when using the original technique and correctly values information gathering in this game.

The use of the IIS extends the domain of reasoning to the least upper bound of the information partition supI ⊆ D. As the hp() function generates paths across an information domain that is closed with respect to what the other roles can know, based on an information set of our role. The closure of the HyperPlay function on the state of the true game being played by the Game Controller:

Ir = Cn(hp(ξ(ht, r), x)) (4.4) is a role’s information set. Every history in an information set for one role r0 generates an information set for any role r such that:5

0 Ir = Cn(hp(ξ(Cn(hp(ξ(ht, r ), x)), r), x)) (4.5) the closure of all of the imperfect-information play histories based on all of the possible play histories signiﬁcantly extends the domain for reasoning. The union of all such closures:

[ 0 supIr0 = Cn(hp(ξ(Cn(hp(ξ(st, r ), x)), r), x)) (4.6) r∈R provides the least upper bound of the information partition used by the HyperPlay-II process. It is both the expanded domain and the use of imperfect-information reasoning that gives the new technique an advantage over its predecessor. That is to say, there is an improvement in both the quantitative and qualitative aspects of the agent’s reasoning.

5 In the case of the one role the newly generated informaton set is the same as the original information set.

76 4.3 Experimental Results

This section presents a summary and interpretation of the experiments shown in Appendix B.

4.3.1 Testing the Player To validate the claim that HyperPlay-II correctly values information gathering moves a version of the new technique was instantiated using the move selection policy outlined in Equation (4.3). Games were selected from a variety of game topologies to cover diﬀerent aspect of imperfect-information. Games played at the Australasian Joint Conference on Artiﬁcial Intel- ligence 2013 were used as inspiration for the experiments. A summary of experimental set up presented in Appendix B is given below.

Design of Experiments A series of experiments was designed to test the capabilities of the new technique using the high-throughput computer facilities at the School of Computer Science and Engineering. As with previous research, the game server was modelled along with both players in a single thread so it could be parallelised across many CPUs. Each test was repeated one hundred times.

Game Play For two player games an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive.

Measuring Performance The number of states visited by the player in playing the game is used as the measure of computational resources, being careful to measure this across the multiple samples of the information set and to include the states visited in the backtracking of invalid samples6. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

Player Resources In each experiment the player resources were varied to demonstrate the performance compared to resources. Care was taken to ensure that each player had equal resources when making a move selection. This was achieved by setting each player’s parameters such that they would visit a similar number of states each round. Some experiments were conducted with two-player games, pitting the original player against the new player using equal resources. A resource index n = 4 gives the new player four HyperGames, each running an IIS with four HyperGames. The old player would get

6 This may consume half of the resources in some games.

77 n = 16 HyperGames. A player resource index of zero represents random decision making and serves to provide a basis for improvement.

4.3.2 Exploding Bomb

Experimental Results The results in Table 4.1 show the averages and conﬁdence intervals from a batch of 100 games played. The agent receives a 10 point penalty for asking which wire was used to arm the bomb, to which the answering agent always answers honestly. The choices being made by each of the players is clear cut with the old technique opting for a greedy strategy and never asks, while the new technique applies cost/beneﬁt to achieve the optimum score.

round agent does HyperPlay HyperPlay-II 1 ask 45.04 ± 0.09 90.00 ± 0.00 1 wait 49.98 ± 0.10 49.91 ± 0.64 2 cut unarmed 49.40 ± 1.19 0.00 ± 0.00 2 cut armed 50.60 ± 1.19 90.00 ± 0.00

Table 4.1: Experimental expected score calculations made by the players during the Exploding Bomb decision-making process. The bold scores indicate the chosen actions.

Interpretation The original player never asks the question in this game since it thinks it already knows the answer due to superficial agreement of its samples and so it believes it can avoid the modest penalty. In contrast, the new player plays out the greedy strategy to see that it does not work and then correctly identifies that asking the question gives the best expected outcome. This is a simple but effective validation of the ability of the new technique to overcome the Strategy Fusion Error.

4.3.3 Spy v Spy

Experimental Results The results in Table 4.2 show the averages and conﬁdence intervals from a batch of 100 games played. The arming agent arms the bomb randomly then decides to whether to honestly tell the other agent which wire was used. There is a cost to keeping the information private.

78 arming agent does HyperPlay HyperPlay-II arm blue and tell 60.00 ± 0.15 20.00 ± 0.00 arm red and tell 60.04 ± 0.14 20.00 ± 0.00 arm blue and hide 39.98 ± 0.16 40.36 ± 1.22 arm red and hide 39.99 ± 0.14 39.45 ± 1.33

Table 4.2: Expected score calculations for the arming agent in round one of the Spy vs. Spy decision-making process. The bold scores indicate the chosen actions.

Interpretation The original player always tells its opponent which wire was used to arm the bomb to avoid the penalty. The new player recognizes that hiding this information yields a better expected outcome and keeps the information secret.

4.3.4 Number Guessing

Figure 4.3: The NumberGuessing Results for HyperPlay-II.

Experimental Results The results in Figure 4.3 show the averages and conﬁdence intervals from a batch of 100 games played. The player can ask question about the range of the number at a cost, then announce it is "ready to guess". Binary search plays perfectly here, guessing correctly after four questions.

Interpretation The original player always announces it is "ready to guess", but then guesses randomly from one of the 16 numbers resulting in a 6.25% chance of guessing correctly. The

79 new player only guesses the number when all playouts agree on the result. A binary search means guessing after four questions. In Figure 4.3 the new player approaches this score. The original player suﬀers from the Strategy Fusion Error as each HyperGame believes its model is the true game and therefore knows the answer, so all agree that they are ready to guess the answer. The new technique, properly resourced will conduct a binary search7. There is a clear log-log relationship between resource index and the CPU resources required to operate the new technique.

4.3.5 Banker and Thief Experimental Results The results in Figure 4.4 show the averages and conﬁdence intervals from a batch of 200 games played with alternating ﬁrst roles. The optimal score for the banker is $40, and for the thief is $100, giving an average score of 70. When the player resources a low they tend to make random choices giving an average score of 25.

Figure 4.4: The Banker and Thief results.

Interpretation The results show that the original technique adopts a greedy policy and places $100 in its bank, only to have it stolen. The new technique, adequately resourced, will deposit $40 of the money in its bank and create a decoy of $60, leaving the remaining banks empty. Then relying on a greedy thief to attempt to steal the $60 in the decoy bank. The new technique reaches optimal avg(40 + 100) at resource index of eight as it correctly models both roles.

7 Watching the game play revealed a ﬁrst guess that would approximately halve the size of the information set.

80 4.3.6 Battleships In Fog Experimental Results The results in Figure 4.4 show the averages and confidence intervals from a batch of 100 games played with alternating first turns. The game is win lose with a small chance of getting in a lucky first shot. The score for the HyperPlayer is not shown on the graph, but is constant at 9.4 for all resource levels.

Figure 4.5: The Battleships In Fog results.

Interpretation The original player sees no value in scanning as all of the samples "know" where the opponent is. It does not value moving after being scanned as it thinks its opponent always knows where it is. Its only strategy is to randomly fire missiles giving it a 6.25% chance of a hit on a 4x4 board. The agent using the new technique will scan for the opponent and fire a missile. A resource index of four is sufficient for the new player to dominate the old in this turn- taking game: HyperPlay has a 9.4% chance of winning with a random shot (12.5% if it goes first, half that if it plays second).

81 4.4 Conclusions

This extended technique is able to play a much larger class of games by reasoning on imperfect- information models. It does value information correctly according to the expected cost/beneﬁt, and it does require additional resources.

4.4.1 Strengths

The experimental results show the value new technique places on information, and how it correctly values information-gathering moves and secret-keeping by itself and its opponents. It is able to collect information when appropriate, withhold information from its opponents, and keep its goals secret. The use of the Imperfect-Information Simulations is an eﬀective tool for reasoning with imperfect-information. A HyperPlayer-II was easily able to outperform an equally resourced HyperPlayer in all of the experiments.

4.4.2 Weaknesses

We observe that the new technique is a "resource pig" as it uses nested playouts to evaluate move selections and have genuine concerns about its ability to scale up for larger games. These concerns motivate the next section of this chapter. Also, the IIS is eﬀectively a search and can be inﬂuenced by the type of search. It has been observed that a simple search is susceptible to shallow traps and this will be followed up with future work.

4.4.3 Limitations of HyperPlay-II

There is an interesting type of games requiring what is known as "coordination without communication" (Fenster et al., 1995) that goes beyond what this technique can achieve.

Consider the following cooperative variant of the Spy vs. Spy game. Spy1 sees which wire is used to arm a bomb. They then signal the name of a colour to Spy2, who must try to disarm the bomb. Both win if Spy2 cuts the right wire and lose otherwise. Clearly Spy1 has an incentive to help Spy2, and there is one obvious way to do this: signal the colour of the wire. The crux, however, is that the game rules can be designed such that the colour being signalled is logically independent of the colour of the armed wire. Whilst a human spy would see the syntactic similarity in the colours and hence the semantic link, the logic of the AI sees them as merely labels and does not make the connection.

82 4.4.4 Future Work Information Set MCTS (Cowling et al., 2012) was mentioned in the beginning of this chapter. This represents an important advancement in imperfect-information game play with the possibility of signiﬁcantly improving the move selection process for online players. Its primary focus is the forward search of the game tree and the evaluation of the move options. This thesis focuses on deterministic sampling of play histories (a backward looking search) and uses simple playouts for move evaluation. Future work in this area would be the next priority and is likely to include:

• Can HyperPlay be coupled with ISMCTS to produce a superior player,

• Can ISMCTS overcome the Strategy Fusion Error and replace HyperPlay-II which is a "resource pig", and

• Can a player based on ISMCTS successfully play games like The Monty Hall Game which requires a weighted particle ﬁlter, and if not, can the HyperPlay ChoiceFactor be incorporated into such a player?

The hurdle to be overcome in implementing ISMCTS is the time budget restriction for online decisions making in GGP-II. There is a high computational load for any type of MCTS with successful players being parallelised across many CPUs.

83 84 85 Chapter 5

Scalability of HyperPlay

This chapter explores the claim from a previous chapter the HyperPlay-II is a "resource pig" and does not scale well. As a nested player it is anticipated that it will scale at O(n2). This is tested experimentally across a number of common GGP-II games and the original claim is supported accurately by the experimental results. Additionally the games used in the experiments are representative of a set of games that satisfy certain properties in such a way as to begin a topological examination of imperfect-information in all GGP-II games.

86 5.1 Introduction

5.1.1 Publications This chapter expands our published work:

• Schoﬁeld, M., Thielscher, M. (2016b). The scalability of the hyperplay technique for imperfect-information games. In Proceedings of AAAI Workshop on Computer Poker and Imperfect Information Games.

This was a workshop paper only, not a conference paper.

• This researcher investigated the scalability of both HyperPlay techniques then designed and conducted the experiments, as well as identifying the broad topological classiﬁcations of imperfect-information in General Game Playing so as to select suitable games for the experiments,

• Thielscher, M. supervised the process.

• Cazenave, T., Saffidine, A., Schofield, M., Thielscher, M. (2016b). Nested monte carlo search for two-player games. In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence.

This was a joint eﬀort between four authors to investigate the eﬀectiveness of nested players in General Game Playing.

• This researcher contributed the idea of discounting the payoﬀ based on the depth of the playout and using that information to prune the search and was the author of the conference version of the paper,

• Cazenave T. designed conducted experiments for most of the games,

• Saﬃdine, A. was the author of the workshop version and contributed to the theory and special cases, and

• Thielscher, M. contributed to the theory and supervised and coordinated the process.

• Cazenave, T., Saﬃdine, A., Schoﬁeld, M., Thielscher, M. (2016a). Discounting and pruning for nested playouts in general game playing. In Proceedings of The IJCAI-15 Workshop on General Game Playing.

This was a workshop version of the conference paper.

5.1.2 Motivation Motivation for this research is the notion that the HyperPlay-II technique is a "resource pig" as it uses nested playouts to evaluate move selections.

87 This chapter focuses on the HyperPlay-II technique and its consumption of computational resources for a particular level of performance, compared to the original process for a variety of games as well as measuring the increase in resources used by both techniques when a game is scaled up. Several pruning techniques are implemented that have had some success in nested perfect-information players and the reduction in resources is measured. These experiments seek to examine the cost of the imperfect-information aspects of a player, not the embedded perfect-information search techniques. While the latter is fertile ground for improvement, the focus is on the resource consumed by the imperfect-information algorithms.

5.1.3 Related Research There has been no previous work published that investigates the scalability of the HyperPlay technique. As part of the experimentation reported in this chapter there is a reference to game topology which helps to diﬀerentiate the various aspects of imperfect-information games. This represents a new topic of research in General Game Playing, but is beyond the scope of this thesis. This chapter introduces variations in: game play, reward structures, player percepts, optimal play strategies, game tree, and scalability as dimensions in game topology. Again no previous work has been done on game topology in General Game Playing.

88 5.2 Scalability of HyperPlay

5.2.1 States Visited The new technique utilizes a nested playout for evaluating move choices, which causes a signiﬁcant increase in the number of states visited during the analysis. However, because of imperfect-information the nested playout must start from the initial state of the game, not the current round. This doubles the number of states visited in a game compared to the perfect information version of a nested playout.

Figure 5.1: Counting the States Visited in a simple search.

Consider the simple game depicted in Figure 5.1 and calculate the order of magnitude of the search space for a simple Monte Carlo search:

Game Depth = d including initial state and terminal

Round Number = 1 ≤ n < d

P layout Depth = (d − n)

MC P layouts = bf · (d − n) states visited (5.1) d−1 W hole Game = X bf · (d − n) n=1 = bf · d · (d − 1)/2 states visited

= O(bf · d2)

Now consider the search space for an Imperfect Information Simulation where each round of the game involves more states being visited than a whole game of Monte Carlo Playouts:

89 Game Depth = d including initial state and terminal

Round Number = 1 ≤ n < d

P layout Depth = d

P layout States = bf · d · (d − 1)/2 states visited (5.2) IIS P layouts = [bf · d · (d − 1)/2] · bf · (d − n) states visited

d−1 W hole Game = X [bf · d · (d − 1)/2] · bf · (d − n) n=1 = bf 2 · [d · (d − 1)/2]2 states visited

= O(bf 2 · d4) ignoring the number of roles being modelled in the simulation and the meta game being played out with O(bf · d2), also ignoring the number of playouts per move choice and the number of samples taken of an information set. The comparison in (5.1) and (5.2) represents a lower bound on the scalability factor and shows a signiﬁcant increase in computational resources required to play a game; from O(bf · d2) for the HyperPlay based player to O(bf 2 · d4) for the HyperPlay-II player.

5.2.2 Imperfect-Information Game Topology

In the General Game Playing domain for imperfect-information games, the rules of the game and the reward structure is fully known to each player. What is not automatically known are the moves made by other players in the game. Player receive percepts from the Game Controller according to the rules of the game expressed in the GDL-II. In this context there can be many ways that one game can vary from another:

Game Play is it turn-taking or simultaneous or determined by the game play,

Reward Structure is it constant sum or variable,

Player Percepts can vary in an almost inﬁnite way,

Optimal Play Strategy may have an impact on rewards or may be inconsequential,

Game Tree may vary in depth and branching factor, and

Scalability Is useful for making a simple game into an impossible search.

90 To examine the question of scalability of the HyperPlay-II technique these game topologies have be re-characterised by some game variants that focus on the nature of information in the game. These variations are:

• Imperfect move information,

• Imperfect initial information,

• Information purchasing,

• Hidden objectives, and

• Tactical information valuing. A game represent each type is used in the experiments with topologies for each game being presented in more detail in Appendix E.

5.2.3 Heuristics and Pruning Using a heuristic to improve the search and/or pruning the search space are effective way to improve the computational efficiency of the move selection process. Several techniques are examined by implementing in a Nested Monte Carlo player (Cazenave, Saffidine, Schofield, & Thielscher, 2016a) as there is a high degree of similarity between the nesting in this player and the nesting in the new technique.

5.2.4 Discounting Discounting is a way of improving the information extracted from a playout to give a rich set of terminal results instead of the usual 0 or 100. Discounting based on the depth of the playout has been demonstrated (Cazenave et al., 2016a) to improve the search performance and to facilitate search pruning. Discounting can only be eﬀective in games with variable playout depth, so an agent’s performance will not be improved when playing the ﬁxed depth games.

5.2.5 Pruning Cut on Win This technique works well with a Nested Monte Carlo player in turn taking two-player win/loss games (Cazenave et al., 2016a), but has problems being implemented in games where players purchase information. The Cut on Win (CoW) technique requires a strict win/loss reward structure to be eﬀective. A variation is explored where the player "knows" the maximum achievable score under optimal play conditions and uses that as a cut-oﬀ-point for the CoW pruning.

91 Pruning on Depth This technique also works well with a Nested Monte Carlo player in turn taking two-player win/loss games (Cazenave et al., 2016a). Pruning on Depth (PoD) is ineﬀective when the playout depth is ﬁxed.

92 5.3 Experimental Results

This section presents the experimental results along with comments that explain and highlight without drawing any conclusions. Full details of the experiments are given in Appendix C.A summary of the experimental set up is given below.

Design of Experiments Experiments were designed to answer two basic questions:

• Does HP-II perform better than HP at this type of game, and at what computational cost; and

• What is the impact of up-sizing the game on the computational cost for HP-II to achieve the same level of performance.

Game Play For two player games there needs to be a consistent opponent to make some useful measurements. To this end, an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive.

Measuring Performance The number of states visited is used as the measure of computational resources, being careful to measure this across the multiple samples of the information set and to include the states visited in the backtracking of invalid samples. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

Player Configuration Preliminary experiments were conducted to find the best configuration of both players for each game. The intent was to show the best performance for each player in terms of maximising the score and minimising the number of states visited. Once that configuration was found, multiples of 2 were used to produce characteristic curves presented in the results.

Game Variants The game variants were chosen to cast the strongest light on the experimental aims and show the best possible performance for both players. With many game variations and player variations conﬁgurations were chosen that gave a fair representation of the relative performances in context of the imperfect information present in each game.

5.3.1 Hidden Connect Experimental Results Two variants of the game were used, connect 3 in a 3x3 grid and connect 4 in a 5x5 grid. As there are no information-gathering or information-purchasing

93 moves the HP player performed as well as the HP-II player but consumed considerable fewer resources with both players improving their performance as they visited more states. Each line connects results from a similar game conﬁguration and each plot point represents a doubling1 of the samples taken and the number of playouts per move choice2

Figure 5.2: Experimental results showing a comparison of resources used by various conﬁgurations of both techniques for the Hidden Connect game. The labels refer to the player and the size of the game: HP,3 refers to HyperPlayer playing Connect 3, etc..

Interpretation The experimental results in Figure 5.2 show that there is a significant increase in resources for each game and player variant. The first plot point for each line represents one playout for each move choice and one sample of each information set, thereby eliminating the resource factor from the scalability equation and giving the best comparison. Also with such poorly resourced players each game is likely to play out in the same way. The tan numbers show increases for the same game configuration but different players. For Connect 3, the HP-II player is 12 times as expensive and for Connect 4, the HP-II player is 148 times as expensive for the first resource configuration. The green numbers show increases for the the same player but different game configurations. The HP player increased its cost 33 times when the game was upscaled and the HP-II player increased its cost 390 times when the game was upscaled.

1 This is generally true, but not exactly true for the ﬁrst two plot points for the HP-II player. 2 An increase of 4 fold, in the states visited, for the HP player and 16 fold for the HP-II player could be expected if all games played out in exactly the same way.

94 Equation 5.1 gives the HP player O(bf · d2) and HP-II player O(bf 2 · d4), but the two players will play diﬀerently impacting the value of bf and d, so a direct comparison cannot be made. At best it is possible to examine the relative increase for similarly resourced player when the game is upscaled (the tan numbers) and consider that the relative increase should be O(R2) where R is the Ration of the number of states visited.

The green numbers show a diﬀerent result which is less than expected for O(R2) but an inspection of the second plot point will show a 19 to 320 ration which is much closer to the expected result.

Equation 5.1 gives the increase from 3x3 with a branching factor of 3 and playout depth of 9 to the 5x5 with a branching factor of 5 and depth of 25 would give a predicted 13 fold, increase for the HP player and a 165 fold increase for the HP-II player.

5.3.2 Mastermind

Experimental Results Two variants of the game were used, two colours in three positions with three guesses, and three colours in four positions with four guesses. The reward was pro rata for the number of correct positions. The number of guesses was restricted to see if an increase in resources would improve the guessing strategy. This game has no hidden game play, only a hidden initial setting created by the random player. As such, even the simplest HP player was able to solve the puzzle.

Figure 5.3: Experimental results showing a comparison of resources used by various conﬁgurations of both techniques for the Mastermind game.

95 Interpretation The experimental results in Figure 5.3 show a very flat performance from both players. This game is solved more in the backtracking of invalid samples, and less in the cleverness of the guesses. Whilst a skilled player might achieve a binary search, even the least resourced player was able to achieve an optimal result3. The results show that there is a significant increase in resources for each game and player variant. The first plot point for each line represents one playout for each move choice and one sample of each information set, thereby eliminating these from the scalability equation and giving the best comparison. The increases for the same game configuration but different players for the 2x3 version of the game is 11 times more costly and 109 times more costly for the 3x4 version of the game. The increases for the the same player but different game configurations show the HP player increased its cost 16 times when the game was upscaled and the HP-II player increased its cost 166 times when the game was upscaled.

5.3.3 Number Guessing Experimental Results Variants of 4, 8 & 16 numbers were used in this experiment. As expected, the HP player was unable to correctly value the information gathering moves and performed no better than a random player would. Whereas, the HP-II player tended towards optimum play as the resources were increased.

Figure 5.4: Experimental results showing a comparison of resources used by various conﬁgurations of both techniques for the number Guessing game.

3 Given there were limited guesses.

96 Interpretation The experimental results in Figure 5.4 show the HyperPlay-II player required significantly more resources than its HP counterpart. Also note that a doubling of the game size increased the computation resource requirements by several orders of magnitude for the same level of performance4. Without any increase in game play complexity there is an expected theoretical increase of 10 fold for a doubling of the game size. The results show that there is a smaller increase in resources for each game and player variant than other games. The first plot point for each line represents one playout for each move choice and one sample of each information set, thereby eliminating the resource factor from the scalability equation and giving the best comparison. The increases for the same game configuration but different players for the 4 number version of the game is 10 times more costly, 25 times more costly for the 8 number version of the game and 53 times more costly for the 16 number version of the game. The increases for the the same player but different game configurations show the HP player increased its cost 3.0 and 2.7 times when the game was upscaled and the HP-II player increased its cost 7.2 and 5.8 times when the game was upscaled.

5.3.4 Banker and Thief

Figure 5.5: Experimental results showing a comparison of resources used by various conﬁgurations of both techniques for the Banker and Thief game.

4 Optimal play with 4 number scores 80, 8 numbers scores 70 and 16 numbers scores 60.

97 Experimental Results Variants with 2 and 4 banks and a deposits of 10 by $10.00 are used in these experiments. As expected the HP banker uses a greedy strategy when making deposits and falls victim to the thief. Whereas, the HP-II player tended towards optimum play as the resources were increased5.

Interpretation The experimental results in Figure 5.5 show a relatively small shift from one game variant to the next as this game has a fixed depth and only variable branching factor. Theoretically there should be an scaling factor of 2 for the HP player and 4 for the HyperPlay-II player from one game variant to the next. However, backtracking states are being measured as well as playout states. Also note that the HyperPlay-II player requires resources an order of magnitude more than its HP counterpart. The results show that there is a similar increase in resources for each game and player variant. The first plot point for each line represents one playout for each move choice and one sample of each information set, thereby eliminating these from the scalability equation and giving the best comparison. The increases for the same game configuration but different players for the 2 bank version of the game is 14 times more costly and 23 times more costly for the 4 bank version of the game. The increases for the same player but different game configurations show the HP player increased its cost 1.9 times when the game was upscaled and the HP-II player increased its cost 3.0 times when the game was upscaled.

5.3.5 Battleships in Fog Experimental Results Variants of 3x3, 4x4 and 5x5 grid were used with a game length of 10 moves. This is a tactical game where players must evaluate every round for a tactical advantage. The HP player plays little better than random with a score just above 50, due to some lucky ﬁrst shots. Whereas, the HP-II player tends towards optimum play as the resources are increased.

Interpretation The experimental results in Figure 5.6 show the HyperPlay-II player required signiﬁcantly more resources than its HP counterpart. Also note that an eﬀective doubling of the game size6 increased the computation resource requirements by an order of magnitude for the same level of performance. Without any increase in game play complexity, and hence the length of the game, the theoretical increase for the HP player would be 2 fold for a doubling

5 Optimum play rewards $40.00 by creating a false target of $60.00 6 From 3x3=9 to 4x4=16.

98 Figure 5.6: Experimental results showing a comparison of resources used by various conﬁgurations of both techniques for the Battleships in Fog game.

of the game size, and the HP-II player would be 4 fold.

The results show that there is a smaller increase in resources for each game and player variant than other games. The first plot point for each line represents one playout for each move choice and one sample of each information set, thereby eliminating these from the scalability equation and giving the best comparison. The increases for the same game configuration but different players for the 3x3 version of the game is 47 times more costly, 130 times more costly for the 4x4 version of the game and 307 times more costly for the 5x5 version of the game. The increases for the same player but different game configurations show the HP player increased its cost 3.4 and 3.0 times when the game was upscaled and the HP-II player increased its cost 9.5 and 7.0 times when the game was upscaled.

5.3.6 Discounting and Pruning

Discounting and pruning were implemented for the HP-II player to gauge its eﬀectiveness. A fully resourced version of the player was chosen so as to give the greatest opportunity for improvement.

For each of the games that were implemented each of the variation of discounting and pruning was implemented and the average score and average number of states per game reported below.

99 HP-II player

Game Enhancement Score States

Hidden Connect None 78.8 768,687 (3x3) CoW 63.6 548,684 Discounting 79.4 753,236 PoD 52.7 520,416

Mastermind None 83.1 307,619 (2x3) CoW 81.2 68,300 Discounting 84.1 303,373 PoD 81.6 82,812

Number Guessing None 78.2 677,314 (4) CoW 70.6 677,190 Discounting 78.3 672,235 PoD 70.5 699,806

Banker & Thief None 28.3 743,705 (2x10) CoW 14.9 660,727 Discounting 27.1 744,103 PoD 26.7 744,071

Battleships in Fog None 89.1 5,236,047 (3x3) CoW 42.3 2,482,041 Discounting 86.6 5,157,731 PoD 48.8 3,027,023

Table 5.1: Results of pruning the search space on player performance

Hidden Connect The 3x3 grid was used and shows only a slight reduction in states visited using pruning, but a signiﬁcant degradation of performance.

Mastermind The two colour, three position version was used and shows a signiﬁcant reduction in resource using pruning with no degradation of performance.

Number Guessing The 4 number version of this game was used and shows no real improvement from pruning, in fact pruning extended some games, reducing the score and increasing resources per game.

100 Banker and Thief The 2 bank, 10 deposit version of this game was used and shows only a small reduction in resources for Cut on Win, but a signiﬁcant degradation of performance.

Battleships in the Fog The 3x3 grid for this game was used and shows a signiﬁcant reduction in resources using pruning, but a signiﬁcant degradation of performance.

101 5.4 Conclusions

5.4.1 HP versus HP-II When the topology is favourable the HP player performs as well as the HyperPlay-II player, improving its score as resources increase and reaching the same level of optimal play. Therefore, it an be concluded that the HP player is an acceptable choice, except where the game topology makes it ineﬀective.

5.4.2 Computation Cost of HP-II The HyperPlay-II player requires significantly more resources to instantiate than the HP player. In each of the games tested, the number of states visited increased by an order of magnitude. The only benefit in using the HyperPlay-II player is that it correctly values information. Therefore, it can be concluded that the HP player should be the first choice, except where the game topology makes it ineffective.

5.4.3 Up-sizing the Game In all of the games tested there was a signiﬁcant impact when the game was up-sized. This was consistent with the theoretical analysis that stated the HP player as being O(bf · d2) and the HyperPlay-II player as being O(bf 2 · d4).

5.4.4 Discounting Refering to Table 5.1, in all of the games, discounting had little impact on the outcome. In games with ﬁxed depth, discounting is known to have no impact. In the other games, discounting did not hasten the win, or prolong the loss in any real way.

5.4.5 Pruning Referring to Table 5.1, there was only one game out of ﬁve where pruning had a positive impact. Cut on Win and Pruning on Depth are known to be safe (Cazenave et al., 2016a) for Nested Monte Carlo players with perfect information. The results from Banker and Thief, and Battleships in Fog suggest they may not be safe in imperfect-information Simulations, but the reason is not clear7.

7 samples of an information set may not contain the same legal moves, but to oﬀer this as a reason would be speculation

102 5.4.6 General The HyperPlay-II player will always play as well as the HP player, and will correctly value information in the context of the reward structure and the expected outcome of the game. Whereas, the HP player falls into the trap of elevating sample to fact and consequently values information at zero. The player of choice should be the HP player, and only utilizing the information valuing properties of the HyperPlay-II player when the game topology dictates.

103 104 105 Chapter 6

The Eﬃciency of the HyperPlay Technique Over Random Sampling

This chapter test the eﬃciency and eﬀectiveness of the HyperPlay technique as a method of taking a deterministic sample of an information set. The basis for the test is a comparison with a random sample taken by tracing out a play history from the initial state to the current round. Games were chosen as a representative sample of General Game Playing with Imperfect Information and experiments conducted to make a comparison. A case study was conducted with one of the most common games to show that random sampling becomes impossible to conduct within a competition time-frame but HyperPlay can continue to function adequately. Additionally, this chapter adds two security games, new the General Game Playing, to the basket of test games.

106 6.1 Introduction

6.1.1 Publications

This chapter expands our published work:

• Schofield, M., Thielscher, M. (2017). The efficiency of the hyperplay technique over random sampling. In Proceedings of Thirty First AAAI Conference on Artificial Intelli- gence.

This paper was the first critique of the efficiency and effectiveness of the HyperPlay process:

• This researcher investigated the ways in which the technique could be worse than a random sample then designed and conducted the experiments, as well as converting the security games to GDL as part of the suite of games used in the experiments,

• Thielscher, M. supervised the process.

6.1.2 Motivation

The motivation for this chapter is proving the efficiency and effectiveness of the HyperPlay technique. Can it provide a sample of an information set more efficiently than a random sampling approach? Are there games where random sampling is impossible? Will the sample be uniformly distributed across an information set thereby providing an unbiased starting point for evaluation? If the sample is biased, can this be rectified? This chapter explores all of these questions using a basket of games that represent all aspect of "imperfect" information. Experiments are designed to expose the worst aspect of HyperPlay and look at the impact on competitive game-play as well as remedies for any shortcomings. Additionally, the games available in the GGP community are supplemented by introducing GDL-II versions of two popular security games. The Transit game and the Border Protection game. Each game is reproduced in the GDL-II as part of the basket of games used to test queries. Details of the conversion are presented later in the chapter and a sample of the Transit game code is presented.

6.1.3 Related Research

Figure 6.3 shows a partial GDL-II description of the Transit security game adapted for GGP. This game is typical of many of the security appearing recently in the literature.

107 Security Games Security games have been a topic of research recently with some success in real world applications, for example, public transport (Yin et al., 2012). These games have become more accessible due to improvements in solving massive imperfect-information games such as Poker (Bowling, Burch, Johanson, & Tammelin, 2015) and the utilisation of those solution techniques to other games. The challenge is overcoming the size of the search space and the need for a behavioural strategy that maps every action in every state to a probability that inhibits on-line play. This chapter presents an adaptation of the Transit game and Border Protection (Bošansk`y, Jiang, Tambe, & Kiekintveld, 2015) and gives an equivalent game in the GDL-II.

Transit Game The Transit Game sees an evader travel from left to right in the grid in Figure 6.2 while the defender tries to apprehend them. This replicates the representation by Lis`y, Davis, and Bowling (2016) which was solved using counterfactual regret minimization. Signals are added to provide the defender with in-game percepts. This signal overlay allows for in-game response adjustments in two respects. Firstly there is the obvious "he’s over there" which collapses an information set, and the more subtle "can’t sees any evaders" which invalidates some models of the game causing re-sampling. It is the second aspect of signalling that is of interest in this paper.

Border Protection The Border Protection game has an evader attempting to escape detection en-route with the protector covering the arrival points and receiving random signals from locations where no evader was detected. It was also presented by Lis`y et al. (2016) and solved using counterfactual regret minimization.

108 6.2 Sampling an Information Set

This section looks at the issues associated with taking deterministic samples of an information set and compares the Hyperplay technique and random sampling. It explores the weaknesses of both techniques and formulates a means of comparison.

6.2.1 Taking a Sampling of an Information Set Every game induces a game tree (Schiffel & Thielscher, 2014) whose vertices (nodes) map to states in the game and whose edges map to joint move vectors. The vertices can be uniquely described by the path of edges from the tree root to the particular node. These edges form a play history of joint moves. An agent may have imperfect-information play histories, and therefore may not be able to distinguish one history from another, giving rise to an information set being defined as "the set of indistinguishable histories". Some histories in an information set are more likely than others and so a weighted sample1 is used, based on the choices made along the path of the history. In order to reason the agent grounds the unknown variables to create a model of the game, and hence a deterministic sample of an information set. Such a sample is used as the starting point for a forward search of the game tree. In order to generate the perfect-information histories required, each missing move must be substituted with one of the legal moves at the time, such that, all percepts arising from this substitution agree with the known percepts from the game. This is a deterministic approach starting from the initial state of the game working towards the current round. As the game progresses such valid substitutions become more difficult.

6.2.2 Move Selection The agent seeks to maximize the perceived payoﬀ using a forward search of the game tree originating from each legal move choice. There are many techniques: UCT (Schäfer, Buro, & Hartmann, 2008), MiniMax (Clune, 2007), to name a few. In contrast to these, ISMCTS (Cowling et al., 2012) overlays the individual searches into a single tree. However, like the others, ISMCTS still guesses the hidden information. It is guessing the hidden information that is the focus of this chapter.

1 This is sometimes called a weighted particle ﬁlter.

109 6.2.3 Random Sampling The random sampling process starts at the root node and makes substitutions in the imperfect- information play history using the legal move choices in the state corresponding to the node. As the joint move vector is selected, the successor function is applied and percepts received. If their are no legal move choices2 or the percepts disagree with the precepts received from the Game Controller then the sample is invalid and the process is started again. The probability of taking a valid sample randomly is the product of the probabilities of randomly choosing a valid legal move in each round of the game. The equations use h for history, 1 ≤ i ≤ n for round, ~a for joint move vector, s for state, G for game, ρ() for percepts arising from actions, and δ() as the successor function. At each round a valid sample is determined by the equality of the imperfect-information play histories:

V alid(~ai) = [ρ(δ(si, ~ai), r) = ρ(Gi+1, r)] (6.1) if the percepts produced, for role r, from the enacting of the randomly chosen joint move vector are equal to the actual percepts received from the Game Controller in round R then this is a valid grounding in that round. The probability of randomly selecting a valid joint move vector is:

P (V alid(~ai)) = |{V alid(~ai)}|/|{~ai}| (6.2) expresses as the size of the set of valid joint move vector compared to the size of the set of joint move vectors. The overall probability of randomly constructing a valid play history is:

n Y P (V alid(h)) ≤ P (V alid(~ai)) (6.3) i=1 less than or equal to the product of all of the probabilities along the path. The inequality comes from the possibility of making a valid choice in an early round that produces a dead end for all subsequent joint moves.

Experimentally it is prohibitive to measure P (V alid(~ai)) for every node on the tree, but it is possible to ﬁnd an estimate of this by measuring the average probability of such a sample

2 The game may be prematurely terminal.

110 Figure 6.1: An example of a silo deﬁned by the ﬁrst move in a game. The black nodes are marked "bad" by the HyperPlay technique. being made in each round of the game. When such an average probability is taken across many experimental runs a reasonable value for P (V alid(h)) can be obtained.

6.2.4 Biased Samples The HyperPlay technique advances each model by randomly substituting a legal move for missing information. If the move is invalid then it searches the local sub tree for a valid combination. In extreme cases the subtree is expanded beyond the local region until a new model is found. This gives rise to a shortcoming of this technique. That is, the game tree can be divided into a small number of subtrees based on the ﬁrst legal move substitution. These subtrees are called "silos", as shown in Figure 6.1. Initially there will be an equal number of models in each silo. As the game progresses one silo may have only one viable play history resulting in an over sampling as all of the models converge to that play history. Experimentally it is possible to compare sample histories and look for duplicate histories, thereby identifying biased samples.

6.2.5 Uniformly Distributed Samples A weighted particle ﬁlter is used3, and so, some samples are more likely than others. However, a priori it must be assumed a uniform distribution across an information set.

When the sample size is smaller than the size of an information set | Br | < | Ir | then

3 Weighted particle ﬁlters are described in Section 3.3.

111 it is diﬃcult to measure the uniformity of the distribution. But when the sample size is much larger than the size of an information set | Br | | Ir | then the task becomes much easier. By counting the number of times each element of an information set is sampled it is possible to use Pearson’s χ2 measure for a uniform distribution as a measure of the probability that the observed distribution matches the expected distribution:

Ev = | Br,i | / | Ir,i | (6.4)

as the ration of the size of the bag of models Br,i to the size of an information set Ir,i for role r in round i. The Perason’s chi squared statistic is then calculated:

2 X 2 χ = (Ov − Ev) /Ev (6.5) v∈Ir,i

from the sum of the squares of the diﬀerences between observed Ov and expected Ev sampling frequencies. The resulting statistic is converted to a probability via pre-calculated tables.

6.2.6 Counting States Visited It is common to measure a search of the game tree by counting the states visited or nodes touched. This is because the primary cost in traversing the game tree is the calculation of the successor function. By comparison storing and retrieving previously visited states from memory is low cost. Each state is counted as its node is touched in the game tree.

112 Figure 6.2: Left: Transit security game, Right: Border Protection security game.

6.3 Security Games

Figure 6.2 shows a representation of the Transit game and the Border Protection game adapted from the representation by Lis`y et al. (2016). Transit Game sees an evader travel from left to right in the grid in Figure 6.2 while the defender start and ﬁnished at node S. This replicates the representation of Lis`y et al. (2016) which was solved using counterfactual regret minimization. An enhancement was added that the defender can get signals from adjacent nodes about the presence of the evader. This signal overlay allows for in-game response adjustments in two respects. Firstly there is the obvious “he’s over there” which collapses an information set, and the more subtle “can’t sees any evaders” which invalidates some models of the game causing re-sampling. It is the second aspect of signalling that is of interest in this paper. Figure 6.3 shows a sample of the encoding of this game in the GDL. The random role makes moves for the evader, plus decides if the defender’s moves are successful. Lines 1-2 show the roles4, lines 3-4 show the initial state of the game, lines 7-8 show the terminal condition, lines 9-11 show one of the goal rules, lines 13-22 showcase the legal moves, lines 23-28 identify percepts, and lines 30-44 characterize the successor function. The auxiliary relation (newlocation defender (cell x y)) calculated in line 30-33 will fail if random chooses (movesuccess 0), mimicking the behaviour in the original game. Border Protection shown in Figure 6.2 has an evader attempting to escape detection en-route with the protector covering the arrival points and receiving random signals from locations where no evader was detected. Again these negative signals invalidate some models of the game causing a re-sampling and a response adjustment. This version of the game wraps around so as to model the possibilities in global transportation, plus it gives the patrol two

4 The role of the evader is played by random.

113 1 (role defender) 23 (<= (sees defender (evaderat ?x ?y)) 2 (role random) 24 (does random (evaderat ?x ?y)) 3 (init (round 1)) 25 (true (location defender (cell ?x1 ?y1))) 4 (init (turn evader)) 26 (adjacent ?x ?y ?x1 ?y1)) 5 (init (location defender (cell 3 1))) 27 (<= (sees defender (movefail)) 6 28 (does random (movesuccess 0))) 7 (<= terminal 29 8 (true (round 13))) 30 (<= (newlocation defender (cell ?x ?y)) 9 (<= (goal defender 100) 31 (true (turn defender)) 10 (true (evadercaught)) 32 (not (does random (movesuccess 0))) 11 (true (returnedtobase))) 33 (does defender (moveto ?x ?y))) 12 34 (<= (next (turn defender)) 13 (<= (legal random (evaderat 1 ?y)) 35 (true (turn evader))) 14 (file ?y) 36 (<= (next (location evader (cell ?x ?y))) 15 (true (round 1))) 37 (true (turn evader)) 16 (<= (legal random (movesuccess ?p)) 38 (does random (evaderat ?x ?y))) 17 (movesuccess ?p) 39 (<= (next (location defender (cell ?x ?y))) 18 (true (turn defender))) 40 (true (turn defender)) 19 (<= (legal defender (moveto ?x1 ?y1)) 41 (newlocation defender (cell ?x ?y))) 20 (true (location defender (cell ?x ?y))) 42 (<= (next (location defender (cell ?x ?y))) 21 (adjacent ?x ?y ?x1 ?y1) 43 (not (true (turn defender))) 22 (true (turn defender))) 44 (true (location defender (cell ?x ?y))))

Figure 6.3: A sample of the GDL-II description of the Transit security game highlighting the key aspects of the game. chances to capture the evader, once in the patrol zone when the evader arrives and the two players are at the same location for a score of 100, and once after the evader arrives in the patrol zone the defender can move to the same location as the evader for a score of 50.

114 6.4 Experimental Results

This section presents and interprets the experimental results in Appendix D. A summary of the experimental set up is given below.

Design of Experiments A batch of games is played while recording the states visited in each round when updating each of the models. Every round this number is written to a log ﬁle and the ﬁles collated. The statistic is examined and used to calculate the probability of successfully making a random selection as well as the cost of updating the model. The resources for each role are set so that it plays at well below the optimal level to ensures good variety in the game-play and a broad base for the calculation of the statistic.

Game Play The basket of games chosen for experiments was drawn from the games available within the GGP community, and from the newly converted security games. A variety of information imperfections are represented in the games. Cut down versions of the game are used, when possible, without loss of generality. For example, the Blind Breakthrough would normally be played on a 8x8 board, but a 5x5 version is used to examine sampling eﬃciency.

Roles The roles were chosen to give meaningful results. In two player turn-taking games the second player is used for the statistic as they receive imperfect-information ﬁrst.

Measuring Performance The HyperPlay process tests each move substitution in a random order, thus it is possible to gain an accurate estimate of the probability in equation 6.2 by calculating the “ﬁrst time” successes in that round.

6.4.1 Sampling Efficiency Experimental Results Figure 6.4 shows the HyperPlay cost of sampling compared to taking a random sample. The lines on the chart do not represent any continuous function, they just connect results from the same game. The horizontal axis is a measure of completion of the game. When a game is 100% complete then it is terminal and no sampling is required. The vertical axis is the ration of states visited for each technique, a value of 50% means that HyperPlay visits only half the number of states per sample and is twice as efficient as random sampling. The game-play is different for each game with some being turn-taking others not, some games have watershed rounds where percepts collapse an information set. Thus, there is no

115 rhyme or reason the shape of the curves. The experimental accuracy is not depicted on the chart. Each game was conﬁgured to provide approximately 10,000 observations for each data point on the chart.

Figure 6.4: Cost of sampling using HyperPlay compared to performing a random sample, as measured by states visited.

Interpretation The results show a significant reduction of the cost of sampling by using HyperPlay. The intuition here is that the cost of backtracking the local subtree will always be cheaper than starting each new sample attempt from the root node. While there will always be exceptions5 the general rule is the longer and larger the game, the more efficient HyperPlay becomes. The lone data point well above the 100% mark in Blind Breakthrough is when the black player initially encounters enemy pawns. The resulting re-sample is less efficient than taking new random samples.

6.4.2 Uniform Distribution The Statistic Bias is measured by playing a batch of games and logging the history footprint of each model for each round. The footprints are examined for repetition and a frequency chart is created. A Pearson’s χ2 test is then performed on the distribution and a probability

5 The ﬁrst encounter of enemy pawns in Blind Breakthrough causes signiﬁcant backtracking beyond the local subtree.

116 value is calculated.

Bias Remedies The ﬁrst remedy is to inversely weight the results from each model based on its sample frequency. If an element was represented by 5 models then each model contributed only 20% of its outcomes to the evaluation process. The second remedy is to to re-balance the sample by replacing a more frequently sampled history with a less frequently sampled history so that each element was sampled the same number of times.

Experimental Results Table 6.1 shows the probability that a sample performed by HyperPlay is uniformly distributed across an information set in the middle of a game. From Figure 6.4 it can be seen that game play is not uniform, so the worst result (most biased) is shown from the mid-game rounds. As one playout of a game is not the same as another it is not possible to average the results so a median and upper and lower quartile readings of the probability value from a Pearson’s χ2 test are shown. The median value for Battleships in Fog of 0.158 infers that there is a 15.8% probability the sample is uniformly distributed. Some median values are extremely low at 0.001, or 0.1% probability of a uiniform distribution. Table 6.2 shows a similar statistic for the end-game. Again the worst result is shown for the ﬁnal rounds of the game.

Game Round Q1 Median Q3 Battleships In Fog 6 0.030 0.158 0.454 Blind Breakthrough 6 0.006 0.156 0.663 Border Protection 4 0.005 0.040 0.262 Guess Who 5 0.001 0.001 0.094 Hidden Connect 4 0.001 0.001 0.001 Krieg TTT 3 0.001 0.001 0.001 Mastermind 3 0.027 0.211 0.520 Transit 5 0.001 0.001 0.116

Table 6.1: Probability of a uniformly distributed sample of an information set created by HyperPlay in mid game.

Interpretation The results show that in every game tested the sample became biased, with least mid-game bias in Mastermind with a 21% probability of an unbiased sample. However, by

117 Game Round Q1 Median Q3 Battleships In Fog 12 0.004 0.058 0.150 Blind Breakthrough 11 0.246 0.998 1.000 Border Protection 7 0.001 0.001 0.004 Guess Who 10 0.121 0.399 0.792 Hidden Connect 8 0.001 0.001 0.001 Krieg TTT 5 0.001 0.167 0.992 Mastermind 5 0.413 0.899 1.000 Transit 11 0.996 0.999 1.000

Table 6.2: Probability of a uniformly distributed sample of an information set created by HyperPlay in the end game. the end-game three of the games had given the agent enough percepts to allow it to re-sample in an unbiased way. The biased sample is a genuine concern as many of the search techniques are mathematically predicated on a uniform random sample of an information set. It is worth noting that some samples are more uniform at the end of the game than they were in the middle of the game as the information set shrinks under certainty. Blind Breakthrough becomes a pawn swapping exercise towards the end game and nears certainty. Mastermind becomes certain as the binary search nears completion and the Transit game almost always becomes certain at the end when the evader is caught.

6.4.3 Remedy for Biased Samples

Experimental Results Table 6.3 shows the results of a batch of games played with different player configurations. The base case is two evenly matched player with no attempt to correct biased samples. The weight remedy reduce the weighting of a sample proportional to its repetition so that each unique sample is given equal consideration before the application of the particle filter weighting, and the balance remedy re-balances the sample every round by replacing the oversampled play history with an under-sampled play history. The mean values show a 95% confidence interval.

Interpretation This was the hardest aspect of this research. That is, to ﬁnd a repeatable, reproducible, realistic, in-game situation where the bias needs to be corrected in order to improve the agents performance. While it was possible to manipulate the game-play to create

118 Game Base Weight Balance Battleships In Fog 82.0±2.4 85.3±2.2 82.8±2.3 Blind Breakthrough 52.5±3.1 51.4±3.1 53.2±3.1 Border Protection 51.0±3.1 49.2±3.1 51.0±3.1 Guess Who 67.8±2.9 68.6±2.9 69.4±2.9 Hidden Connect 37.2±3.0 36.6±3.0 36.0±3.0 Krieg TTT 51.6±3.1 50.9±3.1 51.7±3.1 Mastermind 92.8±1.6 93.3±1.6 93.8±1.5 Transit 72.2±2.8 74.6±2.7 73.0±2.8

Table 6.3: Average performance of player using diﬀerent remedies to unbalanced samples of an information set. scenarios where the agent’s choices were signiﬁcantly compromised by biased samples, these scenarios were so improbable as to have little impact on the average game. By and large, the remedies for biased samples do not improve the agent’s performance. However, the cost of both remedies is so small that it is prudent and mathematically reassuring to implement them.

6.4.4 Case Study The Game In previous experiments the game variants and sizes have been chosen to prove (or disprove) the experimental objectives with a minimum of computational resources. However, there is value in extending the experimentation using one of the full scale versions of a common game in the form of a case study. Blind Breakthrough in the full 8x8 format is used to show how impractical random sampling can be.

The Objectives The intention is to show that random sampling would become impossible within the normal time constraints, yet the HyperPlayer could successfully maintain a bag of models throughout the entire game. It should be noted that the HyperPlayer is capable of taking a model oﬀ line if the backtracking process consumes too many resources. This is one of its design features for managing large search spaces.

Experimental Results Table 6.4 shows the results from the full sized version of Blind Break- through. Both players were resourced just enough so as to exhibit a variety of game plays without making "stupid" moves. The results show an estimated upper bound on the set of

119 play histories and the probability that a random history is valid in the game being played.

Round P (V alid(hR)) sup|{hR}| Active Models 1 100% 22 100% 2 100% 22 100% ---- 16 1.19% 4.2 E+11 71.9% 17 0.73% 1.5 E+13 70.3% ---- 32 < 0.01% 6.0 E+23 50.0% 33 < 0.01% 1.7 E+25 48.7%

Table 6.4: Full sized Blind Breakthrough with the probability of randomly choosing a valid play history, an upper bound on the set size and the models still active.

Interpretation This game is popular in GGP competitions, and so the results are very relevant to this work. In the 32nd round of a game the HyperPlayer could expect 1.3 models of a bag of 100 models to become inactive after each backtracking 100,000 states. The successful models took an average of 745 states to update, giving a total cost of 166, 000 states to take the sample. Each valid random sample would cost approximately 32/2/0.01% = 160, 000 states6. In this context, the random sampler would be completely ineﬀective taking only one sample for every 48 samples taken by the HyperPlayer. This result is totally consistent with the cut down version of the game reported in Figure 6.4 which shows a long term cost of 2.2% for HyperPlay over random.

6 Each random sample will average 32/2 states before failure in the 32nd round

120 6.5 Conclusion

The conclusion is that HyperPlay is generally more efficient than a random search, and in some cases an order of magnitude more efficient. Clearly HyperPlay samples become biased, but with easy remedy. In some cases random sampling is impossible within the constraints of the game. In short, the technique is efficacious in games with imperfect-information. This technique is expected to be applicable in Artificial General Intelligence applications wherever an information set of indistinguishable action histories exists. Although the technique was developed within the context of General Game Playing it is not bound to that domain, or restricted by the Game Description Language. Any search that can be described using a connected, directed graph with a single root node that is acyclic in its undirected form will benefit from this technique.

121 122 123 Chapter 7

Conclusion

In many human endeavours we must crawl before we can walk, and so it is with this research. In the ﬁeld of General Game Playing with Imperfect Information this means building a player capable of all of the cognitive skills we take for granted as human beings. The principle skill explored in this research is the maintenance of an internal representation of the world (game) despite the fact that the agent has imperfect or missing sensory perception of that world. This research has shown how to maintain an internal representation of a multi-agent world when the agent has little or no information about the actions of the other agents and still be able to take intelligent goal-orient action after reasoning about the current state of the world and the impact that such action will have.

124 7.1 HyperPlay

7.1.1 Strengths The maintenance of the imperfect-information path, the lists of legal moves and bad moves clearly works. It also facilitates the calculation of the probability P (vi = vt) of the sample being the true game. This was amply demonstrated in the experimental results. The player operated under a time budget with the ability for the Update() process for each model (HyperGame) to be taken oﬀ line so that slow updates would not slow down the player’s move selection. The impact being that the large search space in Blind Breakthrough did not cause the player to stall due to excessive backtracking. The HyperGame in question was simply taken oﬀ line until it had completed its backtracking calculations and then returned on line. Whilst the instance used for testing was a simple Monte Carlo player, it was clear from the software development that any complete-information player could have been used.

7.1.2 Weaknesses The primary weakness is that the search space may be extremely large and the enumeration of the possible imperfect-information histories given by ξ(vt, MyRole) may take so long as to make the Update() process appear to be never ending. In practical terms, there will always be a few HyperGames in the bag that have randomly chosen a path that is close to the true game. However, the size of the search space is a genuine concern for any implementation of this technique.

125 7.2 HyperPlay-II

7.2.1 Strengths The experimental results show the value new technique places on information, and how it correctly values information-gathering moves by itself and its opponents. It is able to collect information when appropriate, withhold information from its opponents, and keep its goals secret. The use of the imperfect-information Simulations is an eﬀective tool for reasoning with imperfect-information. A HyperPlayer-II was easily able to outperform an equally resourced HyperPlayer in all of the experiments.

7.2.2 Weaknesses We observe that the new technique is a "resource pig" as it uses nested playouts to evaluate move selections and have genuine concerns about its ability to scale up for larger games. These concerns motivate the next section of this article. Also, the IIS is eﬀectively a search and can be inﬂuenced by the type of search. We observed that a simple search is susceptible to shallow traps and will follow this up with future work.

7.2.3 Limitations of HyperPlay-II There is an interesting type of games requiring what is known as “coordination without communication" (Fenster et al., 1995) that goes beyond what our technique can achieve. Consider the following cooperative variant of the Spy vs. Spy game. Spy1 sees which wire is used to arm a bomb. They then signal the name of a colour to Spy2, who must try to disarm the bomb. Both win if Spy2 cuts the right wire and lose otherwise. Clearly Spy1 has an incentive to help Spy2, and there is one obvious way to do this: signal the colour of the wire. The crux, however, is that the game rules can be designed such that the colour being signalled is logically independent of the colour of the armed wire. Whilst a human spy would see the syntactic similarity in the colours and hence the semantic link, the logic of the AI sees them as merely labels and does not make the connection.

126 7.3 Scalability of HyperPlay

7.3.1 HP versus HP-II When the topology is favourable the HP player performs as well as the HyperPlay-II player, improving its score as resources increase and reaching the same level of optimal play. Therefore, it an be concluded that the HP player is an acceptable choice, except where the game topology makes it ineﬀective.

7.3.2 Computation Cost of HP-II The HyperPlay-II player requires significantly more resources to instantiate than the HP player. In each of the games tested, the number of states visited increased by an order of magnitude. The only benefit in using the HyperPlay-II player is that it correctly values information. Therefore, it can be concluded that the HP player should be the first choice, except where the game topology makes it ineffective.

7.3.3 Up-sizing the Game In all of the games tested there was a signiﬁcant impact when the game was up-sized. This was consistent with the theoretical analysis that stated the HP player as being O(bf · d2) and the HyperPlay-II player as being O(bf 2 · d4).

7.3.4 Discounting Referring to Table 5.1, in all of the games, discounting had little impact on the outcome. In games with ﬁxed depth, discounting is known to have no impact. In the other games, discounting did not hasten the win, or prolong the loss in any real way.

7.3.5 Pruning Referring to Table 5.1, there was only one game out of ﬁve where pruning had a positive impact. Cut on Win and Pruning on Depth are known to be safe (Cazenave et al., 2016a) for Nested Monte Carlo players with perfect information. The results from Banker and Thief, and Battleships in Fog suggest they may not be safe in imperfect-information Simulations, but the reason is not clear1.

1 samples of an information set may not contain the same legal moves, but to oﬀer this as a reason would be speculation

127 7.3.6 General The HyperPlay-II player will always play as well as the HP player, and will correctly value information in the context of the reward structure and the expected outcome of the game. Whereas, the HP player falls into the trap of elevating sample to fact and consequently values information at zero. The player of choice should be the HP player, and only utilizing the information valuing properties of the HyperPlay-II player when the game topology dictates.

128 7.4 Eﬃciency of HyperPlay

HyperPlay is generally more efficient than a random search, and in some cases an order of magnitude more efficient. Clearly HyperPlay samples become biased, but with easy remedy. In some cases random sampling is impossible within the constraints of the game. In short, the technique is efficacious in games with imperfect-information. This technique is expected to be applicable in Artificial General Intelligence applications wherever an information set of indistinguishable action histories exists. Although the technique was developed within the context of General Game Playing it is not bound to that domain, or restricted by the Game Description Language. Any search that can be described using a connected, directed graph with a single root node that is acyclic in its undirected form will benefit from this technique.

129 7.5 Future Work

There are some interesting directions for future work using the HyperPlay technique both inside and outside General Game Playing. Several suggestion are presented below.

7.5.1 Non Locality Error This occurs when the optimal solution to a game in progress depends on information outside the local game tree. That is, a perfect-information algorithm operating within the conﬁnes of the local game tree would make a less than optimal move selection. From this it is reasonable to infer that HyperPlay, as a bolt on solution, would perpetuate this error. However, it is not clear that HyperPlay-II, as a nested player, would fall into the same trap.

7.5.2 Information Set MCTS Future work in this area would be likely to include:

• Can HyperPlay be coupled with ISMCTS to produce a superior player,

• Can ISMCTS overcome the Strategy Fusion Error and replace HyperPlay-II which is a "resource pig", and

7.5.3 Security Games Two popular security games were converted to GGP for this thesis to help demonstrate the eﬃciency of HyperPlay over random sampling of an information set. Additional work could be done to bring more security games into the GGP arena so as to take advantage of the various techniques for their solution.

7.5.4 More Eﬃcient Random Sampling in Tree Searches It was clearly demonstrated that HyperPlay could sample and information set more eﬃciently than a random approach. This is true whether the search is for a GGP game of any other imperfect-information search that can be represented as a tree.

130 131 Appendix A

HyperPlayer Experimental Results

This appendix recapitulates the experimental results presented in previous works. It expands on the published results and provides the design of the experiments and the equipment used in the experiments.

132 A.1 Design of Experiments

A series of experiments was designed to test the capabilities of the new technique using the high-throughput computer facilities at the School of Computer Science and Engineering. Games played at the recent Australasian Joint Conference on Artiﬁcial Intelligence were used as inspiration for the experiments to validate the claim that the new technique correctly values moves that seek or protect information. The conference organisers specially designed games that would challenge the state of the art of GDL-II players so as to encourage research and development in this ﬁeld. As with previous research, the game server was modelled along with both players in a single thread so it could be parallelised across many CPUs. Each test was repeated one hundred times.

A.1.1 Game Play For single player games there is no issue with game play, but with two player games there needs to be a consistent opponent to make some useful measurements. To this end, an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive. Since a comparison is being made between diﬀerent instantiations of the player, the experiments will not be overly sensitive to the performance of the opponent.

A.1.2 Measuring Performance It is common in GGP to simulate the game play of a competition by giving each player a time budget for each move. However, there have been very few GGP-II competitions and the idea of a time budget has less meaning. Also a time budget is very dependent on the hardware being used for the computations. Therefore the number of states visited by the player in playing the game is used as the measure of computational resources, being careful to measure this across the multiple samples of the information set and to include the states visited in the backtracking of invalid samples1. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

1 This may consume half of the resources in some games.

133 A.1.3 Confidence Level Each game variant and player configuration was played one thousand times2 and the average results reported. For the two player games a result of zero indicated that the opponent scored the maximum payoff in every game, and a result of 100 indicated that the nominated player scored the maximum payoff in every game. A confidence level of 95% is shown using error bars in a two-tailed calculation of the standard error of the mean for both the states visited and the average payoff. That is to say that there can be 95% confident that the player would score within the range reported using the computational resources reported. In some cases the error bars are smaller than the marker used to plot the point.

A.1.4 Equipment The experiments were conducted on the high-throughput computer facilities at the School of Computer Science and Engineering at the University of New South Wales. That is, each experiment was distributed across several hundred laboratory PCs and the results collected and tabulated.

A.1.5 Player Resources In each experiment the player resources were varied to demonstrate the performance as a function of resources. Care was taken to ensure that each player had equal resources when making a move selection. This was achieved by setting each player’s parameters such that they would visit a similar number of states each round. The new technique has the potential to be a ‘resource pig’ as it is a nested player. For example, a value of n = 4 in eval(s, ~π, r, n) for the original technique in a two player game with a branching factor of eight and a playout length of 10 would result in 1,280 states being visited. The new technique would visit 2,048,000 states. However experiments showed that the new technique can function optimally3 with the same total resources as the original player. That is, where the original player might need n = 16, the new player only requires n = 4. This translates to the same number of states visited by each player.

2 For larger game variants fewer games were played and this is reﬂected in the error bars shown on the chart. 3 A player is ‘optimal’ when increasing its resources will not lead to better play. This is referred to as ‘adequate resourcing’.

134 A.1.6 Standardized Opponent In each of the two player games the HyperPlayer opposed a Cheat; a HyperPlayer with access to the true game, who maintains HyperGames that are the game (instead of models of the game). The Cheat was fully resourced so that it made the best move choices within the limitations of the move selection process.

135 A.2 Games

A.2.1 Monty Hall In addition to the rules already presented, The number of initial doors is varied between three, four, and ﬁve doors. The host always opens all but two doors.

A.2.2 Krieg-TicTacToe A variant of traditional TicTacToe where the opponent’s pieces are hidden. Players move simultaneously and the winning length was ﬁxed at four in a row. The board size is varied between 4x4, 5x5, and 6x6 squares. Players are given feedback via percepts as to whether their move, and their opponent’s move, were successful.

A.2.3 Blind Breakthrough A variant of chess where each player starts with two rows of pawns on their side of the board. Players take turns, trying to ‘break through’ their opponent’s ranks to reach the other side ﬁrst. Opponent’s pieces are hidden, as with Krieg-TicTacToe. The board size was varied between 5x5, 6x6, and 7x7 squares and players were given equal opportunities to go ﬁrst.

136 A.3 Experimental Results

A.3.1 Particle Filter Weighting

The weightings applied to the HyperGames can be validated by playing the Monty Hall game. If the formulation is correct then the maximum long term score should be equal to 100 · (1 − 1/No_Of_Doors), otherwise the maximum long term score should be 50.

A.3.2 Monty Hall

As expected, the adequately resourced HyperPlayer was able to achieve an average payoﬀ appropriate for the number of doors in the game.

HyperGames Sims/Move Resources 3 Doors 4 Doors 5 Doors 1 1 1 51.1 46.9 51.8 2 2 2 56.7 61.2 66.3 4 4 4 56.0 66.4 75.0 8 8 8 60.6 70.4 75.4 16 16 16 63.6 75.6 79.2 32 32 32 64.7 73.7 80.3 64 64 64 65.8 75.6 81.0

Table A.1: Monty Hall Results showing the mean values for the HyperPlayer.

HyperGames Sims/Move Resources 3 Doors 4 Doors 5 Doors 1 1 1 1000 1000 1000 4 4 16 1000 1000 1000 8 8 64 1000 1000 1000 16 16 256 1000 1000 1000 32 32 1024 1000 1000 1000 64 64 4096 1000 1000 1000

Table A.2: Monty Hall Results showing the resources and observations values for the HyperPlayer.

137 HyperGames Sims/Move Resources 3 Doors 4 Doors 5 Doors 1 1 1 3.10 3.09 3.10 2 2 4 3.07 3.02 2.93 4 4 16 3.08 2.93 2.68 8 8 64 3.03 2.83 2.67 16 16 256 2.98 2.66 2.52 32 32 1024 2.96 2.73 2.47 64 64 4096 2.94 2.66 2.43

Table A.3: Monty Hall Results showing the 95% Conﬁdence Interval for the HyperPlayer.

Krieg-TicTacToe

These experiments showed steady improvement in performance as HyperPlayer resources increased. The limiting values appear to be well short of the 50% mark, especially on the larger board. On inspection of the play messages, it could be seen that the reduced number of percepts relative to the game duration gave the HyperPlayer very little information to assist in narrowing its search. In fact, the Cheat often won the game before the HyperPlayer could establish an accurate model of the board.

HyperGames Sims/Move Resources TTT4 TTT5 TTT6 2 2 2 3.1 1.8 1.0 4 4 4 4.8 5.2 6.8 8 8 8 8.2 10.9 14.8 16 16 16 16.0 10.5 12.0 32 32 32 27.7 23.9 8.8 64 64 64 27.1 25.2 16.2 128 128 128 32.6 23.9 16.3

Table A.4: TicTacToe Results showing the mean values for the HyperPlayer.

138 HyperGames Sims/Move Resources TTT4 TTT5 TTT6 2 2 4 1000 1000 1000 4 4 16 1000 1000 1000 8 8 64 1000 1000 1000 16 16 256 1000 1000 1000 32 32 1024 1000 1000 1000 64 64 4096 1000 1000 1000 128 128 16384 1000 1000 1000

Table A.5: TicTacToe Results showing the resources and observations values for the HyperPlayer.

HyperGames Sims/Move Resources TTT4 TTT5 TTT6 2 2 4 1.07 0.81 0.62 4 4 16 1.32 1.37 1.56 8 8 64 1.70 1.93 2.20 16 16 256 2.27 1.90 2.01 32 32 1024 2.77 2.64 1.75 64 64 4096 2.75 2.69 2.28 128 128 16384 2.90 2.64 2.29

Table A.6: TicTacToe Results showing the 95% Conﬁdence Interval for the HyperPlayer.

Blind Breakthrough

The results of the Blind Breakthrough experiments show clear improvement in performance as resources increase and approach a limiting value. As the number of HyperGames increases and the number of simulations per move increases the HyperPlayer is able to match the Cheat’s performance with neither player having the advantage. In the 5x5 results, the percepts were suﬃcient for the HyperPlayer to maintain models that were very close to the true game. This allowed the HyperPlayer to perform as if it had complete information.

139 HyperGames Sims/Move Resources BTC5 BTC6 BTC7 2 2 2x2 21.1 13.2 4 4 4x4 35.7 33.0 20.0 8 8 8x8 43.5 41.8 27.0 16 16 16x16 46.5 42.5 32.0 32 32 32x32 50.2 44.6 34.8 64 64 64x64 43.2 34.2 128 128 128x128 45.5 37.5258

Table A.7: Blind Breakthrough Results showing the mean values for the HyperPlayer.

HyperGames Sims/Move Resources BTC5 BTC6 BTC7 2 2 4 1000 1000 1 4 4 16 1000 1000 1000 8 8 64 1000 1000 1000 16 16 256 1000 1000 1000 32 32 1024 1000 1000 1000 64 64 4096 1 1000 1000 128 128 16384 1 1000 1000

Table A.8: Blind Breakthrough Results showing the resources and observations values for the HyperPlayer.

HyperGames Sims/Move Resources BTC5 BTC6 BTC7 2 2 4 2.52 2.09 0.00 4 4 16 2.95 2.90 2.47 8 8 64 3.06 3.04 2.74 16 16 256 3.08 3.05 2.88 32 32 1024 3.08 3.07 2.94 64 64 4096 0.00 3.05 2.93 128 128 16384 0.00 3.07 2.99

Table A.9: Blind Breakthrough Results showing the 95% Conﬁdence Interval for the HyperPlayer.

140 141 Appendix B

HyperPlayer-II Experimental Results

142 B.1 Design of Experiments

B.1.1 Game Play For single player games there is no issue with game play, but with two player games there needs to be a consistent opponent to make some useful measurements. To this end, an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive. Since a comparison is being made between diﬀerent instantiations of the player, the experiments will not be overly sensitive to the performance of the opponent.

B.1.2 Measuring Performance It is common in GGP to simulate the game play of a competition by giving each player a time budget for each move. However, there have been very few GGP-II competitions and the idea of a time budget has less meaning. Also a time budget is very dependent on the hardware being used for the computations. Therefore the number of states visited by the player in playing the game is used as the measure of computational resources, being careful to measure this across the multiple samples of the information set and to include the states visited in the backtracking of invalid samples1. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

1 This may consume half of the resources in some games.

143 B.1.3 Confidence Level Each game variant and player configuration was played one thousand times2 and the average results reported. For the two player games a result of zero indicated that the opponent scored the maximum payoff in every game, and a result of 100 indicated that the nominated player scored the maximum payoff in every game. A confidence level of 95% is shown using error bars in a two-tailed calculation of the standard error of the mean for both the states visited and the average payoff. That is to say that there can be 95% confidence that the player would score within the range reported using the computational resources reported. In some cases the error bars are smaller than the marker used to plot the point.

B.1.4 Equipment The experiments were conducted on the high-throughput computer facilities at the School of Computer Science and Engineering at the University of New South Wales. That is, each experiment was distributed across several hundred laboratory PCs and the results collected and tabulated.

B.1.5 Player Resources In each experiment the player resources were varied to demonstrate the performance as a function of resources. Care was taken to ensure that each player had equal resources when making a move selection. This was achieved by setting each player’s parameters such that they would visit a similar number of states each round. The new technique has the potential to be a ‘resource pig’ as it is a nested player. For example, a value of n = 4 in eval(s, ~π, r, n) for the original technique in a two player game with a branching factor of eight and a playout length of 10 would result in 1,280 states being visited. The new technique would visit 2,048,000 states. However experiments showed that the new technique can function optimally3 with the same total resources as the original player. That is, where the original player might need n = 16, the new player only requires n = 4. This translates to the same number of states visited by each player.

144 B.1.6 Equal Resources Some experiments were conducted with two-player games, pitting the original player against the new player using equal resources. A resource index n = 4 gives the new player four hypergames, each running an IIS with four hypergames. The old player would get n = 16 hypergames. A player resource index of zero represents random decision making and serves to provide a basis for improvement.

145 B.2 Games

Games played at the Australasian Joint Conference on Artiﬁcial Intelligence 2013 were used as inspiration for the experiments to validate the claim that the new technique correctly values moves that seek or protect information. The conference organizers specially designed games that would challenge the state of the art of GDL-II players to encourage research and development in this ﬁeld.

Exploding Bomb This simple game commences with the random player choosing a red or blue wire to arm a bomb. Next, the agent may choose whether to ask which wire was used. Asking carries a cost of 10%. Finally, the agent must cut one wire to either disarm, or detonate, the bomb. This game is used as a running example.

Spy v Spy A simple variant of the Exploding Bomb game reverses the information ﬂow. In this version the arming agent—who chooses which wire arms the bomb—also decides whether to tell the other player which wire to cut. Withholding this information carries a penalty of 20%. This tests the value a player places on giving away information.

Number Guessing The agent must guess a random number between 1 and 16. It can ask if the number is ‘less than X’, or can announce it is ‘ready to guess’, then guess the number. The score is discounted by time after the ﬁrst 5 moves.

Banker and Thief This game tests a player’s ability to keep secrets, ie. to value withholding information. There are two banks, a banker and a thief. The banker distributes ten $10 notes between the two banks. The Banker scores all the money left in his bank at the end of the game, except his bank has a faulty alarm system. The thief can steal all the money from the faulty bank, if they can identify it. The challenge for the banker is not to reveal the faulty bank by over-depositing.

Battleships In Fog This turn-taking, zero-sum game was designed to test a player’s ability to gather information and to be aware of information collected by its opponent. Two battleships occupy separate grids. A player can ﬁre a missile to any square on the opponent’s grid, move to an adjacent square, or scan for their opponent. If they scan they will get the exact location, and their opponent will know that they have been scanned.

146 B.3 Experimental Results

Average Count Std Dev 95% CI Round 0 does agent noop 47.518 100 0.362 0.071 Round 1 does agent ask 45.042 100 0.346 0.068 does agent noop 49.979 100 0.373 0.073 Round 2 does agent cut armed 50.602 100 4.626 0.907 does agent cut not armed 49.398 100 4.626 0.907

Table B.1: Experimental results for the Exploding Bomb game for the HyperPlay based player.

Average Count Std Dev 95% CI Round 0 does agent noop 69.066 100 2.958 0.580 Round 1 does agent ask 90.000 100 0.000 0.000 does agent noop 49.914 100 3.290 0.645 Round 2 does agent cut armed 90.000 100 0.000 0.000 does agent cut not armed 0.000 100 0.000 0.000

Table B.2: Experimental results for the Exploding Bomb game for the HyperPlay-II based player.

The original player never asks the question in this game since it thinks it already knows the answer due to superficial agreement of its samples and so it believes it can avoid the modest penalty. In contrast, the new player plays out the greedy strategy to see that it does not work and then correctly identifies that asking the question gives the best expected outcome. This is a simple but effective validation of the ability of the new technique to overcome the Strategy Fusion Error.

147 B.3.1 Spy v Spy

Average Count Std Dev 95% CI Round 1 does agent arm blue and tell 59.995 100 0.592 0.116 does agent arm red and tell 60.042 100 0.529 0.104 does agent arm blue and hide 39.985 100 0.615 0.120 does agent arm red and hide 39.991 100 0.557 0.109 Round 2 does agent cut armed 80.000 100 4.198 0.823 does agent cut unarmed 0.000 100 4.660 0.913

Table B.3: Experimental results for the Exploding Bomb game for the HyperPlay based player.

Average Count Std Dev 95% CI Round 1 does agent arm blue and tell 20.000 100 0.000 0.000 does agent arm red and tell 20.000 100 0.000 0.000 does agent arm blue and hide 40.363 100 4.715 0.924 does agent arm red and hide 39.450 100 5.123 1.004 Round 2 does agent cut armed 59.550 100 1.936 0.379 does agent cut unarmed 60.450 100 2.219 0.435

Table B.4: Experimental results for the Exploding Bomb game for the HyperPlay-II based player.

The original player always tells its opponent which wire was used to arm the bomb to avoid the penalty. The new player recognizes that hiding this information yields a better expected outcome and keeps the information secret.

148 B.3.2 Number Guessing

Resources 1 2 4 8 Average Score 32.3 49.3 82.1 87.1 Standard Deviation 43.27 43.42 27.09 18.11 Games Played 100 100 100 100 95% Conﬁdence Interval 8.48 8.51 5.31 3.55 Average Rounds per Game 6.51 7.43 6.75 6.89 Total Time (cpu hours) 0.14 1.85 23.3 311

Table B.5: Experimental results for the Number Guessing game for the HyperPlay-II based player.

The original player always announces it is "ready to guess", but then guesses randomly from one of the 16 numbers resulting in a 6.25% chance of guessing correctly. The new player only guesses the number when all playouts agree on the result. A binary search means guessing after four questions. The original player suﬀers from the Strategy Fusion Error as each HyperGame believes it knows the answer and all agree that it is ready to guess the answer. The new technique, properly resourced will conduct a binary search.

B.3.3 Banker and Thief

Resources 0 1 2 4 8 Banker Average Score 29.8 25.9 23.1 23.6 32.1 Banker Std Dev 28.2 29.4 23.4 20.4 6.6 Thief Average Score 24.6 31.5 25.0 19.4 1.0 Thief Std Dev 29.5 32.0 28.8 24.0 7.1 Games Played 100 100 100 100 99 Banker 95% Conﬁdence Interval 6.0 6.2 6.0 3.7 0.9 Thief 95% Conﬁdence Interval 6.1 6.3 6.6 4.1 1.0

Table B.6: Experimental results for the Banker and Thief game for the HyperPlay-II based player.

149 The results show that the original technique adopts a greedy policy and places $100 in its bank, only to have it stolen. The new technique, adequately resourced, will deposit $40 of the money in its bank and create a decoy of $60, leaving the remaining banks empty. Then relying on a greedy thief to attempt to steal the $60 in the decoy bank. The new technique reaches optimal avg(40 + 100) at resource index of eight as it correctly models both roles.

B.3.4 Battleships In Fog

Resources 0 1 2 4 8 Average Score 46.5 64.0 87.8 89.0 88.9 Standard Deviation 49.0 46.8 32.7 31.4 31.4 Games Played 198 200 200 200 199 95% Conﬁdence Interval 6.95 6.65 4.54 4.34 4.36 Average Rounds per Game 8.1 7.1 4.0 3.4 3.4 Total Time (cpu hours) 0.19 0.47 3.42 46.63 342.14

Table B.7: Experimental results for the Battleships in Fog game for the HyperPlay-II based player.

The original player sees no value in scanning as all of the samples "know" where the opponent is. It does not value moving after being scanned as it thinks its opponent always knows where it is. Its only strategy is to randomly fire missiles giving it a 6.25% chance of a hit on a 4x4 board. The agent using the new technique will scan for the opponent and fire a missile. A resource index of four is sufficient for the new player to dominate the old in this turn- taking game: HyperPlay has a 9.4% chance of winning with a random shot (12.5% if it goes first, half that if it plays second).

150 151 Appendix C

HyperPlayer Scalability Experimental Results

152 C.1 Design of Experiments

Experiments were designed to answer two basic questions:

• Does HP-II perform better than HP at this type of game, and at what computational cost; and

• What is the impact of up-sizing the game on the computational cost for HP-II to achieve the same level of performance.

C.1.1 Game Play For single player games there is no issue with game play, but with two player games there needs to be a consistent opponent to make some useful measurements. To this end, an opponent is instantiated who uses the HP technique and is adequately resourced so as to be competitive. Since a comparison is being made between diﬀerent instantiations of the player, the experiments will not be overly sensitive to the performance of the opponent.

C.1.2 Measuring Performance It is common in GGP to simulate the game play of a competition by giving each player a time budget for each move. However, there have been very few GGP-II competitions and the idea of a time budget has less meaning. Also a time budget is very dependent on the hardware being used for the computations. Therefore the number of states visited by the player in playing the game is used as the measure of computational resources, being careful to measure this across the multiple samples of the information set and to include the states visited in the backtracking of invalid samples1. So the measure includes both the states visited in creating the samples plus the states visited in the playouts.

C.1.3 Player Conﬁguration For the HP player there are two resource parameters: the size of the bag of models,2 and the number of playouts per legal move. For example, HP(4, 2) maintains 4 models of the game and uses two playouts for each legal move to make a choice. That would mean in a game

1 This may consume half of the resources in some games. 2 Models are samples of the information set.

153 with a branching factor of 10 and a playout depth of 10 there would be 800 states visited in make a choice, and 4,400 states visited in playing a game. The HP-II player is twice as complex as it is a level 2 nested player, so there are four resource parameters. For example, HP-II(4, 2, 4, 2) is equivalent to HP(4, 2, HP(4 ,2)) which significantly increased the number of states visited from 800 to 3520003 for a choice. Preliminary experiments were conducted to find the best configuration of both players for each game. The intent was to show the best performance for each player in terms of maximising the score and minimising the number of states visited. Once that configuration was found, multiples of 2 were used to produce characteristic curves presented in the results.

C.1.4 Game Variants The game variants were chosen to cast the strongest light on the experimental aims and show the best possible performance for both players. With many game variations and player variations conﬁgurations were chosen that gave a fair representation of the relative performances in context of the imperfect information present in each game.

C.1.5 Confidence Level Each game variant and player configuration was played one thousand times4 and the average results reported. For the two player games a result of zero indicated that the opponent scored the maximum payoff in every game, and a result of 100 indicated that the nominated player scored the maximum payoff in every game. A confidence level of 99% is shown using error bars in a two-tailed calculation of the standard error of the mean for both the states visited and the average payoff. That is to say that there can be 99% confidence that the player would score within the range reported using the computational resources reported. In some cases the error bars are smaller than the marker used to plot the point.

C.1.6 Equipment The experiments were conducted on the high-throughput computer facilities at the School of Computer Science and Engineering at the University of New South Wales. That is, each

3 4 x 2 x 10 x 4400, remember the Imperfect-Information Simulation starts at the beginning of the game, not the next round. 4 For larger game variants fewer games were played and this is reﬂected in the error bars shown on the chart.

154 experiment was distributed across several hundred laboratory PCs and the results collected and tabulated.

155 C.2 Experimental Results

Experiments were conducted on various conﬁgurations of players for each of the games reported. The lines joining points plotted in each chart do not represent a continuous function, but are used to link results for similar conﬁgurations of player and game.

Each point represents a diﬀerent resourcing of that player in terms of the number of samples taken from the information set, and the number of playouts made when selecting a move. Typically resources were doubled from one conﬁguration to the next, so a log scale is used for the horizontal axis of the chart.

C.2.1 Hidden Connect

Two variants of the game were used, connect 3 in a 3x3 grid and connect 4 in a 5x5 grid. As this game does not have any information-gathering or information-purchasing moves the HP player performed as well as the HyperPlay-II player but consumed considerable fewer resources with both players improving their performance as they visited more states.

Resources 1,1 2,2 4,4 8,8 16,16 32,32 Count 980 980 980 980 980 490 States Average 75.7 319.6 1199.4 4692.6 18483.5 72829.3 StdDev 13.0 39.4 121.9 415.5 1557.1 6109.4 99% CI 1.1 3.2 10.0 34.2 128.3 712.1 Score Average 57.7 60.6 65.6 69.8 74.7 73.7 StdDev 43.5 43.1 40.7 38.4 34.2 36.0 99% CI 3.6 3.5 3.4 3.2 2.8 4.2

Table C.1: Hidden Connect 3 game showing the HP player performance.

156 Resources 1,1,1,1 2,2,1,1 4,4,1,1 8,8,1,1 Count 980 980 980 280 States Average 939.3 5007.1 62267.1 768686.6 StdDev 136.9 698.8 6406.7 67649.4 99% CI 11.3 57.6 528.0 10430.5 Score Average 57.2 59.9 67.6 78.8 StdDev 43.9 42.7 40.4 34.2 99% CI 3.6 3.5 3.3 5.3

Table C.2: Hidden Connect 3 game showing the HP-II player performance.

Resources 1,1 2,2 4,4 8,8 Count 1024 768 512 384 States Average 2461.8 6290.3 13034.8 39967.1 StdDev 10041.7 21194.6 20941.3 30625.9 99% CI 809.6 1973.2 2387.7 4032.2 Score Average 43.9 51.7 55.3 64.2 StdDev 47.2 47.9 48.0 45.6 99% CI 3.8 4.5 5.5 6.0

Table C.3: Hidden Connect 4 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 Count 476 194 States Average 365467.7 1601500.1 StdDev 288223.5 912538.2 99% CI 34083.6 169032.4 Score Average 41.6 47.9 StdDev 47.0 47.7 99% CI 5.6 8.8

Table C.4: Hidden Connect 4 game showing the HP-II player performance.

157 C.2.2 Mastermind

Two variants of the game were used, two colours in three positions with three guesses, and three colours in four positions with four guesses. The reward was pro rata for the number of correct positions. The number of guesses was restricted to see if an increase in resources would improve the guessing strategy. This game has no hidden game play, only a hidden initial setting created by the random player. As such, even the simplest HP player was able to solve the puzzle.

Resources 1,1 2,2 4,4 8,8 Count 1024 1024 1024 1024 States Average 53.8 195.9 722.9 2779.8 StdDev 13.8 39.9 132.8 511.6 99% CI 1.1 3.2 10.7 41.2 Score Average 80.8 81.6 84.3 82.4 StdDev 30.3 24.8 24.2 26.5 99% CI 2.4 2.0 2.0 2.1

Table C.5: Mastermind 2x3 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,1,1 8,8,1,1 Count 1024 1024 1024 512 States Average 581.6 2518.1 25473.7 307619.2 StdDev 127.4 512.5 4212.6 44149.3 99% CI 10.3 41.3 339.6 5033.9 Score Average 80.7 83.1 84.3 83.1 StdDev 30.4 23.5 24.3 26.3 99% CI 2.4 1.9 2.0 3.0

Table C.6: Mastermind 2x3 game showing the HP-II player performance.

158 Resources 1,1 2,2 4,4 8,8 Count 1024 1024 512 512 States Average 882.0 3283.4 12712.5 49596.8 StdDev 114.8 412.7 1452.5 6957.5 99% CI 9.3 33.3 165.6 793.3 Score Average 73.7 76.1 76.4 76.2 StdDev 32.8 29.2 30.1 32.2 99% CI 2.6 2.4 3.4 3.7

Table C.7: Mastermind 3x4 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,1,1 8,8,1,1 Count 512 512 256 107 States Average 96334.9 390725.3 5435155.6 15598744.3 StdDev 5578.8 22442.0 3569596.0 4244471.8 99% CI 636.1 2558.9 575597.4 1058647.7 Score Average 73.3 75.3 78.5 91.6 StdDev 30.9 30.1 41.2 27.9 99% CI 3.5 3.4 6.6 7.0

Table C.8: Mastermind 3x4 game showing the HP-II player performance.

C.2.3 Number Guessing

Variants of 4, 8 & 16 numbers were used in this experiment. As expected, the HP player was unable to correctly value the information gathering moves and performed no better than a random player would. Whereas, the HP-II player tended towards optimum play as the resources were increased.

159 Resources 1,1 2,2 4,4 8,8 Count 1000 1000 1000 1000 States Average 101.4 286.8 895.6 2655.9 StdDev 66.1 152.3 399.4 521.8 99% CI 5.4 12.4 32.6 42.6 Score Average 37.2 36.5 31.7 26.1 StdDev 38.9 42.9 43.7 43.5 99% CI 3.2 3.5 3.6 3.6

Table C.9: Number Guessing 4 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 1000 1000 1000 1000 States Average 1058.4 5701.2 58971.7 677313.6 StdDev 546.1 2024.8 12313.6 129504.0 99% CI 44.6 165.2 1004.6 10565.8 Score Average 45.4 62.0 76.8 78.2 StdDev 42.4 33.4 12.6 7.1 99% CI 3.5 2.7 1.0 0.6

Table C.10: Number Guessing 4 game showing the HP-II player performance.

Resources 1,1 2,2 4,4 8,8 Count 1000 1000 1000 1000 States Average 308.1 744.9 2233.6 6252.6 StdDev 169.0 444.5 1140.0 1853.9 99% CI 13.8 36.3 93.0 151.3 Score Average 18.6 21.1 17.5 12.8 StdDev 30.0 35.4 35.1 32.8 99% CI 2.4 2.9 2.9 2.7

Table C.11: Number Guessing 8 game showing the HP player performance.

160 Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 1000 1000 500 330 States Average 7611.9 33081.0 364186.1 4144407.1 StdDev 3342.1 11307.5 91281.9 985596.8 99% CI 272.7 922.5 10532.2 139978.7 Score Average 30.6 36.8 57.9 63.8 StdDev 36.7 36.3 24.1 14.3 99% CI 3.0 3.0 2.8 2.0

Table C.12: Number Guessing 8 game showing the HP-II player performance.

Resources 1,1 2,2 4,4 8,8 Count 1000 1000 1000 1000 States Average 830.0 2239.8 5507.3 15027.2 StdDev 372.1 1283.4 2937.3 5999.9 99% CI 30.4 104.7 239.6 489.5 Score Average 7.6 10.8 9.4 8.8 StdDev 21.0 25.2 26.6 27.5 99% CI 1.7 2.1 2.2 2.2

Table C.13: Number Guessing 16 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 1000 500 231 70 States Average 44053.4 157359.1 2159432.7 22020843.0 StdDev 16160.8 58863.9 461888.9 5190879.0 99% CI 1318.5 6791.8 78406.4 1600706.1 Score Average 17.2 18.5 31.1 48.1 StdDev 28.3 30.4 30.4 24.3 99% CI 2.3 3.5 5.2 7.5

Table C.14: Number Guessing 16 game showing the HP-II player performance.

161 C.2.4 Banker and Thief Variants with 2 and 4 banks and a deposits of 10 by $10.00 are used in these experiments. As expected the HP banker uses a greedy strategy when making deposits and falls victim to the thief. Whereas, the HP-II player tended towards optimum play as the resources were increased5.

Resources 1,1 2,2 4,4 8,8 Count 1024 1024 1024 1024 States Average 153.5 619.0 2382.0 9340.0 StdDev 0.5 0.7 1.0 1.4 99% CI 0.0 0.1 0.1 0.1 Score Average 15.6 12.2 6.2 0.7 StdDev 21.2 20.6 16.5 6.3 99% CI 1.7 1.7 1.3 0.5

Table C.15: Banker and Thief 2 game showing the HP player performance.

Resources 1,1,1,1 2,2,2,2 4,4,4,4 8,8,8,8 Count 1024 1024 512 256 States Average 2173.9 15183.4 101943.0 743705.1 StdDev 7.3 15.0 30.2 64.1 99% CI 0.6 1.2 3.4 10.3 Score Average 14.7 21.7 23.8 28.3 StdDev 21.2 21.2 20.1 17.2 99% CI 1.7 1.7 2.3 2.8

Table C.16: Banker and Thief 2 game showing the HP-II player performance.

5 Optimum play rewards $40.00 by creating a false target of $60.00

162 Resources 1,1 2,2 4,4 8,8 Count 1024 1024 512 256 States Average 284.5 1141.0 4465.9 17668.2 StdDev 1.1 1.6 2.2 3.1 99% CI 0.1 0.1 0.2 0.5 Score Average 10.3 8.0 3.0 0.0 StdDev 17.6 16.7 13.0 0.0 99% CI 1.4 1.3 1.5 0.0

Table C.17: Banker and Thief 4 game showing the HP player performance.

Resources 1,1,1,1 2,2,2,2 4,4,4,4 8,8,8,8 Count 1024 1024 512 256 States Average 6525.3 46184.5 323538.3 2416665.1 StdDev 23.3 44.9 95.7 197.0 99% CI 1.9 3.6 10.9 31.8 Score Average 10.0 14.7 19.0 20.0 StdDev 16.8 15.7 14.6 13.9 99% CI 1.4 1.3 1.7 2.2

Table C.18: Banker and Thief 4 game showing the HP-II player performance.

C.2.5 Battleships in Fog

Variants of 3x3, 4x4 and 5x5 grid were used with a game length of 10 moves. This is a tactical game where players must evaluate every round for a tactical advantage. The HP player plays little better than random with a score just above 50, due to some lucky ﬁrst shots. Whereas, the HP-II player tends towards optimum play as the resources are increased.

163 Resources 1,1 2,2 4,4 8,8 Count 980 980 980 980 States Average 288.6 967.8 3515.2 13180.3 StdDev 123.0 373.7 1263.6 4996.4 99% CI 10.1 30.8 104.1 411.8 Score Average 51.2 50.2 54.0 53.2 StdDev 38.6 40.1 40.0 39.9 99% CI 3.2 3.3 3.3 3.3

Table C.19: Battleships in Fog 3x3 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 980 980 490 92 States Average 13493.8 55866.3 467577.0 5236046.9 StdDev 5556.3 23588.3 153079.4 1149174.6 99% CI 457.9 1944.0 17841.8 309109.1 Score Average 51.7 59.6 79.9 89.1 StdDev 45.9 47.4 39.9 31.3 99% CI 3.8 3.9 4.7 8.4

Table C.20: Battleships in Fog 3x3 game showing the HP-II player performance.

Resources 1,1 2,2 4,4 8,8 Count 1024 1024 1024 512 States Average 987.2 3210.4 11444.4 45127.0 StdDev 504.2 1526.3 5107.5 18888.0 99% CI 40.6 123.1 411.8 2153.6 Score Average 45.9 52.2 52.0 50.3 StdDev 44.2 46.6 47.1 47.6 99% CI 3.6 3.8 3.8 5.4

Table C.21: Battleships in Fog 4x4 game showing the HP player performance.

164 Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 1024 1024 504 228 States Average 128320.8 490609.2 3515140.9 33246197.0 StdDev 56323.8 232071.8 1261395.0 6833719.9 99% CI 4541.1 18710.8 144962.5 1167641.3 Score Average 50.7 65.0 79.2 95.2 StdDev 46.9 46.7 40.7 21.5 99% CI 3.8 3.8 4.7 3.7

Table C.22: Battleships in Fog 4x4 game showing the HP-II player performance.

Resources 1,1 2,2 4,4 8,8 Count 1024 1024 512 512 States Average 2948.8 9776.6 37334.5 139919.1 StdDev 1529.7 4890.2 17197.6 62957.6 99% CI 123.3 394.3 1960.9 7178.5 Score Average 47.4 52.8 50.2 52.5 StdDev 45.2 46.4 45.9 47.5 99% CI 3.6 3.7 5.2 5.4

Table C.23: Battleships in Fog 5x5 game showing the HP player performance.

Resources 1,1,1,1 2,2,1,1 4,4,2,2 8,8,4,4 Count 1024 490 256 107 States Average 905531.3 3283660.8 19867908.5 147462381.6 StdDev 432442.3 1658349.8 8301866.4 24581462.6 99% CI 34865.7 193284.8 1338676.0 6131059.6 Score Average 52.1 63.7 78.5 91.6 StdDev 47.5 48.0 41.2 27.9 99% CI 3.8 5.6 6.6 7.0

Table C.24: Battleships in Fog 5x5 game showing the HP-II player performance.

165 166 167 Appendix D

HyperPlay Eﬃciency Experimental Results

• Schofield, M., Thielscher, M. (2017). The efficiency of the hyperplay technique over random sampling. In Proceedings of Thirty First AAAI Conference on Artificial Intelli- gence.

168 D.1 Design of Experiments

This appendix outlines the design of experiments to expose the shortcomings of HyperPlay and identify the impact on the agent.

D.1.1 Sampling Efficiency The efficiency of the sampling process is tested by playing a batch of games and recording the states visited in each round when updating each of the models. Every round this number is written to a log file and the files collated. The statistic is examined and used to calculate the probability of successfully making a random selection as well as the cost of updating the model. The resources for each role are set so that it plays at well below the optimal level. This ensures good variety in the game-play and a broad base for the calculation of the statistic. The HyperPlay process tests each move substitution in a random order, thus it can gain an accurate estimate of the probability in equation 6.2 by calculating the “first time” successes in that round.

D.1.2 Biased Samples Bias is measured by playing a batch of games and logging the history footprint of each model for each round. The footprints are examined for repetition and a frequency chart is created. A Pearson’s χ2 test is then performed on the distribution and a probability value is calculated. The resources for each role are set so that it plays at well below the optimal level. This ensures good variety in the game-play and a broad base for the statistic.

D.1.3 Bias Remedies These are examined by playing two batches of games, with and without remedies. The final scores for a designated player are averaged and reported along with a confidence interval. The resources for each player are set so that the player is competitive within a realistic time constraint based on the game complexity and the common competition times. In competition an agent would truncate its search to meet the time constraints, this would distort the statistic so the experiments allow the agent to complete the search for each round. The first remedy was to inversely weight the results from each model based on its sample frequency. The mathematical adjustment would give each element of an information set an equal impact in the evaluation of outcomes. If an element was represented by 5 models then each model contributed only 20% of its outcomes to the evaluation process.

169 The second remedy was to to re-balance the sample by replacing a more frequently sampled history with a less frequently sampled history so that each element was sampled the same number of times. Although this may seem to be the same as the ﬁrst remedy it makes better use of the player’s resources.

D.1.4 Roles

The roles were chosen to give meaningful results. In two player turn-taking games the second player is used for the statistic as that player receives imperfect-information ﬁrst. In two player simultaneous play the choice is arbitrary.

D.1.5 Experimental Procedure

The experimental program serializes the game controller and the players into a single thread to facilitate parallelization across an array. Because of variable run times on diﬀerent hardware the number of states visited is reported. When indicative times are given they are for computation performed by a single agent on an Intel Core i7-2600 @ 3.4GHz in a single thread. Where appropriate a conﬁdence level of 95% is used in a two-tailed calculation using the standard deviation for a binomial distribution. Where an average is statistically meaningless, a median and upper & lower quartiles are reported. Batch sizes were calculated to give statistically meaningful results. Generally each experiment had a batch size of 1000 games, or 10,000 observations.

D.1.6 Games

The basket of games chosen for experiments was drawn from the games available within the GGP community, and from the newly converted security games. A variety of information imperfections are represented in the games. Cut down versions of the game are used, when possible, without loss of generality. Battleships in Fog is a two-player, turn-taking, random start game with information gathering moves. It requires a HP-II player to be played eﬀectively. Blind Breakthrough is a two-player, turn-taking blind variant of the Breakthrough game. There are no information gathering moves and can be eﬀectively played with a HyperPlayer using a tree search for move evaluation. Border Protection is a two-player, turn-taking security game where one role is the random player. Percepts are provided to the defender to facilitate in-game response changes. There

170 are no information gathering moves and can be effectively played with a HyperPlayer using a tree search for move evaluation. Guess Who is a single-player mystery solving game where all moves are information gathering moves. It requires a HP-II player to be played effectively. Hidden Connect is a two-player, turn-taking game that is a blind version of Connect4. There are no information gathering moves and can be effectively played with a HyperPlayer using a tree search for move evaluation. KriegTTT is a two-player, simultaneous move game that is a blind version of TicTacToe. There are no information gathering moves and can be effectively played with a HyperPlayer using a tree search for move evaluation. Mastermind is a single-player mystery solving game where all moves are information gathering moves. It is solved in the backtracking of invalid models by either HP or HP-II player. A well resourced player can achieve a binary search. Transit is a two-player, turn-taking security game where one role is the random player. Percepts are provided to the defender to facilitate in-game response changes. There are no information gathering moves and can be effectively played with a HyperPlayer using a tree search for move evaluation.

D.1.7 Case Study Game variants and sizes have been chosen to prove (or disprove) the experimental objectives with a minimum of computational resources. However, there is value in extending the experimentation using one of the full scale versions of a common game in the form of a case study. Blind Breakthrough in the full 8x8 format is used to show how impractical random sampling can be. A sub-optimal HyperPlayer is used to ensure good variety in the game-play and a broad base for the calculation of the probability in equation 6.2. The intention is to show that random sampling would become impossible within the normal time constraints, yet the HyperPlayer could successfully maintain a bag of models throughout the entire game. It should be noted that the HyperPlayer is capable of taking a model oﬀ line if the backtracking process consumes too many resources. This is one of its design features for managing large search spaces.

171 D.2 Results

This section presents he experimental results along with comments that explain and highlight without drawing any conclusions.

D.2.1 Sampling Eﬃciency

BB BiF BP GW HC TTT Mm T

1 100% 68% 100% 100% 100% 100% 100% 100% 2 76% 39% 67% 67% 67% 82% 107% 61% 3 123% 12% 50% 58% 49% 71% 88% 73% 4 37% 6% 27% 56% 37% 80% 75% 39% 5 22% 1% 19% 59% 32% 86% 59% 53% 6 18% 1% 15% 57% 18% 31% 7 12% 1% 21% 57% 14% 33% 8 9% 1% 47% 10% 21% 9 8% 0% 33% 42% 10 6% 0% 25% 29% 11 4% 0% 58% 12 3% 0% 13 3% 0% 14 3% 0% 15 2% 0% 16 2% 0% 17 2% 18 3% 19 2% 20 2% 21 2% 22 2% 23 2% 24 2% 25 2%

Table D.1: Cost of sampling using HyperPlay compared to performing a random sample, as measured by states visited.

In Figure 6.4 the HyperPlay cost of sampling is compared to taking a random sample. The game-play is diﬀerent for each game with some being turn-taking others not, some games

172 have watershed rounds where percepts collapse an information set. Thus, there is no rhyme or reason the patterns in the results. Each game was conﬁgured to provide approximately 10,000 observations for each data point on the chart. The lone data point well above the 100% mark in Blind Breakthrough is when the black player initially encounters enemy pawns. The resulting re-sample is less eﬃcient than taking new random samples.

D.2.2 Uniform Distribution

Figure D.2 shows the probability that a sample performed by HyperPlay is uniformly distributed across an information set in the middle of a game. From Figure 6.4 it can be seen that game play is not uniform, so the worst result (most biased) is shown from the mid-game rounds.

Table D.2: Probability of a uniformly distributed sample of an information set created by HyperPlay in mid game

Game Round Q1 Median Q3 Battleships In Fog 12 0.004 0.058 0.150 Blind Breakthrough 11 0.246 0.998 1.000 Border Protection 7 0.001 0.001 0.004 Guess Who 10 0.121 0.399 0.792 Hidden Connect 8 0.001 0.001 0.001 Krieg TTT 5 0.001 0.167 0.992 Mastermind 5 0.413 0.899 1.000 Transit 11 0.996 0.999 1.000

Table D.3: Probability of a uniformly distributed sample of an information set created by HyperPlay in the end game.

As one playout of a game is not the same as another it is not possible to average the results

173 so a median and upper and lower quartile readings of the probability value from a Pearson’s χ2 test are shown. The median value for Battleships in Fog of 0.158 infers that there is a 15.8% probability the sample is uniformly distributed. Figure D.3 shows a similar statistic for the end-game. Again the worst result is shown for the ﬁnal rounds of the game. It is worth noting that some samples are more uniform at the end of the game as the information set shrinks under certainty. Blind Breakthrough becomes a pawn swapping exercise towards the end game and nears certainty. Mastermind becomes certain as the binary search nears completion and the Transit game almost always becomes certain at the end when the evader is caught.

D.2.3 Remedy for Biased Samples In Figure 6.3 the results of a batch of games played with different player configurations are shown. The base case is two evenly matched player with no attempt to correct biased samples. The weight remedy reduce the weighting of a sample proportional to its repetition, and the balance remedy re-balances the sample every round. The mean values show a 95% confidence interval.

Game Base Weight Balance Battleships In Fog 82.0±2.4 85.3±2.2 82.8±2.3 Blind Breakthrough 52.5±3.1 51.4±3.1 53.2±3.1 Border Protection 51.0±3.1 49.2±3.1 51.0±3.1 Guess Who 67.8±2.9 68.6±2.9 69.4±2.9 Hidden Connect 37.2±3.0 36.6±3.0 36.0±3.0 Krieg TTT 51.6±3.1 50.9±3.1 51.7±3.1 Mastermind 92.8±1.6 93.3±1.6 93.8±1.5 Transit 72.2±2.8 74.6±2.7 73.0±2.8

Table D.4: Average performance of player using diﬀerent remedies to unbalanced samples of an information set.

D.2.4 Case Study In Figure D.5 the results from the full sized version of Blind Breakthrough are shown. Both players were resourced just enough so as to exhibit a variety of game plays without making "stupid” moves. The results show an estimated upper bound on the set of play histories and the probability that a random history is valid in the game being played.

174 Also note that some models are taken oﬀ line by the HyperPlayer if they exceed 100,000 states visited in the backtracking stage. This is a design feature to prevent paralysis of the player. Such models can always be brought back on line if time permits.

Round P (V alid(hR)) sup|{hR}| Active Models 1 100% 22 100% 2 100% 22 100% - --- 16 1.19% 4.2 E+11 71.9% 17 0.73% 1.5 E+13 70.3% - --- 32 < 0.01% 6.0 E+23 50.0% 33 < 0.01% 1.7 E+25 48.7%

Table D.5: Average performance of player using diﬀerent remedies to unbalanced samples of an information set.

175 176 177 Appendix E

Imperfect Information Game Topologies

This appendix recapitulates previous works. It presents detailed topologies for imperfect- information games used in experiments throughout this thesis.

178 E.1 Imperfect-Information Games

In the General Game Playing domain for imperfect-information games, the rules of the game and the reward structure is fully known to each player. What is not automatically known are the moves made by other players in the game. Player receive percepts from the game controller according to the rules of the game expressed in the GDL-II. This chapter looks at the variations that can occur in the structure of a game.

E.1.1 Imperfect Move Information

This is perhaps the simplest type of game, where a player must compete with another player whose moves are hidden, only receiving some clue about the game play from time to time. Whilst this is true for almost all imperfect-information games there are some games where the moves are the only thing that is hidden. There are many games that ﬁt this category including board games that have been adapted for imperfect information. Hidden Connect is used to represent this type of game.

Hidden Connect Is a blind version of the children’s two player board game, Connect 4.

Game Play is turn taking, with players dropping a coloured token into a vertical column of a grid. A player wins by making a row, column or diagonal of their tokens.

Reward Structure is constant sum with win, loss and draw possible for each player.

Player Percepts are received only when a column is full so that illegal moves are not made. Otherwise players get no clue as their opponent’s moves.

Optimal Play Strategy is a very basic modelling of expected outcomes. The only information gathering moves come indirectly by ﬁlling (or not ﬁlling) a column.

Game Tree is variable depth with a diminishing branching factor as columns ﬁll up. The game tree contains around 1017 states for the ﬁve column version of the game.

Scalability is achieved by changing the number of columns and the number of tokens forming a line.

179 E.1.2 Imperfect Initial Information This is also one of the simpler type of game, where a player must search out the solution to a challenge that starts with some missing information. Often the random player (nature) makes some hidden move to begin the challenge. There are many games that ﬁt this category including many popular card games, Mastermind is used to represent this type of game.

Mastermind Is a guessing game where the random player selects a coloured pattern, and the player must replicate the pattern exactly.

Game Play is single player with the player oﬀering coloured patterns for evaluation against a target.

Reward Structure is diminishing pro rata according to the correctness of the guess in a maximum number of turns.

Player Percepts are received every round about the closeness of the matching pattern, but are ambiguous.

Optimal Play Strategy is to reduce the size of the information set as quickly as possible, and so a move that can halve the information set is an optimal move.

Game Tree is variable depth with a constant branching factor. The game tree contains around 109 states for the three colour version of the game.

Scalability is achieved by changing the number of colours and changing the number of tokens forming the pattern.

E.1.3 Information Purchasing This type of game tests the players ability to value information gathering moves. Generally the player can choose between asking a question about the state of the game, or attempting to meet the criteria for success. Information gathering moves incur a cost through a reduction in the ﬁnal score. There are fewer of these games played in General Game Playing and in the community. The Number Guessing game is used to represent this type of game.

180 Number Guessing Is a guessing game where the random player selects a number from 1 to 16, and the player must ask questions about the number or guess the number directly.

Game Play is single player with the player asking questions like “is it less than x" or guess the number directly.

Reward Structure is diminishing per turn. As such, every question will cost the player a small portion of their ﬁnal score.

Player Percepts are received every round in response to the question asked.

Optimal Play Strategy is to reduce the size of the information set as quickly as possible, and so a move that can halve the information set is an optimal move. A binary search is achievable in this game.

Game Tree is variable depth with a constant branching factor. The game tree contains around 1012 states for the 16 number version of the game.

Scalability is achieved by changing the number range.

E.1.4 Hidden Objectives This type of game tests the player’s ability to identify their opponents success criteria, or to hide their own success criteria. There are very few of these games played in General Game Playing and in the community. Banker and Thief from the GGP competition at the Australasian Joint Conference on Artiﬁcial Intelligence is used to represent this type of game.

Banker and Thief Is a strategy game where the random player gives the banker a secret target. The banker must avoid giving clues to the thief that will allow the thief to steal from the banker.

Game Play is a two player game with an uneven reward structure. The banker is given a target bank (from many banks) to make deposits, however if the thief can guess the target bank then the banker loses.

Reward Structure is uneven. The banker can keep all of the money deposited in the target bank after the thief has struck. The thief can only keep money stolen if it comes from the target bank.

181 Player Percepts are received every round about deposits made into each bank.

Optimal Play Strategy is to fool the thief into choosing the wrong target. As such the banker must give up more than half the money. A greedy strategy will always fail in this game.

Game Tree is ﬁxed depth with a constant branching factor. The game tree contains around 107 states for the 4 bank, ten deposit version of the game.

Scalability is achieved by changing the number of banks and the number of deposits.

E.1.5 Tactical Information Valuing This type of game tests the players ability to correctly value information in terms of a tactical advantage in the game play, both the cost of collecting information and the cost of keeping secrets. There is no direct cost for information gathering moves, but there is a tactical cost/beneﬁt in terms of the expected outcome of the game. There are very few of these games played in General Game Playing and in the community. Battleships in the Fog as played in the GGP competition at the Australasian Joint Conference on Artiﬁcial Intelligence is used to represent this type of game.

Battleships in the Fog This is a tactical two player game with a random starting position.

Game Play is turn taking. Players can choose from three basic actions; scan, ﬁre or move. Scanning reveals the location of their opponent, but tells them they have been scanned. Firing upon an adjacent square may lead to victory, but reveals the location of the square being ﬁred upon. Moving changes a players location on the board.

Reward Structure is win/loss/draw. In practical terms the game is win/loss as a draw takes many moves of avoiding each other.

Player Percepts are received according to the action taken. Scanning gives the location of the scanned vessel to both players, and Firing gives the location of the square being ﬁred upon.

Optimal Play Strategy is to correctly value the risks and rewards of moves that generate a percept. Ultimately this is a game of calculating what information your

182 opponent knows about you based on what you know about your opponent. And hence, correctly valuing moves that reveal information.

Game Tree is variable depth with a constant branching factor. The game tree contains around 1033 states for the ﬁve-by-ﬁve grid.

Scalability is achieved by changing the size of the grid.

183 184 List of Figures

1.1 Adapted from Maturana and Varela’s ideogram of systems that exhibit co-determination and self-development...... 9

2.1 A GDL-II description of the Monty Hall game...... 21 2.2 Monty Hall game tree, as seen by the contestant. Moves are serialized and (does role noop) removed, for the sake of clarity. The moves in the game were (does random (hide_car 3)), (does contestant (choose 1)), (does random (open_door 2))...... 26

3.1 An imperfect-information player using the HyperPlay technique to maintain a bag of models of the game...... 47 3.2 The HyperPlay technique used to maintain a model of the game...... 48 3.3 The Game Tree for a GDL-II game at rounds n and n + 1, showing the subtree deﬁned by an information set...... 49 3.4 Monty Hall results, validating the weighting used in the particle ﬁlter. . . . . 58 3.5 Krieg-TicTacToe results ...... 59 3.6 Blind Breakthrough results ...... 61 3.7 GDL-II description of the Exploding Bomb game...... 63

4.1 The game tree for the Exploding Bomb. At Round 1 the agent has an

information set of two nodes; that is, Iagent = {hv1 , hv2 }...... 70 4.2 A visual representation of an Imperfect Information Simulation of a two player game, where each role’s beliefs are built on one half of the play history. . . . 73 4.3 The NumberGuessing Results for HyperPlay-II...... 79 4.4 The Banker and Thief results...... 80

185 4.5 The Battleships In Fog results...... 81

5.1 Counting the States Visited in a simple search...... 89 5.2 Experimental results showing a comparison of resources used by various configurations of both techniques for the Hidden Connect game. The labels refer to the player and the size of the game: HP,3 refers to HyperPlayer playing Connect 3, etc...... 94 5.3 Experimental results showing a comparison of resources used by various configurations of both techniques for the Mastermind game...... 95 5.4 Experimental results showing a comparison of resources used by various configurations of both techniques for the number Guessing game...... 96 5.5 Experimental results showing a comparison of resources used by various configurations of both techniques for the Banker and Thief game...... 97 5.6 Experimental results showing a comparison of resources used by various configurations of both techniques for the Battleships in Fog game...... 99

6.1 An example of a silo deﬁned by the ﬁrst move in a game. The black nodes are marked "bad" by the HyperPlay technique...... 111 6.2 Left: Transit security game, Right: Border Protection security game. . . . . 113 6.3 A sample of the GDL-II description of the Transit security game highlighting the key aspects of the game...... 114 6.4 Cost of sampling using HyperPlay compared to performing a random sample, as measured by states visited...... 116

186 List of Tables

2.1 GDL-II keywords ...... 20

4.1 Experimental expected score calculations made by the players during the Exploding Bomb decision-making process. The bold scores indicate the chosen actions...... 78 4.2 Expected score calculations for the arming agent in round one of the Spy vs. Spy decision-making process. The bold scores indicate the chosen actions...... 79

5.1 Results of pruning the search space on player performance ...... 100

6.1 Probability of a uniformly distributed sample of an information set created by HyperPlay in mid game...... 117 6.2 Probability of a uniformly distributed sample of an information set created by HyperPlay in the end game...... 118 6.3 Average performance of player using diﬀerent remedies to unbalanced samples of an information set...... 119 6.4 Full sized Blind Breakthrough with the probability of randomly choosing a valid play history, an upper bound on the set size and the models still active. 120

A.1 Monty Hall Results showing the mean values for the HyperPlayer...... 137 A.2 Monty Hall Results showing the resources and observations values for the HyperPlayer...... 137 A.3 Monty Hall Results showing the 95% Conﬁdence Interval for the HyperPlayer. 138 A.4 TicTacToe Results showing the mean values for the HyperPlayer...... 138

187 A.5 TicTacToe Results showing the resources and observations values for the HyperPlayer...... 139 A.6 TicTacToe Results showing the 95% Conﬁdence Interval for the HyperPlayer. 139 A.7 Blind Breakthrough Results showing the mean values for the HyperPlayer. . . 140 A.8 Blind Breakthrough Results showing the resources and observations values for the HyperPlayer...... 140 A.9 Blind Breakthrough Results showing the 95% Conﬁdence Interval for the HyperPlayer...... 140

B.1 Experimental results for the Exploding Bomb game for the HyperPlay based player...... 147 B.2 Experimental results for the Exploding Bomb game for the HyperPlay-II based player...... 147 B.3 Experimental results for the Exploding Bomb game for the HyperPlay based player...... 148 B.4 Experimental results for the Exploding Bomb game for the HyperPlay-II based player...... 148 B.5 Experimental results for the Number Guessing game for the HyperPlay-II based player...... 149 B.6 Experimental results for the Banker and Thief game for the HyperPlay-II based player...... 149 B.7 Experimental results for the Battleships in Fog game for the HyperPlay-II based player...... 150

C.1 Hidden Connect 3 game showing the HP player performance...... 156 C.2 Hidden Connect 3 game showing the HP-II player performance...... 157 C.3 Hidden Connect 4 game showing the HP player performance...... 157 C.4 Hidden Connect 4 game showing the HP-II player performance...... 157 C.5 Mastermind 2x3 game showing the HP player performance...... 158 C.6 Mastermind 2x3 game showing the HP-II player performance...... 158 C.7 Mastermind 3x4 game showing the HP player performance...... 159 C.8 Mastermind 3x4 game showing the HP-II player performance...... 159 C.9 Number Guessing 4 game showing the HP player performance...... 160 C.10 Number Guessing 4 game showing the HP-II player performance...... 160 C.11 Number Guessing 8 game showing the HP player performance...... 160

188 C.12 Number Guessing 8 game showing the HP-II player performance...... 161 C.13 Number Guessing 16 game showing the HP player performance...... 161 C.14 Number Guessing 16 game showing the HP-II player performance...... 161 C.15 Banker and Thief 2 game showing the HP player performance...... 162 C.16 Banker and Thief 2 game showing the HP-II player performance...... 162 C.17 Banker and Thief 4 game showing the HP player performance...... 163 C.18 Banker and Thief 4 game showing the HP-II player performance...... 163 C.19 Battleships in Fog 3x3 game showing the HP player performance...... 164 C.20 Battleships in Fog 3x3 game showing the HP-II player performance...... 164 C.21 Battleships in Fog 4x4 game showing the HP player performance...... 164 C.22 Battleships in Fog 4x4 game showing the HP-II player performance...... 165 C.23 Battleships in Fog 5x5 game showing the HP player performance...... 165 C.24 Battleships in Fog 5x5 game showing the HP-II player performance...... 165

D.1 Cost of sampling using HyperPlay compared to performing a random sample, as measured by states visited...... 172 D.2 Probability of a uniformly distributed sample of an information set created by HyperPlay in mid game ...... 173 D.3 Probability of a uniformly distributed sample of an information set created by HyperPlay in the end game...... 173 D.4 Average performance of player using diﬀerent remedies to unbalanced samples of an information set...... 174 D.5 Average performance of player using diﬀerent remedies to unbalanced samples of an information set...... 175

189 190 References

Auer, P., Billard, A., Bischof, H., Bloch, I., Boettcher, P., Bülthoff, H., . . . others (2005). A research roadmap of cognitive vision. IST project IST-2001-35454. Billings, D., Davidson, A., Schauenberg, T., Burch, N., Bowling, M., Holte, R., . . . Szafron, D. (2004). Game-tree search with adaptation in stochastic imperfect-information games. In Proceedings of International Conference on Computers and Games (p. 21-34). Björnsson, Y., & Finnsson, H. (2009). CadiaPlayer: A simulation-based general game player. IEEE Transactions on Computational Intelligence and AI in Games, 1, 4-15. Bogaard, P. A. (1979). Heaps or wholes: Aristotle’s explanation of compound bodies. Isis, 11-29. Bošansk`y, B., Jiang, A. X., Tambe, M., & Kiekintveld, C. (2015). Combining compact representation and incremental generation in large games with sequential strategies. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (p. 812- 818). Bowling, M., Burch, N., Johanson, M., & Tammelin, O. (2015). Heads-up limit holdem poker is solved. Science, 347(6218), 145-149. Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P., Rohlfshagen, P., . . . Colton, S. (2012). A survey of Monte Carlo Tree Search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1-43. Cazenave, T., Saffidine, A., Schofield, M., & Thielscher, M. (2016a). Discounting and pruning for nested playouts in general game playing. In Proceedings of The IJCAI-15 Workshop on General Game Playing. Cazenave, T., Saffidine, A., Schofield, M., & Thielscher, M. (2016b). Nested monte carlo search for two-player games. In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence.

191 Clune, J. (2007). Heuristic evaluation functions for general game playing. In Proceedings of Twenty Second AAAI Conference on Artificial Intelligence (p. 1134-1139). Cowling, P. I., Powley, E. J., & Whitehouse, D. (2012). Information set monte carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 4(2). Edelkamp, S., Federholzner, T., & Kissmann, P. (2012). Searching with partial belief states in general games with incomplete information. In Proceedings of Annual Conference on Artificial Intelligence (p. 25-36). Fenster, M., Kraus, S., & Rosenschein, J. S. (1995). Coordination without communication: Experimental validation of focal point techniques. In Proceedings of International Conference on Mechatronics and Automation Science (p. 102-108). Frank, I., & Basin, D. (1998). Search in games with incomplete information: A case study in using Bridge card play. Artificial Intelligence, 100(1-2), 87-123. Frank, I., & Basin, D. (2001). A theoretical and empirical investigation of search in imperfect information games. Theoretical Computer Science, 252(1-2), 217-256. Genesereth, M. R., Love, N., & Pell, B. (2005). General game playing: Overview of the AAAI competition. AI Magazine, 26(2), 62-72. Ginsberg, M. L. (2001). GIB: Imperfect information in a computationally challenging game. Journal of Artificial Intelligence Research, 14, 303-358. Goertzel, B., & Pennachin, C. (2007). Artificial General Intelligence (Vol. 2). Springer. Hart, S. (1992). Games in extensive and strategic forms. Handbook of Game Theory with Economic Applications, 19-40. Hutter, M. (2007). Universal algorithmic intelligence: A mathematical top down approach. In Artificial General Intelligence (p. 227-290). Springer. Lanctot, M., Lis`y, V., & Bowling, M. (2014). Search in imperfect information games using online monte carlo counterfactual regret minimization. In AAAI Workshop on Computer Poker and Imperfect Information. Levine, J., Bates Congdon, C., Ebner, M., Kendall, G., Lucas, S. M., Miikkulainen, R., . . . Thompson, T. (2013). General video game playing. Artificial and Computational Intelligence in Games, 77-84. Leyton-Brown, K., & Shoham, Y. (2008). Essentials of game theory: A concise multidisci- plinary introduction. Synthesis Lectures on Artificial Intelligence and Machine Learning, 2(1), 1-88. Lis`y, V., Davis, T., & Bowling, M. (2016). Counterfactual regret minimization in sequential security games. In Proceedings of Thirtieth AAAI Conference on Artificial Intelligence.

192 Long, J., Sturtevant, N., Buro, M., & Furtak., T. (2010, July). Understanding the success of perfect information M Carlo sampling in game tree search. In Proceedings of Twenty Fourth AAAI Conference on Artificial Intelligence (p. 134-140). Atlanta: AAAI Press. Love, N., Hinrichs, T., Haley, D., Schkufza, E., & Genesereth, M. (2008). General game playing: Game description language specification (No. LG-200801). Love, N., Hinrichs, T., Schkufza, D. H. E., & Genesereth, M. (2006). General game playing: Game description language specification (Tech. Rep. No. LG-2006-01). Stanford Logic Group. Méhat, J., & Cazenave, T. (2011). A parallel general game player. KI-Künstliche Intelligenz, 43-47. Nijssen, P., & Winands, M. H. (2012). Monte carlo tree search for the hide-and-seek game scotland yard. IEEE Transactions on Computational Intelligence and AI in Games, 4(4), 282–294. Perez-Liebana, D., Samothrakis, S., Togelius, J., Schaul, T., Lucas, S. M., Couëtoux, A., . . . Thompson, T. (2016). The 2014 general video game playing competition. IEEE Transactions on Computational Intelligence and AI in Games, 8(3), 229–243. Quenault, M., & Cazenave, T. (2007). Extended general gaming model. In Proceedings of Computer Games Workshop (p. 195-204). Rasmusen, E. (2007). Games and information: an introduction to game theory. Blackwell Publishing. Richards, M., & Amir, E. (2009). Information set sampling for general imperfect information positional games. In Proceedings of GIGA 2009 Workshop on General Game Playing (p. 59-66). Rosenhouse, J. (2009). The monty hall problem: The remarkable story of math’s most contentious brain teaser. Oxford University Press. Russell, S. J., Norvig, P., Canny, J. F., Malik, J. M., & Edwards, D. D. (2003). Artificial intelligence: a modern approach (Vol. 2). Prentice hall Upper Saddle River. Schäfer, J., Buro, M., & Hartmann, K. (2008). The uct algorithm applied to games with imperfect information. Diploma, Otto-Von-Guericke University Magdeburg, Magdeburg, Germany. Schiffel, S., & Björnsson, Y. (2014). Efficiency of gdl reasoners. IEEE Transactions on Computational Intelligence and AI in Games, 6(4), 343-354. Schiffel, S., & Thielscher, M. (2007). Fluxplayer: A successful general game player. In Proceedings of Twenty Second AAAI Conference on Artificial Intelligence (p. 1191-

193 1196). Schiffel, S., & Thielscher, M. (2014). Representing and reasoning about the rules of general games with imperfect information. Journal of Artificial Intelligence Research, 49, 171- 206. doi: doi: doi:10.1613/jair.4115 Schofield, M., Cerexhe, T., & Thielscher, M. (2012). HyperPlay: A solution to general game playing with imperfect information. In Proceedings of Twenty Seventh AAAI Conference on Artificial Intelligence (p. 1606-1612). Schofield, M., Cerexhe, T., & Thielscher, M. (2013). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of GIGA 2013 Workshop on General Game Playing (p. 39-45). Schofield, M., & Saffidine, A. (2013). High speed forward chaining for general game playing. In Proceedings of GIGA 2013 Workshop on General Game Playing (p. 31-38). Schofield, M., & Thielscher, M. (2015). Lifting hyperplay for general game playing to incomplete-information models. In Proceedings of Twenty Ninth AAAI Conference on Artificial Intelligence (p. 3585-3591). Schofield, M., & Thielscher, M. (2016a). General game playing with incomplete information. Submitted to Journal of Artificial Intelligence Research. Schofield, M., & Thielscher, M. (2016b). The scalability of the hyperplay technique for imperfect-information games. In Proceedings of AAAI Workshop on Computer Poker and Imperfect Information Games. Schofield, M., & Thielscher, M. (2017). The efficiency of the hyperplay technique over random sampling. In Proceedings of Thirty First AAAI Conference on Artificial Intelligence. Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. In Advances in Neural Information Processing Systems (p. 2164-2172). Thielscher, M. (2010). A general game description language for incomplete information games. In Proceedings of Twenty Fourth AAAI Conference on Artificial Intelligence (p. 994-999). Thielscher, M. (2011). The general game playing description language is universal. In IJCAI Proceedings - International Joint Conference on Artificial Intelligence (Vol. 22, p. 1107). Vernon, D. (2008). Cognitive vision: The case for embodied perception. Image and Vision Computing, 26(1), 127–140. Vernon, D., Metta, G., & Sandini, G. (2007). A survey of artificial cognitive systems: Implica- tions for the autonomous development of mental capabilities in computational agents. IEEE Transactions on Evolutionary Computation, 11(2), 151.

194 Wisser, F. (2015, July). An expert-level card playing agent based on a variant of perfect information Monte Carlo sampling. In IJCAI Proceedings - International Joint Conference on Artiﬁcial Intelligence (p. 125-131). Buenos Aires: AAAI Press. Yin, Z., Jiang, A. X., Johnson, M. P., Kiekintveld, C., Leyton-Brown, K., Sandholm, T., . . . Sullivan, J. P. (2012). Trusts: Scheduling randomized patrols for fare inspection in transit systems. In Proceedings of Conference on Innovative Applications of Artiﬁcial Intelligence.

195