Simulation-Based Approach to General Game Playing
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Simulation-Based Approach to General Game Playing Hilmar Finnsson and Yngvi Bjornsson¨ School of Computer Science Reykjav´ık University, Iceland fhif,[email protected] Abstract There is an inherent risk with that approach though. In practice, because of how disparate the games and their play- The aim of General Game Playing (GGP) is to create ing strategies can be, the pre-chosen set of generic features intelligent agents that automatically learn how to play may fail to capture some essential game properties. On many different games at an expert level without any top of that, the relative importance of the features can of- human intervention. The most successful GGP agents in the past have used traditional game-tree search com- ten be only roughly approximated because of strict online bined with an automatically learned heuristic function time constraints. Consequently, the resulting heuristic evalu- for evaluating game states. In this paper we describe a ations may become highly inaccurate and, in the worst case, GGP agent that instead uses a Monte Carlo/UCT sim- even strive for the wrong objectives. Such heuristics are of ulation technique for action selection, an approach re- a little (or even decremental) value for lookahead search. cently popularized in computer Go. Our GGP agent has In here we describe a simulation-based approach to gen- proven its effectiveness by winning last year’s AAAI eral game playing that does not require any a priori do- GGP Competition. Furthermore, we introduce and em- main knowledge. It is based on Monte-Carlo (MC) simu- pirically evaluate a new scheme for automatically learn- ing search-control knowledge for guiding the simula- lations and the Upper Confidence-bounds applied to Trees tion playouts, showing that it offers significant benefits (UCT) algorithm for guiding the simulation playouts (Koc- for a variety of games. sis & Szepesvari´ 2006), thus bypassing the need for a heuris- tic evaluation function. Our GGP agent, CADIAPLAYER, uses such an approach to reason about its actions and has Introduction already proven its effectiveness by winning last year’s an- nual GGP competition. The UCT algorithm has recently In General Game Playing (GGP) the goal is to create intel- been used successfully in computer Go programs, dramat- ligent agents that can automatically learn how to skillfully ically increasing their playing strength (Gelly et al. 2006; play a wide variety of games, provided only the descriptions Coulom 2006). However, there are additional challenges in of the game rules. This requires that the agents learn diverse applying it to GGP, for example, in Go pre-defined domain- game-playing strategies without any game-specific knowl- knowledge can be used to guide the playout phase, whereas edge being provided by their developers. A successful re- such knowledge must be automatically discovered in GGP. alization of this task poses interesting research challenges for artificial intelligence sub-disciplines such as knowledge The main contributions of the paper are as follows, we: representation, agent-based reasoning, heuristic search, and (1) describe the design of a state-of-the-art GGP agent machine learning. and establish the usefulness of simulation-based search ap- The most successful GGP agents so far have been based proaches in GGP, in particular when used in combination on the traditional approach of using game-tree search aug- with UCT; (2) empirically evaluate different simulation- mented with an (automatically learned) heuristic evalua- based approaches on a wide variety of games, and finally (3) tion function for encapsulating the domain-specific knowl- introduce a domain-independent enhancement for automati- edge (Clune 2007; Schiffel & Thielscher 2007; Kuhlmann, cally learning search-control domain-knowledge for guiding Dresner, & Stone 2006). However, instead of using a set simulation playouts. This enhancement provides significant of carefully hand-crafted domain-specific features in their benefits on all games we tried, the best case resulting in 90% evaluation as high-performance game-playing programs do, winning ratio against a standard UCT player. GGP programs typically rely on a small set of generic fea- The paper is structured as follows. In the next section tures (e.g. piece-values and mobility) that apply in a wide we give a brief overview of GGP, followed by a review of range of games. The relative importance of the features is our GGP agent. Thereafter we detail the simulation-based then automatically tuned in real-time for the game at hand. search approach used by the agent and highlight GGP spe- cific enhancements, including the new technique for improv- Copyright c 2008, Association for the Advancement of Artificial ing the playout phase in a domain-independent manner. Fi- Intelligence (www.aaai.org). All rights reserved. nally, we present empirical results and conclude. 259 General Game Playing are the search algorithms used for making action decisions. Artificial Intelligence (AI) researchers have for decades For single-agent games the engine uses an improved vari- worked on building game-playing systems capable of ation of the Memory Enhanced IDA* (Reinefeld & Marsland matching wits with the strongest humans in the world. The 1994) search algorithm. The search starts immediately dur- success of such systems has largely been because of im- ing the start-clock. If successful in finding at least a partial proved search algorithms and years of relentless knowledge- solution (i.e. a goal with a higher than 0 point reward) it con- engineering effort on behalf of the program developers, tinues to use the algorithm on the play-clock, looking for im- manually adding game-specific knowledge to their pro- proved solutions. However, if unsuccessful, the engine falls grams. The long-term aim of GGP is to take that approach to back on using the UCT algorithm on the play-clock. The the next level with intelligent agents that can automatically single-agent search module is still somewhat rudimentary in learn to play skillfully without such human intervention. our agent (e.g. the lack of a heuristic function), although it ensures that the agent can play at least the somewhat less The GGP Competition (Genesereth, Love, & Pell 2005) complex puzzles (and sometimes even optimally). was founded as an initiative to facilitate further research into Apart from the single-agent case, all action decisions are this area. Game-playing agents connect to a game server that made using UCT/MC simulation searches. The simula- conducts the matches. Each match uses two separate time tion approach applies to both two- and multi-player games, controls: a start-clock and a play-clock. The former is the whether they are adversary or cooperative. In two-player time the agent gets to analyze the game description until play games the agent can be set up to either maximize the differ- starts, and the latter is the time the agent has for deliberating ence in the players’ score (games are not necessarily zero- over each move decision. The server oversees play, relays sum), or maximize its own score, but with tie-breaking to- moves, keeps time, and scores the game outcome. wards minimizing the opponent’s score. In multi-player Game descriptions are specified in a Game Description games the agent considers only its own score, ignoring the Language (GDL) (Love, Hinrichs, & Genesereth 2006), a ones of the other players. This sacrifices the possibility of specialization of KIF (Genesereth & Fikes 1992), a first- using elaborate opponent-modeling strategies, but is done order logic based language for describing and communicat- for simplification purposes. ing knowledge. It is a variant of Datalog that allows function A detailed discussion of the architecture and implementa- constants, negation, and recursion (in a restricted form). The tion of CADIAPLAYER is found in (Finnsson 2007). expressiveness of GDL allows a large range of determinis- tic, perfect-information, simultaneous-move games to be de- scribed, with any number of adversary or cooperating play- Search ers. Turn-based games are modeled by having the players The UCT algorithm (Kocsis & Szepesvari´ 2006) is the who do not have a turn return a no-operation move. A GDL core component of our agent’s action-selection scheme. It game description specifies the initial game state, as well as is a variation of the Upper Confidence Bounds algorithm rules for detecting and scoring terminal states and for gen- (UCB1) (Auer, Cesa-Bianchi, & Fischer 2002) applied to erating and playing legal moves. A game state is defined by trees, and offers a simple — yet sound and effective — way the set of propositions that are true in that state. to balance exploration and exploitation. The UCT algorithm gradually builds a game tree in mem- CadiaPlayer ory where it keeps track of the average return of each state- action pair played, Q(s; a). During a simulation, when still An agent competing in the GGP competition requires at least within the tree, it selects the action to explore by: three components: a HTTP server to interact with the GGP ( s ) game server, the ability to reason using GDL, and the AI ln N(s) a∗ = argmax Q(s; a) + C for strategically playing the games presented to it. In CADI- a2A(s) N(s; a) APLAYER the HTTP server is an external process, whereas the other two components are integrated into one game- The Q function is the action value function as in an MC playing engine. The HTTP server process is always kept algorithm, but the novelty of UCT is the second term — the running, monitoring the incoming port. It spawns an in- so-called UCT bonus. The N function returns the number stance of the game-playing engine for each new game de- of visits to a state or the number of times a certain action has scription it receives from the GGP game server.