JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 1 Fast Approximate Max-n Monte-Carlo Tree Search for Ms Pac-Man Spyridon Samothrakis, David Robles, and Simon Lucas, Senior Member, IEEE

Abstract—We present an application of Monte-Carlo Tree release in Japan, the game became immensely popular, rapidly Search (MCTS) for the game of Ms Pac-Man. Contrary to achieving cult status, and various other versions followed. most applications of MCTS to date, Ms Pac-Man requires The best known of these is Ms Pac-Man, released in 1981, almost real-time decision making and does not have a natural end state. We approached the problem by performing Monte- which many see as being a significant improvement over the Carlo tree searches on a 5 player maxn tree representation original. Ms Pac-Man introduced a female character, new maze of the game with limited tree search depth. We performed a designs and several gameplay changes over the original game. number of experiments using both the MCTS game agents (for Probably the most important change was the introduction of pacman and ghosts) and agents used in previous work (for an element of randomness to the ghosts’ behaviour which ghosts). Performance-wise, our approach gets excellent scores, outperforming previous non-MCTS opponent approaches to the eliminates the effectiveness of exploiting set patterns or routes, game by up to two orders of magnitude. making Ms Pac-Man much more of a game of short-term planning and reactive skill than memorisation. While it is Index Terms—pac-man, MCTS, max-n, Monte-Carlo no longer the newest or most advanced example of game development, Pac-Man style games still provide a platform that I.INTRODUCTION is both simple enough for AI research and complex enough With the recent success of Monte-Carlo Tree Search to require intelligent strategies for successful gameplay. The (MCTS) in Computer Go [1], [2] and General Game Playing current Ms Pac-Man’s world record is held by Abdner Ashman (GGP) [3], there has been a growing interest in applying with 921, 360 points. To put this score in perspective, more these techniques to other board-like strategy games such as than 130 stages were completed in an almost perfect fashion1. Hex [4]. However, there has been very limited research in On the other hand, the AI agent who holds the world record the application of MCTS in real time video games. In this is ICE Pambush [5], who won the competition held in the paper we investigate the use of Monte-Carlo Tree Search in Ms Pac-Man Competition 2009 IEEE Symposium on Com- the game Ms Pac-Man, a predator/prey-like game popularised putational Intelligence and Games with a maximum score of in the 80s. From an AI research perspective, there are many 30, 010. There is an enormous difference between the human aspects of the game worthy of study, but in this particular world record and the agent world record, which indicates that paper we are mostly focused on the performance of MCTS in computers are significantly behind the human players and there a real time game setting. The rest of the paper is organised is much room for improvement with AI techniques. Note that as follows: in the next section (section II), we discuss the this score of 30, 010 was achieved in the original Ms Pac- game of Ms Pac-Man and the previous research on this game; Man, using a screen-capture software agent; this will be further in Section III, we describe our Ms Pac-Man simulator and discussed in the next section. the experimental setup done in order to perform simulations Since the source code of the original Ms Pac-Man game is on the game; in section IV we discuss MCTS in more depth in Z80 assembly language, most of the research done in Pac- together with some of its previous applications; in section V Man style games is done with simulators. These vary in how we present the methodology and rationale used in order to closely they resemble the original game, both functionally and create the current set of agents. In section VI we perform four cosmetically, but all the ones that do not use the original as- sets of experiments and present the results. Finally, section sembly code have some significant differences when compared VII summarises the contributions of this research, and gives to the original. For this reason, it is hard to make comparisons directions for future developments. Throughout the text as a between agents created by other researchers, who use their convention, we use the word “pac-man” to refer to the pac- own simulators. man agent, while “Ms Pac-Man” and “Pac-Man” (capitalised first letter) refer to the actual game. A. Ms Pac-Man Gameplay The player starts in maze A with three lives, and a single II.MS PAC-MAN extra life is awarded at 10, 000 points. There are four mazes in Pac-Man is a classic arcade game originally developed by the game (A, B, C and D), and each one contains a different for the Company in 1980. Since its layout with pills and power pills placed on specific nodes. The goal of the player is to obtain the highest possible score by Spyridon Samothrakis, David Robles and Simon M. Lucas are with the School of Computer Science and Electronic Engineering, University of eating all the pills and power pills in the maze and continue Essex, Colchester CO4 3SQ, United Kingdom. (emails: [email protected], [email protected], [email protected]). 1http://uk.gamespot.com/arcade/action/mspac-man/news.html?sid=6130815 JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 2 to the next stage. The difficulty of clearing the maze comes the player, using a very simple Pac-Man implementation. They from the ghosts. There are four different ghosts in Ms Pac- used a neural network and temporal difference learning (TDL) Man: Blinky (red), Pinky (pink), Inky (cyan) and Sue (yellow). in a 10 x 10 centred window, but using simple mazes with only At the start of each level the ghosts start in their lair in the one ghost and no power pills. Using complex learning tasks, middle and spend some idle time before exiting the lair to they showed basic ghost avoidance. chase the pac-man. The time spent in the lair before joining Gallagher and Ryan [9] used a Pac-Man agent based on the chase varies with the level, with the ghosts leaving the a simple finite-state machine model with a set of rules lair sooner on the harder levels of the game. Each time the to control the movement of Pac-Man. The rules contained pac-man is eaten by a ghost, a life is taken away and pac-man weight parameters which were evolved using the Population- and the ghosts return to their original positions on that maze. Based Incremental Learning (PBIL) algorithm. They ran a There are four power pills in each of the four mazes, which simplified version of Pac-Man with only one ghost and no when eaten, reverse the direction of the ghosts and turn them power pills, which reduces the scoring opportunities in the blue, which mean they can be eaten for extra points. The score game and removes most of the complexity. However this for eating each ghost in succession immediately after a power implementation was able to achieve machine learning at a pill (which is worth 50 points) starts at 200 and doubles each minimum level. Gallagher and Ledwich [10] described an time. So, an optimally consumed power pill is worth 3050 (= approach to developing Pac-Man playing agents that learn 50 + 200 + 400 + 800 + 1600). Note that if a second power pill game play based on minimal screen information. The agents is consumed while some ghosts remain edible from the first were based on evolving neural network controllers using power pill consumption, then the ghost score is reset to 200. a simple evolutionary algorithm. Their results showed that The more difficult the level, the shorter that time becomes, neuro-evolution is able to produce agents that display novice until the ghosts do not turn blue at all (but they do still change playing ability with no knowledge of the rules of the game direction). When the ghosts are flashing, they are about to and a minimally informative fitness function. change from blue state back to their normal state, so the player Szita and Lorincz [7] proposed a different approach to (or agent) should be careful in these situations to avoid losing playing Ms Pac-Man. The aim was to develop a simple rule- lives. Another source of points are the extra fruits and prizes based policy, where rules are organised into action modules that appear and bounce around the maze. These bonus items and a decision about which direction to move in is made based include cherries, strawberries, peaches, pretzels, apples, pears on priorities assigned to the modules in the agent. Policies and bananas, and their values increase with increasing levels are built using the cross-entropy optimisation algorithm. The of the game. best-performing agent obtained a score average of 8,186, As previously mentioned, when all the pills and power pills comparable to the performance of a set of five human subjects are cleared in a maze, the next maze appears, though with played on the same version of the game, who averaged 8,064 increased speed and difficulty. Also, ghosts travel at half speed points. More recently, Wirth and Gallagher [11] used an agent through the side escape tunnels which allows pac-man to gain based on an influence map model to play Ms Pac-Man. The some breathing space, at the risk of being trapped in the tunnel. model captures the essentials of the game and showed that the Another detail of the first stages is that pac-man moves faster parameters of the model interact in a fairly simple way and that when not eating pills than when she is. She is also capable of it is reasonably straight-forward to optimise the parameters to turning around corners faster than the ghosts, so should make maximise game playing performance. The performance of the as many turns as possible when the ghosts are on her tail. optimised agents averaged a score of just under 7,000. The work reported in the papers listed above used nearly as many different simulators as there are papers listed. Unfor- B. Previous Research tunately this makes a meaningful comparison of the abilities In recent years, Pac-Man style games have received some of the agents extremely difficult — it is just possible to get a attention in Computational Intelligence research. Previous vague idea of the relative playing strengths of these approaches work in Pac-Man has been carried out in different versions of but no more than that. the game (i.e. Pac-Man and Ms Pac-Man simulators), hence For this reason one of the authors (Lucas) has developed a an exact comparison is not possible, but we can have a screen-capture based competition that uses the original rom- general idea of the performances. Most of these works can be code of the Ms Pac-Man game running on an emulator. The divided in two areas: agents that use CI techniques partially or competition has been run at several conferences over the completely, and controllers with hand-coded approaches. We last few years since 2008. Entrants submit a software agent will briefly discuss the most relevant works in both areas. that plays the game by capturing the screen many times per One of the earliest studies with Pac-Man was conducted second, processing the captured pixel map to identify the main by Koza [6] to investigate the effectiveness of genetic pro- game objects (pac-man, ghosts, pills etc.) and then generating gramming for task prioritisation. This work utilised a modified keyboard-events which are sent to the game window. This version of the game, using different score values for the items allows a standard way to measure the performance of the and also a different maze. According to Szita and Lorincz [7], software agents, and has the benefit of testing performance on the only score reported on Koza’s implementation would have the original game rather than some modified (and usually sig- been around 5,000 points in their Pac-Man version. Bonet and nificantly simplified) clone. However, there are a few caveats Stauffer [8] proposed a reinforcement learning technique for to this. The performance of the agent can be significantly JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 3 impaired by a poor screen-capture and processing kit, and Lucas [15] used genetic programming (GP) to evolve a wide while a specific kit could be specified, none of the existing variety of pacman agents. A diverse set of behaviours were ones seem to be good enough yet to standardise on. Therefore, evolved using the same GP setup in three different versions of this is currently left open as a choice to the entrants in order the game: 1) single level, 2) four mazes, and 3) unlimited to promote the development of better kits. For this reason we levels. All of them using only one life and the same GP have not yet tested our agent in screen-capture mode, but parameters. The function set was created based on Koza [6], this is a priority for future work; to be done properly we including operations such as IsEdible, IsInDanger, IsEner- are developing a better screen-capture kit that will enable our gizersCleared, etc. On the other hand, the terminal set was MCTS agent to show its true ability. Hence, for the moment divided in two groups: data-terminals and action terminals. we base our result on a simulator developed by the authors and Most of the data terminals returned the current distance of freely available for download for the Agent versus Ghost Team a component from the agent, such as DISPill, DISEnergizer, Ms Pac-Man competition2 (originally developed by Lucas [12] DISGhost, etc. As for the action-terminals, each one moved and then significantly enhanced). Pac-Man one step toward the target, e.g., ToEnergizer, ToPill, The papers described next all use some version of this and ToEdibleGhost. The results showed that GP was able to simulator, using the same ghost algorithms (known collectively evolve controllers that are well-matched to the game used as LegacyTeam) unless otherwise stated. The earlier papers for evolution and, in some cases, also generalise well to used a single maze (the first maze) version of the game, with previously unseen mazes. The average score in the experiments the same edible time for each level (hence the game consists was approximately 10, 000 points, with a maximum score of of endless repetitions of the first level until all the lives are approximately 20,000. lost). The later papers (as explained) have also used the full four original mazes, and in some cases reducing the edible III.MS PAC-MAN SIMULATOR time, or (as in the current paper) using no edible time at all. The Ms Pac-Man simulator used for this research is a The simulator is described in some detail in the next section. re-factored and extended version of the one used in Lucas Lucas [12] proposed evolving neural networks to evaluate [12], which is written in object-oriented style in Java with a the moves. He used Ms Pac-Man because the ghosts behave reasonably clean and simple implementation. Compared to the with pseudo-random movement, and that eliminates the possi- original several changes and improvements have been made bility of playing the game using path-following patterns. This in order to perform Monte-Carlo Tree Search and also to work utilised a handcrafted input feature vector consisting of enable much faster path evaluation. This current version of the distances from pac-man to each non-edible ghost, to each the simulator is a more accurate approximation of the original edible ghost, to the nearest pill, to the nearest power pill and game, not only at the functional but also at the cosmetic to the nearest junction. His implementation was able to score level, and includes the four original mazes. Figure 1 shows an average of 4, 780 points over 100 games. a screen shot of each level in action. Nevertheless, there are Robles and Lucas [13] proposed perhaps the first attempt still important differences with respect to the original game: to apply tree search to Ms Pac-Man. The approach taken was to expand a route-tree based on possible moves that the pac- • The speed of our pac-man and the ghosts are identical, man agent can take to depth 40, and evaluated which path and pac-man does not slow down to eat pills. was best using hand-coded heuristics. On their simulator of • Our pac-man cannot cut corners, and so has no speed the game the agent achieved a high score of 40, 000, but only advantage over the ghosts when turning a corner. around 15, 000 on the original game using a screen-capture • Our ghosts do not slow down in the tunnels. interface. However, it should be emphasised that many of the • Bonus fruits are not present, as their inclusion plays a above quoted scores have been obtained on the different Pac- relatively minor contribution to the score of the game (at Man simulators, and therefore only provide the very roughest least at lower levels). idea of relative performance. • No additional life at 10,000 points. This would have been Burrow and Lucas [14] used a previous version of the easy to implement, but has little bearing on MCTS-based simulator to analyse the learning behaviours of TDL and strategies. Evolutionary Algorithms, with a simple TD(0) and a standard • Perhaps the most significant of all: the ghost behaviours ES(15 + 15) respectively. Two features were used as input are not the same as in the original game. We developed to the function approximators (interpolated table and MLP) to various ghosts agent controllers with different strategies give an estimate of the value of moving to the candidate node: to use in the simulations, however, they are only an 1) the distance from a candidate node to the nearest escape approximation to the original behaviours. node, and 2) the distance from a candidate node to the nearest The mazes of the game are modelled as graphs of connected pill along the shortest maze path. Their experiments showed nodes. Each node has two, three or four neighbouring nodes that under that experimental configuration evolution performed depending on whether it is in a corridor, L-turn, T-junction significantly better than TDL, although other reward structures or a crossroads. After the mazes have been created, a simple or features were not tested. efficient algorithm is run to compute the shortest-path distance Also using a previous version of the simulator, Alhejali and between every node and every other node in the mazes. These distances are stored in a look-up-table, and allow fast 2http://csee.essex.ac.uk/staff/sml/pacman/kit/AgentVersusGhosts.html computation of the various controller-algorithm input features. JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 4

is allowed to move in any direction along a corridor at any given time, while the ghosts cannot reverse unless the model of the game determines a ghosts reversal. For example, in Figure 3 the state of the game is on time step 25, s25, in which the action set for pac-man and Blinky (Blinky is the ghost in the bottom left of the maze in Figure 3) are:

AP (s25) = {north, south} AB(s25) = {north}

If one of the agents chooses an action that is not part of the respective set of actions, Ap(s25), the simulator will take the default behaviour of the agent. The GameStateInterface (Listing 3) provides an approxi- mation to a Markov state signal that summarises everything important about the preceding moves that contributed to the current state. Some of the information about the previous moves is lost, but most that really matters for decision-making is retained. Apart from the information of the current state of the game, the GameStateInterface also provide methods that ease the use of Monte-Carlo simulations: copy() and next(). The method copy() returns a copy of the current state of the game. This is helpful while doing game-tree search, since it is necessary to keep copies of the game state in every node. The next() method takes the current game state, st ∈ S, where S is the set of possible states, and receives a set of actions to take for each of the five agents (pac-man and the ghosts), ap,t ∈ Ap(st) where Ap(st) is the set of actions available in Fig. 1. Snapshots of the four mazes in the Ms Pac-Man Simulator. state st for agent p. One time step later, as a consequence of their actions, the agents find themselves in a new state, st+1.

Each maze is played twice consecutively, starting in Maze A Listing 3. Game State Interface that enables MCTS up to Maze D. When Maze D is cleared, the game goes back public interface GameStateInterface { to Maze A and continues the same sequence, i.e., (A, A, B, GameStateInterface copy(); B, C, C, D, D, A, A,...) until game over. void n e x t ( i n t pacDir , i n t [] ghostDirs); Agent getpac −man ( ) ; The new design of the simulator allows us to implement AI MazeInterface getMaze(); controllers for the pac-man and the team of ghosts as well, i n t getLevel (); by implementing their controller interfaces: AgentInterface BitSet getPills(); BitSet getPowers(); (Listing 1) and GhostTeamController (Listing 2) respectively. GhostState[] getGhosts(); Each interface has a single method which takes the game state i n t getScore (); as input and returns the desired movement directions. i n t getGameTick (); i n t getEdibleGhostScore (); i n t getLivesRemaining (); Listing 1. Agent Interface boolean agentDeath (); public interface AgentInterface { boolean terminal (); i n t action(GameStateInterface gs); void r e s e t ( ) ; } }

Listing 2. Ghosts Controller public interface GhostTeamController { The simulator represents the game state in a compact way p u b l i c i n t [] getActions(GameStateInterface gs); in order to allow efficient copying of the state, and efficient } transmission to remote clients (enabling remote evaluation During the game, the model of the game will send the of ghost teams and agents). The game state contains only GameState to both controllers, and they must return the references to other immutable objects where possible. For appropriate actions to take. In the case of the AgentInterface example, the maze object never changes, but the positions and (pac-man) it will return the next direction to take, whereas states of the agents change, and the state of the pills change. the Ghosts Controller must return an array with the desired Bit-sets are used to store whether each pill is present or has directions to move each ghost, and they will be executed as already been eaten, enabling the entire state of the game to be long as they are legal moves for the ghosts. The pac-man represented in a few tens of bytes. JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 5

IV. MONTE CARLO TREE SEARCH A. MCTS and Monte-Carlo Tree Search The idea behind Monte-Carlo algorithms in AI is that the approximation of future rewards (as they are understood in the Markov Decision Process (MDP) sense [16]) can be achieved through random sampling. What this effectively means is that the agent extrapolates to future states in a random fashion and moves to the state with the highest predicted reward. MCTS tries to rectify some of the issues that come with such an approach by combining it with a tree and effectively creating a stochastic form of best-first search. From a game theoretic perspective, the tree is a subtree of the game tree in extensive form [17]. Upper Confidence Bounds in Trees (UCT) (presented in Algorithm 1) can be seen as simply an implementation of MCTS where the “selection” part of the algorithm is provided by ideas borrowed from the study of multi-armed bandit [18] problems. In UCT, each node in the tree is seen as multi-armed bandit. The goal of the search is to “push” more towards areas of the search space that seem more promising. Although there are many different versions of the algorithm, the one presented in [19] is quite commonly used3. The algorithm (see Figure 2) can be summarised as follows; starting from the root node, Fig. 2. A sample MCTS search. expand the tree by a single node. If the node is a leaf node, estimate its value by performing a roll-out and back-propagate the value to the node’s ancestors in the tree. If the node is not the other hand, once we have collected enough information, a leaf node, keep exploring the tree until one is reached. The more informed policies should be able to perform better. most commonly used back-propagation strategy is one that makes direct use of the underlying tree, e.g., for a minmax Algorithm 1 playOneSequence(rootNode) (negamax) tree, that would include subtracting or adding the {The entry point of MCTS. It traverses through a tree in result depending on who is the owner of each node in the an asymmetric fashion. End state rewards are provided as ancestor list (for an example see Algorithms 1, 2, 3, taken an array (rewardsArray), as we can have more than two from [19]). players.} In case all possible nodes have been visited, a common node[0] ← rootNode way to distinguish which node to explore further is to assign i ← 0 a value to each node based on the Chernoff-Hoeffding bound. repeat This leads to what is known as the UCB1 policy[18]. Play nodeArray[i+1] ← descendByUCB1(nodeArray[i]) arm j that maximises i ← i + 1 until nodeArray [i] is a terminal Node s ln(n) updateValue(nodeArray, nodeArray[i].rewardsArray) x¯j + C (1) nj Note that events in a tree are not independent, so algorithms The symbols x¯j in Equation 1 denote the average reward that would naturally work for bandit problems need some from the underlying bandit, and it is the exploitation part of adaptation when applied to tree search. Advantages of MCTS the algorithm. The second part of the above equation√ is the include the fact that it explores the tree asymmetrically and exploration part. C is a constant (often set to 2) [18], n is that it gives a natural way to handle uncertainty. the sum of all trials and nj is the number of trials for the j bandit. Finally, Pj=jmax n = n. j=0 j B. MCTS in Games The equation to choose which arm to play (in the case of a tree search which child to follow) can be heavily tuned de- The popularity of MCTS stems primarily from the fact pending on the underlying distribution. UCB1 (see Algorithm that it revolutionised Computer Go [19], to the point where 2 for the pseudocode) should be seen as the “lowest common computer players actually became competitive against hu- denominator” policy. If no information about the bandits is man players (e.g. [20], [21], [22]). Its success lead to available to us, this is probably the correct policy to use. On widespread acclaim of Monte-Carlo methods, eventually reaching popular media [23]. Example implementations in- 3Although many leading MCTS Go programs now use very different node- clude (CRAZY ST ONE [24], MANGO [25], MOGO [19] selection formulas based on various heuristics. and F UEGO [26]). JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 6

Algorithm 2 descendByUCB1(node) Finally, there have been some developments for MCTS in {We now descend through the tree according to UCB1 some real time video games, mainly from the the perspective policy, making this a UCT algorithm. This part of the of real time strategy games (e.g. [31], [32], [33]). Most real algorithm can be modified to suit the particular problem time strategy games harbour an element of imperfection, which (e.g. use UCB-TUNED) } is however discarded in these studies, with the results however nb ← 0 being exceptionally strong. for i ← 0 to node.childNode.size() - 1 do nb ← nb + node.childNode[i].nb V. METHODOLOGY end for A. Applying MCTS for i ← 0 to node.childNode.size() - 1 do if node.childNode[i].nb = 0 OR There has been no “standard” process for applying MCTS, MAX DEPTH REACHED then so we are proposing an empirical four step process. The first v[i] ← ∞ step is to understand the number of agents and the information else content of each game and choose the right tree. For games of v[i] ← 1.0 - node.childNode[i].value / complete and perfect information (e.g. chess, Go), a min-max node.childNode[i].nb + sqrt(2 * log(nb) / tree is commonly used. For games of incomplete but perfect node.childNode[i].nb) information (e.g. backgammon), expectimax trees should be end if used. Finally for games of imperfect and incomplete informa- end for tion (e.g. poker), miximax trees were recently introduced [34]. index ← argmax(v[j]) All the above cases apply naturally to 2 or 2.5 player games. return node.childNode[index] In the case of N-Player games, one can easily extend the algorithm to either maxn [35] or one of its possible variations. The basic principle behind these algorithms is that each player Algorithm 3 updateValue(nodeArray, rewardsArray) tries to maximise its payoffs independently from the rest. {Finally, we can update the values of each node in the The second step is understanding the underlying distribution path between the newly added node and the root, stored on of each arm and tuning the policy equation. This can be done nodeArray (i.e. the ancestors of the newly added node). Note in a number of ways, which can range from tuning Equation that each node has a corresponding rewardId. The vector of 1, to completely replacing it with something that captures rewards should be provided by the game.} the underlying probabilities better. In our case we use the nb ← 0 algorithm UCB-TUNED [18] (see Equation 2), which seems for i ← nodeArray.size() - 2 to 0 do to be fairly common in the literature, and initial short runs value ← rewards[nodeArray[i].rewardId] showed that it works better in our case than UCB1 (although nodeArray[i].value ← nodeArray[i].value + value we did perform some experiments with UCB1 (see Equation nodeArray[i].nb ← nodeArray[i].nb + 1 1) for comparison purposes, see the experiments section). end for v u ( s ) uln(n) ¯2 2 2ln(n) x¯j + t min 1/4, xj − x¯j + (2) As a result, MCTS has been applied already to a large nj nj number of games [27], [28], [29]. For the most part, the non- The third step is to come up with a back-propagation policy. Go papers failed to replicate the burgeoning success of MCTS Our strategy is the one used by default in the original Go in Go. The area that was identified for improvement [27] MCTS implementation and presented in the previous section, was mostly around the concept of doing good Monte-Carlo with a maxn adaptation [35]. In this case, the end node simulations. Being a best-first search, MCTS relies heavily on provides the algorithm with n rewards. Each node in the tree the quality of Monte-Carlo simulations, and its performance has an associated rewardId and one of the rewards is added is greatly affected by them. For example, Gelly et al. [19] accordingly. To put it in another way, each end node has an report a big boost compared to purely random simulations associated vector of reward r with as many elements as agents (from 1647 to 2200 ELO), which can be increased even further in the game (in our case 5). Each node however has just one with further heuristics of a more general nature like RAVE[30]. reward, based on which agent it belongs to. Another big issue with the non-Go implementations of MCTS The final step is to augment the algorithm with knowledge is the lack of comparison with the state of the art. As a and/or “guide” the Monte-Carlo simulations. In Go this is consequence there is no way of understanding how well MCTS achieved by using local patterns [19], which significantly did compared to other methods. improves the quality of the simulations. In our case a set of MCTS is currently very successful in General Game Playing heuristics is created, and presented in the next subsection. (GGP) [3], a domain that practically prohibits the use of strong heuristics. While General Game Playing in this sense is not fully general since it refers to a subset of perfect and complete B. MCTS on Pac-Man: Problems games, it nevertheless shows that MCTS has the potential of The first thing one should notice in Pac-Man is that the game achieving good results in diverse domains. (like most video games, and arguably life) does not easily JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 7 converge into winning final state for the pac-man player, which creates a problem for all kinds of tree search algorithms. This problem stems from two facts. On one hand pac-man has an almost infinite number of back-and-forth moves it can do, with no ending apart from it dying. On the other hand, the game can keep going on forever, with pac-man progressing from maze to maze, with no final winning position. The second issue with the game is that it has a strong timing element; whatever moves one has to perform at each state, it should be calculated in less than 50−60ms. Although we do not adhere strictly to real-time solutions in this paper, as it is understood that further improvements in code quality and CPU speed can easily bring performance within bounds, nevertheless we tried to avoid timing settings that can never lead to real time game play.

C. Maxn approach to Ms Pac-Man

Ms Pac-Man is a real-time computer game where events in the game occur at the same rate as the events which are Fig. 3. Example of path creation being depicted. For instance, one minute of play of the game depicts one minute of character movement. Despite its real- time simultaneous-move nature, the game has simple rules  and can be transformed to a turn-based game to apply MCTS node , if c(m, p) = 1  1 directly. There are many ways one can model Pac-Man for gpn(m, p) = node , if c(m, p) = 2 (3) MCTS. In our approach, we chose to model Pac-Man as a 2 node , c(m, p) = 3 5-player game, and base the tree on maxn. Pac-Man is a 3 if simultaneous move game, at least theoretically speaking, none  of the min-max like trees is really applicable, and one should 1, if ∃gi : Ed(gi) ∧ dpg(m, p) < G  be searching for mixed strategies (at least for the endgames). c(m, p) = 2, if ∃!gi : Ed(gi) ∧ dpn(m, p) < P (4) As seen recently by [36] however, in practice the strategic 3, otherwise advantage of taking this into account is trivial and can be safely ignored. node1 = arg minn d2(p, gi) In order to solve the problem of not having a natural end node2 = arg minn d2(p, nxy) (5) p state, we artificially limit the search tree to a fixed depth. We node3 = arg minn d2(nxy, node2) also restrict pac-man’s ability to move back and forth as it pleases within a single tree, making its behaviour similar to a dpg(m, p) = d2(arg min d2(p, gi), p) (6) n ghost. This (implicitly) creates a number of paths that a search can easily evaluate. Just to clarify this last point, our MCTS dpn(m, p) = d2(arg min d2(p, nxy), p) (7) pac-man agent is able to move back and forth at each time n step (for example, while waiting to see which path a ghost In Equation 4 m, is the current maze, p is the pac-man will take), but when expanding each MCTS tree these step- location, gi is one ith ghost, n is a node, d2 is the pre- by-step direction reversals are not allowed in order to restrict calculated shortest path distance, np is one of the nodes the growth of the tree. that surround pac-man and P is the furthest away distance In that respect, an end node can be either the natural end MCTS can see given its maximum depth size (usually set to of the game (a ghost eats pac-man) or the end of a tree, with (MCT S.T REE DEPTH/5 − 1). Function Ed(gi) returns TREE DEPTH = c. However, intuitively this would result if a ghost is edible. Equations 6 and 7 specify the distance in pac-man running around the maze, trying to avoid ghosts, from the closest ghost and the distance from closest node since finding a natural end state (e.g. eating all the pills in the respectively. The function c(m, p) returns a different value maze) is beyond the tree depth, at least at the beginning of depending on which of three specific cases apply. Case 1 is the game. In order to avoid this scenario, we have created a when there is an edible ghost in the map and the distance to function called gpn(m, p) (which stands for “game preferred the ghost is smaller than G = 20. Case 2 is when there are node”). This function assigns the binary value of one to a no edible ghosts and the distance of the closest pill/power is certain node in the tree path, and leaves the rest of the tree smaller than P . Case 3 is the catch-all for when there are no to zero. That way we can set a target for the search (and we edible ghosts and the nearest pill is outside the search depth will see how we use this target in the next section). of the tree. In each case, a corresponding node is marked as a JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 8

“target” node using Equation 3 (and consequently equation 5). VI.EXPERIMENTS In the first case the nearest ghost is the target, in the second We performed four sets of experiments. The first three case the nearest pill, while in the third case a node en route to experiments are meant to portray the idiosyncrasies of the the nearest pill. Note that case three exists purely because it is specific MCTS implementations and how heuristics, tree frequently the case that pac-man will find itself in a position search depth and the number of simulations affect our agents. where all the natural target nodes are beyond the search depth. The fourth and final experiment is meant to demonstrate This solves the problem by effectively making a node within the strength of the MCTS approach compared to previous the search tree a target node. approaches in this area using a version of the simulator used In the natural endgame scenarios where the end of the maze in previous published research [12], [14]. is reached or pac-man gets eaten, rewards are set up in an In each set of experiments a number of different individual obvious fashion. In the case where the maze is cleared pac- experiments are performed. At first we progressively increase man gets a reward of 1, while all ghosts get a reward of 0. In the number of simulations from the set {100, 200, 300, 400} case pac-man dies, the ghosts get a reward of 1, while pac- and vary the depth of the search tree from 100 to 450 in incre- man gets a reward of 0. In all other cases, the rewards for the ments of 50. We perform two experiments like this, each one ghosts are set proportionally to the inverse distance between with a different UCB function (UCB1 and UCB−TUNED). the ghost and pacman. Thus: We also perform a run where we vary the tree depth from 100 to 950, but this time we fix the time in milliseconds the ( agent has before he has to perform a move. We test for the 0.5/((d2(gi, p) + 1)), if pac-man not eaten r(gi) = (8) set of {20, 30, 40, 50, 60}ms. Here it is important to note that 1, if pac-man eaten it takes almost half a second4 to have an agent performing In Equation 8, first part, the rewards are set so that the closer 400 iterations in a tree of 400, making these kinds of setups a ghost is to pac-man the higher the reward is. This makes prohibitive for real time simulations. However, the results are minimal impact when the ghosts are close, where rewards are still of interest, as CPU speed is increasing by the day and primarily acquired through pac-man’s death, but it makes a a multi-threaded setup can easily bring the results within the difference when the ghosts are too far away and cannot reach bounds of real time game play. pac-man’s node within their tree search. In the first three experiments the edible time of the ghosts Now, for pac-man the reward function is: is set to 3 (almost non-existent). There are no random ghost reversals and the each agent has one life. This way we hope to make it as hard as possible for the pac-man agent, thus 1.0, if last pill in map is eaten  minimising the time for each experiment. In each test run, we 0.8, if preferred node gpn(m, p) is hit only run one game, while keeping fixed all the initial random r(p) = (9) 0.0, if pac-man dies seeds. This way we hope to make scores between different  0.6, otherwise runs comparable, as the agents effectively play the same game. An alternate (and better) approach would be to perform a Hitting the preferred node (Equation 9) means that in the number of runs for each depth-time/simulations combination. current path evaluated by MCTS, the preferred node was However, due to the high scores achieved by some of the accessed and it is part of the search. agents, acquiring descriptive statistics from multiple test runs The above reward equations are used both by the MCTS is prohibitive, as some agents run for days even with a single pac-man agent and the MCTS ghost team. Thus, in all cases life. the reward vector is: A. MCTS pac-man Vs LegacyTeam r = [r(p), r(g1), r(g2), r(g3), r(g4)] (10) For this experiment The Ghost team is set to the Lega- cyTeam. MCTS is fully aware of this, thus not wasting rollouts In the ghost team however, the first ghost does NOT follow in ghost moves that are not possible for this team. In practical the action proposed by MCTS, but rather follows the route terms, this means that for the first three ghosts whose be- that minimises distance between itself and the pac-man agent haviour is deterministic there is only one possible future move d (g , p)) (whatever their model dictates) while for the probabilistic ( 2 i ). n In almost every equation presented above there is some ad ghost we assume its behaviour is unknown, letting max hoc variable. For example pac-man receives a reward of 0.8 search for plausible scenarios. when a preferred node is hit. The values for these variables are In the first experiment with this setup, we play a set of a result of short trial and error experimentation and are part of games using UCB1. The best performance achieved by this the heuristics. There are some obvious criteria involved (e.g. a time is 70K at a tree depth of 450 (see Figure 4(a)). It can reward for a preferred node should be higher than the reward be observed from the experiments that there is no clear, clean for a non-prefered node - 0.8 Vs 0.6 in our case), however relationship between tree depth, number of simulations and there is no “hard” scientific justification for these values, as performance. An increased number of simulations with the there is not for the equations themselves; they are part of the 4All experiments were run on a Intel(R) Core(TM)2 Duo CPU E8600 @ chosen set of heuristics. 3.33GHz machine with java having 512MB of RAM. JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 9

(a) UCB1, number of simulations Vs tree search depth (b) UCB − TUNED, number of simulations Vs tree search depth

(c) UCB−TUNED, time spend per move in ms Vs tree search depth

Fig. 4. Scores of pac-man Vs known LegacyTeam (i.e. future states in the tree search take into account the fact that we know where the ghosts are going to move next same tree will not necessarily lead to better performance, In the first setup of this experiment, we see a noticeable contrary to what one would assume. There is however a clear drop in scores, almost an order of magnitude (see Figure 5(a)), and strong trend towards this direction, as is evident from the with a high score of 10K. The same trend continues with experiments. Our second run is the same as with UCB1, but the UCB − TUNED setup as well (Figure 5(b)). There is this time we switched to UCB−TUNED (Figure 4(b)). This a massive drop in score compared with the previous experi- effectively doubles the scores acquired by pac-man by almost mental setup ( Vs. Known team), but UCB − TUNED still two, to 160K. Again one can notice here that although tree outperforms UCB1 with a score of 12K. Finally, in the case depth does play a role in the quality of the results, this role where we fix time (Figure 5(c)), we achieve even better results is not clear cut, as there are “bumps” in the graph above. (18K). Note that this is a much tougher version of the game In the final setup for this experiment, we play a number than previous approaches using the same simulator due to the of games using a fixed amount of time, while we let the extremely low ghost edible time. number of evaluations vary (Figure 4(c)). The varying amount of simulations per move allows even better results, with our C. Experiment 3: MCTS pac-man Vs MCTS Ghosts best score being almost 700K. In this experiment we performed a test of MCTS ghosts Vs MCTS pac-man. Please note here, as explained earlier, that B. MCTS pac-man Vs Unknown Team (LegacyTeam) both pac-man and the ghost play using the same tree. The In this set of experiments, although we play against the MCTS score is evaluated once and followed for all agents, same team as in the previous experiment, the agent treats the with the exception of the first ghost, which blindly follows team as an unknown team. This means that the full maxn tree pac-man. is searched, which should make it much harder for our agent The results in all three setups (see Figures 6(a), 6(b), 6(c)))) to search for a good strategy. show pac-man being clearly overpowered by the ghost team, JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 10

(a) UCB1, number of simulations Vs tree search depth (b) UCB − TUNED, number of simulations Vs tree search depth

(c) UCB−TUNED, time spend per move in ms Vs tree search depth

Fig. 5. Scores of pac-man Vs unknown LegacyTeam (i.e. when MCTS is not explicitly aware of which team it is performing against) with very low high scores in all cases (a max of 6K). The We performed three tests; one with a known model, one reason for the not-so-clear curve in pacman’s performance is an unknown ghost-team model, and non-real time one with that the number of iterations changes both for pacman and an unknown ghost model. Since this experiment is the only the ghosts. Thus, a stronger pacman would play with stronger one comparable to previously reported experiments with the ghosts, which would result in similar results no matter the simulator, which makes result robustness important, for each number of iterations (taking into account some variance as agent we performed 100 runs. Results are presented in tables well). If one is to draw a conclusion here, one can say that I and II. the MCTS ghosts can easily overpower pac-man, at least in TABLE I the case of minimum edible time. Admittedly, one can vary DESCRIPTIVE STATISTICS FOR THE TWO PLAYERS.KM ISTHEKNOWN asymmetrically the number of iterations of the ghosts in order MODEL PLAYER AND UM ISTHEUNKNOWNMODELPLAYER.UM-NR to balance power. This could be used in order to create weaker AND UM-R SHOW RESULTS FOR THE UNKNOWN MODEL PLAYER FOR THE REAL-TIMEANDTHENON-REALTIMECASERESPECTIVELY.S STATISTICS “default” MCTS ghost teams for the game, which would AREFORTHENUMBEROFROLLOUTSWHILE T STATISTICS ARE FOR TIME provide an interesting direction if one is interested in creating PERMOVE.TD STANDS FOR TREE DEPTH. progressively harder ghost teams to play against, adding to the fun element of the game. MeanS MinS MaxS MeanT MaxT TD KM 300 300 300 45 222 300 UM-NR 300 300 300 74 398 300 D. Experiment 4: MCTS pac-man Vs Typical LegacyTeam UM-R 150 150 300 35 182 300 Setup / High Score In this experiment we aim purely to show the strength of The first thing worth noting here is the very high score, our agent against previous published work using the same slightly above 2.8M for the known model agent. The best simulator[12], [14]. Thus, in this experiment we only use human player has achieved a score of 933,580 [37], however the first stage and we set the ghost edible time to 100. this is not directly comparable to our version, as the ghost JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 11

(a) UCB1, number of simulations Vs tree search depth (b) UCB − TUNED, number of simulations Vs tree search depth

(c) UCB−TUNED, time spend per move in ms Vs tree search depth

Fig. 6. Scores of pac-man Vs MCTS team

TABLE II SCORES FOR EACH PLAYER, SEE FIGURE I FORTHETHREE behaviour, in the unknown model case, all ghost actions have ABBREVIATIONS (KM,UM-NR,UM).THE STATISTICS ARE BASED ON to be guessed. This creates trees which have a higher branching 100 GAMES PLAYED FOR EACH ROW OF THE TABLE. factor, which in turn requires more search time. Finally, the Mean Standard Error SD Min Max huge variability observed is explained by the fact that the agent KM 546,260 55,980 559,802 4,500 2,807,700 performs simulations not in the real version of the game, but UM-NR 81,979 8,909 89,090 1,080 409,730 an abstract one. At some point, inconsistencies in the abstract UM-R 45,389 4,041 40,407 1,400 207,800 model can easily lead to a premature death. In the real version of the game, one has at least three lives, so one has greater opportunity to amass a high score, and also be less affected by behaviours are somewhat different. On the other hand, this fatal errors (which cause the loss of a single life rather than shows that one can aim for such high scores in the real the end of the entire game). game as well, using similar approaches to ours, but this is out of the scope of this paper. For the record, the highest VII.DISCUSSION &CONCLUSION scoring Pac-man player for the real game (using a screen- capture kit for interfacing the agent to the game) we are aware We have shown that MCTS can successfully be used in a of achieved a maximum score of 44, 910 [38], and is also real time game, getting results that are almost two orders of based on Monte-Carlo Tree Search5. In the “unknown model” magnitude better than previous results in the same simulator player tests, we can see that high scores are also achieved (at acquired by evolutionary, reinforcement learning and genetic least 5 times previously reported results in the non-real time programming methods. In order to make a critical appraisal, case). This can easily be explained by understanding that while we first need to concentrate on one simple fact: MCTS in the known model case, only one ghost has unpredictable exploits the model of the game and (in the non-planning tree scenario) models our opponents as being perfect players. This 5The paper [38] is currently only available in Japanese. ruthless exploitation of the forward model gives the agent a JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 12 characteristic strength that static/reactive evaluation heuristics [5] H. Matsumoto, T. Ashida, Y. Ozasa, T. Maruyama, and R. Thawonmas, cannot possibly match. What is important to note here is that it “Ice pambush 3,” Ritsumeikan University, Tech. Rep., 2009. [6] J. R. Koza, Genetic Programming: on the Programming of Computers is not the senses that guide our agents behaviour, but rather an by Means of Natural Selection. MIT Press, 1992. internalised model. Senses are practically only used in order [7] I. Szita and A. Lorincz, “Learning to play using low-complexity rule- to infer the current state, but do not influence behaviour in based policies,” Journal of Artificial Intelligence Research, vol. 30, no. 1, pp. 659–684, December 2007. any other form. On top of all that, the plan the agent is to [8] J. S. D. Bonet and C. P. Stauffer, “Learning to follow is re-formed at every timestep, making the need for play Pac-Man using incremental reinforcement learn- feedback corrections redundant. The only thing that needs to ing.” 1999, http://www.ai.mit.edu/people/ be accurately measured is the current state and that state needs stauffer/Projects/PacMan/. [9] M. Gallagher and A. Ryan, “Learning to Play Pac-Man: An Evo- to be mapped to the internal model. lutionary, Rule-Based Approach,” in IEEE Congress on Evolutionary Another reason why the on-line exploitation of the model Computation, 2003, pp. 2462–2469. is so successful (compared to creating off-line controllers [10] M. L. Marcus Gallagher, “Evolving Pac-Man Players: Can We Learn from Raw Input?” in IEEE Symposium on Computational Intelligence using evolution or reinforcement learning) is that even in the and Games, 2007. case where a reactive behaviour would perform successfully, [11] N. Wirth and M. Gallagher, “An Influence Map Model for Playing Ms the necessary function approximation would diminish the Pac-Man,” IEEE Symposium on Computational Intelligence and Games, 2008. performance of the agent. The constant re-planning at every [12] S. M. Lucas, “Evolving a Neural Network Location Evaluator to Play Ms time step we do here alleviates the need for any form of Pac-Man,” IEEE Symposium on Computational Intelligence and Games, approximator (such as a neural network). pp. 203–210, 2005. [13] D. Robles and S. Lucas, “A Simple Tree Search Method for Playing An interesting phenomenon which is evident from the Ms Pac-Man,” in Proceedings of the 5th International Conference on experiments is that in our case more computational time, which Computational Intelligence and Games, 2009, pp. 249–255. invariably results in more simulations, does not necessarily [14] P. Burrow and S. Lucas, “Evolution versus temporal difference learning for learning to play Ms Pac-Man,” in Proceedings of the 5th international result in better performance. We think that the primary reason conference on Computational Intelligence and Games. IEEE Press, behind this is may be as follows: unless some critical threshold 2009, pp. 53–60. is crossed, where the agent clearly plays better, an avalanche of [15] A. M. Alhejali and S. Lucas, “Evolving Diverse Ms Pac-Man Playing Agents Using Genetic Programming,” in Proceedings of the UK Work- slightly different decisions leads to totally different games. So shop on Computational Intelligence (UKCI). IEEE Press, 2010. a slightly better agent might find himself in a difficult situation [16] L. Kocsis and C. Szepesvari,´ “Bandit based Monte-Carlo Planning,” in and fail, whereas an inferior agent will never come to see that 15th European Conference on Machine Learning (ECML), 2006, pp. state at all! This kind of behaviour has not been observed in 282–293. [17] E. Rasmusen, Games and information: An introduction to game theory. any MCTS implementations until now as far as we are aware. Blackwell Pub, 2007. However, in most previous efforts to create MCTS agents, the [18] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the simulations involved playing a real, albeit guided, game until multiarmed bandit problem,” Machine Learning, vol. 47, no. 2, pp. 235– 256, 2002. the end. In our case the simulations run an idealised game, [19] S. Gelly and Y. Wang, “Exploration exploitation in Go: UCT for Monte- with a heuristic end function. Carlo Go,” in Twentieth Annual Conference on Neural Information We believe that the approach presented in this paper has Processing Systems (NIPS 2006). Citeseer, 2006. [20] C. S. Lee, M. H. Wang, C. Chaslot, J. Hoock, A. Rimmel, O. Teytaud, great potential for creating generic AI agents. One can easily S. R. Tsai, S. C. Hsu, and T. P. Hong, “The computational intelligence envisage a procedure where the most important abstract fea- of MoGo revealed in Taiwan’s computer Go tournaments,” in IEEE tures of a world are modelled (such as the game rules and Transactions on Computational Intelligence and AI in Games, vol. 1, no. 1, 2009, pp. 73–89. “equation” of motion) and given to an agent to reason with. [21] S. Billouet, J. Hoock, C.-S. Lee, O. Teytaud, and S.-J. Yen, “9x9 go as The agent can then use MCTS to produce general intelligent black with komi 7.5: At last some games won against top players in the (or at least sensible) behaviour with a minimum of domain- disadvantageous situation,” in ICGA Journal, vol. 32, no. 3, 2009, pp. 241–246. specific programming. [22] S.-J. Yen, C.-S. Lee, and O. Teytaud, “Human vs. Computer Go Competition in FUZZ-IEEE 2009,” in ICGA Journal, vol. 32, no. 3, ACKNOWLEDGEMENTS 2009, pp. 178–180. [23] R. Blincoe, “Go, going, gone?” The Guardian, 2006. This research is partly funded by a postgraduate studentship [Online]. Available: http://www.guardian.co.uk/technology/2009/apr/30/ from the Engineering and Physical Sciences Research Council games-software-mogo/print [24] R. Coulom, “Computing ELO ratings of move patterns in the game of and from the National Council of Science and Technology of Go,” in Computer Games Workshop. Citeseer, 2007. Mexico (CONACYT). [25] G. Chaslot, M. Winands, H. Herik, J. Uiterwijk, and B. Bouzy, “Pro- gressive strategies for Monte-Carlo Tree Search,” New Mathematics and Natural Computation, vol. 4, no. 3, p. 343, 2008. REFERENCES [26] M. Enzenberger, M. Muller, B. Arneson, and R. Segal, “FUEGO: [1] R. Coulom, “Efficient selectivity and backup operators in Monte-Carlo An Open-Source Framework for Board Games and Go Engine Based Tree Search,” in Proceedings of the 5th International Conference on on Monte-Carlo Tree Search,” IEEE Transactions on Computational Computers and Games (CG2006), 2006, pp. 72–83. Intelligence and AI in Games, vol. 2, no. 4, p. to appear, 2010. [2] S. Gelly, Y. Wang, R. Munos, and O. Teytaud, “Modifications of UCT [27] F. Van Lishout, G. Chaslot, and J. Uiterwijk, “Monte-Carlo Tree Search with Patterns in Monte-Carlo Go,” INRIA, Tech. Rep. 6062, 2006. in Backgammon,” in Computer Games Workshop, 2007, pp. 175–184. [3] Y. Bjornsson and H. Finnsson, “CadiaPlayer: A simulation-based general [28] P. Hingston and M. Masek, “Experiments with Monte Carlo Othello,” in game player,” IEEE Transactions on Computational Intelligence and AI IEEE Congress on Evolutionary Computation, 2007. CEC 2007, 2007, in Games, vol. 1, pp. 4–15, 2009. pp. 4059–4064. [4] B. Arneson, R. Hayward, and P. Henderson, “Monte Carlo Tree Search [29] G. V. den Broeck, K. Driessens, and J. Ramon, “Monte-Carlo Tree in Hex,” IEEE Transactions on Computational Intelligence and AI in Search in Poker Using Expected Reward Distributions,” in ACML, 2009, Games, vol. 2, no. 4, pp. 251 –258, 2010. pp. 367–381. JOURNAL OF COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, SUBMITTED OCTOBER 2010 13

[30] S. Gelly and D. Silver, “Achieving Master Level Play in 9 x 9 Computer Simon Lucas (SMIEEE) is a professor of computer Go,” in Proceedings of the Twenty-Third AAAI Conference on Artificial science at the University of Essex (UK) where Intelligence, 2008, pp. 1537–1540. he leads the Game Intelligence Group. His main [31] M. Chung, M. Buro, and J. Schaeffer, “Monte Carlo Planning in RTS research interests are games, evolutionary compu- Games,” in IEEE Symposium on Computational Intelligence and Games, tation, and machine learning, and he has published 2005. widely in these fields with over 130 peer-reviewed [32] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, “Monte-carlo tree search: papers, mostly in leading international conferences A new framework for game AI,” in Proceedings of the Fourth Artificial and journals. He was chair of IAPR Technical Com- Intelligence and Interactive Digital Entertainment Conference, 2008, pp. mittee 5 on Benchmarking and Software (2002 - 216–217. 2006) and is the inventor of the scanning n-tuple [33] R. Balla and A. Fern, “UCT for tactical assault planning in real- classifier, a fast and accurate OCR method. He was time strategy games,” in Proceedings of the 21st International Joint appointed inaugural chair of the IEEE CIS Games Technical Committee in Conference on Artifical intelligence. Morgan Kaufmann Publishers July 2006, has been competitions chair for many international conferences, Inc., 2009, pp. 40–45. and co-chaired the first IEEE Symposium on Computational Intelligence [34] D. Billings, A. Davidson, T. Schauenberg, N. Burch, M. Bowling, and Games in 2005. He was program chair for IEEE CEC 2006, program R. Holte, J. Schaeffer, and D. Szafron, “Game-tree search with adap- co-chair for IEEE CIG 2007, and for PPSN 2008. He is an associated tation in stochastic imperfect-information games,” Lecture Notes in editor of IEEE Transactions on Evolutionary Computation, and the Journal of Computer Science, vol. 3846, pp. 21–34, 2006. Memetic Computing. He has given invited keynote talks and tutorials at many [35] C. Luckhardt and K. Irani, “An algorithmic solution of n-person games,” conferences including IEEE CEC, IEEE CIG, and PPSN. Professor Lucas was in Proceedings of the 5th National Conference on Artificial Intelligence recently appointed as the founding Editor-in-Chief of the IEEE Transactions (AAAI), 1986, pp. 158–162. on Computational Intelligence and AI in Games. [36] S. Samothrakis, D. Robles, and S. Lucas, “A UCT agent for Tron: Initial investigations,” in Proceedings of the 6th International Conference on Computational Intelligence and Games, 2010. [37] T. Galaxies. (2006) Twin Galaxies - Ms Pac-Man High Scores @ONLINE. [Online]. Available: http://www.twingalaxies.com/index. aspx?c=22&pi=2&gi=3162&vi=1386 [38] N. Ikehata and T. Ito, “Monte Carlo Tree Search in Ms Pac-Man,” in The 15th Game Programming Workshop, IPSJ Symposium Series Vol. 2010/12, 2010.

Spyridon Samothrakis is currently pursuing a PhD in Computational Intelligence and Games at the University of Essex. His interests include game theory, machine learning, evolutionary algorithms and consciousness.

David Robles did his MSc in Computer Science at the University of Essex in 2008. He is now pursuing a PhD in Computer Science in the area of artificial intelligence in games.