Writing Stratagus-Playing Agents in Concurrent Alisp
Total Page:16
File Type:pdf, Size:1020Kb
Writing Stratagus-playing Agents in Concurrent ALisp Bhaskara Marthi, Stuart Russell, David Latham Department of Computer Science University of California Berkeley, CA 94720 fbhaskara,russell,[email protected] Abstract We describe Concurrent ALisp, a language that al- lows the augmentation of reinforcement learning algorithms with prior knowledge about the struc- ture of policies, and show by example how it can be used to write agents that learn to play a subdo- footmen main of the computer game Stratagus. 1 Introduction peasants Learning algorithms have great potential applicability to the problem of writing artificial agents for complex computer games [Spronck et al., 2003]. In these algorithms, the agent learns how to act optimally in an environment through experi- ence. Standard “flat” reinforcement-learning techniques learn barracks goldmine very slowly in environments the size of modern computer games. The field of hierarchical reinforcement learning [Parr Figure 1: An example subgame of Stratagus. and Russell, 1997; Dietterich, 2000; Precup and Sutton, 1998; Andre and Russell, 2002] attempts to scale RL up to larger en- peasants should be trained, or how many footmen should vironments by incorporating prior knowledge about the struc- be trained before attacking the ogre. One way to go about ture of good policies into the algorithms. writing an artificial agent that played this program would In this paper we focus on writing agents that play the game be to have the program contain free parameters such as Stratagus (stratagus.sourceforge.net). In this game, a player num-peasants-to-build-given-single-enemy, must control a medieval army of units and defeat oppos- and then figure out the optimal setting of the parameters, ing forces. It has high-dimensional state and action spaces, either “by hand” or in some automated way. A naive and successfully playing it requires coordinating multiple implementation of this approach would quickly become complex activities, such as gathering resources, constructing infeasible1 for larger domains, however, since there would buildings, and defending one’s base. We will use the follow- be a large number of parameters, which are coupled, and so ing subgame of Stratagus as a running example to illustrate exponentially many different joint settings would have to be our approach. tried. Also, if the game is stochastic, each parameter setting Example 1 In this example domain, shown in Figure 1, the would require many samples to evaluate reliably. agent must defeat a single ogre (not visible in the figure). It The field of reinforcement learning [Kaelbling, 1996] ad- starts out with a single peasant (more may be trained), and dresses the problem of learning to act optimally in sequential must gather resources in order to train other units. Even- decision-making problems and would therefore seem to be tually it must build a barracks, and use it to train footman applicable to our situation. However, standard “flat” RL algo- units. Each footman unit is much weaker than the ogre so rithms scale poorly to domains the size of Stratagus. One rea- multiple footmen will be needed to win. The game dynamics son for this is that these algorithms work at the level of prim- are such that footmen do more damage when attacking as a itive actions such as “move peasant 3 north 1 step”. These al- group, rather than individually. The only evaluation measure gorithms also provide no way to incorporate any prior knowl- is how long it takes to defeat the ogre. edge one may have about the domain. Despite its small size, writing a program that performs 1A more sophisticated instantiation of this approach, com- well in this domain is not completely straightforward. bined with conventional HRL techniques, has been proposed re- It is not immediately obvious, for example, how many cently [Ghavamzadeh and Mahadevan, 2003]. Hierarchical reinforcement learning (HRL) can be viewed (defun top () as combining the strengths of the two above approaches, us- (spawn ``allocate-peasants'' ing partial programs. A partial program is like a conventional #'peas-top nil *peas-eff*) program except that it may contain choice points, at which (spawn ``train-peasants'' #'townhall-top there are multiple possible statements to execute next. The *townhall* *townhall-eff*) idea is that the human designer will provide a partial pro- (spawn ``allocate-gold'' gram that reflects high-level knowledge about what a good #'alloc-gold nil) policy should look like, but leaves some decisions unspec- (spawn ``train-footmen'' #'barracks-top nil) ified, such as how many peasants to build in the example. (spawn ``tactical-decision'' The system then learns a completion of the partial program #'tactical nil)) that makes these choices in an optimal way in each situation. HRL techniques like MAXQ and ALisp also provide an addi- Figure 2: Top-level function tive decomposition of the value function of the domain based on the structure of the partial program. Often, each compo- nent in this decomposition depends on a small subset of the (defun peas-top () state variables. This can dramatically reduce the number of (loop parameters to learn. unless (null (my-effectors)) We found that existing HRL techniques such as ALisp were do (let ((peas (first (my-effectors)))) not directly applicable to Stratagus. This is because an agent (choose ``peas-choice'' playing Stratagus must control several units and buildings, (spawn (list ``gold'' peas) which are engaged in different activities. For example, a #'gather-gold nil peas) (spawn (list ``build'' peas) peasant may be carrying some gold to the base, a group of #'build-barracks nil peas))))) footmen may be defending the base while another group at- tacks enemy units. The choices made in these activities are correlated, so they cannot be solved simply by having a sepa- rate ALisp program for each unit. On the other hand, a single Figure 3: Peasant top-level function ALisp program that controlled all the units would essentially have to implement multiple control stacks to deal with the asynchronously executing activities that the units are engaged threaded, and at any point, each effector is assigned to some in. Also, we would lose the additive decomposition of the thread. value function that was present in the single-threaded case. Execution begins with a single thread, at the function top, We addressed these problems by developing the Concur- shown in Figure 2. In our case, this thread simply creates rent ALisp language. The rest of the paper demonstrates by some other threads using the spawn operation. For exam- example how this language can be used to write agents for ple, the second line of the function creates a new thread with Stratagus domains. A more precise description of the syntax ID “allocate peasants”, which begins by calling the function and semantics can be found in [Marthi et al., 2005]. peas-top and is assigned effector *peas-eff*. Next, examine the peas-top function shown in Figure 3. This function loops until it has at least one peasant assigned 2 Concurrent ALisp to it. This is checked using the my-effectors operation. Suppose we have the following prior knowledge about what a It then must make a choice about whether to use this peasant good policy for Example 1 should look like. First train some to gather gold or to build the barracks, which is done using peasants. Then build a barracks using one of the peasants. the choose statement. The agent must learn how to make Once the barracks is complete, start training footmen. Attack such a choice as a function of the environment and program the enemy with groups of footmen. At all times, peasants not state. For example, it might be better to gather gold if we have engaged in any other activity should gather gold. no gold, but better to build the barracks if we have plentiful We will now explain the syntax of concurrent ALisp with gold reserves. reference to a partial program that implements this prior Figure 4 shows the gather-gold function and the knowledge. Readers not familiar with Lisp should still be navigate function, which it calls. The navigate func- able to follow the example. The main thing to keep in mind is tion navigates to a location by repeatedly choosing a direc- that in Lisp syntax, a parenthesized expression of the form (f tion to move in, and then performing the move action in arg1 arg2 arg3) means the application of the function the environment using the action operation. At each step, f to the given arguments. Parenthesized expressions may also it checks to see if it has reached its destination using the be nested. In our examples, all operations that are not part of get-env-state operation. standard Lisp are in boldface. We will not give the entire partial program here, but Fig- We will refer to the set of buildings and units in a state as ure 5 summarizes the threads and their interactions. The the effectors in that state. In our implementation, each effec- allocate-gold thread makes decisions about whether tor must be given a command at each step (time is discretized the next unit to be trained is a footman or peasant, and into one step per 50 cycles of game time). The command then communicates its decision to the train-footmen may be a no-op. A concurrent ALisp program can be multi- and train-peasants threads using shared variables. The one time step. Section 6 describes how to fit Stratagus into (defun gather-gold () (call navigate *gold-loc*) this framework. (action *get-gold*) We build on the standard semantics for interleaved execu- (call navigate *base-loc*) tion of multithreaded programs. At each point, there is a set (call *dropoff*)) of threads, each having a call stack and a program counter.