Instructions to Authors s2

Machine Learning for Adversarial Agent Microworlds

J. Scholz1, B. Hengst2, G. Calbert1, A. Antoniades2, P. Smet1, L. Marsh1, H-W. Kwok1, D. Gossink1

1DSTO Command and Control Division, 2NICTA, E-Mail: [email protected] Keywords: Wargames; Reinforcement Learning

EXTENDED ABSTRACT automated decision making, when it is likely to be hazardous.” Cognisant of these issues, we state a We describe models and algorithms to support formalism for microworlds initially to manage interactive decision making for national and representation complexity for machine learning. military strategic courses of action. We describe This formalism is based on multi-agent stochastic our progress towards the development of games, abstractions and homomorphism. innovative reinforcement learning algorithms, scaling progressively to more complex and According to Kaelbling et al (1996), realistic simulations. “Reinforcement learning (RL) is the problem faced by an agent that must learn behaviour through trial Abstract representations or ‘Microworlds’ have and error interactions with a dynamic been used throughout military history to aid in the environment.” As RL is frustrated by the curse of conceptualization of terrain, force disposition and dimensionality, we have pursued methods to movements. With the introduction of digitized exploit temporal abstraction and hierarchical systems into military headquarters the capacity to organisation. As noted by Barto and Mahadevan degrade decision-making has become a concern 2003 (p. 23), “Although the proposed ideas for with these representations. Maps with overlays are hierarchical RL appear promising, to date there has a centerpiece of most military headquarters and been insufficient experience in experimentally may attain an authority which exceeds their testing the effectiveness of these ideas on large competency, as humans forget their limitations applications.” We describe the results of our foray both as a model of the physical environment, and into reinforcement learning approaches of as a means to convey command intent (Miller, progressively larger-scale applications, with ‘Grid 2005). The design of microworlds may help to world’ models including modified Chess and alleviate human-system integration problems, by Checkers, a military campaign-level game with providing detail where it is available and concurrent moves, and ‘fine-grained’ models with appropriate and recognizable abstractions where it a land and an air game. is not. Ultimately, (Lambert and Scholz 2005 p.29) view “the division of labor between people and We detail and compare results from microworld machines should be developed to leave human models for which both human-engineered and decision making unfettered by machine machine-learned agents were developed and interference when it is likely to prove heroic, and compared. We conclude with a summary of enhance human decision making with automated insights and describe future research directions. decision aids, or possibly override it with

INTRODUCTION  accumulate learning over long time periods on various microworld models and use what is During Carnegie Mellon University's Robotics learned to cope with new situations, Institute 25th anniversary, Raj Reddy said, "the biggest barrier is (developing) computers that learn  decompose problems and formulate their own with experience and exhibit goal-directed representations, recognising relevant events behaviour. If you can't build a system that can among the huge amounts of data in their learn with experience, you might as well forget "experience", everything else". This involves developing agents characterized by the ability to:  exhibit adaptive, goal directed behaviour and prioritise multiple, conflicting and time  learn from their own experience and the varying goals, experience of human commanders,  interact with humans and other agents using shown that humans under certain conditions language and context to decipher and respond display weakness in self-regulation of their own to complex actions, events and language. cognitive load. Believing for example, that “more detailed information is better”, individuals can Extant agent-based modelling and simulation become information overloaded, fail to realise that environments for military appreciation appear to they are overloaded and mission effectiveness occupy extreme ends of the modelling spectrum. suffers. Similarly, a belief that “more reliable At one end, Agent-Based Distillations (ABD) information is better”, individuals can be informed provide weakly expressive though highly or observe that some sources are only partially computable (many games per second execution) reliable, fail to pay sufficient attention to the models using attraction-repulsion rules to represent reliable information from those sources and intent in simple automata (Horne and Johnson mission effectiveness suffers. 2003). At the other end of the spectrum, large scale systems such as the Joint Synthetic Armed Forces Thus we seek a formalism which will allow the (JSAF) and Joint Tactical Level Simulation (JTLS) embodiment of cognitively-austere, adaptable and require days to months and many staff to set up for agreed representations. a single simulation run (Matthews and Davies 2001) and are of limited value to highly interactive Microworld Formalism decision-making. We propose a microworld-based representation which falls somewhere in between We use the formalisation of multi-agent stochastic these extremes, yet allows for decision interaction. games (Fundenberg and Tirole, 1995) as our generalised scientific model for microworlds. A INTERACTIVE MICROWORLDS multi-agent stochastic game may be formalised as a tuple N,S, A,T, R where N is the number of Microworlds have been used throughout military agents, S the states of the microworld, history to represent terrain, force disposition and movements. Typically such models were built as A  A1  A2 ... AN concurrent actions, where miniatures on the ground (see figure 1) or on a Ai is the set of actions available to agent i , T is board grid, but these days are more usually the stochastic state transition function for each computer-based. action, T : S  A S  0,1, and

R  R1, R2 ,...,RN  , where Ri is the immediate

reward for agent i, Ri : S  A   .The objective

is to find an action policy  i S for each agent i to maximise the sum of future discounted rewards

In Multi-agent stochastic games each agent must choose actions in accordance with the actions of other agents. This makes the environment essentially non-stationary. In attempting to apply learning methods to complex games, researchers have developed a number of algorithms that combine game-theoretic concepts with reinforcement learning algorithms, such as the Figure 1. Example of a Microworld. Nash-Q algorithm (Hu and Wellman 2003).

Creating a microworld for strategic decisions is Abstractions, Homomorphism and Models primarily a cognitive engineering task. A microworld (e.g. like a map with overlaid To make learning feasible we need to reduce the information) is a form of symbolic language that complexity of problems to manageable should represent the necessary objects and proportions. A challenge is to abstract the huge dynamics, but not fall into the trap of becoming as state-space generated by multi-agent MDPs. To do complex as the environment it seeks to express this we plan to use any structure and constraints in (Friman and Brehmer 1999). It should also avoid the problem to decompose the MDP. the potential pitfalls of interaction, including the human desire for “more”, when more is not Models can be thought of as abstractions that necessarily better. Omedei et al (2004) have generally reduce the complexity and scope of an environment to allow us to focus on specific update a parameterised value function through on- problems. A good or homomorphic model may be line temporal differences. Independently thought of as a many-to-one mapping that discovered by Beal and Smith in 1997, TDLeaf has preserves operations of interest as shown in figure been applied to classical games such as 2. If the environment transitions from Backgammon, Chess, Checkers and Othello with environmental states w1 to w2 under environmental impressive results (Schaeffer, 2000). dynamics, then we have an accurate model if the model state m transitions to m under a model For each asymmetric variant, we learned a simple 1 2 evaluation function, based on materiel and dynamic and states w1 map onto m1 , w2 map mobility balance. The combined cycle of onto m2 and environmental dynamics map onto evaluation function learning and Monte Carlo the model dynamics. simulation allowed us to draw general conclusions about the relative importance of materiel, planning, tempo, stochastic dynamics and partial observability of pieces (Smet et al 2005) and will be described further in section 4.

Our approach to playing alternate move partially observable games combined some aspects of information theory and value-function based reinforcement learning. During the game, we stored not only a belief of the distribution of our opponent’s pieces, but conditioned on one of these states, a distribution of the possible positions of our pieces (a belief of our opponent’s belief). This Figure 2. Homomorphic Models enabled the use of ‘entropy-balance,’ the balance between relative uncertainty of opposition states as The purpose of a microworld is to represent a one of the basis functions in the evaluation model of the actual military situation. The initial function (Calbert and Kwok 2004). Simulation models described here are abstractions provided by results from this are described in section 4. commanders to reflect the real situation as accurately as possible. TD-Island Microworld

Grid World Examples TD-Island stands for ‘Temporal Difference Island’, indicating the use of learning to control elements Chess and Checkers Variant Microworlds in an island-based game. Space and time are discrete in our model. Each of 12 discrete spatial Our project initially focused on variants of abstract states is either of type land or sea (figure 3). games in discrete time and space such as Chess and Checkers. We deliberately introduced asymmetries into the games and conducted sensitivity analysis through Monte-Carlo simulation at various ply depths up to seven. The six asymmetries were either materiel, where one side commences with a materiel deficit, planning, with differing ply search depths, tempo, in which one side was allowed to make a double move at some frequency, stochastic dynamics where pieces are taken probabilistically and finally hidden Figure 3. TD-Island Microworld. pieces, where one side possessed pieces that could be moved, though the opponent could only Elements in the game include jetfighters, army indirectly infer the piece location (Smet et al brigades and logistics supplies. Unlike Chess, two 2005). or more pieces may occupy the same spatial state. Further from Chess, in TD-Island, there is The TDLeaf algorithm is suitable for learning concurrency of action in that all elements may value functions in these games (Baxter et al, move at each time step. This induces an action- 1998). TDLeaf takes the principal value of a space explosion making centralized control minimax tree (at some depth) as the sample used to difficult for games with a large number of elements. The TD-Island agent uses domain symmetries and heuristics to prune the action- space, inducing the homomorphic model.

Some symmetries in the TD-Island game are simple to recognise. For example, if considering the set of all possible concurrent actions of jetfighters, striking targets is the set of all possible permutations of jetfighters assigned to these targets. One can eliminate this complexity by considering only a single permutation. By considering this single permutation, one “breaks” Figure 1: Tempest Seer Microworld. the game symmetry, reducing the overall concurrent action space dimension which induces Given the complexity of this microworld, it is a homomorphic model (Calbert 2004). natural to attempt to systematically decompose its structure into a hierarchically organized set of sub- The approach to state-space reduction in the TD- tasks. The hierarchy would be decomposed of Island game is similar to that found in Chess based higher-level strategic tasks such as attack and reinforcement learning studies, such as the defend at the top, with lower level tactical level KnightCap game (Baxter et al 1998). A number of tasks near the bottom of the hierarchy, with features or basis functions are chosen and the primitive or atomic moves at the base. Such a importance of these basis functions forms the set hierarchy has been composed for the Tempest Seer of parameters to be learned. The history of Chess game as shown below. has developed insights into the appropriate choice of basis functions (Furnkranz and Kubat 2001). EffectsEffects (root) (root) Fortunately there are considerable insights into abstractions important in warfare, generating what is termed ‘operational art’. Such abstractions may AttackingAttacking force force DefendingDefending force force include the level of simultaneity, protection of forces, force balance and maintenance of supply Target(t) lines amongst others (US Joint Chiefs of Staff Defend(d)

2001). Each of these factors may be encoded into a Strike mission Strike number of different basis functions, which form Mission(t) Diversionary CAP(d) Intercept(j)

CAP CAP the heart of the value function approximation. intercept Release weapon intercept Release weapon Max intercept Max intercept Out mission In mission Avoid threats Target dist Out mission In mission Avoid threats Target dist

Continuous Time-Space Example Tarmac wait Loiter Take off Straight Nav(p) Land Tarmac wait Shoot Loiter Take off Straight Nav(p) Land Shoot Tempest Seer Figure 2: Hierarchical decomposition of the This is a microworld for evaluating strategic Tempest Seer game. decisions for air operations. Capabilities modelled include various fighters, air to air refuelling, Given this decomposition, one can apply the surface to air missiles, ground based radars, as MAXQ-Q algorithm to learn game strategies well as logistical aspects. Airports and storage (Dieterrich, 2002). One can think of the game depots are also included. The model looks at a being played through the execution of differing 1000 km square of continuous space, with time subroutines, these being nodes of the hierarchy as advancing in discrete units. The Tempest Seer differing states are encountered. Each node can be microworld will enable users to look at the considered as a stochastic game, focused on a potential outcomes of strategic decisions. Such particular aspect of play such as targeting, or at an decisions involve the assignment of aircraft and elemental level taking-off or landing of aircraft. weapons to missions, selection of targets, defence of assets and patrolling of the air space. While Though the idea of learning with hierarchical these are high-level decisions, modelling the decomposition is appealing, in order to make the repercussions of such decisions accurately requires application practical, several conditions must attention to detail. This microworld is therefore at apply. First, our human designed hierarchy must a much lower level of abstraction than TD-Island. be valid, in the sense that at the nodes of the hierarchy capture the richness of strategic development used by human players. Second, crucially, one must be able to abstract out irrelevant parts of the state representation, at 2004). Another form of action abstraction is to differing parts of the hierarchy. As an example, define actions extended in time. They have been consider the “take-off” node. Provided all aircraft, variously referred to as options (Sutton et al 1999), including the enemies, are sufficiently far from the sub-tasks, macro-operators, or extended actions. In departing aircraft, we can ignore their states. The multi-agent games the action space states of other aircraft can effectively be safely A  A1  A2 ... AN describes concurrent actions, abstracted out, without altering the quality of where A is the set of actions available to agent i. strategic learning. Without such abstraction, i hierarchical learning is in fact far more expensive The number of concurrent actions is A  Ai . than general “flat” learning (Dietterich, 2002). i Finally, the current theory of hierarchical learning We can reduce the problem by constraining the must be modified to include explicit adversarial agents so that only one agent can act at each time- reasoning, for example the randomisation of step. State symmetry arises when behaviour over strategies to keep the opponent “second-guessing” state sub-spaces is repetitive throughout the state so to speak. space (Hengst 2002), or when there is some geometric symmetry that solves the problem in one We have completed research into incorporating domain and translates to others. Deictic adversarial reasoning into hierarchical learning representations only consider state regions through a vignette in which taxi’s complete for important to the agent, for example, the agent’s passengers, and are able to “steal” passengers. The direct field of view. Multi-level task hierarchical decomposition for this game is shown decompositions use a task hierarchy to decompose below the problem. Whether the task hierarchy is user specified as in MAXQ (Dietterich 2000) or Root machine discovered as in HEXQ (Hengst 2002), they have the potential to simultaneously abstract

Get Deliver both actions and states.

Part of our approach to address achieving human agreement and trust in the representation has been

PickUp Steal Navigate (t) Evade PutDown a move from simple variants of Chess and Checkers to more general microworlds like TD Island and Tempest Seer. We advocate an open- interface component approach to software

North South East West Wait development, allowing the representations to be dynamically adaptable as conditions change.

Figure 3: Hierarchical decomposition for the RESULTS OF A MICROWORLD GAME competitive taxi problem. We illustrate learning performance for a Checkers- In an approach to combining hierarchy with game- variant microworld. In this variant, each side is theory, one of the authors has combined MAXQ given one hidden piece. If this piece reaches the learning with the “win-or-learn fast” or WOLF opposition baseline, it becomes a hidden king. algorithm (Bowling and Veloso, 2003) which After some experience in playing this game, we adjusts the probabilities of executing a particular estimated the relative importance of standard strategy according to whether you are doing well pieces, hidden unkinged and hidden kinged pieces. in the strategic play or loosing. We then machine learned our parameter values using the belief based TDLeaf algorithm described earlier.

SUMMARY ISSUES

We have found the use of action and state symmetry, extended actions, time-sharing, deictic representations, and hierarchical task decompositions useful for improving the efficiency of learning in our agents. Action symmetries arise, for example, when actions have identical effects in the value or Q-function (Ravindran 2004, Calbert Ability initiative, in part through the Australian Research Council. We wish to acknowledge the inspiration of Dr Jan Kuylenstierna and his team from the Swedish National Defence College.

REFERENCES

Alberts, D., Gartska, J., Hayes, R., (2001) Understanding Information Age Warfare, CCRP Publ., Washington, DC.

Barto, A.G., Mahadevan, S. (2003) Recent Advances in Hierarchical Reinforcement Figure 5. Results for Checkers with hidden pieces. Learning. Discrete-Event Systems, 13:41-- 77. Special Issue on Reinforcement Figure 5 details the probability of wins, losses and Learning. draws for three competitions each of 5000 games. In the first competition, an agent with machine- Baxter, J.; Tridgell, A.; and Weaver, L, (1998) learned evaluation function weights was played TDLeaf () : Combining Temporal against an agent with an evaluation function where Difference Learning with Game Tree all weights were set as uniformly important. In the Search. Proceedings of the Ninth second competition, an agent with machine- Australian Conference on Neural Network: learned evaluation function weights was played 168-172. against an agent with a human-estimated evaluation function. Finally, an agent using human Beal, D.F. and Smith, M.C. (1997) Learning Piece estimated weights played against an agent using Values Using Temporal Differences, ICCA uniform values. The agent with machine-estimated Journal 20(3): 147-151. evaluation function weights wins most games. This result points towards our aim to construct adequate Bowling, M and Veloso, M. (2002) Multiagent machine learned representations for new unique learning using a variable learning rate. strategic situations. Artificial Intelligence Vol. 136, pp. 215- 250. FUTURE WORK AND CONCLUSIONS Calbert, G., (2004) “Exploiting Action-Space We will continue to explore the theme of Symmetry for Reinforcement Learning in a microworld and agent design and fuse our Concurrent Dynamic Game”, Proceedings experience in games, partially-observable planning of the International Conference on and extended actions. Though we have identified Optimisation Techniques and Applications. good models as homomorphic, there is little work 2004. describing realistic approaches to constructing these homomorphisms for microworlds or Calbert, G and Kwok, H-W, (2004) Combining adversarial domains. Entropy Based Heuristics, Minimax search and Temporal differences to play Hidden We are also interested in capturing human intent State games. Proceedings of the Florida and rewards as perceived by military players in Artificial Intelligence Symposium, 2004. microworld games. This may be viewed as an AAAI Press. inverse MDP, where one determines rewards and value function given a policy, as opposed to Dietterich, T.G, (2000) “Hierarchical determining a policy through predetermined Reinforcement Learning with the MAXQ rewards (Ng and Russell 2000). We believe that Function Decomposition”, Journal of augmentation of human cognitive ability will Artificial Intelligence Research, Vol. 13, provide an improvement in the ability for military pp.227-303. commanders to achieve their intent. Friman, H. and Brehmer, B. (1999) “Using ACKNOWLEDGEMENTS Microworlds To Study Intuitive Battle Dynamics: A Concept For The Future”, National ICT Australia is funded through the ICCRTS, Rhode Island, June 29 - July 1. Australian Government's Backing Australia's Fundenberg, D. and Tirole, J., (1995) Game International Conference on Machine Theory. MIT Press. Learning.

Furnkranz, J., Kubat, M. (eds), (2001) Machines Omodei, M. M., Wearing, A. J., McLennan, J., that learning to play games. Advances in Elliott, G. C., & Clancy, J. M. (2004). More Computation Theory and Practice, Vol.8. is Better? Problems of self regulation in Nova Science Publishers, Inc. naturalistic decision making settings, in Montgomery, H., Lipshitz, R., Brehmer, B. Hengst, B., (2002) “Discovering Hierarchy in (Eds.). Proceedings of the 5th Naturalistic Reinforcement Learning with HEXQ”, Decision Making Conference. Proceedings of the Nineteenth International Conference on Machine Learning, pp. 243- Ravindran, B. (2004) "An Algebraic Approach to 250, Morgan-Kaufman Publ. Abstraction in Reinforcement Learning". Doctoral Dissertation, Department of Horne, G., Johnson, S., Eds. (2003), Maneuver Computer Science, University of Warfare Science, USMC Project Albert Massachusetts, Amherst MA. Publ. Schaeffer, J. (2000) “The games computers (and Hu, J., Wellman, M.P., (2003), Nash Q-Learning people) play.” Advances in Computers 50 , for General Sum Stochastic Games, Marvin Zelkowitz editor, Academic Press, Journal of Machine Learning Research, pp. 189-266. Vol. 4, pp. 1039-1069. Smet, P. et. al. (2005) “The effects of materiel, Kaelbling L.P., Littman M.L., Moore A.W. (1996). tempo and search depth in win loss ratios in Reinforcement Learning: A Survey, Journal Chess”, Proceedings of Australian of Artificial Intelligence Research, Vol. 4, Artificial Intelligence Conference. pp. 237-285. Sutton, R., Precup, D. and Singh, S. (1999) Kuylenstierna, J., Rydmark, J. and Fahrus, T. Between MDPs and semi-MDPs: A (2000) “The Value of Information in War: Framework for Temporal Abstraction in Some Experimental Findings”, Proc. of the Reinforcement Learning. Artificial 5th International Command and Control Intelligence. Vol. 112. pp. 181-211. Research and Technology Symposium (ICCRTS), Canberra, Australia. US Joint Chiefs of Staff, (2001) Joint Doctrine Keystone and Capstone Primer. Appendix Lambert, D., Scholz, J.B. (2005), “A Dialectic for A, 10 Sept. Network Centric Warfare”, Proceedings of http://www.dtic.mil/doctrine/jel/new_pubs/ the 10th International Command and Control primer.pdf Research and Technology Symposium (ICCRTS), MacLean, VA, June 13–16. http://www.dodccrp.org/events/2005/10th/ CD/papers/016.pdf

Matthews, K, Davies, M., (2001) Simulations for Joint Military Operations Planning. Proceedings of the SIMTEC Modelling and Simulation Conference.

Miller, C.S. (2005), “A New Perspective for the Military: Looking at Maps Within Centralized Command and Control Systems”. Air and Space Power Chronicles, 30 March. http://www.airpower.maxwell.af.mil/airchr onicles/cc/miller.html

Ng, A.Y. and Russell, S. (2000) “Algorithms for inverse reinforcement learning”, Proceedings of the Seventeenth