Value Function Approximation in Zero-Sum Markov Games {Mgl,Parr}
Total Page:16
File Type:pdf, Size:1020Kb
UAI2002 LAGOUDAKIS & PARR 283 Value Function Approximation in Zero-Sum Markov Games Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 {mgl,parr}©cs.duke.edu Abstract by working in strategy space may be represented more ef fectively as Markov games. A well-known example of a Markov game is Littman's soccer domain (Littman, 1994). This paper investigates value function approx imation in the context of zero-sum Markov Recent work on learning in games has emphasized accel games, which can be viewed as a generalization erating learning and exploiting opponent suboptimalities of the Markov decision process (MDP) frame (Bowling & Veloso, 2001). The problem of learning in work to the two-agent case. We generalize er games with very large state spaces using function approx ror bounds from MDPs to Markov games and imation has not been a major focus, possibly due to the describe generalizations of reinforcement learn perception that a prohibitively large number of linear pro ing algorithms to Markov games. We present grams would need to be solved during learning. a generalization of the optimal stopping prob This paper contributes to the theory and practice of learning lem to a two-player simultaneous move Markov in Markov games. On the theoretical side, we generalize game. For this special problem, we provide some standard bounds on MDP performance with approxi stronger bounds and can guarantee convergence mate value functions to Markov games. When presented in for LSTD and temporal difference learning with their fully general form, these tum out to be straightforward linear value function approximation. We demon applications of known properties of contraction mappings strate the viability of value function approxima to Markov games. tion for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good We also generalize the optimal stopping problem to the policies for a soccer domain and a flow control Markov game case. Optimal stopping is a special case of problem. an MDP in which states have only two actions: continue on the current Markov chain, or exit and receive a (possi bly state dependent) reward. Optimal stopping is of inter est because it can be used to model the decision problem 1 Introduction faced by the holder of stock options (and is purportedly modeled as such by financial institutions). Moreover, lin Markov games can be viewed as generalizations of both ear value function approximation algorithms can be used classical game theory and the Markov decision process for these problems with guaranteed convergence and more (MDP) framework1. In this paper, we consider the two compelling error bounds than those obtainable for generic player zero-sum case, in which two players make simulta MDPs. We extend these results to a Markov game we call neous decisions in the same environment with shared state the "opt-out", in which two players must decide whether --; information. The reward function and the state transition to remain in the game or opt out, with a payoff that is de probabilities depend on the current state and the current pendent upon the actions of both players. For example, this agents' joint actions. The reward function in each state is can be used to model the decision problem faced by two the payoff matrix of a zero-sum game. "partners" who can choose to continue participating in a business venture, or force a sale and redistribution of the Many practical problem can be viewed as Markov games, business assets. including military-style pursuit-evasion problems. More over, many problems that are solved as classical games As in the case of MDPs, these theoretical results are some what more motivational than practical. They provide re 1 Littman (1994) notes that the Markov game framework actu assurance that value function approximation architectures ally preceded the MOP framework. 284 LAGOUDAKIS & PARR UAI2002 which can represent value functions with reasonable er value function for a zero-sum Markov game can be defined rors will lead to good policies. However, as in the MDP in a manner analogous to the Bellman equation for MDPs: case, they do not provide a priori guarantees of good per V(s) = max min Q(s,a,o)1r(s,a) , (1) formance. Therefore we demonstrate, on the practical side rr(s)E!"l(A) oEO L of this paper, that with a good value function approximation aEA architecture, we can achieve good performance on large where the Q-values familiar to MDPs are now definedover Markov games. We achieve this by extending the least states and pairs of agent and opponent actions: squares policy iteration (LSPI) algorithm (Lagoudakis & Parr, 2001), originally designed for MDPs, to general two Q(s, a,o) = R(s, a, o) + 'Y L P(s, a, o, s')V(s') . (2) player zero-sum Markov games. LSPI is an approximate s'ES policy iteration algorithm that uses linear approximation We will refer to the policy chosen by Eq. (I) as the minimax architectures, makes efficient use of training samples, and policy with respect to Q. This policy can be determined in learns faster than conventional learning methods. The ad any state s by solving the following linear program: vantage of using LSPI for Markov games stems from its generalization ability and its data efficiency. The latter is Maximize: V(s) of even greater importance for large Markov Games than Subject to: L 1r(s, a) = 1 for MDPs since each learning step is not light weight; it aEA requires solving a linear program. 'riaE A, 1r(s,a) 2: 0 o We demonstrate LSPI with value function approximation 'ri E 0, V(s) � L Q(s,a,o)1r(s,a) in two domains: Littman's soccer domain in which LSPI aEA learns very good policies using a relatively small num ber of sample "random games" for training; and, a simple In many ways, it is useful to think of the above linear pro gram as a generalization of the max operator for MDPs server/router flow control problem where the optimal pol (this is also the view taken by Szepesvari and Littman icy is found easily with a small number of training samples. (1999)). We briefly summarize some of the properties of operators for Markov games which are analogous to oper ators for MDPs. (See Bertsekas and Tsitsiklis (1996) for 2 Markov Games an overview.) Eq. (I) and Eq. (2) have fixed points V* and Q*, respectively, for which the minimax policy is optimal A two-player zero-sum Markov game is defined as a 6- in the minimax sense. The following iteration, which we name with operator converges to tuple (S, A, 0, P, n, 'f'), where: s = { S), 82, ... , Sn } is a T*, Q*: finite set of game states; A = {a 1,a 2, ..., am} and 0 = (i+ll T* : Q (s,a, o) = R(s, a, o) + { o1,o2 , ..., o!} are finitesets of actions, one for each player; ! L P(s,a,o,s') max min L QUl(s,a,o)7r(s,a) Pis a Markovian state transition model- P(s, a, o, s') is rr(s)E!"!(A) oEO aEA the probability that s' will be the next state of the game s'ES when the players take actions a and o respectively in state This is analogous to value iteration in the MDP case. s; n is a reward (or cost) function- R(s, a,o) is the ex For any policy 1r,we can defineQ rr ( s,a, o) as the expected, pected one-step reward for taking actions a and o in state discounted reward when following policy 1rafter taking ac s; and, 'Y E (0, 1] is the discount factor for future rewards. tions and for the first step. The corresponding fixed We will refer to the first player as the "agent," and the sec a o point equation for Qrr and operator Trrare: ond player as the "opponent." Note that if the opponent is rr i permitted only a single action, the Markov game becomes T : Q(+l)(s,a,o) = R(s,a,o) + an MDP. 'Y '"""' � P(s,a,o,s')mioEOn�'"""' Q(i)(s,a,o)1r(s,a) A policy 1r for a player in a Markov game is a mapping, s'ES aEA 1r : S --+ O(A), which yields probability distributions over the agent's actions for each state in S. Unlike MDPs, the A policy iteration algorithm can be implemented for optimal policy for a Markov game may be stochastic, i.e., Markov games in a manner analogous to policy iteration it may define a mixed strategy for every state. By conven for MDPs by fixing a policy 7r;, solving for Qrr,, choosing tion, 1r(s) denotes the probability distribution over actions 1ri+1 as the minimax policy with respect to Qrr, and iterat in states and 1r(s,a) denotes the probability of action a in ing. This algorithm will also converge to Q*. states. The agent is interested in maximizing its expected, dis 3 Value Function Approximation counted return in the minimax sense, that is, assuming the worst case of an optimal opponent. Since the underlying Value function approximation has played an important role rewards are zero-sum, it is sufficient to view the opponent in extending the classical MDP framework to solve practi as acting to minimize the agent's return. Thus, the state cal problems. While we are not yet able to provide strong UAI2002 LAGOUDAKIS & PARR 285 a priori guarantees in most cases for the performance of IIVIIoo = maxsES IV(s)l (Bertsekas & Tsitsiklis, 1996).