Markov Games As a Framework for Multi-Agent Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Markov games as a framework for multi-agent reinforcement learning Michael L. Littman Brown University / Bellcore Department of Computer Science Brown University Providence, RI 02912-1910 [email protected] Abstract 2 DEFINITIONS In the Markov decision process (MDP) formaliza- An MDP [Howard, 1960] is de®ned by a set of states, ¡ ¡¥¤ £ ¢§¦ tion of reinforcement learning, a single adaptive , and actions, ¢ . A transition function, : ¡ © agent interacts with an environment de®ned by a PD ¨ , de®nes the effects of the various actions on the ¡ © probabilistic transition function. In this solipsis- state of the environment. (PD ¨ represents the set of ¡ tic view, secondary agents can only be part of the discrete probability distributions over the set .) The reward ¡ ¤ ¢ ¦ environment and are therefore ®xed in their be- function, : , speci®es the agent's task. havior. The framework of Markov games allows In broad terms, the agent's objective is to ®nd a policy us to widen this view to include multiple adap- mapping its interaction history to a current choice of action tive agents with interacting or competing goals. so as to maximize the expected sum of discounted reward, This paper considers a step in this direction in ! , where is the reward received " steps which exactly two agents with diametrically op- 0 into the future. A discount factor, 0 # 1 controls how posed goals share an environment. It describes %$ much effect future rewards have on the optimal decisions, a Q-learning-like algorithm for ®nding optimal with small values of emphasizing near-term gain and policies and demonstrates its application to a sim- larger values giving signi®cant weight to later rewards. ple two-player game in which the optimal policy is probabilistic. In its general form, a Markov game, sometimes called a stochastic game [Owen, 1982], is de®ned by a set of states, ¡ ¢ ¢+* , and a collection of action sets, 1 &(''')& , one for each 1 INTRODUCTION agent in the environment. State transitions are controlled by the current state and one action from each agent: £ : ¡¥¤ ¤-,(,,.¤ ¡ © * ¢ ¦ ¨ ¢ 1 PD . Each agent also has an ¤1,,(,2¤ No agent lives in a vacuum; it must interact with other agents ¡0¤ +/ ¢+*3¦ associated reward function, : ¢ , to achieve its goals. Reinforcement learning is a promis- 1 for agent 4 , and attempts to maximize its expected sum of 8 8 56 [ ing technique for creating agents that co-exist Tan, 1993, ( /7 discounted rewards, /7 , where is the Yanco and Stein, 1993], but the mathematical frame- 0 4 reward received " steps into the future by agent . work that justi®es it is inappropriate for multi-agent en- vironments. The theory of Markov Decision Processes In this paper, we consider a well-studied specialization in (MDP's) [Barto et al., 1989, Howard, 1960], which under- which there are only two agents and they have diametrically lies much of the recent work on reinforcement learning, opposed goals. This allows us to use a single reward func- assumes that the agent's environment is stationary and as tion that one agent tries to maximize and the other, called such contains no other adaptive agents. the opponent, tries to minimize. In this paper, we use ¢ to denote the agent's action set, 9 to denote the opponent's The theory of games [von Neumann and Morgenstern, © :¨<; action set, and &>=?&>@ to denote the immediate reward to 1947] is explicitly designed for reasoning about multi-agent ¡ ¢ ; A the agent for taking action =BA in state when its systems. Markov games (see e.g., [Van Der Wal, 1981]) is 9 opponent takes action @:A . an extension of game theory to MDP-like environments. This paper considers the consequences of using the Markov Adopting this specialization, which we call a two-player game framework in place of MDP's in reinforcement learn- zero-sum Markov game, simpli®es the mathematics but ing. Only the speci®c case of two-player zero-sum games makes it impossible to consider important phenomena such is addressed, but even in this restricted version there are as cooperation. However, it is a ®rst step and can be consid- 9DCE insights that can be applied to open questions in the ®eld of ered a strict generalization of both MDP's (when C 1) ¡ CE reinforcement learning. and matrix games (when C 1). As in MDP's, the discount factor, , can be thought of Agent as the probability that the game will be allowed to con- rock paper scissors tinue after the current move. It is possible to de®ne a no- rock 0 1 -1 tion of undiscounted rewards [Schwartz, 1993], but not all Opponent paper -1 0 1 Markov games have optimal strategies in the undiscounted scissors 1 -1 0 case [Owen, 1982]. This is because, in many games, it is best to postpone risky actions inde®nitely. For current pur- Table 1: The matrix game for ªrock, paper, scissors.º poses, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later. paper scissors (vs. rock) rock scissors (vs. paper) rock paper (vs. scissors) E 3 OPTIMAL POLICIES rock paper scissors 1 Table 2: Linear constraints on the solutionto a matrix game. The previous section de®ned the agent's objective as max- imizing the expected sum of discounted reward. There are subtleties in applying this de®nition to Markov games, 4 FINDING OPTIMAL POLICIES however. First, we consider the parallel scenario in MDP's. In an MDP, an optimal policy is one that maximizes the This section reviews methods for ®nding optimal policies ¢¡¤£ ¡ £ 4 for matrix games, MDP's, and Markov games. It uses a uni- ¦¥ =¨§ © expected sum of discounted reward and is @ , meaning that there is no state from which any other policy form notation that is intended to emphasize the similarities can achieve a better expected sum of discounted reward. between the three frameworks. To avoid confusion, func- Every MDP has at least one optimal policy and of the op- tion names that appear more than once appear with different timal policies for a given MDP, at least one is stationary numbers of arguments each time. and deterministic. This means that, for any MDP, there is ¡ ¦ ¢ a policy : that is optimal. The policy is called 4.1 MATRIX GAMES stationary since it does not change as a function of time and it is called deterministic since the same action is always ¡ At the core of the theory of games is the matrix game de®ned ; ; chosen whenever the agent is in state , for all A . /87 by a matrix, , of instantaneous rewards. Component For many Markov games, there is no policy that is undomi- is the reward to the agent for choosing action " when its nated because performance depends critically on the choice opponent chooses action 4 . The agent strives to maximize of opponent. In the game theory literature, the resolu- its expected reward while the opponent tries to minimize tion to this dilemma is to eliminate the choice and evaluate it. Table 1 gives the matrix game corresponding to ªrock, each policy with respect to the opponent that makes it look paper, scissors.º the worst. This performance measure prefers conservative The agent's policy is a probability distribution over actions, strategies that can force any opponent to a draw to more © ¨¢ A PD . For ªrock, paper, scissors,º is made up of daring ones that accrue a great deal of reward against some 3 components: rock, paper, and scissors. According to the opponents and lose a great deal to others. This is the essence notion of optimality discussed earlier, the optimal agent's of minimax: Behave so as to maximize your reward in the minimum expected reward should be as large as possible. worst case. How can we ®nd a policy that achieves this? Imagine that Given this de®nition of optimality, Markov games have we would be satis®ed with a policy that is guaranteed an several important properties. Like MDP's, every Markov expected score of no matter which action the opponent game has a non-empty set of optimal policies, at least one chooses. The inequalities in Table 2, with 0, constrain of which is stationary. Unlike MDP's, there need not be a the components of to represent exactly those policiesÐ deterministic optimal policy. Instead, the optimal stationary any solution to the inequalities would suf®ce. policy is sometimes probabilistic, mapping states to discrete ¡ © For to be optimal, we must identify the largest for ¦ ¨8¢ probability distributions over actions, : PD . A which there is some value of that makes the constraints classic example is ªrock, paper, scissorsº in which any hold. Linear programming (see, e.g., [Strang, 1980]) is a deterministic policy can be consistently defeated. general technique for solving problems of this kind. In this The idea that optimal policies are sometimes stochastic may example, linear programming ®nds a value of 0 for and seem strange to readers familiar with MDP's or games with (1/3, 1/3, 1/3) for . We can abbreviate this linear program alternating turns like backgammon or tic-tac-toe, since in as: these frameworks there is always a deterministic policy that 7 E max min & PD does no worse than the best probabilistic one. The need for probabilistic action choice stems from the agent's uncer- 7 tainty of its opponent's current move and its requirement to where expresses the expected reward to the avoid being ªsecond guessed.º agent for using policy against the opponent's action @ . 4.2 MDP's The resulting recursive equations look much like Equations 1±2 and indeed the analogous value iteration algorithm can There is a host of methods for solving MDP's.