A Polynomial-Time Nash Equilibrium Algorithm for Repeated Stochastic Games

A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games Enrique Munoz de Cote∗ Michael L. Littman† DEI, Politecnico di Milano Dept. of Computer Science piazza Leonardo da Vinci, 32 Rutgers University 20133 Milan, Italy Piscataway, NJ 08854 [email protected] [email protected] Abstract In an infinitely repeated stochastic game, the stochastic game is played an unbounded number of rounds. We present a polynomial-time algorithm that On each round, a stage game is played, starting in s0 always finds an (approximate) Nash equi- and consisting of a series of state transitions (steps), librium for repeated two-player stochastic jointly controlled by the two agents. At each step, games. The algorithm exploits the folk the- both agents simultaneously select their actions, pos- orem to derive a strategy profile that forms sibly stochastically, via strategies πi (for each agent an equilibrium by buttressing mutually ben- i). To avoid infinitely long rounds, after each step, eficial behavior with threats, where possible. the round is allowed to continue with probability γ, One component of our algorithm efficiently otherwise it is terminated. The payoff for a player in searches for an approximation of the egali- a stage game is the total utility obtained before the tarian point, the fairest pareto-efficient solu- stage game is terminated. (Note that the continuation tion. The paper concludes by applying the probability γ is equivalent to a discount factor.) Play- algorithm to a set of grid games to illus- ers behave so as to maximize their average stage-game trate typical solutions the algorithm finds. payoffs over the infinite number of rounds. These solutions compare very favorably to A strategy profile, π = π1, π2 , is a Nash equilibrium those found by competing algorithms, result- (NE) if each strategy ish optimizedi with respect to the ing in strategies with higher social welfare, as other. In an equilibrium, no agent can do better by well as guaranteed computational efficiency. changing strategies given that the other agent continues to follow its strategy in the equilibrium. In a repeated game, the construction of equilibrium strategy 1 Problem Statement profiles can involve each player changing strategy from round to round in response to the behavior of the other Stochastic games (Shapley, 1953) are a popular agent. Note that an ǫ-approximate NE is one in which model of multiagent sequential decision making in the no agent can do better by more than ǫ by changing machine-learning community (Littman, 1994; Bowling strategies given that the other agent continues to fol- & Veloso, 2001). In the learning setting, these games low its strategy in the equilibrium. are often repeated over multiple rounds to allow learning agents a chance to discover beneficial strategies. Our approach to finding an equilibrium for repeated stochastic games relies on the idea embodied in the folk Mathematically, a two-player stochastic game is a theorems (Osborne & Rubinstein, 1994). The relevant tuple ,s0, A1, A2, ,U1,U2, γ ; namely, the set of hS T i folk theorem states that if an agent’s performance is states , an initial state s0 , action sets for the two S ∈ S measured via expected average payoff, for any strictly agents A1 and A2, with joint action space = A1 A2; A × enforceable (all agents receive a payoff larger than their the state-transition function, : Π( ) (Π( ) minimax values) and feasible (payoffs can be obtained is the set of probability distributionsT S × over A → );S the util-· S by adopting some strategy profile) set of average pay- ity functions for the two agents U1,U2 : , offs to the players, there exist equilibrium strategy pro- and the discount 0 γ 1. S×A→ℜ ≤ ≤ files that achieve these payoffs. The power of this folk ∗ Supported by The National Council of Science theorem is that communally beneficial play, such as and Technology (CONACyT), Mexico, under grant No. mutual cooperation in the Prisoner’s Dilemma, can be 196839. justified as an equilibrium. A conceptual drawback is †Supported, in part, by NSF IIS-0325281. that there may exist infinitely many feasible and en- plane. This region is convex because any two strat- forceable payoffs (and therefore a daunting set of equi- egy profiles can be mixed by alternating over succes- librium strategy profiles to choose from). We focus on sive rounds to achieve joint payoffs that are any convex the search for a special point inside this (possibly in- combination of the joint payoffs of the original strategy finite) set of solutions that maximizes the minimum profiles. The disagreement point v = (v1,v2) divides advantage obtained by the players. (The advantage is the plane into two regions (see Figure 1): a) the region the improvement a player gets over the payoff it can of mutual advantages (all points in X, above and to guarantee by playing defensively.) We call this point the right of v), denotes the strictly enforceable payoff the egalitarian point, after Greenwald and Hall (2003). profiles; and b) the relative complement of the region Other points can also be justified, such as the one that of mutual advantage, which are the payoff profiles that maximizes the product of advantages—the Nash bar- a rational player would reject. gaining solution (Nash, 1950). In general-sum bimatrix games, the disagreement Earlier work (Littman & Stone, 2005) has shown that point can be computed exactly by solving two zero- the folk theorem can be interpreted computationally, sum games (von Neumann & Morgenstern, 1947) to resulting in a polynomial-time algorithm for repeated find the attack and defensive strategies and their val- games. In the prior work, the game in each round ues. In contrast, the solution to any zero-sum stochas- is represented in matrix form—each strategy for each tic game can be approximated to any degree of accu- player is explicitly enumerated in the input represen- racy ǫ > 0 via value iteration (Shapley, 1953). The tation. This paper considers the analogous problem running time is polynomial in 1/(1 γ), 1/ǫ, and the − when each stage game is represented much more com- magnitude of the largest utility Umax. pactly as a stochastic game. Representing such games in matrix form would require an infinitely large ma- 2.2 Markov Decision Processes trix since the number steps per round, and therefore the complexity of the strategies, is unbounded. Even In this paper, we use Markov decision processes (Put- if we limit ourselves to stationary deterministic strate- erman, 1994), or MDPs, as a mathematical framework gies, there are exponentially many to consider. for modeling the problem of the two players work- Concretely, we address the following computational ing together as a kind of meta-player to maximize a problem. Given a stochastic game, return a strategy weighted combination of their payoffs. For any weight profile that is a Nash equilibrium—one whose payoffs [w, 1 w] (0 w 1) and point p = (p1,p2), define − ≤ ≤ match those of the egalitarian point—of the average σw(p) = wp1 + (1 w)p2. − payoff repeated stochastic game in polynomial time. Note that any strategy profile π for a stochastic game In fact, because exact Nash equilibria in stochastic has a value for the two players that can be represented games can require unbounded precision, our algorithm π as a point p X. To find the strategy profile π for a returns an arbitrarily accurate approximation. ∈ π stochastic game that maximizes σw(p ), we can solve MDP(w), which is the MDP derived from replacing 2 Background the utility r = (r1, r2) in each state with σw(r). Here, we present background on the problem. 2.3 Other Solutions for Stochastic Games 2.1 Minimax Strategies There are several solution concepts that have been con- sidered in the literature. Generally speaking, a Nash Minimax strategies guarantee a minimum payoff value, equilibrium (NE) is a vector of independent strategies called the security value, that an agent can guaran- in which all players optimize their independent prob- tee itself by playing a defensive strategy. In addi- ability distributions over actions with respect to ex- tion, an agent can be held to this level of payoff if pected payoff. A correlated equilibrium (CE) allows the other agent adopts an aggressive attack strategy for dependencies in the agent’s randomizations, so a (because minimax equals maximin). Given that mini- CE is a probability distribution over joint spaces of max strategies guarantee a minimum payoff value, no actions. Minimax strategies maximize payoff in the rational player will agree on any strategy in which it face of their worst opponent. At the other extreme, obtains a payoff lower than its security value. The “friend” strategies maximize behavior assuming the pair of security values in a two-player game is called opponents are working to maximize the agent’s own the disagreement point. utility. Friend strategies are appropriate in purely co- The set X R2 of average payoffs achievable by strat- operative settings but can perform very badly in mixed egy profiles⊆ can be visualized as a region in the x-y incentive settings. minv(A) = 2. All other points are to the left of A, so A is the point with maximum x coordinate. In the A 5 set filled with circles, E is the egalitarian point with minv(E) = 2. All other points are below E, so E is 4 the point with maximum y coordinate. F The intermediate region with the vertical fill lines is B 3 a bit more complex. Point F has the largest y coor- C dinate, but its egalitarian value is negative because of E 2 the x coordinate. Point D has the largest x coordinate, but its egalitarian value is negative because of the y 1 coordinate.

Load more