Solving Large Extensive-Form Games with Strategy Constraints

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Solving Large Extensive-Form Games with Strategy Constraints Trevor Davis,1 Kevin Waugh,2 Michael Bowling2,1 1Department of Computing Science, University of Alberta 2DeepMind [email protected], fwaughk, [email protected] Abstract One approach for handling multiple objectives is to use a linear combination of per-objective utilities. This approach Extensive-form games are a common model for multiagent has been used in EFGs to “tilt” poker agents toward taking interactions with imperfect information. In two-player zero- sum games, the typical solution concept is a Nash equilibrium specific actions (Johanson et al. 2011), and to mix between over the unconstrained strategy set for each player. In many cost minimization and risk mitigation in sequential security situations, however, we would like to constrain the set of pos- games (Lisy,´ Davis, and Bowling 2016). However, objec- sible strategies. For example, constraints are a natural way to tives are typically measured on incommensurable scales. This model limited resources, risk mitigation, safety, consistency leads to dubious combinations of weights often selected by with past observations of behavior, or other secondary objec- trial-and-error. tives for an agent. In small games, optimal strategies under linear constraints can be found by solving a linear program; A second approach is to constrain the agents’ strategy however, state-of-the-art algorithms for solving large games spaces directly. For example, rather than minimizing the cannot handle general constraints. In this work we introduce expected cost, we use a hard constraint that disqualifies high- a generalized form of Counterfactual Regret Minimization cost strategies. Using such constraints has been extensively that provably finds optimal strategies under any feasible set of studied in single-agent perfect information settings (Altman convex constraints. We demonstrate the effectiveness of our 1999) and partial information settings (Isom, Meyn, and algorithm for finding strategies that mitigate risk in security games, and for opponent modeling in poker games when given Braatz 2008; Santana, Thiebaux,´ and Williams 2016), as well only partial observations of private information. as in (non-sequential) security games (Brown et al. 2014). Incorporating strategy constraints when solving EFGs 1 Introduction presents a unique challenge. Nash equilibria can be found by Multiagent interactions are often modeled using extensive- solving a linear program (LP) derived using the sequence- form games (EFGs), a powerful framework that incoporates form representation (Koller, Megiddo, and von Stengel 1996). sequential actions, hidden information, and stochastic events. This LP is easily modified to incorporate linear strategy con- Recent research has focused on computing approximately straints; however, LPs do not scale to large games. Special- optimal strategies in large extensive-form games, resulting in ized algorithms for efficiently solving large games, such as a solution to heads-up limit Texas Hold’em, a game with ap- an instantiation of Nesterov’s excessive gap technique (EGT) proximately 1017 states (Bowling et al. 2015), and in two in- (Hoda et al. 2010) as well as counterfactual regret minimiza- dependent super-human computer agents for the much larger tion (CFR) (Zinkevich et al. 2008) and its variants (Lanctot heads-up no-limit Texas Hold’em (Moravcˇ´ık et al. 2017; et al. 2009; Tammelin et al. 2015), cannot integrate arbitrary Brown and Sandholm 2018). strategy constraints directly. Currently, the only large-scale When modeling an interaction with an EFG, for each out- approach is restricted to constraints that consider only indi- come we must specify the agents’ utility, a cardinal measure vidual decisions (Farina, Kroer, and Sandholm 2017). of the outcome’s desirability. Utility is particularly difficult to In this work we present the first scalable algorithm for specify. Take, for example, situations where an agent has mul- solving EFGs with arbitrary convex strategy constraints. Our tiple objectives to balance: a defender in a security game with algorithm, Constrained CFR, provably converges towards the primary objective of protecting a target and a secondary a strategy profile that is minimax optimal under the given objective of minimizing expected cost, or a robot operating p constraints. It does this while retaining the O(1= T ) con- in a dangerous environment with a primary task to complete vergence rate of CFR and requiring additional memory pro- and a secondary objective of minimizing damage to itself and portional to the number of constraints. We demonstrate the others. How these objectives combine into a single value, the empirical effectiveness of Constrained CFR by comparing agent’s utility, is ill-specified and error prone. its solution to that of an LP solver in a security game. We Copyright c 2019, Association for the Advancement of Artificial also present a novel constraint-based technique for opponent Intelligence (www.aaai.org). All rights reserved. modeling with partial observations in a small poker game. 1861 2 Background A strategy σi is an "-best response to the opponent’s strat- 0 Formally, an extensive-form game (Osborne and Rubinstein egy σ−i if ui(σi; σ−i) + " ≥ ui(σi; σ−i) for any alternative 0 1994) is a game tree defined by: strategy σi 2 Σi. A strategy profile is an "-Nash equilibrium when each σ is a "-best response to its opponent; such a • A set of players N. This work focuses on games with two i profile exists for any " ≥ 0. The exploitability of a strategy players, so N = f1; 2g. 1 profile is the smallest " = =2("1 +"2) such that each σi is an • A set of histories H, the tree’s nodes rooted at ;. The leafs, "i-best response. Due to the zero-sum property, the game’s Z ⊆ H, are terminal histories. For any history h 2 H, we Nash equilibria are the saddle-points of the minimax problem 0 0 0 let h @ h denote a prefix h of h, and necessarily h 2 H. max min u(σ1; σ2) = min max u(σ1; σ2): (3) • For each h 2 H n Z, a set of actions A(h). For any a 2 σ12Σ1 σ22Σ2 σ22Σ2 σ12Σ1 A(h), ha 2 H is a child of h. A zero-sum EFG can be represented in sequence form • A player function P : H nZ ! N [fcg defining the player (von Stengel 1996). The sets of sequence-form strategies for to act at h. If P (h) = c then chance acts according to a players 1 and 2 are X and Y respectively. A sequence-form known probability distribution σc(h) 2 ∆ , where jA(h)j strategy x 2 X is a vector indexed by pairs I 2 I1, a 2 A(I). ∆ is the probability simplex of dimension jA(h)j. jA(h)j The entry x(I;a) is the probability of player 1 playing the • A set of utility functions ui : Z ! R, for each player. Out- sequence of actions that reaches I and then playing action a. come z has utility ui(z) for player i. We assume the game A special entry, x; = 1, represents the empty sequence. Any is zero-sum, i.e., u1(z) = −u2(z). Let u(z) = u1(z). behavioral strategy σ1 2 Σ1 has a corresponding sequence- form strategy SEQ(σ1) where • For each player i 2 N, a collection of information sets Ii. Ii partitions Hi, the histories where i acts. Two histories SEQ(σ ) := πσ1 (I)σ (I; a): 8I 2 I ; a 2 A(I) 0 1 (I;a) 1 1 i h; h in an information set I 2 Ii are indistinguishable to i. Necessarily A(h) = A(h0), which we denote by A(I). Player i has a unique sequence to reach any history h 2 H When a player acts they do not observe the history, only and, by perfect recall, any information set I 2 Ii. Let the information set it belongs to, which we denote as I[h]. xh and xI denote the corresponding entries in x. Thus, we are free to write the expected utility as u(x; y) = We assume a further requirement on the information sets Ii P called perfect recall. It requires that players are never forced z2Z πc(z)xzyzu(z). This is bilinear, i.e., there exists a to forget information they once observed. Mathematically payoff matrix A such that u(x; y) = x|Ay. A consequence of perfect recall and the laws of probability is for I 2 I1 this means that all indistinguishable histories share the same P sequence of past information sets and actions for the actor. that xI = a2A(I) x(I;a) and that x ≥ 0. These constraints Although this may seem like a restrictive assumption, some are linear and completely describe the polytope of sequence- perfect recall-like condition is needed to guarantee that an form strategies. Using these together, (3) can be expressed as EFG can be solved in polynommial time, and all sequential a bilinear saddle point problem over the polytopes X and Y: games played by humans exhibit perfect recall. max min x|Ay = min max x|Ay (4) x2X y2Y y2Y x2X 2.1 Strategies A behavioral strategy for player i maps each information For a convex function f : X! R, let rf(x) be any element of the subdifferential @f(x), and let r(I;a)f(x) be set I 2 Ii to a distribution over actions, σi(I) 2 ∆jA(I)j. the (I; a) element of this subgradient. The probability assigned to a 2 A(I) is σi(I; a).A strategy profile, σ = fσ ; σ g, specifies a strategy for each player. 1 2 2.2 Counterfactual regret minimization We label the strategy of the opponent of player i as σ−i.

Load more