PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference

Computing Quantal Stackelberg Equilibrium in Extensive-Form Games

Jakub Cernˇ y´1, Viliam Lisy´2, Branislav Bosanskˇ y´2, Bo An1 1 Nanyang Technological University, Singapore 2 AI Center, FEE, Czech Technical University in Prague, Czech Republic [email protected], viliam.lisy, branislav.bosansky @fel.cvut.cz, [email protected] { }

Abstract players choose better actions more often than worse actions instead of playing only the best action (McKelvey and Pal- Deployments of game-theoretic solution concepts in the real frey 1995). When facing a human opponent we aim to find world have highlighted the necessity to consider human op- a optimal against QR, termed Quantal Stackelberg ponents’ boundedly rational behavior. If subrationality is not addressed, the system can face significant losses in terms Equilibrium (QSE). Relying on SE is not advisable in such of expected utility. While there exist algorithms for com- situations as the committing player can suffer huge losses puting optimal strategies to commit to when facing subra- when deploying rational strategies against boundedly ratio- tional decision-makers in one-shot interactions, these algo- nal counterparts (Yang et al. 2011). rithms cannot be generalized for solving sequential scenarios QSE was studied extensively in Stackelberg Security because of the inherent curse of strategy-space dimensional- Games (SSGs) (Sinha et al. 2018), and the developed al- ity in sequential games and because humans act subrationally gorithms were used in many real-world applications. SSG in each decision point separately. We study optimal strate- is a widely applicable game model, but it requires formulat- gies to commit to against subrational opponents in sequential games for the first time and make the following key contri- ing the problem in terms of allocating limited resources to butions: (1) we prove the problem is NP-hard in general; (2) a set of targets. To solve real-world problems beyond SSGs, to enable further analysis, we introduce a non-fractional re- QSE was recently introduced also in a more general model formulation of the direct non-concave representation of the of normal-form games (NFGs) (Cernˇ y´ et al. 2020). How- equilibrium; (3) we identify conditions under which the prob- ever, even NFGs model only one-shot interactions. To the lem can be approximated in polynomial time in the size of the best of our knowledge, QSE was never properly analyzed representation; (4) we show how an MILP can approximate for extensive-form (i.e., sequential) games (EFGs), despite the reformulation with a guaranteed bounded error, and (5) the fact that many real-world domains are sequential. we experimentally demonstrate that our algorithm provides higher quality results several orders of magnitude faster than In this paper, we develop methods for finding QSE in a baseline method for general non-linear optimization. EFGs. There are two main challenges that prevent us from using techniques from SSGs or NFGs directly. First, while there is only one decision point for the boundedly ratio- Introduction nal player in NFGs, any non-trivial EFG contains multiple In recent years, has achieved many ground- causally linked decision points where the same player acts. breaking advances both in large games (e.g., super-human The psychological studies show that humans prefer short- AI in poker (Moravcˇ´ık et al. 2017; Brown and Sandholm term, delayed heuristic decisions (Gigerenzer and Gold- 2018)), and practical applications (physical security (Sinha stein 1996) rather than long-term, premeditated decisions. et al. 2018), wildlife protection (Fang et al. 2017), AI for so- This behavior arises especially in in conflicts (Gray 1999) cial good (Yadav et al. 2016)). In deployed two-player real- or when facing information overload caused by large deci- world solutions, the aim is often to compute an optimal strat- sion space (Malhotra 1982). Therefore, a natural assumption egy to commit to while assuming that the other player plays is that a subrational player acts according to a QR model the best response to this commitment – i.e., to find a Stack- in each decision point separately. This behavior cannot be elberg Equilibrium (SE). While the traditional theory (e.g., modeled by an equivalent NFG. Second, contrary to NFGs, the SE) assumes that all players behave entirely rationally, the number of rational player’s pure strategies in EFGs is the real world’s deployments proved that taking into account exponential in the game size. Therefore, even if the QSE the of the human players is necessary to concepts would coincide in NFGs and EFGs, applying the provide higher quality solutions (An et al. 2013; Fang et al. algorithms for NFGs directly would soon ran into scalabil- 2017). One of the most commonly used models of bounded ity problems.We begin our analysis by showing that find- rationality is Quantal Response (QR), which assumes that ing QSE is NP-hard, and the straightforward formulation of QSE in EFGs is a non-concave fractional problem that Copyright c 2021, Association for the Advancement of Artificial is difficult to optimize. Therefore, we derive an equivalent Intelligence (www.aaai.org). All rights reserved. Dinkelbach-type formulation of QSE that does not contain any fraction, and represents the rational player’s strategy of the probabilities of chance actions in history h. The set of linearly in the size of the game. We use the Dinkelbach chance histories is denoted as c. formulation to identify sufficient condition for solving the Imperfect observation of playerH i is modeled via infor- problem in polynomial time. If the conditions are satisfied, mation sets i that form a partition over h i. Player i the optimal solution can be found by gradient ascent. For cannot distinguishI between histories in any information2H set other cases, we formulate a mixed-integer linear program I i. We overload the notation and use (Ii) to denote (MILP) approximating the QSE through QR model’s linear possible2I actions available in each history inA an information relaxation. We provide theoretical guarantees on the solu- set Ii. Each action a uniquely identifies the information set tion quality, depending on the number of segments used to where it is available, referred as (a). We assume perfect re- linearize the QR function and the arising bilinear terms. call, which means that players rememberI the history of their In the experiments, we compare the direct formulation own actions and all information gained during the course of solved by non-linear optimization to our MILP reformula- the game. As a consequence, all histories in any information tion. We show that in 3.5 hours, our algorithm computes set Ii have the same history of actions for player i. solutions that the baseline cannot reach within three days. Pure strategies ⇧i assign one action for each I i and 2I Moreover, the quality of solutions of the our algorithm out- we assume the actions of a strategy ⇡ ⇧i can be enu- matches the baseline’s solutions even for the instances com- merated as a ⇡. We denote the set2 of leafs reachable 2 putable by the baseline. by a strategy ⇡ as (⇡).Abehavioral strategy i Bi is one probability distributionZ over actions (I) for2 each Related Solution Concepts. Several other solution con- I . For any pair of strategies B =(BA,B ) we use cepts study bounded rationality in EFGs. Perhaps the most i l f u 2(I)=u ( , ) for the expected2 outcome of the game well-known one is the model of Quantal Response Equilib- i i i i for player i when players follow strategies . rium (QRE) (McKelvey and Palfrey 1998). In QRE, all play- Strategies in EFGs with perfect recall can be also com- ers act boundedly rationally, and no player has the ability pactly represented by using the sequence form (Koller, to commit to a strategy. This is in contrast to QSE, where Megiddo, and von Stengel 1996). A sequence ⌃ is the entirely-rational leader aims to exploit the boundedly i i an ordered list of actions taken by a single player i2in his- rational opponent. A similar approach for EFGs was taken tory h. stands for the empty sequence (i.e., a sequence in (Basak et al. 2018), where only one of the players was with no; actions). A sequence ⌃ can be extended by considered rational. However, their goal was to evaluate the i i a valid action a taken by player i,2 written as a = .We performance against human participants, and the approach i i0 say that is a prefix of ( ) if is obtained by lacked theoretical foundations. The computed strategies did i i0 i i0 i0 finite number (possibly zero) ofv extensions of . We use not correspond to a well-defined and were i seq(h) to denote the sequence of actions of all players lead- empirically suboptimal. We define and study this concept in ing to h. By using a subscript seq we refer to the subse- our concurrent paper (Milec et al. 2020). Besides rationality, i quence of action of player i. Because of perfect recall we in QSE, the leader also benefits from the power of commit- write seq (I)=seq (h) for I ,h I. We use the func- ment. It is well known that the ability to commit increases i i i tion inf ( ) to denote the information2I 2 set in which the last the payoffs achievable by the leader (Von Stengel and Za- i i0 action of the sequence is taken. For an empty sequence, mir 2010) and it trivially holds also against bounded rational i0 function inf ( ) returns the information set of the root. Us- players. Mathematically, QRE is a fixed-point, while QSE i ing sequences,; any behavioral strategy of a player can be is a global optimization problem. Hence, the techniques for represented as a realization plan (ri :⌃i R). A realiza- computing QRE cannot be applied for finding QSE. ! tion plan for a sequence i is the probability that player i will play i under the assumption that the opponents play to Background allow the actions specified in i to be played. In other words, Extensive-form games model sequential interactions be- for a behavioral strategy i, ri()= a i(a). The set of 2 tween players and can be visually represented as game trees. valid realization plans for player i can be represented using Q Formally, a two-player Stackelberg EFG is defined as a tu- a set of linear network-flow constraints of a size linear in the ple G =( , , , ,u, , ): is a set of 2 players: a number of information sets: N H Z A C I N i ri( )=1, 0 ri() 1 ⌃i leader (l) and a follower (f). We use to refer to one of the ;   8 2 players, and i to refer to his opponent. denotes a finite (1) H ri(seqi(I)) = ri(seqi(I)a) I i. set of histories in the game tree, with h0 being the 8 2I 2H a (I) root. denotes the set of histories in which player i acts. 2XA Hi We use pr(h)=(h0,a) to identify a pair of immediately For a given history h, we shorten the notation and write preceding h0 and action a connecting h0 with h. For h0 we ri(seqi(h)) as ri(h). set pr(h )=(, ). denotes the set of all actions, with 0 ; ; A i denoting the actions of player i. is the set of all Quantal Stackelberg Equilibrium in EFGs terminalA histories of the game. We defineZ✓H for each player i We follow the earlier works on boundedly rational Stack- a utility function ui : R. The chance player selects ac- elberg equilibria and consider two possible causes of the tions based on a fixedZ probability! distribution known to all emergence of subrational behavior that combine into a for- players. Function : [0, 1] denotes the probability of mal definition of quantal response: (i) a subjective percep- taking an action byC chance;A! we write (h) for the product tion of action values and (ii) a proneness to making mistakes C f QSE in an EFG is not equivalent to QSE of the normal- form representation of the EFG, because instead of picking a a b 15 pure strategy in the whole game according to a given model l of bounded rationality, we assume a more natural setting, (18, 14) when a player acts quantally in every information set they x y 10 f encounter separately. =0.33 Example 1. Consider an EFG depicted in Figure 1 and two c (12, 16) Leader’s expected utility =0.83 d 5 SE possible behavioral models of the follower. In both models, the follower assumes they act fully rationally and uses an 0 0.2 0.4 0.6 0.8 1 (17, 12) ( 8, 11) Probability of action y expected utility of a best response as their evaluation func- tion in both information sets. However, they are unaware Figure 1: (Left) An example of a general-sum EFG with that when choosing an action to play, they will act quan- tally according to a generator qU (x)=exp(x). In the first utilities in form (ul,uf ), and (Right) objective functions of three equilibria: QSEs with a generator q = exp(x), where model, we set =0.33, while in the second model is 0.33, 0.83 , and SE. equal to 0.83. On the right of the same figure we can find 2{ } the non-concave objective functions of both QSEs, and SE. choosing an action to play. The first source of subrationality The choice of the behavioral model significantly affects the is the (possibly) imperfect evaluation of the follower’s own solution. While with =0.33 the leader commits to playing choices. We assume an evaluation model in the form of a action y, with =0.83 her strategy is completely opposite: function e : f Bl R that describes how the follower to play the action x with probability 1. Furthermore, an op- values an actionA ⇥ given! the leader’s strategy. An entirely ra- timal solution in SE is any strategy with probability of action tional player uses expected utility of the best response as the y lower than 0.6. However, if the leader deploys a strategy evaluation function. For boundedly rational players, many close to this threshold against either of the two behavioral examples of evaluation functions can be found in the lit- models, her utility will be, in fact, close to global minimum erature: among others, the subjective utility (Nguyen et al. of the corresponding QSE. For =0.33, the utility is low 2013), past-experience-based learning (Lejarraga, Dutt, and for all SE strategies. An example with a fixed quantal func- Gonzalez 2012), risk assessment (Yechiam and Hochman tion and different evaluation functions is in the appendix. 2013; Kahneman and Tversky 2013), preference in simple The example verifies the observations made in SSGs: ˇ strategies (Cerny,´ Bosanskˇ y,´ and An 2020) or models from playing SE strategies against boundedly rational opponents the cognitive hierarchy (Wright and Leyton-Brown 2020). may inflict huge losses in utility for the leader (Yang et al. The second cause of subrationality relates to the ability of 2011). It is not difficult to design an EFG in which a unique the player to pick a correct action. Entirely rational play- SE is a global minimum of a QSE with arbitrarily low utility. ers always select the utility-maximizing option. Relaxing The following straightforward mathematical program rep- this assumption leads to a “statistical version” of the best resents QSE using the direct definition of quantal response response, which considers the inevitable error-proneness from Eq. (2): of humans and allows the players to make systematic er- rors (McFadden 1976). max v( , ) (4a) l Bl ; ; Definition 1. Let e be a follower’s evaluation function. 2 v(pr(h)) = v(h, a)C(a) h (4b) Function QR : Bl Bf is a canonical quantal response 8 2Hc function if for each I! a (h) 2If 2XA v(pr(h)) = v(h, a)l(a) h l (4c) q(e(a, l)) 8 2H QR( )= (2) a (h) l q(e(a , )) 2XA a0 (I) 0 l ! a (I),I f v(h, a)q(e(a, )) 2A 2A 2I l a (h) P 2A for all l Bl and some real-valued function q. v(pr(h)) = P h f (4d) 2 q(e(a, l)) 8 2H Note that whenever q is a strictly positive increasing func- a (h) 2PA tion, the corresponding QR is a valid quantal response func- v(pr(z)) = ul(z) z . (4e) tion, which gives rise to a valid behavioral strategy of the 8 2Z follower. We call such functions q generators of canonical The variable v is defined for every action interconnect- quantal functions. ing two consecutive nodes in the game tree (i.e., an edge) and it serves to propagate the leader’s utility from the leafs Definition 2. Given an extensive-form game G, a behav- up to the root through both the chance nodes and nodes of ioral strategy QS B and a quantal response function l 2 l the players. As Example 1 shows, this formulation using be- QR of the follower form a Quantal Stackelberg Equilibrium havioral strategies is non-concave, might have multiple local (QSE) if and only if optima, and it contains fractional terms (Eq. (4d)). Unfortu- QS nately, similarly to SE (Letchford and Conitzer 2010), also l = arg max ul(l,QR(l)). (3) B computing QSE in EFGs is an NP-hard problem. l2 l Theorem 1. Let q be an exponential generator of a quan- Theorem 2. Computing the maximum of the original for- tal function of the follower. Computing optimal strategy of mulation of QSE (3) is equivalent to finding a unique root of a rational player against the quantal response opponent in the following function : two-player imperfect-information EFGs with perfect recall D (p) = max F (rl,p), (6) is an NP-hard problem. D rl Proof (Sketch). We reduce from the 3-SAT problem us- where ing the tree structure from (Letchford and Conitzer 2010). F (r ,p)= q(e(a, r ))( u (z)C(z)r (z) p). We adapt the utilities because while the follower plays the l l l l ⇡ ⇧f a ⇡ z (⇡) leader’s preferred action in case of indifference in SE, the X2 Y2 2XZ quantal response chooses all actions with the same utility Proof. We start from the representation of QSE in form of uniformly. We show there exists a threshold probability of Eq. (5). Because the follower acts quantally in every in- reaching a target subtree separating SAT from UNSAT. Full formation set, the smallest common multiple over all leafs proof is in the appendix. is q(e(a, r )). The fractional representation I f a (I) l of the2 QSEI is2 henceA Dinkelbach-Type Formulation of QSE Q P The non-concavity of formulation (4) makes it difficult to ul(z)C(z)rl(z)Q(z,rl) z optimize and guarantee global optimality. Therefore, we max 2Z , search for an alternative representation of QSE that would rl P q(e(a, rl)) express the problem as a single fractional criterion, instead I f a (I) 2QI 2PA of a set of equations. Such representation would allow us where to leverage reformulation methods from fractional program- ming to eliminate the fraction. For this purpose we use the Q(z,rl)= q(e(a, rl)) q(e(a, rl)). realization plans. Note that in (4), the utility from each leaf a seqf (z) I f a (I) 2 Y seq(IY)2Iseq(2zX)A (Eq. (4e)) is propagated up through the variables v by mul- 6v tiplying it by the values of a behavioral strategy (Eqs. (4c) Q(z,r ) q and (4d)) and chance (Eq. (4b)) of all actions on the way to Because l is a sum of products of functions ap- z the root. Because the product of behavioral probabilities of plied to a fixed path from root to and one action in each a sequence is (by definition) equivalent to the realization of information set outside this path, it iterates over pure strate- z Q(z,r ) the sequence, the criterion (4a) is expressed as gies enabling to reach . l is therefore equivalent to ⇡ ⇧ :z (⇡) a ⇡ q(e(a, rl)). Applying the same idea 2 f 2Z 2 max ul(z)C(z)rl(z)QR(rl,z), (5) also for the denominator and swapping the sum over leafs rl P Q z with the sum over pure strategies in the nominator we obtain X2Z where q(e(a, r )) u (z)C(z)r (seq (z)) q(e(a, rl)) l l l l ⇡ ⇧ a ⇡ QR(rl,z)= . 2 f 2 z (⇡) q(e(a0,rl)) max P Q 2XZ . (7) a seqf (z) 2 a0 ( (a) rl q(e(a, rl)) Y 2A I ⇡ ⇧ a ⇡ P f 2 Because realization plans are equivalent to behavioral strate- 2P Q gies, instead of e(a, l) we write e(a, rl). The purpose of By the Dinkelbach reformulation, maximizing this equation this reformulation is that the criterion (3) can now be ex- is equivalent to finding a root of pressed as a single fraction, similarly to the representation of QSE in mixed strategies in security games (Yang, Ordonez, max q(e(a, rl)) ul(z)C(z)rl(z) rl and Tambe 2012) and normal-form games (Cernˇ y´ et al. ⇡ ⇧f a ⇡ z (⇡) X2 Y2 2XZ 2020).We can hence use the Dinkelbach’s method for solv- p q(e(a, r )), ing nonlinear fractional programming problems (Dinkelbach l ⇡ ⇧ a ⇡ 1967) and adapt it for finding the QSE. The key idea of X2 f Y2 the Dinkelbach’s algorithm is to express a problem in a which is the desired equation. form of maxx M f(x)/g(x),g(x) > 0 for some convex set M and continuous,2 real-valued functions f and g as Due to the leader’s strategy being represented using a re- an equivalent problem of finding a unique root of func- alization plan, the formulation (6) has ⌃ variables. The ex- | l| tion (p) = maxx M f(x) pg(x), p R. It holds that pression e(a, rl) is evaluated for every follower’s action in D 2 2 maxx M f(x)/g(x)=p⇤ if and only if (p⇤)=0. Be- the game tree, the number of evaluations is hence also linear cause 2 is a maximum of functions affineD in p, it is con- in . The outer sum, however, enumerates the follower’s D |If | vex. Finding the root is hence straightforward (e.g., a binary pure strategies and is thus exponential in f . In many real- search method can be used for this purpose) and it can be world applications this fact might not be|I critical,| as the fol- done efficiently if and only if we are able to effectively deter- lower’s strategy space (i.e., choosing a target to attack) is minate the value of function for any p. Most importantly, often much smaller than a combinatorial strategy space of D no longer contains any fractionalD term and is therefore the leader (i.e., deploying/moving multiple units to different easier to optimize. locations). ALGORITHM 1: Dinkelbach-Type Algorithm for QSE We begin by approximating the quantal generator q.We focus on logit QR, which is the most commonly studied UB maxz ul(z), LB minz ul(z) 2Z 2Z quantal response in the literature. In case of logit QR, func- rl⇤ arg maxr F (rl,LB) l tion q is defined as q(x)=exp(x), R+. The player repeat 2 p (UB LB)/2 becomes more rational as approaches infinity. We can express the product of generator functions from Eq. (6) v maxrl F (rl,p) p r arg max F (r ,p) through a substitutional variable x⇡ as l rl l p if v<0 then LB p, rl⇤ rl else UB p exp(e(a, rl)) = exp( e(a, rl)) x⇡. until UB LB < ✏ ! a ⇡ a ⇡ return rl⇤ Y2 X2 The exp function is linearizable into K segments as Because function D is convex, its root can be found us- K ing a binary search method, as described in Algorithm 1. We exp( e(a, r )) = ↵ktk + exp(e) x , (8) refer to formulation (6) as to the Dinkelbach subproblem of l ⇡ ! ⇡ a ⇡ k=0 the Dinkelbach formulation of QSE. Algorithm 1 iteratively X2 X k k updates the upper bound (UB) and lower bound (LB) on the where (↵ )k [K],↵ R is a slope of the k-th segment, 2 2 value of QSE according to a binary search method for find- subjected to constraints ing a root of a function. For running the binary search it is K essential to solve the Dinkelbach subproblems (i.e., evaluate k t⇡ + e = e(a, rl) (p) for any p). The following proposition presents condi- k=0 a ⇡ tionsD under which the subproblem can be efficiently approx- X X2 tk zk(e e)/K (9) imated. ⇡  ⇡ tk+1 zk(e e)/K Proposition 3. Let q be a twice differentiable genera- ⇡ ⇡ k k tor of a quantal response and e be twice differentiable 0 t⇡ (e e)/K, z⇡ 0, 1 , evaluation function of the follower. The Dinkelbach sub-   2{ } where binary variables z indicate whether the linear segment problem is concave if for any ⇡ ⇧ ,a ⇡, p 2 f 2 2 is used and real variables t define what portion of the seg- [minz ul(z), maxz ul(z)] and realization plan rl 2Z 2Z ment is active. We set e = f mina ,r e(a, rl) and |I | 2Af l T T T e = f maxa f ,rl e(a, rl). Using the bound from (Yano (⇡, a)=q0(ea)(url e0a + ea0 ur )+(ur rl p) ea00q0(ea)+ |I | 2A l l et al. 2013), the maximum difference in values of exp and 2 ⇣ its linearization exp using K segments on interval [e, e] can +e0 q00(ea)+ e0 e0 q0(ea)q0(ea ) q(ea ) , a a a0 0 00 be bounded as a0=a ⇡ a00=a0=a ⇡ X6 2 6 Y6 2 ⌘ (e e)2 exp(x) exp(x) exp(e) ,x[e, e]. (10) is negative semidefinite, where ea = e(a, rl), ea0 = e0(a, rl), | | 8K2 2 T ea00 = e00(a, rl), and url rl = z (⇡)ul(z)C(z)rl(z). 2Z With linearized logit generator, the Dinkelbach subprob- lem (6) is expressed as Proof. The formulationP of the subproblem from Eq. (6) is concave when its Hessian matrix is negative semidefinite (NSD). The Hessian matrix is of a form (p) = max ul(z)C(z)rl(z) p x⇡. D rl 0 1 (⇡, a) q(ea ). Because a sum of z (⇡) ⇡ ⇧f a ⇡ a0=a ⇡ 0 2XZ NSD2 matrices2 is NSD, the6 2 generator is always positive @ A P P Q Clearly, the criterion contains multiple bilinear terms and the definiteness is preserved under multiplication by a x⇡rl(z). For linearizing the bilinear terms, we use the MDT positive number, Eq. (6) is concave if (⇡, a) is NSD. technique (Kolodziej, Castro, and Grossmann 2013). MDT is a parametrizable method which enables controlling the er- In case the conditions are met, local-optimization al- ror in exchange of introducing binary variables. The product gorithms (e.g., projected gradient ascent (Nesterov 2004), c(⇡, z)=x⇡rl(z) is expressed using linear equations given Eq. (6) is L-smooth) are guaranteed to reach optimum. b 1 b 1 We discuss how useful Proposition 3 is in the appendix. j j c(⇡, z)= ib ri,j, x⇡E = ib si,j i=0 j i=0 j Approximating the Dinkelbach Subproblem X X2E X X2E b 1 b 1 (11) In case a game does not satisfy the conditions in Proposi- 1= s ,r(z)= r j i,j l i,j 8 2I tion 3, the guarantee of convergence is lost. A solution com- i=0 i=0 monly suggested in the literature is then to linearize the cri- X X s 0, 1 , 0 r s j , i [b 1], terion (6) and transform the problem into an (MI)LP that can i,j 2{ }  i,j  i,j8 2E 8 2 be solved using standard methods. For linearizing the crite- where Z is a finite subset controlling the error of the E⇢ rion we need to approximate both functions of the behavioral approximation with basis b. x⇡E is a representation of x⇡ in model from Definition 1: (i) the quantal function q and (ii) MDT over . The following lemma identifies the approxi- the utility evaluation function e. mation errorE introduced by the selection of K and . E Lemma 4. Let =L and x⇡ be linearized with K segments. Experimental Evaluation |E| Finally, we demonstrate the performance of our Dinkelbach- x⇡rl(z) x⇡E rl(z) ✏K + ✏ , (12) | | E 2 type algorithm (DTA). We compare the DTA to the stan- (e e) M+1 where ✏K = exp(e) 8K2 , ✏ = max(N,exp(e) b + dard benchmark for nonlinear optimization: the COBYLA E M L+1 N,N exp(e)), M = logb(exp(e)) and N = b . algorithm (Powell 2007), implemented in the open-source b c NLOPT library. COBYLA is a gradient-free algorithm ca- Proof. Let = M L+1 ,M L+2 ,...,M . hence defines pable of handling linear network-flow equality constraints a discretizationE { of variable x on interval} [N,bE M+1 N] ⇡ induced by realization plans. We opted for COBYLA be- with a step of size N. Because x⇡ is defined on interval cause the follower’s evaluation function is not smooth – it [exp(e),exp(e)], the maximum difference between x⇡ and is non-differentiable, possibly in infinitely many points. We x⇡E is ✏ . As rl(z) is (by definition) always at most 1, we apply COBYLA directly to formulation (5). For evaluating E have the algorithms, we used two domains: a variant of search game commonly used to evaluate algorithms for SE (March- x⇡rl(z) x⇡E rl(z) x⇡ x⇡ + x⇡ x⇡E . | || | | | esi et al. 2019; Kroer, Farina, and Sandholm 2018; Cermˇ ak´ By Eq. (10), x x ✏ , concluding the proof. | ⇡ ⇡| K et al. 2016) and a network game, handcrafted to be difficult Now we move to the second part of the QR definition: for QSE. the evaluation function e. We present a domain-independent Search Game. The game is played on a directed graph, formulation of a common situation when the follower is not depicted in the middle of Figure 2. The attacker’s goal is to aware of their subrationality and evaluate the actions in the reach one of the destination nodes (D0 – D7) from the start- current information set on the basis of acting rationally in the ing node (S), while the defender aims to catch the attacker subsequent information sets, weighted by the probability of with one of the two units operating in the marked areas of reaching the current set. In that case, the evaluation function the graph (P0 and P1). The attacker receives a different re- e can be expressed as ward for reaching a different destination node (the reward is e(a, r)=v( (a)) s(seq ( (a))a) selected randomly from interval [0, 2]). While the defender I f I can move freely with unit P1, unit P0 is static – placed by v(inff (f )) = s(f )+ v(I)+ the defender at the beginning of the game. If the attacker

I:seqf (I)=f evades unit P0, the defender is given N steps to set unit P1. X (13) The defender receives a signal if P1 is within 1 step from the + u (z)C(z)r (z) ⌃ l l 8 f 2 f attacker. In case the defender captures the attacker, she gets z:seqf (z)=f a positive reward of 1 n/(N + 1), where n N is a num- X  0 s() M(1 rf ()) ⌃f , ber of taken steps, and the attacker receives 0. We consider a   8 2 version of the game in which the attacker perceives no infor- where rf is the binary best-response realization plan of the mation about the whereabouts of the defender’s units, and follower, v is the optimal expected utility contribution in an unit P1 starts above D0. information set and s is a slack variable compensating the deficiency in action’s suboptimal utility. Now, we can finally Network Game. The network game is played on a di- state the approximation error for computing the QSE. rected graph too. Some nodes in the graph are grouped to- gether into mutually disjunctive areas. In the beginning, the Proposition 5. Consider a linear formulation of the Dinkel- attacker observes their areas of possible infiltration and the bach subproblem in form area the defender uses to enter the network, and selects one area for further probing. The defender then picks a node (p) = max ul(z)C(z)c(⇡, z) px⇡E , D rl from the entering area, while the attacker chooses a node z (⇡) 2XZ from their area to compromise. The defender is given a max- with constraints (1), (9), (11), and (13) with K segments, imum number of steps N to survey the network and dis- = L and substitution (8). Let rl⇤ be a realization plan cover the attacker. There is binary information: if the at- |E| QS computed by Algorithm 1 with precision ✏B, and rl be a tacker is located within steps from the defender, the de- realization plan of the leader in QSE. Then for the utility dif- fender observes it. If she captures the attacker in n N QS QS steps she receives a utility 1 n/(N + 1), 0 otherwise. The ference d = ul(rl⇤,QR(rl⇤)) ul(r ,QR(r )) it holds | l l | attacker is given a reward associated with the compromised ul ⇧f ✏K + ⇧f maxz ul(z) (✏K + ✏ ) node if not found (the reward is chosen randomly from in- d ✏B + | | | | 2Z | | E ,  exp(e) terval [ 2.5, 2.5]–the negative utility represents nodes with compromising costs higher than the data’s price or a pos- where ✏K and ✏ are defined as in Lemma 4. E sible honeypot), -3 otherwise. We designed three networks, Proof (Sketch). We derive specific bounds for the Dinkel- shown on the left in Figure 2. Attacker’s areas are depicted bach representation of QSE based on Lemma 4 and com- in thin-lined rectangles. The defender’s entering area is de- bine them with bounds on the difference between linearized picted in a thick-lined rectangle in game 03 and selected ran- and non-linearized nominator and denominator introduced domly (4 nodes per seed) in other games. We set =0 in (Cernˇ y´ et al. 2020), adapted for QSE in form of Eq. (7). for game 01, 1 otherwise. It is a type of ; Full proof can be found in the appendix. hence, the vast majority of the attacker’s strategies can be S Search Game 108 P0 105 C(0.4) D(0.4)

P1 C(0.7) D(0.7) C(1.0) D(1.0) Runtime [ms] 102 2 3 4 5 6 D D D D D D D D Network Game 01 Network Game 02 Network Game 03 0 1 2 3 4 5 6 7 Game size [# of steps] Figure 2: (Left) Graphs for network games, (Middle) Graph for search game, (Right) Mean runtimes in search game. Network Game 01 Network Game 02 Network Game 03 108 107 107 105 C(0.4) D(0.4) 105 C(0.4) D(0.4) 105 C(0.4) D(0.4) C(0.7) D(0.7) C(0.7) D(0.7) C(0.7) D(0.7) 2 C(1.0) D(1.0) C(1.0) D(1.0) C(1.0) D(1.0) Runtime [ms] 10 103 103 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 Game size [# of steps] Game size [# of steps] Game size [# of steps] Figure 3: Mean runtimes of COBYLA and DTA in network games. Every point shows also a standard error. best responses to the defender’s strategy, which makes com- 2 steps 3 steps 4 steps puting Stackelberg strategies particularly difficult. Networ Game 01 =0.4 0.23% 0.03% 1.27% 0.20% 2.83% 0.46% Experimental Setting. We assume the defender acts as =0.7 -1.99% ± 0.38% 0.42% ± 0.38% 3.84% ± 0.74% =1.0 -3.82% ± 0.51% 0.50% ± 0.50% 2.32% ± 0.92% a leader in the games, while the attacker assumes the fol- ± ± ± lower’s role. We consider three different exponential gener- Network Game 02 =0.4 1.17% 0.27% 1.16% 0.44% 13.32% 7.35% ators of canonical logit functions, 0.4, 0.7, 1.0 . The =0.7 0.20% ± 0.21% 1.80% ± 0.42% 8.87% ± 3.47% tolerance parameter for the COBYLA2 algorithm{ in NLOPT} =1.0 -2.35% ± 0.35% -0.22% ± 0.49% 3.74% ± 4.05% 2 ± ± ± was set to 10 and ✏B = 1% of the leader’s utility range Network Game 03 for the DTA’s binary search. The linearization uses K =3, =0.4 1.09% 0.27% 1.30% 0.42% - =0.7 1.02% ± 0.45% 5.73% ± 1.85% - the basis of MDT is set to b =3and the size of the precision =1.0 0.25% ± 0.68% 8.33% ± 4.94% - ± ± interval is L =4. For each combination of game size Search Game E ⇥ =0.4 -2.13% 0.29% -0.20% 0.42% 16.23% 3.77% generator function, we constructed and solved 20 instances. ± ± ± All implementations were done in C++17. We used NLOPT =0.7 -7.24% 1.00% -4.75% 0.79% 18.24% 3.14% =1.0 -21.24% ± 1.81% -10.99% ± 1.64% -1.98% ± 4.91% 2.6.1, and a single-threaded IBM CPLEX 12.8 carried all ± ± ± MILP computations. The experiments were performed on a Table 1: Comparison of solution quality. A positive value 3.2GHz CPU with 16GB RAM. signifies DTA returning better solution than COBYLA. Runtimes. In Figures 2 (right) and 3 we present the run- solutions for smaller instances. In some cases, the solution is time results. The x-axis shows the game size, while the y- even better than the DTA because of the DTA’s approxima- axis varies the runtimes of the algorithms. Every point in the tion parameters. However, as the Table reveals, the quality graph corresponds to the mean over the sampled instances of COBYLA’s solutions degrade with increasing game size, and shows the achieved standard error. We terminated all reaching an error of 18.24% for 4 steps in the search game. running seeds after 24h and depict them in the graph with this lower bound on runtime if the computation was still Conclusion ongoing. As the figure shows, despite the overhead of the We study Quantal Stackelberg Equilibrium (QSE) – a strat- DTA algorithm on smaller instances, it scales significantly egy to which the rational player should commit to against better than COBYLA. For 6 steps, we ran longer jobs, and a boundedly rational player – in extensive-form games COBYLA computed no game within 3 days. Interestingly, (EFGs). We show that computing this solution concept is as in some other NP-hard problems (Cheeseman, Kanef- NP-hard, so polynomial algorithms are not likely. Still, com- sky, and Taylor 1991), increasing the search game’s strat- puting such solutions is useful for evaluating scalable heuris- egy space enables finding better strategies faster, and binary tics, or extending experiments on sequential games against search terminates earlier. human participants (similar to (Basak et al. 2018)) to com- Solution Quality. The relative errors of computed solu- pare their expected outcome and thus improving the un- tions are presented in Table 1. The values correspond to the derstanding of human decision making in competitive sce- mean ratio of the difference in the defender’s expected util- narios. Therefore, we introduce the first practical algorithm ity computed using COBYLA and the DTA to the length for computing QSE in EFGs. We show that contrary to the of the defender’s utility range in the game. Due to linear ap- straightforward mathematical formulation of QSE, our algo- proximations used by COBYLA, it can find close-to-optimal rithm can solve larger games with a smaller error. Acknowledgement Gray, J. R. 1999. A bias toward short-term thinking in threat- related negative emotional states. Personality and Social This research is supported by the SIMTech-NTU Joint Lab- Psychology Bulletin 25(1): 65–75. oratory on Complex Systems, the Czech Science Foundation (grant no. 18-27483Y and 19-24384Y) and by the OP VVV Kahneman, D.; and Tversky, A. 2013. Prospect theory: An MEYS funded project CZ.02.1.01/0.0/0.0/16 019/0000765 analysis of decision under risk. In Handbook of the Fun- “Research Center for Informatics”. Bo An is partially sup- damentals of Financial Decision Making: Part I, 99–127. ported by Singtel Cognitive and Artificial Intelligence Lab World Scientific. for Enterprises (SCALE@NTU), which is a collaboration Koller, D.; Megiddo, N.; and von Stengel, B. 1996. Efficient between Singapore Telecommunications Limited (Singtel) computation of equilibria for extensive two-person games. and Nanyang Technological University (NTU) that is funded Games and Economic Behavior 247–259. by the Singapore Government through the Industry Align- Kolodziej, S.; Castro, P. M.; and Grossmann, I. E. 2013. ment Fund – Industry Collaboration Projects Grant. Global optimization of bilinear programs with a multipara- metric disaggregation technique. Journal of Global Opti- References mization 57(4): 1039–1063. An, B.; Shieh, E.; Yang, R.; Tambe, M.; Baldwin, C.; Di- Kroer, C.; Farina, G.; and Sandholm, T. 2018. Robust Stack- Renzo, J.; Maule, B.; and Meyer, G. 2013. A deployed quan- elberg equilibria in extensive-form games and extension to tal response based patrol planning system for the US Coast limited lookahead. In Proceedings of Thirty-Second AAAI Guard. In Interfaces 43(5): 400–420. Conference on Artificial Intelligence. Basak, A.; Cernˇ y,´ J.; Gutierrez, M.; Curtis, S.; Kamhoua, C.; Lejarraga, T.; Dutt, V.; and Gonzalez, C. 2012. Instance- Jones, D.; Bosanskˇ y,´ B.; and Kiekintveld, C. 2018. An initial based learning: A general model of repeated binary choice. study of targeted personality models in the FlipIt game. In Journal of Behavioral Decision Making 25(2): 143–153. International Conference on Decision and Game Theory for Letchford, J.; and Conitzer, V. 2010. Computing optimal Security, 623–636. Springer. strategies to commit to in extensive-form games. In Pro- Brown, N.; and Sandholm, T. 2018. Superhuman AI for ceedings of the 11th ACM conference on Electronic com- heads-up no-limit poker: Libratus beats top professionals. merce, 83–92. Science 359(6374): 418–424. Malhotra, N. K. 1982. Information load and consumer deci- sion making. Journal of consumer research 8(4): 419–430. Cermˇ ak,´ J.; Bosanskˇ y,´ B.; Durkota, K.; Lisy,´ V.; and Kiek- intveld, C. 2016. Using correlated strategies for computing Marchesi, A.; Farina, G.; Kroer, C.; Gatti, N.; and Sand- Stackelberg equilibria in extensive-form games. In Proceed- holm, T. 2019. Quasi-perfect stackelberg equilibrium. In ings of Thirtieth AAAI Conference on Artificial Intelligence, Proceedings of the AAAI Conference on Artificial Intelli- 439–445. gence, volume 33, 2117–2124. Cernˇ y,´ J.; Bosanskˇ y,´ B.; and An, B. 2020. Finite state ma- McFadden, D. L. 1976. Quantal choice analaysis: A survey. chines play extensive-form games. In Proceedings of the In Annals of Economic and Social Measurement, Volume 5, 21st ACM Conference on Economics and Computation, EC number 4, 363–390. NBER. ’20, 509–533. New York, NY, USA: Association for Com- McKelvey, R. D.; and Palfrey, T. R. 1995. Quantal response puting Machinery. equilibria for normal form games. Games and Economic Behavior 10(1): 6–38. Cernˇ y,´ J.; Lisy,´ V.; Bosanskˇ y,´ B.; and An, B. 2020. Dinkelbach-type algorithm for computing Quantal Stackel- McKelvey, R. D.; and Palfrey, T. R. 1998. Quantal response berg equilibrium. In Bessiere, C., ed., Proceedings of the equilibria for extensive form games. Experimental Eco- Twenty-Ninth International Joint Conference on Artificial nomics 1(1): 9–41. Intelligence, IJCAI-20, 246–253. IJCAI. Milec, D.; Cernˇ y,´ J.; Lisy,´ V.; and An, B. 2020. Complexity and Algorithms for Exploiting Quantal Opponents in Large Cheeseman, P. C.; Kanefsky, B.; and Taylor, W. M. 1991. Two-Player Games. In Thirty-Fifth AAAI Conference on Ar- Where the really hard problems are. In IJCAI, volume 91, tificial Intelligence. 331–337. Moravcˇ´ık, M.; Schmid, M.; Burch, N.; Lisy,´ V.; Morrill, D.; Dinkelbach, W. 1967. On nonlinear fractional programming. Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling, Management Science 13(7): 492–498. M. 2017. DeepStack: Expert-level artificial intelligence in Fang, F.; Nguyen, T. H.; Pickles, R.; Lam, W. Y.; Clements, no-limit poker. Science . G. R.; An, B.; Singh, A.; Schwedock, B. C.; Tambe, M.; Nesterov, Y. 2004. Introductory Lectures on Convex Op- and Lemieux, A. 2017. PAWS - a deployed game-theoretic timization: A Basic Course. Applied Optimization 87. application to combat poaching. AI Magazine . Springer US, 1 edition. Gigerenzer, G.; and Goldstein, D. G. 1996. Reasoning the Nguyen, T. H.; Yang, R.; Azaria, A.; Kraus, S.; and fast and frugal way: models of bounded rationality. Psycho- Tambe, M. 2013. Analyzing the effectiveness of adver- logical review 103(4): 650. sary modeling in security games. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, AAAI’13, 718–724. AAAI Press. Powell, M. J. 2007. A view of algorithms for optimization without derivatives. Mathematics Today-Bulletin of the In- stitute of Mathematics and its Applications 43(5): 170–174. Sinha, A.; Fang, F.; An, B.; Kiekintveld, C.; and Tambe, M. 2018. Stackelberg security games: Looking beyond a decade of success. In Proceedings of the 27th Interna- tional Joint Conference on Artificial Intelligence (IJCAI’18), 5494–5501. IJCAI. Von Stengel, B.; and Zamir, S. 2010. Leadership games with convex strategy sets. Games and Economic Behavior 69(2): 446–457. Wright, J. R.; and Leyton-Brown, K. 2020. A formal sepa- ration between strategic and nonstrategic behavior. In Pro- ceedings of the 21st ACM Conference on Economics and Computation, EC ’20, 535–536. New York, NY, USA: As- sociation for Computing Machinery. Yadav, A.; Chan, H.; Xin Jiang, A.; Xu, H.; Rice, E.; and Tambe, M. 2016. Using social networks to aid home- less shelters: Dynamic influence maximization under uncer- tainty. In Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’16, 740–748. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems. Yang, R.; Kiekintveld, C.; Ordonez, F.; Tambe, M.; and John, R. 2011. Improving resource allocation strategy against human adversaries in security games. In Proceed- ings of the Twenty-Second International Joint Conference on Artificial Intelligence, IJCAI’11, 458–464. IJCAI. Yang, R.; Ordonez, F.; and Tambe, M. 2012. Computing optimal strategy against quantal response in security games. In Proceedings of the 11th International Conference on Au- tonomous Agents and Multiagent Systems, 847–854. Yano, M.; Penn, J. D.; Konidaris, G.; and Patera, A. T. 2013. Math, Numerics & Programming (for Mechanical En- gineers). MIT. Yechiam, E.; and Hochman, G. 2013. Losses as modulators of attention: Review and analysis of the unique effects of losses over gains. Psychological Bulletin 139(2): 497.