Preliminary Version Do Not Cite

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference Computing Quantal Stackelberg Equilibrium in Extensive-Form Games Jakub Cernˇ y´1, Viliam Lisy´2, Branislav Bosanskˇ y´2, Bo An1 1 Nanyang Technological University, Singapore 2 AI Center, FEE, Czech Technical University in Prague, Czech Republic [email protected], viliam.lisy, branislav.bosansky @fel.cvut.cz, [email protected] { } Abstract players choose better actions more often than worse actions instead of playing only the best action (McKelvey and Pal- Deployments of game-theoretic solution concepts in the real frey 1995). When facing a human opponent we aim to find world have highlighted the necessity to consider human op- a strategy optimal against QR, termed Quantal Stackelberg ponents’ boundedly rational behavior. If subrationality is not addressed, the system can face significant losses in terms Equilibrium (QSE). Relying on SE is not advisable in such of expected utility. While there exist algorithms for com- situations as the committing player can suffer huge losses puting optimal strategies to commit to when facing subra- when deploying rational strategies against boundedly ratio- tional decision-makers in one-shot interactions, these algo- nal counterparts (Yang et al. 2011). rithms cannot be generalized for solving sequential scenarios QSE was studied extensively in Stackelberg Security because of the inherent curse of strategy-space dimensional- Games (SSGs) (Sinha et al. 2018), and the developed al- ity in sequential games and because humans act subrationally gorithms were used in many real-world applications. SSG in each decision point separately. We study optimal strate- is a widely applicable game model, but it requires formulat- gies to commit to against subrational opponents in sequential games for the first time and make the following key contri- ing the problem in terms of allocating limited resources to butions: (1) we prove the problem is NP-hard in general; (2) a set of targets. To solve real-world problems beyond SSGs, to enable further analysis, we introduce a non-fractional re- QSE was recently introduced also in a more general model formulation of the direct non-concave representation of the of normal-form games (NFGs) (Cernˇ y´ et al. 2020). How- equilibrium; (3) we identify conditions under which the prob- ever, even NFGs model only one-shot interactions. To the lem can be approximated in polynomial time in the size of the best of our knowledge, QSE was never properly analyzed representation; (4) we show how an MILP can approximate for extensive-form (i.e., sequential) games (EFGs), despite the reformulation with a guaranteed bounded error, and (5) the fact that many real-world domains are sequential. we experimentally demonstrate that our algorithm provides higher quality results several orders of magnitude faster than In this paper, we develop methods for finding QSE in a baseline method for general non-linear optimization. EFGs. There are two main challenges that prevent us from using techniques from SSGs or NFGs directly. First, while there is only one decision point for the boundedly ratio- Introduction nal player in NFGs, any non-trivial EFG contains multiple In recent years, game theory has achieved many ground- causally linked decision points where the same player acts. breaking advances both in large games (e.g., super-human The psychological studies show that humans prefer short- AI in poker (Moravcˇ´ık et al. 2017; Brown and Sandholm term, delayed heuristic decisions (Gigerenzer and Gold- 2018)), and practical applications (physical security (Sinha stein 1996) rather than long-term, premeditated decisions. et al. 2018), wildlife protection (Fang et al. 2017), AI for so- This behavior arises especially in in conflicts (Gray 1999) cial good (Yadav et al. 2016)). In deployed two-player real- or when facing information overload caused by large deci- world solutions, the aim is often to compute an optimal strat- sion space (Malhotra 1982). Therefore, a natural assumption egy to commit to while assuming that the other player plays is that a subrational player acts according to a QR model the best response to this commitment – i.e., to find a Stack- in each decision point separately. This behavior cannot be elberg Equilibrium (SE). While the traditional theory (e.g., modeled by an equivalent NFG. Second, contrary to NFGs, the SE) assumes that all players behave entirely rationally, the number of rational player’s pure strategies in EFGs is the real world’s deployments proved that taking into account exponential in the game size. Therefore, even if the QSE the bounded rationality of the human players is necessary to concepts would coincide in NFGs and EFGs, applying the provide higher quality solutions (An et al. 2013; Fang et al. algorithms for NFGs directly would soon ran into scalabil- 2017). One of the most commonly used models of bounded ity problems.We begin our analysis by showing that find- rationality is Quantal Response (QR), which assumes that ing QSE is NP-hard, and the straightforward formulation of QSE in EFGs is a non-concave fractional problem that Copyright c 2021, Association for the Advancement of Artificial is difficult to optimize. Therefore, we derive an equivalent Intelligence (www.aaai.org). All rights reserved. Dinkelbach-type formulation of QSE that does not contain any fraction, and represents the rational player’s strategy of the probabilities of chance actions in history h. The set of linearly in the size of the game. We use the Dinkelbach chance histories is denoted as c. formulation to identify sufficient condition for solving the Imperfect observation of playerH i is modeled via infor- problem in polynomial time. If the conditions are satisfied, mation sets i that form a partition over h i. Player i the optimal solution can be found by gradient ascent. For cannot distinguishI between histories in any information2H set other cases, we formulate a mixed-integer linear program I i. We overload the notation and use (Ii) to denote (MILP) approximating the QSE through QR model’s linear possible2I actions available in each history inA an information relaxation. We provide theoretical guarantees on the solu- set Ii. Each action a uniquely identifies the information set tion quality, depending on the number of segments used to where it is available, referred as (a). We assume perfect re- linearize the QR function and the arising bilinear terms. call, which means that players rememberI the history of their In the experiments, we compare the direct formulation own actions and all information gained during the course of solved by non-linear optimization to our MILP reformula- the game. As a consequence, all histories in any information tion. We show that in 3.5 hours, our algorithm computes set Ii have the same history of actions for player i. solutions that the baseline cannot reach within three days. Pure strategies ⇧i assign one action for each I i and 2I Moreover, the quality of solutions of the our algorithm out- we assume the actions of a strategy ⇡ ⇧i can be enu- matches the baseline’s solutions even for the instances com- merated as a ⇡. We denote the set2 of leafs reachable 2 putable by the baseline. by a strategy ⇡ as (⇡).Abehavioral strategy βi Bi is one probability distributionZ over actions (I) for2 each Related Solution Concepts. Several other solution con- I . For any pair of strategies β B =(BA,B ) we use cepts study bounded rationality in EFGs. Perhaps the most i l f u 2(βI)=u (β ,β ) for the expected2 outcome of the game well-known one is the model of Quantal Response Equilib- i i i i for player i when− players follow strategies β. rium (QRE) (McKelvey and Palfrey 1998). In QRE, all play- Strategies in EFGs with perfect recall can be also com- ers act boundedly rationally, and no player has the ability pactly represented by using the sequence form (Koller, to commit to a strategy. This is in contrast to QSE, where Megiddo, and von Stengel 1996). A sequence σ ⌃ is the entirely-rational leader aims to exploit the boundedly i i an ordered list of actions taken by a single player i2in his- rational opponent. A similar approach for EFGs was taken tory h. stands for the empty sequence (i.e., a sequence in (Basak et al. 2018), where only one of the players was with no; actions). A sequence σ ⌃ can be extended by considered rational. However, their goal was to evaluate the i i a valid action a taken by player i,2 written as σ a = σ .We performance against human participants, and the approach i i0 say that σ is a prefix of σ (σ σ ) if σ is obtained by lacked theoretical foundations. The computed strategies did i i0 i i0 i0 finite number (possibly zero) ofv extensions of σ . We use not correspond to a well-defined solution concept and were i seq(h) to denote the sequence of actions of all players lead- empirically suboptimal. We define and study this concept in ing to h. By using a subscript seq we refer to the subse- our concurrent paper (Milec et al. 2020). Besides rationality, i quence of action of player i. Because of perfect recall we in QSE, the leader also benefits from the power of commit- write seq (I)=seq (h) for I ,h I. We use the func- ment. It is well known that the ability to commit increases i i i tion inf (σ ) to denote the information2I 2 set in which the last the payoffs achievable by the leader (Von Stengel and Za- i i0 action of the sequence σ is taken. For an empty sequence, mir 2010) and it trivially holds also against bounded rational i0 function inf ( ) returns the information set of the root.

Preliminary Version Do Not Cite

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support