Fictitious Self-Play in Extensive-Form Games

Fictitious Self-Play in Extensive-Form Games Johannes Heinrich [email protected] University College London, UK Marc Lanctot [email protected] Google DeepMind, London, UK David Silver [email protected] Google DeepMind, London, UK Abstract learning from experience that has inspired artificial intel- ligence algorithms in games. Fictitious play is a popular game-theoretic model of learning in games. However, it has received Despite the popularity of fictitious play to date, it has seen little attention in practical applications to large use in few large-scale applications, e.g. (Lambert III et al., problems. This paper introduces two variants 2005; McMahan & Gordon, 2007; Ganzfried & Sandholm, of fictitious play that are implemented in be- 2009; Heinrich & Silver, 2015). One possible reason for havioural strategies of an extensive-form game. this is its reliance on a normal-form representation. While The first variant is a full-width process that is re- any extensive-form game can be converted into a normal- alization equivalent to its normal-form counter- form equivalent (Kuhn, 1953), the resulting number of ac- part and therefore inherits its convergence guar- tions is typically exponential in the number of game states. antees. However, its computational requirements The extensive-form offers a much more efficient represen- are linear in time and space rather than exponen- tation via behavioural strategies whose number of param- tial. The second variant, Fictitious Self-Play, is eters is linear in the number of information states. Hen- a machine learning framework that implements don et al. (1996) introduce two definitions of fictitious play fictitious play in a sample-based fashion. Ex- in behavioural strategies and show that each convergence periments in imperfect-information poker games point of their variants is a sequential equilibrium. However, compare our approaches and demonstrate their these variants are not guaranteed to converge in imperfect- convergence to approximate Nash equilibria. information games. The first fictitious play variant that we introduce in this paper is full-width extensive-form fictitious play (XFP). It is 1. Introduction realization equivalent to a normal-form fictitious play and Fictitious play, introduced by Brown (1951), is a popu- therefore inherits its convergence guarantees. However, it lar game-theoretic model of learning in games. In ficti- can be implemented using only behavioural strategies and tious play, players repeatedly play a game, at each iteration therefore its computational complexity per iteration is lin- choosing a best response to their opponents’ average strate- ear in the number of game states rather than exponential. gies. The average strategy profile of fictitious players con- XFP and many other current methods of computational verges to a Nash equilibrium in certain classes of games, game theory (Sandholm, 2010) are full-width approaches e.g. two-player zero-sum and potential games. Fictitious and therefore require reasoning about every state in the play is a standard tool of game theory and has motivated game at each iteration. Apart from being given state- substantial discussion and research on how Nash equilib- aggregating abstractions, that are usually hand-crafted from ria could be realized in practice (Brown, 1951; Fuden- expert knowledge, the algorithms themselves do not gen- berg, 1998; Hofbauer & Sandholm, 2002; Leslie & Collins, eralise between strategically similar states. Leslie & 2006). Furthermore, it is a classic example of self-play Collins (2006) introduce generalised weakened fictitious play which explicitly allows certain kinds of approxima- Proceedings of the 32 nd International Conference on Machine tions in fictitious players’ strategies. This motivates the use Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- right 2015 by the author(s). of approximate techniques like machine learning which ex- Fictitious Self-Play in Extensive-Form Games cel at learning and generalising from finite data. 2. Background The second variant that we introduce is Fictitious Self-Play In this section we provide a brief overview over common (FSP), a machine learning framework that implements gen- game-theoretic representations of a game, fictitious play eralised weakened fictitious play in behavioural strategies and reinforcement learning. For a more detailed exposi- and in a sample-based fashion. In FSP players repeatedly tion we refer the reader to (Myerson, 1991), (Fudenberg, play a game and store their experience in memory. In- 1998) and (Sutton & Barto, 1998). stead of playing a best response, they act cautiously and mix between their best responses and average strategies. 2.1. Extensive-Form At each iteration players replay their experience of play against their opponents to compute an approximate best re- Extensive-form games are a model of sequential interac- sponse. Similarly, they replay their experience of their own tion involving multiple agents. The representation is based behaviour to learn a model of their average strategy. In on a game tree and consists of the following components: more technical terms, FSP iteratively samples episodes of N = f1; :::; ng denotes the set of players. S is a set of the game from self-play. These episodes constitute data states corresponding to nodes in a finite rooted game tree. sets that are used by reinforcement learning to compute For each state node s 2 S the edges to its successor states approximate best responses and by supervised learning to define a set of actions A(s) available to a player or chance compute perturbed models of average strategies. in state s. The player function P : S ! N [ fcg, with c denoting chance, determines who is to act at a given state. 1.1. Related work Chance is considered to be a particular player that follows a fixed randomized strategy that determines the distribu- Efficiently computing Nash equilibria of imperfect- tion of chance events at chance nodes. For each player information games has received substantial attention by re- i there is a corresponding set of information states U i searchers in computational game theory and artificial in- and an information function Ii : S!U i that determines telligence (Sandholm, 2010; Bowling et al., 2015). The which states are indistinguishable for the player by map- most popular modern techniques are either optimization- ping them on the same information state u 2 U i. Through- based (Koller et al., 1996; Gilpin et al., 2007; Miltersen & out this paper we assume games with perfect recall, i.e. i Sørensen, 2010; Bosansky et al., 2014) or perform regret each player’s current information state uk implies knowl- minimization (Zinkevich et al., 2007). Counterfactual re- edge of the sequence of his information states and actions, i i i i i gret minimization (CFR) is the first approach which essen- u1; a1; u2; a2; :::; uk, that led to this information state. Fi- tially solved an imperfect-information game of real-world nally, R : S! Rn maps terminal states to a vector whose scale (Bowling et al., 2015). Being a self-play approach components correspond to each player’s payoff. that uses regret minimization, it has some similarities to behavioural strategy πi(u) 2 the utility-maximizing self-play approaches introduced in A player’s , ∆ (A(u)) ; 8u 2 U i, determines a probability distri- this paper. i bution over actions given an information state, and ∆b is Similar to full-width CFR, our full-width method’s worst- the set of all behavioural strategies of player i.A strategy case computational complexity per iteration is linear in profile π = (π1; :::; πn) is a collection of strategies for all the number of game states and it is well-suited for paral- players. π−i refers to all strategies in π except πi. Based lelization and distributed computing. Furthermore, given on the game’s payoff function R, Ri(π) is the expected a long-standing conjecture (Karlin, 1959; Daskalakis & payoff of player i if all players follow the strategy profile Pan, 2014) the convergence rate of fictitious play might be π best responses i 1 . The set of of player to their opponents’ − 2 −i i −i i i −i O(n ), which is of the same order as CFR’s. strategies π is b (π ) = arg max i i R (π ; π ). π 2∆b i −i i i i i −i Similar to Monte Carlo CFR (Lanctot et al., 2009), FSP For > 0, b(π ) = fπ 2 ∆b : R (π ; π ) ≥ i i −i −i uses sampling to focus learning and computation on se- R (b (π ); π ) − g defines the set of -best responses −i lectively sampled trajectories and thus breaks the curse of to the strategy profile π .A Nash equilibrium of an dimensionality. However, FSP only requires a black box extensive-form game is a strategy profile π such that i i −i simulator of the game. In particular, agents do not require π 2 b (π ) for all i 2 N . An -Nash equilibrium is a i i −i any explicit knowledge about their opponents or even the strategy profile π such that π 2 b(π ) for all i 2 N . game itself, other than what they experience in actual play. A similar property has been suggested possible for a form 2.2. Normal-Form of outcome-sampling MCCFR, but remains unexplored. An extensive-form game induces an equivalent normal- form game as follows. For each player i 2 N their deter- i i ministic strategies, ∆p ⊂ ∆b, define a set of normal-form Fictitious Self-Play in Extensive-Form Games actions, called pure strategies. Restricting the extensive- Definition 2. Two strategies π1 and π2 of a player are form payoff function R to pure strategy profiles yields a realization-equivalent if for any fixed strategy profile of payoff function in the normal-form game. the other players both strategies, π1 and π2, define the same probability distribution over the states of the game.

Load more