Mean Field Asymptotic of Markov Decision Evolutionary Games and Teams

Hamidou Tembine, Jean-Yves Le Boudec, Rachid El-Azouzi, Eitan Altman ∗†

Abstract as the transition probabilities of a controlled Markov chain associated with each player. Each player wishes to maxi- We introduce Markov Decision Evolutionary Games with mize its expected fitness averaged over time. N players, in which each individual in a large population This model extends the basic evolutionary games by in- interacts with other randomly selected players. The states troducing a controlled state that characterizes each player. and actions of each player in an interaction together de- The stochastic dynamic games at each interaction replace termine the instantaneous payoff for all involved players. the matrix games, and the objective of maximizing the ex- They also determine the transition probabilities to move to pected long-term payoff over an infinite time horizon re- the next state. Each individual wishes to maximize the to- places the objective of maximizing the of a matrix tal expected discounted payoff over an infinite horizon. We game. Instead of a choice of a (possibly mixed) action, a provide a rigorous derivation of the asymptotic behavior player is now faced with the choice of decision rules (called of this system as the size of the population grows to infin- strategies) that determine what actions should be chosen at ity. We show that under any Markov , the random a given interaction for given present and past observations. process consisting of one specific player and the remaining This model with a finite number of players, called a population converges weakly to a jump process driven by mean field interaction model, is in general difficult to an- the solution of a system of differential equations. We char- alyze because of the huge state space required to describe acterize the solutions to the team and to the game prob- all the of players. Then taking the asymptotics as the num- lems at the limit of infinite population and use these to con- ber of players grows to infinity, the whole behavior of the struct almost optimal strategies for the case of a finite, but population is replaced by a deterministic limit that repre- large, number of players. We show that the large popu- sents the system’s state, which is fraction of the population lation asymptotic of the microscopic model is equivalent at each individual state that use a given action. to a (macroscopic) Markov decision evolutionary game in In this paper we study the asymptotic dynamic behav- which a local interaction is described by a single player ior of the system in which the population profile evolves in against a population profile. We illustrate our model to de- time. For large N, under mild assumptions (see Section 3), rive the equations for a dynamic evolutionary Hawk and the mean field converges to a deterministic measure that Dove game with energy level. satisfies a non-linear ordinary differential equation for un- der any stationary strategy. We show that the mean field interaction is asymptotically equivalent to a Markov deci- 1. Introduction sion evolutionary game. When the rest of the population uses a fixed strategy u, any given player sees an equivalent We consider a large population of players in which fre- game against a collective of players whose state evolves ac- quent interactions occur between small numbers of chosen cording to the ordinary differential equation (ODE) which individuals. Each player is thus involved in infinitely many we explicitly compute. In addition to providing the ex- interactions with other randomly selected players. Each in- act limiting asymptotic, the ODE approach provides tight teraction in which a player is involved can be described as approximations for fixed large N. The mean field asymp- one stage of a dynamic game. The state and actions of the totic calculations for large N for given choices of strategies players at each stage determine an immediate payoff (also allows us to compute the equilibrium of the game in the called fitness in behavioral ecology) for each player as well asymptotic regime.

∗This work was partially supported by the INRIA ARC Program: Pop- Related Work. Mean field interaction models have al- ulations, , and Evolution (POPEYE) and by an EPFL PhD ready been used in standard evolutionary games in a com- internship grant. pletely different context: that of evolutionary game dynam- † H. Tembine and R. El-Azouzi are with University of Avignon, ics (such as replicator dynamics) see e.g. [7] and refer- LIA/CERI, France, J. Y. Le Boudec is with EPFL, Laboratory for Com- puter Communications and Applications, Lausanne, Switzerland. E. Alt- ences therein. The paradigm there has been to associate man is with INRIA, MAESTRO Group, Sophia-Antipolis, France relative growth rate to actions according to the fitness they achieved, then study the asymptotic trajectories of the state Each player such that j ∈ BN(t) takes part in a one-shot of the system, i.e. the fraction of users that adopt the dif- event at time t, as follows. First, the player chooses an ferent actions. Non-atomic Markov Decision Evolutionary action a in the finite set A with probability uθ (a|s) where Games have been applied in [8] to firm idiosyncratic ran- (θ,s) is the current player state. The stochastic array u is dom shocks and in [1] to cellular communications. the strategy profile of the population, and uθ is the strategy Structure. The remainder of this paper is organized as of subpopulation θ. A vector of probability distributions u follows. In next section we present the model assumptions which depend only on the type of the player and its internal and notations. In Section 3 we present some convergence state is called stationary strategy. N results of the ODE in the random number of interacting Second, say that B (t) = ( j1,..., jk). Given the actions players. In Section 4 a resource competition between ani- a j1 ,...,a jk drawn by the k players, we draw a new set of mals with two types of behaviors and several states is pre- internal states (s0 ,...,s0 ) with probability LN (k,~m), j1 jk θ;s;a;s0 sented. All the sketch of proofs are given in Appendix. Section 5 concludes the paper. where θ = (θ j1 ,...,θ jk ), s = (s j1 ,...,s jk ) a = (a ,...,a ), s0 = (s0 ,...,s0 ) j1 jk j1 jk 2. Model description Then the collection of k players makes one synchronized transition, such that 2.1. Markov Decision Evolutionary Process With 1 N Players SN(t + ) = s0 i = 1,...,k ji N ji N 1 N N We consider the following model, which we call Markov Note that S j (t + N ) = S j (t) if j is not in B (t). Decision Evolutionary Game with N players. It can easily be shown that this form of interaction has • There are N ∈ N players. following properties: (1) XN is Markov and (2) players can • Each player has its own state. A state has two compo- be observed only through their state. nents: the type of the player and the internal state. The type The model is entirely specified by the probability distri- is a constant during the game. The state of player j at time butions JN, the Markov transition kernels LN and the strat- N N N N t is denoted by Xj (t) = (θ j,S j (t)) where θ j is the type. egy profile u. In this paper, we assume that J and L are The set of possible states X = {1,...,Θ} × S is finite. fixed for all N, but u can be changed and does not depend N 1 2 on N (though it would be trivial to extend our results to • Time is discrete, taking values in N := {0, N , N ,...}. • The global detailed description of the system at time strategies that depend on N, but this appears to be unneces- N N N sary complication). We are interested in large N. t is X (t) = (X1 (t),...,XN (t)). Define MN(t) to be the current population profile i.e It follows from our assumptions that N 1 N N M (t) = ∑ 1 N . At each time t, M (t) is in N x N j=1 {Xj (t)=x} 1. M (t) is Markov. the finite set {0, 1 , 2 ,...,1}]X . and MN (t) is the frac- N N θ,s 2. for any fixed j ∈ {1,...,N}, (XN(t),MN(t)) is tion of players who belong to population of type θ (also j Markov. This means that the evolution of one spe- called subpopulation θ) and have internal state s. Also let N ¯ N N N cific player Xj (t) depends on the other players only Mθ = N ∑s∈S Mθ,s(t) be the size of subpopulation θ (inde- N pendent of t by hypothesis). We do not make any specific through the occupancy measure M (t). ¯ N Mθ hypothesis on the ratios N as N gets large (it may be con- 2.2. Payoffs stant or not, it may tend to 0 or not). • Strategies and local interaction: At time slot t, an We consider two types of instantaneous payoff and one N ordered list B (t), of players in {1,2,...,N}, without rep- discounted payoff: etition, is selected randomly as follows. First we draw a N • Instant Gain: This is the random gain G j (t) obtained random number of players K(t) such that by one player whenever it is involved in an event at time t.

N N We assume that it depends on this player’s state just before P(K(t) = k|M (t) = ~m) = Jk (~m) the event and just after the event, the chosen action, and on the states and actions of all players involved in this event. where the distribution JN(~m) is given for any N, ~m ∈ k Formally, if player j ∈ BN(t) {0, 1 , 2 ,...,1}]X . Second, we set BN to an ordered list N N N N 0 0 of K(t) players drawn uniformly at random among the G j (t) = g (x j,a j,x j,xBN (t)\ j,aBN (t)\ j,xBN (t)\ j) N(N − 1)...(N − K(t) + 1) possible ones. By abuse of no- N 0 tation we write j ∈ BN(t) with the meaning that j appears where x j = Xj (t), a j is the action chosen by player j, x j = N N 1 0 in the list B (t). Xj (t + N ), xBN (t)\ j [resp. xBN (t)\ j] is the list of states at 1 time t [resp. at time t + N ] of players other than j involved Markov Decision Evolutionary Game in the event, aBN (t)\ j is the list of their actions and g() is some non random function defined on the set of appropriate Player j may choose a strategy u j. We look for a (Nash) N N lists. Whenever j is not in B (t), G j (t) = 0. We assume equilibrium u such that if all players use u then no player N has an incentive to deviate from u. For any finite N one can that G j (t) is bounded, i.e. there is a non random number N map this into a standard Markov game. This is true for both C0 such that, with probability 1: for all j,t: |G j (t)| ≤ C0 • Expected Instant Payoff: It is defined as the expected the case where the number of players is known and in the instant gain of player j, given the state x of j and the popu- case it is unknown when taking a decision. Therefore we lation profile ~m. By our indistinguishability assumption, it know that a stationary equilibrium exists in the discounted does not depend on the identity of a player, so we can write case. A stationary equilibrium is solution of the fixed point it as equation: ¡ ¯ ¢ N rN(u,x,~m) := E GN(t) ¯XN(t) = x,MN(t) = ~m ∀ j, u j,θ ∈ argmaxR (v j,θ ,u− j;s,m) j j v j,θ Note that this conditional expectation contains the case N N By assuming symmetry per type we can show that a sta- when j is not in B (t), i.e. when G j (t) = 0. tionary equilibrium exists which is a solution of the fixed • Discounted Long-Term Payoff: It is defined as the ex- point equation pected discounted long term payoff of one player, given the N N initial state of this player and the population: r¯ (u;x,~m) := ∀θ,uθ ∈ argmaxR (vθ ,u;s,m) vθ ∞ −βt N N E( ∑ e G j (t)|Xj(0) = x,M (0) = ~m) Markov Decision Evolutionary Team t=0 step 1/N where β is a positive parameter (existence follows from We wish to find a stationary u that maximizes RN aver- N the boundedness of G j ). The fact that it does not depend aged over all players. on the identity j of the player, but only on its initial state x N u = (u1,...,uΘ) ∈ argmaxR (v;s,m) and the initial population profile ~m, follows from the indis- v tinguishability assumption. We defined the Discounted Long-Term Payoff in terms 3. Main Results of the instant gain, as this is the most natural definition. The following proposition shows that the alternative definition, 3.1. Scaling Assumptions by means of the expected instant payoff, is equivalent. Proposition 2.2.1. For all player state x and population We are interested in the large N regime and obtain that, N N profile ~m for any fixed j, (Xj ,M ) converges weakly to a simple N ∞ process. This requires the weak convergence of M (0) to N −βt N N ~ N some ~m . r¯ (u;x,~m) = E( ∑ e r (u,Xj (t),M (t)) 0 t=0 step 1/N We assume that the parameters of the model and the pay- N off per time unit converge as N → ∞, i.e. |Xj(0) = x,M (0) = ~m)  N  Jk (~m) → Jk(~m) 2.3. Focus on One Single Player N L 0 (k,~m) → L 0 (k,~m) (1)  θ;s;a;s θ;s;a;s rN(u,x,~m) → r(u,x,~m) We are interested in the following special case (here we make the dependency on the strategy explicit). There are Our main scaling assumption is two types of players, i.e. Θ = 2. There is exactly one player 2 (the player of interest) with type 1. All other players have H1 ∑k k Jk(~m) < ∞ for all ~m ∈ ∆. This ensures that the N second moment of the number of players involved in type 2. In this case we use the notation R (u1,u2;s,~m) for the discounted long-term payoff obtained by the player an event per time slot is bounded. in type 0, when her strategy is u1 and all other players’s Note that H1 excludes the case where the number of play- strategy is u2, given that this player’s initial internal state is ers involved in an event per time slot scales like N (i.e. syn- s and the initial type 2 subpopulation profile is ~m. Note that chronous transitions of all players at the same time). There N N 0 may be large N asymptotic results for such cases [9] but the R (u1,u2;s,~m) = r¯ (u1,u2;(1,s),~m ) limit is not given by an ODE. In contrast, H1 is automati- 0 1 0 0 with m1,s0 = N 1s=s0 and m2,s0 = m2,s0 for all s ∈ S . cally true if the number of players involved in an event per N time slot is upper bounded by a non random constant. We Theorem 3.2.1. Assume that limN−→∞ M (0) = ~m0 and N also need some technical assumptions, which are usually limN−→∞ X1 (0) = x0 = (θ1,s0) in probability. The dis- true and can be verified by inspection. N N N crete time process (X1 (t),M (t)) defined for t ∈ N , con- verges weakly to the continuous time jump and drift pro- H2 ∑ J (~m) > 0 for all ~m ∈ ∆ (∆ is the simplex {~m : k k cess (X1(t),~m(t)), where ~m(t) is solution of the ODE Equa- m ≥ 0,∑ m = 1}). This ensures that the mean θ,s θ,s θ,s tion (3) with initial condition ~m0 and X1(t) is a continuous number of players involved in an event per time slot, time, non homogeneous jump process, with initial state x0. ∑ kJ (~m) is non zero. k≥0 k The rate of transition of X1(t) from state x1 = (θ1,s1) to state x0 = (θ ,s0 ) is Define the drift of MN(t) as 1 1 1 µ ¶ 0 0 1 A(x1,x1;~m(t),u) = ∑ Jk(~m)Ak(s1,s1;~m(t),u) ~f N(u,~m) = E MN(t + ) − MN(t)|MN(t) = ~m k≥1 N with A (s ,s0 ;~m(t),u) = Note that we make explicit the dependency on the strategy k 1 1 u but not on J and L, assumed to be fixed. k k It follows from our hypotheses that 0 0 ∑ Lθ1,θ;s,s;a;s ,s (k,~m(t)) ∏ uθ j (a j|s j) ∏ mθ j,s j (t) θ;s;a;s0 j=1 j=2 lim N f N(u,~m) := f (u,~m) (2) N→∞ exists. where θ = (θ2,...,θk),s = (s2,...,sk) 0 0 0 a = (a1,...,ak),s = (s2,...,sk) H3 We assume that the convergence in Equation (2) is uni- form in ~m and the limit is Lipschitz-continuous in ~m. Note that, contrary to results based on propagation of This is in particular true if one can write, for every chaos, we do not assume that the distribution of player N 1 1 strategy u, f (u,~m) = N φu( N ,~m), with φu defined on states at time 0 is exchangeable. In contrast, we will use [0,ε] × ∆ where ε > 0 and Φu is continuously differ- Theorem 3.2.1 precisely in the case where player 1 is dif- entiable. ferent from other players. Theorem 3.2.1 motivates the fol- lowing definition. N N N N H4 P(Xj (t + 1/N) = y|Xj (t) = x,M (t) = m,M (t + 1/N) = m0) converges uniformly in ~m,~m0 and the Definition 3.3. To a game as defined in Section (2.1) we limit is Lipschitz-continuous in ~m,~m0. This is in par- associate a “Macroscopic Markov Decision Evolutionary ticular true if one can write, for every strategy u, as Game”, defined as follows. There is one player, (player 1), 0 ξu,x;y(1/N,m,m ). with ξ defined on [0,1]×∆×∆ and with state X1(t) and a population profile ~m(t). The initial ξu,x;y is continuously differentiable. condition of the game is X1(0) = x, ~m(0) = ~m0. The popu- lation profile is solution to the ODE (3) and X1(t) evolves Our model satisfies the assumptions in [2], therefore we as a jump process as in Theorem 3.2.1. have the following result: Further, let r¯(u;x,~m) be the discounted long-term payoff of player 1 in this game, given that X1(0) = x and ~m(0) = Theorem 3.1.1 ([2]). Assume that lim MN(0) = ~m N−→∞ 0 ~m0, i.e. r¯(u;x,~m) = in probability. For any stationary strategy u, and any time N 1 N µZ ¶ t, the random process M (t) = ∑ δ N converges in ∞ N j=1 Xj (t) −βt E e r(u,X1(t),m(t))|X1(0) = x,~m(0) = ~m0 distribution to the (non-random) solution of the ODE 0

~m˙ (t) = f (u,~m(t)) (3) We also consider, as in Section (2.3), the case with Θ = 2 types and define by analogy R(u1,u2;s,~m) as the with initial condition ~m0. discounted long-term payoff when player 1 starts in state s and the population profile starts in state ~m, with player 1 3.2. Convergence results using strategy u1 and other players strategy u2.

We focus on one player, without loss of generality we In order to exploit the convergence in distribution of the N N can call her player 1, and consider the process (X1 ,M ). process focused on one player, we need that the payoff be N N For any finite N, X1 and M are not independent, however continuous in the topology of this convergence. This is in the limit we have the following: stated in the next theorem. Theorem 3.3.1. Let E = S × ∆ and DE [0,∞) the set of 3.5. Single player per type selected per time slot cadlag functions from [0,∞) to R, equipped with Skoro- hod’s topology. The mapping Consider the special case where at each time slot, only one player per type between the N is randomly selected and DE [0,∞) → R has a chance to change its action, i.e. ]BN = 1 w.p 1. Z ∞ (s,m) 7→ e−βt r(u,s(t),m(t)) dt Thus H1 and H2 are automatically satisfied. The result- 0 ing ODE (see [3]) becomes is continuous. Using Theorem 3.2.1 and Theorem 3.3.1 we obtain the d mx(t) = ∑mx0 Lx0,x(~m,u,Θ) − mx ∑Lx,x0 (~m,u,Θ) following, which is the main result of this paper. It says dt x0 x0 that when N goes to infinity, the Markov Decision Evolu- tionary Game with N(t) of players becomes equivalent to The term ∑x0 mx0 Lx0,x(~m,u,Θ) is the incoming flow in to the associated Macroscopic Markov decision evolutionary x and the outgoing flow from x is mx ∑x0 Lx,x0 (~m,u,Θ). game. This reduces any multi-player problem into an ef- We then obtain a large class of state-dependent evolu- fective one-player problem. tionary game dynamics. Note that in general the trajecto- ries of the mean dynamics need not to converge. In the case Theorem 3.3.2 (Asymptotically equivalent game). When of single player selected in each time slot of 1/N and lin- N goes to infinity: ear transition in m, the time averages under the replicator N dynamics converge its interior rest points or the boundaries 1. The discrete time process X1 converges in distribution to the continuous time process X1 of the simplex. 2. r¯N(u;x,~m) → r¯(u;x,~m) 3.6. Equilibrium and optimality N 3. R (u1,u2;s,~m) → R(u1,u2;s,~m) Let Us be the set of strategies. Consider the optimal 3.4. Case with Global Attractor control problems ½ Maximize RN(u,u;s,~m ) Assume that, for some strategy u, the ODE (3) has a (OPT ) 0 N s.t u ∈ U global attractor ~m∗ (this may or may not hold, depending s on the ODE). If in addition the model with N players is ½ N Maximize R(u,u;s,~m0) irreducible, with stationary probability distribution ϖ for (OPT∞) s.t u ∈ Us MN, then N lim ϖ = δ~m∗ The strategy u is an ε−optimal strategy for the N-optimal N−→∞ control problem if ∗ where δ~m∗ is the Dirac mass at ~m (follows from [2]). i.e. N N N the large time distribution of M (t) converges, as N → ∞, R (u,u;s,~m0) ≥ −ε + sup R (v,v;s,~m0). to the attractor ~m∗. v N N Also, (Xj (t),M (t)) converges to a continuous time, Also consider the fixed-point problems homogeneous Markov jump process with time-independent ½ transition matrix: find u ∈ Us such that (FIXN) N 0 0 ∗ u ∈ argmaxv∈Us {R (v,u;s,~m0)} A(x1,x1;u) = ∑ Jk(~m)Ak(s1,s1;~m ,u) k≥1 ½ find u ∈ Us such that 0 (FIX∞) N Assume that the transition matrix A(x1,x1;u) is also irre- u ∈ argmaxv∈Us {R (v,u;s,~m0)} ducible and let π() be its unique stationary probability. A solution to (FIX ) or (FIX ) is a ( Nash) Also let πN be the first marginal of the stationary prob- N ∞ N N equilibrium. We say that u is an ε−equilibrium ability of (X1 ,M ). It is natural in this case to replace N for the game with N [resp. N → ∞] players the definition of the long term payoffs R (u1,u2;s,~m) and N N N if R (u,u;s,~m0) ≥ supv R (v,u;s,~m0) − ε [resp. R (u1,u2;s,~m) by their stationary counterparts R(u,u;s,~m0) ≥ supv R(v,u;s,~m0) − ε]. N N N ∗ Note that the definition of equilibrium and optimal strat- Rst (u1,u2) := ∑π (s)R (u1,u2;s,~m ) s egy may depend on the initial conditions. If, for any u ∈ Us, ∗ Rst (u1,u2) := ∑π(s)R(u1,u2;s,~m ) the hypotheses in Section (3.4) hold, then we may relax this s dependency. Theorem 3.6.1 (Finite N). For every discount factor β > 0 wounded 1/2 . Then the fitness of the animal 1 is 1 1 1 the optimal control problem (OPTN) (resp. the fixed-point 2 (v¯− c) + 2 (−c) = 2 v¯− c, where the −c term repre- problem (FIXN)) has at least one 0−optimal strategy (resp. sents the expected loss of fitness due to being injured. 0−equilibrium). In particular, there a εN-optimal strategy (resp. εN−equilibrium) with εN −→ 0. N N N 1 N 1 i\ j (gi ,g j ) Xi (t + N ),Xj (t + N ) D − D ( v¯ , v¯ ) 1 δ Theorem 3.6.2 (Infinite N). Optimal strategies (resp. equi- 2 2 2 min(x1−1,0),max(x2+1,2) + 1 δ librium strategies) exist in the limiting regime when N → ∞ 2 max(x1+1,2),min(x2−1,0) N under uniform convergence and continuity of R → R. D − H (0,v)(min(x1 − 1,0),max(x2 + 1,2)) N H − H 1 v − c 1 δ Moreover, if {U } is a sequence of εN−optimal strategies 2 2 min(x1−1,0),max(x2+1,2) + 1 δ (resp. εN−equilibrium strategies) in the finite regime with 2 max(x1+1,2),min(x2−1,0) φ(N) εN −→ ε, then, any limit of subsequence U −→ U is an The vector of frequencies of states at time t is given by ε− optimal strategies (resp. ε−equilibrium) for game with N 1 N Mx (t) = ∑ j=1 1 N for x = 0,1,2 and the action set infinite N. N {Xj (t)=x} is Ax = {H,D} in each state x 6= 0, A0 = {}. The assumptions in Section 3 are satisfied (pairwise in- 4. Illustrating example teraction, ]BN(t) = 2) and the occupancy measure MN(t) converges to m(t). We present in this section an example of a dynamic ver- sion of the Hawk and Dove problem where each individual 4.1. ODE and Stationary strategies has three energy levels. We derive the mean field limit for the case where all users follow a given policy and where consider the following fixed parameters µ1 = L0,1, µ2 = possibly one player deviates. We then further simplify the L0,2. The population profile is denoted by ~m = (m0,m1,m2) model to only two energy states per player. In that case and the stationary strategy is described by the parameters we are able to fully identify and compute the equilibrium v1,v2 where v1 := u(H|1), v2 = u(H|2) in the limiting MDEG. Interestingly, we show that the ODE converges to a fixed point which depends on the initial con- m˙ 2 = m0L0,2 + m1L1,2(u,m) − m2L2,1(u,m) m˙ = m L + m L (u,m) − m L (u,m) − m L (u,m)) dition. 1 0 0,1 2 2,1 1 1,2 1 1,0 m˙ = m L (u,m) − (µ + µ )m Consider an homogenous population of N animals. An 0 1 10 1 2 0 animal plays the role of a player. Occasionally two animals where L12(u,m) = find themselves in competition on the same piece of food. ³ ´ v1m1 v2m2 Each animal has three states x = 0,1,2 which represents its m0 + v1 + (1 − v1)m1 + + (1 − v2)m2 µ2 2 ¶ energy level. An animal can adopt an aggressive behavior (1 − v )m (1 − v )m +(1 − v ) 1 1 + 2 2 (Hawk) or a peaceful one (Dove, passive attitude). At the 1 2 2 ³ ´ state x = 0 there is no action. We describe the fitness of an v1m1 v2m2 animal (some arbitrary player) associated with the possi- L2,1(u,m) = v2 + µ2 2 ¶ ble outcomes of the meeting as a function of the decisions (1 − v )m (1 − v )m +(1 − v ) 1 1 + v m + 2 2 taken by each one of the two animals. The fitnesses repre- 2 2 2 2 2 ³ ´ sent the following: v1m1 v2m2 L10(u,m) := v1 + • An encounter Hawk-Dove or Dove-Hawk results in µ 2 2 ¶ (1 − v )m (1 − v )m zero fitness to the Dove and in v¯ of value for the + (1 − v ) v m + 1 1 + v m + 2 2 , 1 1 1 2 2 2 2 Hawk that gets all the food without fight. The state of the Hawk (the winner) is incremented a = For BN = { j , j }, x0 ,x ∈ {0,1,2}, 1 0 and the state of the Dove is b = 1 2 j i {xH =min(xH +1,2)} 1 0 . {xD=max(xD−1,0)} d mx = mx mx L 0 (u,~m) ∑ 1 2 x1,x2;x,x2 • An encounter Dove-Dove results in a peaceful, equal- dt 0 v¯ x1,x2,x2 sharing of the food which translates to a fitness of 2 to + mx mx L 0 (u,~m) each animal and the state of each animal change with ∑ 1 2 x1,x2;x1,x 0 1 1 x1,x2,x1 the sum of the two distributions 2 a + 2 b −mx mx Lx,x ;x0 ,x0 (u,~m) • An encounter Hawk-Hawk results in a fight in which ∑ 2 2 1 2 x ,x0 ,x0 with p = 1/2 chances, one (resp. the other) animal 2 1 2 −mx mx L 0 0 (u,~m) obtains the food but also in which there is a pos- ∑ 1 x1,x;x1,x2 x ,x0 ,x0 itive probability for each one of the animals to be 1 1 2 4.2. Computation of R(u1,u2;s,~m). Case 1 u2 = 1 (fully aggressive when it is possible): the ODE 3 becomes m˙ 2(t) = 1 − 2 m2(t) and the solution has the We want to compute the value form Z ∞ −βt V(x) = E e r(u ,u ,x(t),m(t)) dt 2 − 3 t x 1 2 m [1,m ](t) = [1 − c e 2 ] (7) 0 2 0 3 1 s.t. m˙ (t) = f (u2,m(t)),m(0) = m0, x(0) = x. 3 with c1 = 1− 2 m0 and m1[u,m0](t) = 1−m2[u,m0](t) Z ∆ −βt V(x) = Ex e r(u1,u2,x(t),m(t)) dt Case 2 u 6= 1, (less aggressive in state 2) 0 2 Z ∞ −βt +Ex e r(u1,u2,x(t),m(t)) dt γ+(u) − γ−(u) ∆ m [u,m ](t) = γ (u) + (8) 2 0 − (γ (u)−γ (u))a t Z ∆ 1 − c2e + − 2 −βt = Ex e r(u1,u2,x(t),m(t)) dt 0 −β∆ +Exe V(u1,u2,x(∆),m) γ+(u) − γ−(u) where c2 = 1 + , This implies that m2(0) − γ−(u) 2 1 Z ∆ 2 − u2/2 − (2 + u2/4) 2 1 −βt γ−(u) = < 1, 0 = Ex e r(u1,u2,x(t),m(t)) dt 1 − u2 ∆ 0 2 1 e−β∆ − 1 E V(x(∆)) −V(x) 2 − u2/2 + (2 + u2/4) 2 + E V(x(∆)) + x γ+(u) = > 1 ∆ x ∆ 1 − u2 Using Ito’s formula and Lebesgue integration properties, ExV(x(∆))−V(x) d 0 d Note that in both cases there is a unique strategy-dependent we obtain that: goes to ∑x0 V(x ) mx0 , ∆ dmx0 dt global attractor. e−β∆−1 ∆ −→−β, and the term ½ Z γ−(u) if u2 6= 1 1 ∆ lim m2[u,m0](t) = −βt t−→∞ Ex e r(u1,u2,x(t),m(t)) dt −→ r(u1,u2,x,m0) 2/3 if u2 = 1 ∆ 0 when ∆ goes to zero. The expected instant payoff of a player using the station- ary strategy v when the population profile is m[u,m0](t), is given by d 0 βV(x) = r(u1,u2,x,m) + ( V(x )) fx0 (u2,m) (4) ∑ 0 x0 dmx r(v,u,2,m[u,m0](t)) = v[v¯−cm2u2]+(1−v)r(v,u,1,m[u,m0](t))

The optimality is then given by the Hamilton-Jacobi- 1 Bellman equation obtained by maximizing the right-hand r(v,u,1,m[u,m0](t)) = (1 − m2[u,m0](t)u2)v¯ 2 side of the equation (4). where m2[u,m0](t) is given by (7) (resp. (8)) for u2 = 1 (resp. u 6= 1). Now, we can compute explicitly the best 4.3. The case of two energy levels 2 response against u for a given initial m0. Let In order to derive closed form expressions for solutions β(u,2,m0,t) = r(H,u,2,m[u,m0](t)) − r(D,u,2,m[u,m0](t)). of our ODE, we restrict the above example to two states, i.e., each animal has two states x = 1,2 which represents The , BR(x,u,m[u,m0](t)), against u at t is its energy levels. Thus ODE equation can be expressed as ½ follows: play Hawk if β(u,x,m0,t) > 0 BR(x,u,m[u,m0](t)) = play Dove if β(u,x,m0,t) < 0 m˙ 2(t) = (1 − m2(t))L1,2(u,m) − m2(t)L2,1(u,m) (5) which can be rewritten as v¯ γ This implies that it is better to play Hawk for 2c > 1+γ 2 where γ = max(2/3,m0). Since the solution of the ODE is m˙ 2 = a1 + a2m2(t) + a3(m2(t)) (6) strictly monotone in time for each stationary strategy, there u2 1−u2 with a1 = 1, a2 = 2 − 2 < 0, a3 = 2 > 0. is at most one time for which β is zero. It is easy to see that v¯ 2 Let m[u,m0](t) be the solution of the ODE given u and if 2c > 3 then the strategy which to play Hawk in state 2 a initial distribution m(0) = m0. We distinguish two cases: and Dove in state 1 is an equilibrium. 1 Appendix

0.9

0.8

0.7 Sketch of proof of Proposition 2.2.1

0.6

0.5 N N 0.4 Let τ be the first time after t = 0 that Xj (t) hits in Proportion of Hawks 0.3 some given state. We show that 0.2

0.1 N 0 τ 0 1 2 3 4 5 6 7 8 9 10 1 ¡ ¢ time r¯N = E e−βt rN XN(s),MN(s) (9) N ∑ j s=0 step 1/N

Figure 1. Global attractor for u2 = 1 Define for t ∈ N/N: t N −βs ¡ N N ¡ N N ¢¢ Zt = ∑ e G (s) − r Xj (s),M (s) s=0 step 1/N we have, for 0 ≤ s ≤ t:

1 ³ ´ N N N 0.9 Q := E Zt − Zs |Fs 0.8

=0.2) 0.7 t ³ ³ ´ ´ 2 −βu0 N 0 N N 0 N 0 N 0.6 = ∑ e E G (u ) − r Xj (u ),M (u ) |Fs 0 0.5 u =0 step 1/N 0.4 0.3 which can be written as 0.2 Proportion of Hawks (u 0.1 t ³ ³ ³ ´ ´ ´ −βu0 N 0 N N 0 N 0 N N 0 0 0 1 2 3 4 5 6 7 8 9 10 e E E G (u ) − r Xj (u ),M (u ) |Fu |Fs time ∑ u0=0 step 1/N = 0 Figure 2. Global attractor for u2 = 0.2 N N N thus Zt is an Ft − martingale. Now τ is a stopping N time with respect to the filtration Ft thus, by Doob’s stop- N N N ping time theorem: EZt∧τN = EZ0∧τN = 0 Further, Zt∧τN ≤ K|τN| for some constant K. Since τN is almost surely fi- nite and has a finite expectation, we can apply dominated convergence (with t → ∞) and obtain EZN = 0. 5. Concluding remarks τN Sketch of Proof of Theorem 3.2.1

N The goal of this paper has been to develop mean field To prove the weak convergence of Z , we check the fol- asymptotic of interactions with large number of players us- lowing steps: Without loss of generality, we took the set N ing stochastic games. Due to the curse of the size of the of states as S = {0,1,2,...,]S } Xj has a jump r with population, the applicability of atomic stochastic games probability has been severely limited. As an alternative, we proposed 1 qN (MN(k)) = LN (MN(k),u)) a method for Markov decision evolutionary games where i,i+r N i,i+r players make decisions only based on their own state and the global system state. We have showed under mild as- and MN is the continuous process with drift f N. sumptions convergence results, where asymptotics were • We introduce of X˜ N by scaling with step size 1 . Then, taken in the number of players. The population state profile j N N N N satisfies a system of non-linear ordinary differential equa- Z = (X ,M ) is approximate in some sense by a ˜N ˜ N N N tions. We have considered very simple class of strategies discrete time process Z = (X ,m˜ ) where m˜ (k) = ˜ N that are functions only of player’s own state and the popu- m(bNtc) m solution of the ODE with Xj is the discrete lation profile. We applied to Hawk-Dove interaction with time jump process with transition matrix several energy level and formulated the ODEs. We show 1 k qN (m˜ N(k)) = LN (m( ),u)). that the best response depends on the initial conditions. i,i+r N i,i+r N N ˜ N We show that d(Xj ,Xj ) −→ 0 for any compact of and this holds for any ε arbitrary small. We de- time intervals. fine d(X,Y) = 1 d(X ,Y ) where d(X ,Y ) = 1 . ∑k=0 2k k k k k Xk6=Yk Then, d(XN ,X˜ N ) −→ 0 when N goes to infinity. • j,|[0,t] j,|[0,t] Z˜N = (X˜ N,m˜ N) =⇒ (X˜ ,m˜ ) Convergence of the discrete time process To prove the weak convergence of (X˜ N,M˜ N), we check the following N j M ([Nt]) −→ m(t). We derive the weak convergence steps: of ZN to (X,m) where m is deterministic and X is ran- ˜ N dom. • the discrete time empirical measures M are tight (fol- lows from Sznitman for finite states) and converges to Approximation by a discrete time process a martingale problem. The limit m˜ is deterministic The following lemma follows from the lemma 1 and 3 in measure and is solution of ODE which has the unique Benaim and Weibull (2003,2008), in which we incorporate solution m (given m0,u). Thus, m˜ = m. behaviorial strategies. ˜ N ˜ N • Conditionally to M , Xj converges to a martingale Lemma 5.0.1. For every t > 0 there exists a constant c problem. The jump and drift process X˜ with time de- such that for every ε > 0 and N large enough one has pendent transition is given by the limit of the marginal N N −ε2CN of AN(.|M˜ N,m ,x ,u). We derive the weak conver- P( sup ||M (τ) − m(τ)|| > ε| M (0) = m0,u) ≤ 2(]S)e 0 0 0≤τ≤T ˜ N ˜ N ˜ gence of (Xj ,M ) to (X,m˜ ) where m˜ is deterministic ˜ for all m0 ∈ ∆d, all every stationary strategy u. and X is random. For this we use the Theorem 17.25 and its discrete time approximation in Theorem 17.28 −ε2C N Since C is independent of N, and (e ) is summable, pages 344-347 in Kallenberg. we can use the dominated convergence theorem: for all ε > 0, Sketch of Proof of Theorem 3.3.1 ³ ´ P sup k MN (τ) − m(τ) k > ε| MN (0) = m ,u < ∞, Since Skorohod’s topology is induced by a metric, it is ∑ 0≤τ≤T ∞ 0 N N N sufficient to show that whenever (Xj ,m ) → (x,m) in Sko- rohod’s topology, we have: By Borel-Cantelli’s lemma, for every fixed t < ∞, the ran- Z N,t N ∞ dom variable ν := sup0≤τ≤t k M (τ) − m(τ) k∞ con- −βt N N N N,t lim e r (v,Xj (t),m (t))dt verges almost completely towards 0. This ν implies that N−→∞ 0 Z ∞ converges almost surely to 0. −βt ˜ N 1 = e r(v,x(t),m(t))dt We introduce of Xj by scaling with step size N . Then, 0 ZN = (XN,MN) is approximate in some sense by a discrete By [4], page 117, there is some sequence of increasing ˜N ˜ N N N time process Z = (X ,m˜ ) where m˜ (k) = m(bNtc) m bijections λn: [0,∞) → [0,∞) s.t. solution of the ODE where X˜ N is the discrete time jump j λ (t) − λ (s) process with transition matrix n n → 1 uniformly in t and s t − s N N 1 k q (m˜ (k)) = L (m( ),u)). and k yn(t) − y(λn(t)) k→ 0 uniformly in t i,i+r N i,i+r N over compact subsets of [0,∞). Fix ε > 0, arbitrary and Using the lemma 5.0.1 and uniform Lipschitz continuity of consider of LN, we obtain that Z ∞ N N N −βt N N N sup sup k qi, j(M (τ)) − qi, j(m(τ)) k h := | e r (X (t),v,m (t))dt i, j 0≤τ≤t 0 Z ∞ N −βt ≤ K(εN + sup kM (τ) − m(τ)k). − e r(x(t),v,m(t))dt| 0≤τ≤t 0 Z ∞ −βt N N N Hence, we can write kMN(τ) − m(τ)k ≤ K(ε + 1 ) over ≤ e |r (x (t),v,m (t)) − r(x(t),v,m(t))|dt N N2 0 N set of event Ωε = {kM (τ) − m(τ)k ≤ ε} and P(Ωε ) ≥ −ε2CN 1 − 2(]S)e → 1. Thus, First let K = supx∈S ,v,m∈∆ |r(x,v,m)| < ∞ by hypothesis, −βT 1 and pick some time T large enough such that e K/β ≤ N ˜ N Bin( N ,Nt) P(X j,|[0,t] = Xj,|[0,t]|k transitions) ≥ E(ε ) ε/3. Thus Z Bin( 1 ,Nt) 1 1 Nt T E(ε N ) = (1 − + ε) N −βt N N N N h ≤ ε/3 + e |r(x (t),v,m (t)) − r(x(t),v,m(t))|dt 0 N ˜ N ε P(X j,|[0,t] = Xj,|[0,t]|k transitions) ≥ e (10) Second, we use the distance on E defined by and for N large enough the second term in the right-hand

0 0 0 side of Equation (15) can be made smaller than ε/3. Fi- d((x,m),(x ,m )) =k m − m k +1x6=x0 (11) nally, for N large enough, hN ≤ ε. This completes the |r(x,v,m) − r(x0,v,m0)| proof. Let K0 = sup < ∞ k m − m0 k x∈S ,v,m∈∆d Sketch of Proof of Theorem 3.3.2 by hypothesis. It is easy to see that for all x,x0 ∈ S and 0 m,m ∈ ∆d: Define the discounted stochastic evolutionary game with 0 0 0 0 0 random number of interacting players in each local interac- k r(x,v,,m) − r(x ,v,m ) k≤ K d((x,m),(x ,m )) (12) tion in which each player in x with the mixed action u(.|x) Thus, by Equation (10): receives r(u,x,m(t)) where m(t) is the population profile at t, which evolves under the dynamical system (3) and the Z T ¡ ¢ hN ≤ ε/3 + K0 e−βt d (xN(t),mN(t)),(x(t),m(t)) dt between states follows the transition kernel L. Then, a strat- 0 egy of a player is the same as in the microscopic case and (13) the discounted payoffs By [4], page 117, there is some sequence of increasing N Z ∞ bijections λ : [0,∞) → [0,∞) s.t. −βt R(u1,u2,s0,m0) = e r(s(t),u1,m[u2](t))dt λ N(t) − λ N(s) 0 → 1 uniformly in t and s N t − s is the limit of R (u1,u2,s0,m0) when N goes to in- ¡ ¢ finity, where m[u ] is the solution of the ODE m˙ = and d (xN(t),mN(t), (xN(λ N(t)),mN(λ N(t))) → 0 2 f (u2,m),m(0) = m0 . It follows that the asymptotic regime uniformly in t over compact subsets of [0,∞). Thus there of the microscopic game and the Markov decision evolu- is some N0 ∈ N such that for N ≥ N0 and t ∈ [0,T]: tionary game (macroscopic game) are equivalent. ¡ ¢ εβeβT d (xN(t),mN(t), (xN(λ N(t)),mN(λ N(t))) ≤ Sketch of Proof of Theorem 3.6.1 3K0 (14) We show that for every discount factor β > 0 the opti- Thus, by the triangular inequality for d: hN ≤ mal control problem (OPTN) (resp. the fixed-point prob- Z lem (FIXN)) has at least one 0−optimal strategy. It fol- ε T ³ ´ ≤ + K0 e−βt d (xN (t)mN (t)),(x(λ N (t)),m(λ N (t)) dt lows from the existence of equilibria in stationary strate- 3 0 gies for finite stochastic games with discounted payoff: Z T ³ ´ The set of pure strategies is a compact space in the prod- +K0 e−βt d (x(λ N (t)),m(λ N (t))),(x(t),m(t) dt 0 uct topology (Tykhonov theorem). Thus, the set of behav- Z 2ε T ³ ´ ioral strategies Σ is a compact space and also convex as ≤ + K0 e−βt d (x(λ N (t)),m(λ N (t))),(x(t),m(t)) dt j 3 0 the set of probabilities on the pure strategies. For every (15) player j and every strategy profile σ the marginal of the payoffs and constraints functions are continuous for any Third, let D be the set of discontinuity points of (x,m). N β > 0 : α j 7−→ R j (α j,σ− j,s,m0). Moreover, the stationary Since (x,m) is cadlag, D is enumerable, thus it is negligible strategies is convex, compact and upper and lower hemi- for the Lebesgue measure and continuous (as a correspondence). Define Z T ¡ ¢ −βt N N N e d (x(λ (t)),a,m(λ (t))),(x(t),a,m(t) dt γ j(s,m0,σ) = arg max R j (α j,σ− j,s,m0). 0 α j∈Us Z T −βt ¡ N N ¢ = e d (x(λ (t)),m(λ (t))),(x(t),m(t) 1t∈/Ddt 0 Then, γ j(m0,σ) ⊆ Σ j is a non-empty, convex and compact set and the product correspondence N Now limN−→∞ λ (t) = t and thus for t ∈/ D ¡ ¢ γ : σ 7−→ (γ (s,m ,σ),...,γ (s,m ,σ)) lim d (x(λ N(t)),m(λ N(t))),(x(t),m(t) = 0 1 0 N 0 N−→∞ is upper hemi-continuous (its graph is closed). We now and thus by dominated convergence use the Glicksberg generalization of Kakutani fixed point Z T ¡ ¢ theorem, and there is a stationary strategy profile σ ∗ such lim e−βt d (x(λ N(t)),m(λ N(t))),(x(t),m(t)) dt = 0 N−→∞ 0 that ∗ ∗ (16) σ ∈ γ(s,m0,σ ). Moreover, if the game has symmetric payoffs and strate- namic Games, Advances in Neural Information Process- gies for each type, there is a symmetric per type stationary ing Systems, Vol 18, 2006. equilibrium. This completes the proof. [9] Le Boudec J.Y., McDonald D. , and Mundinger J., A Generic Mean Field Convergence Result for Systems of Sketch of Proof of Theorem 3.6.2 Interacting Objects. In QEST 2007 pages 3–18.

N Let (U )N be a sequence of solution of (FIXN) i.e equilibrium in the system with N players. Let Nk such N that U k converges to some point u when k,Nk goes to infinity. We can write RNk (UNk ,UNk ) − R(U,U) = RNk (UNk ,UNk ) − RNk (U,U) + RNk (U,U) − R(U,U). Since RN(.,.) is continuous and converges uni- formly to R(.,.), RNk converge uniformly to R, the second term RNk (U,U) − R(U,U) −→ 0 when N N N N Nk −→ ∞. and the first term R k (U k ,U k ) − R k (U,U) can be rewritten as RNk (UNk ,UNk ) − RNk (U,U) = RNk (UNk ,UNk ) − R(UNk ,UNk ) + R(UNk ,UNk ) − R(U,U) + R(U,U)−RNk (U,U). Each term goes to zero by continuity of R, convergence of UNk to U and uniform conver- N N gence of R to R. Let U be a εN−equilibrium. Then, N N N N N R (U ,U ) ≥ R (v,U ) − εN, ∀v. Then any limit U of a subsequence of UN satisfies R(U,U) ≥ R(v,U), ∀v. Similarly, if

N N N N R (U ,U ) ≥ R (v,v) − εN, ∀v then any omega-limit U of the sequence of UN satisfies R(U,U) ≥ R(v,v), ∀v i.e U is an optimal strategy.

References

[1] E. Altman, Y. Hayel, H. Tembine, R. El-Azouzi, ”Markov decision Evolutionary Games with Time Average Ex- pected Fitness Criterion”, In proc. of Valuetools, October, 2008. [2] Benaim, M. and Le Boudec, J. Y. , A Class Of Mean Field Interaction Models for Computer and Communica- tion Systems, Performance Evaluation, 2008. [3] Benaim, M. and Weibull J. W. (2003). Deterministic Ap- proximation of Stochastic Evolution in Games, Econo- metrica 71, 873-903 [4] Stewart N. Ethier and Thomas G. Kurtz. Markov Pro- cesses, Characterization and Convergence. Wiley, 2005. [5] Kurtz T. G., Solutions of Ordinary Differential Equations as Limits of Pure Jump Markov Processes, Journal of Ap- plied Probability, Vol. 7, No. 1 (Apr., 1970), pp. 49-58. [6] Kurtz T. G., Limit Theorems for Sequences of Jump Markov Processes Approximating Ordinary Differential Processes, Journal of Applied Probability, Vol. 8, No. 2 (Jun., 1971), pp. 344-356. [7] Tanabe Y., The propagation of chaos for interacting in- dividuals in a large population, Mathematical Social Sci- ences, 2006,51,2,pp.125-152. [8] G. Y.Weintraub, L. Benkard, B. Van Roy, Oblivious Equi- librium: A mean field Approximation for Large-Scale Dy-