Structured Perfect Bayesian Equilibrium in Infinite Horizon

Structured Perfect Bayesian Equilibrium in Inﬁnite Horizon Dynamic Games with Asymmetric Information

Abhinav Sinha and Achilleas Anastasopoulos

Abstract—In dynamic games with asymmetric information are sequentially rational based on specific beliefs, finding structure, the widely used concept of equilibrium is perfect a pair of strategy and belief that satisfy PBE requirement Bayesian equilibrium (PBE). This is expressed as a strategy is a difficult problem and usually reduces to a large fixed- and belief pair that simultaneously satisfy sequential rationality and belief consistency. Unlike symmetric information dynamic point equation over the entire time horizon. To date, there games, where subgame perfect equilibrium (SPE) is the natural is no universal algorithm that provides simplification by equilibrium concept, to date there does not exist a universal decomposing the aforementioned fixed-point equation for algorithm that decouples the interdependence of strategies and calculating PBEs. beliefs over time in calculating PBE. In this paper we find Our motivation stems from the work of Nayyar et al. [10]. a subset of PBE for an infinite horizon discounted reward asymmetric information dynamic game. We refer to it as Authors in [10] consider a decentralized dynamic team Structured PBE or SPBE; in SPBE, any agents’ strategy problem (non-strategic agents) with asymmetric information. depends on the public history only through a common public Each agents observes a private type which evolves with time belief and on private history only through the respective agents’ as a controlled Markov process. They introduce a common latest private information (his private type). The public belief information based approach, whereby each agent calculates acts as a summary of all the relevant past information and it’s dimension does not increase with time. The motivation for this a belief on every agents’ current private type. This belief comes the common information approach proposed in Nayyar is defined in such a manner that despite having asymmetric et al. (2013) for solving decentralized team (non-strategic) information, agents can agree on it. They proceed to show resource allocation problems with asymmetric information. We optimality of policies that depend on private history only calculate SPBE by solving a single-shot fixed-point equation and through this belief and respective agents’ current private type. a corresponding forward recursive algorithm. We demonstrate our methodology by means of a public goods example. The common information based approach has been studied for finite horizon dynamic games with asymmetric informa- I.INTRODUCTION tion in [11], [12], [13], [14], [9]. Of these, [11], [12] consider models where the aforementioned belief from the common Dynamic games with symmetric information among agents information approach is updated independent of strategy. This has been studied well in game theory literature [1], [2], [3], implies that the simultaneous requirements of sequential ra- [4]. The appropriate equilibrium concept is subgame perfect tionality and belief consistency can be decoupled, resulting in equilibrium (SPE) which is a refinement of Nash equilibrium. a simplification in calculating PBEs. However these models A strategy profile is SPE if its restriction to any subgame of produce non-signaling PBEs. For the models studied in [13], the original game is also an SPE of the subgame. However [14], [9] belief updates depend on strategies. Consequently there are models of interest where strategic agents interact the resulting PBEs are signaling equilibria. The problem of repeatedly whilst observing private signals. For instance, finding them though, becomes more complicated. relay scheduling with private queue length information at Other than the common information based approach, Li et different relays [5]. Another example is that of Bayesian al. [15] consider a finite horizon zero-sum dynamic game, learning games [6], [7], [8], [9] such as when customers where at each time only one agent out of the two knows the post reviews on websites such as Amazon; clearly agents state of the system. The value of the game is calculated by posses their own experience of the object they bought and formulating an appropriate linear program. this constitutes private information. These examples lead to All works listed above for dynamic games consider a a model with asymmetric information, where the notion of finite horizon. In this paper, we deal with an infinite horizon SPE is ineffective. discounted reward dynamic game. Cole et al. [16] consider Appropriate equilibrium concepts in such a case consist an infinite horizon discounted reward dynamic game where of strategy profiles and beliefs. These include weak perfect actions are only privately observable. They provide a fixed- Bayesian equilibrium, perfect Bayesian equilibrium (PBE) point equation for calculating a subset of sequential equi- and Sequential equilibrium. PBE is the most commonly used librium, which is referred to as Markov private equilibrium notion. The requirement of PBE is sequential rationality (MPrE). In MPrE strategies depend on history only through of strategy as well as belief consistency. Due to the fact the latest private observation. that future beliefs depend on past strategy and strategies i Contributions observe their own type xt but not that of others. Denote the i In this work we follow the models of [13], [14], [9] where profile of types at time t by xt = (xt)i∈N . beliefs are updated dependent on strategy and thus signaling Each agent i at any time t ≥ 1, after observing their private i i i PBEs are considered. type xt takes action at ∈ A . The set of available actions to i This paper provides a one-shot fixed-point equation and a agent i is denoted by A and is assumed to be finite. It is forward recursive algorithm that together generate a subset of assumed that the action profile is publicly observed and is perfect Bayesian equilibria for the infinite horizon discounted common knowledge. reward dynamic game with asymmetric information. The The type of each agent evolves over time as a controlled common information based approach of [10] is used and the Markov process independent of other agents given the action i strategies are such that they depend on the public history only profile at = (at)i∈N . We assume a time-homogeneous kernel i through the common belief (which summarizes the relevant Q for evolution of agent i’s type i.e., P(xt+1 | x1:t, a1:t) = QN i i i information) and on private history only through respective P(xt+1 | xt, at) = i=1 Q (xt+1 | xt, at). The notation we agents’ current private type. The PBE thus generated are use is a1:t = (ak)k=1,...,t. Initial belief, at time 1, is assumed QN i i referred to as Structured PBE or SPBE and posses the as P(x1) = i=1 Q0(x1). i property that the mapping between belief and strategy is Associated with each agent i is a reward function R stationary (w.r.t. time). that depends on the type and action profiles i.e., reward i The provided methodology (consisting of solving a one- at time t for agent i is R (xt, at). Furthermore, rewards shot fixed-point equation) provides a decomposition of the are accumulated over time in a discounted manner with a interdependence between belief and strategy in PBEs and common discount factor δ ∈ (0, 1). We consider two versions enables a systematic evaluation of SPBEs. of the problem - finite horizon (T ) and infinite horizon. For The model allows for signaling amongst agents as beliefs the finite horizon problem we introduce a terminal reward. depend on strategies. This is because the source of asymmetry We first define beliefs πt since the terminal reward is defined of information in our model is that each agent has a privately as a function of beliefs. c observed type that affects the utility function of all the Define the public history ht at time t as the set of publicly c N i t−1 agents. These types evolve as a controlled Markov process observed actions i.e., ht = a1:t−1 ∈ (×i=1A ) . Denote c i and depend on actions of all agents (the actions are assumed the set of public histories by Ht . Define private history ht to be common knowledge). of any agent i at time t as the set of privately observed i i t Our methodology for proving our results is as follows: types x1:t ∈ (X ) of agent i and publicly observed action i i first we extend the finite horizon model of [13] to include profiles a1:t−1 i.e., ht = (x1:t, a1:t−1). Denote the set of i i belief-based terminal reward. This does not change the proofs private histories by Ht. Note that private history ht contains c significantly but helps us simplify the exposition and proofs public history ht . Also denote the overall history at time t of the infinite horizon case. In the next step, the infinite by ht = (x1:t, a1:t−1) and the set of all such histories by Ht. i horizon results are obtained with the help of the finite horizon Any strategy used by agent i can be represented as g = i i i i ones through continuity arguments as horizon T → ∞. (gt)t≥1 with gt : Ht → ∆(A ). The notation we use is: ∆(S) We demonstrate our methodology of finding SPBEs is the set of all probability distributions over the elements of through a concrete public goods game. the set S. N j The remainder of this paper is organized as follows: Sec- The belief πt ∈ ×j=1∆(X ), at any time t, depends on j tion II introduces the model for the dynamic game with pri- the strategy profile g1:t−1 = (g1:t−1)j∈N . Given a strategy vate types and discounted reward. We consider two versions profile g1:t−1, we define the belief as follows: πt(xt) = i i i i g i c g c of the problem, finite and infinite horizon. Section III defines πt(xt) i∈N , with πt(xt) = P (xt | ht ) if P (ht ) > 0, the backward recursive algorithm for calculating SPBEs in else the finite horizon model. The results in this section are an X πi(xi) = Qi(xi | xi , a )πi (xi ). (1) adaptation of the finite horizon results from [13]. This is t t t t−1 t−1 t−1 t−1 xi ∈X i done so that the finite horizon results can be appropriately t−1 i i i i used in Section IV. Section IV contains the central result of Initial belief is π1(x1) = Q0(x1). The above definition this paper, namely the fixed-point equation that defines the ensures that the belief random variable Πt, at any time t, set of SPBE strategy for the infinite horizon game. Finally, is measurable, even if the set of histories resulting in any Section V discusses a concrete example of a public goods instance πt has zero measure under strategy g. game with two agents. This example is an infinite horizon Finally, in the finite horizon game, for each agent i there i i version of the finite horizon public goods example from [1, is a terminal reward G (πT +1, xT +1) that depends on the ch. 8, Example 8.3]. terminal type of agent i and the terminal belief. It is assumed that Gi(·) is absolutely bounded. II.MODELFOR DYNAMIC GAME We consider a dynamical system with N strategic agents, III.FINITE HORIZON denoted by N . Associated with each agent i at any time In this section, we state and prove properties of the finite i i i t ≥ 1, is a type xt ∈ X . The set of types is denoted by X horizon backwards recursive algorithm for calculating SPBE, and is assumed to be finite. We assume that each agent i can adopted from [13] with minor modifications to accommodate for terminal rewards (specifically belief-based). The results B. Finite Horizon Result from this section are used in Section IV to prove the infinite In this section, we presented three lemmas. The first two horizon equilibrium result. are technical results needed in the proof of the third. The To distinguish quantities defined in this section with infi- result of the third lemma is used in Section IV. nite horizon, we add a superscript T to all the quantities. i,βi,T Define the reward-to-go Wt for any agent i and i A. Finite Horizon problem and Backwards Recursion strategy β as Consider a finite horizon, T > 1, problem in the above T i i −i,? ? c i,β ,T i β ,β ,µt [ht ] X n−t i dynamic game model. Wt (ht) = E δ R (Xn,An) i,T j i Define the value functions Vt : ×j∈N ∆(X ) × X → n=t i,T + δT +1−tGi(Π ,Xi ) | hi. (9) R i∈N ,t∈{1,...,T } and strategy γ˜t i∈N ,t∈{1,...,T } back- T +1 T +1 t wards inductively as follows. Here agent i’s strategy is βi whereas all other agents use i,T i 1) ∀ i ∈ N , set VT +1 ≡ G . strategy β−i,? defined above. Since X i, Ai are assumed to 2) For any t ∈ {1,...,T } and πt, solve the following fixed- be finite and Gi absolutely bounded, the reward-to-go is finite i,T i i point equation in (˜γ )i∈N . ∀ i ∈ N , x ∈ X , i i t ∀ i, t, β , ht. i,T i i −i,T −i i i i γt (·|xT ),γ˜t ,πt i Lemma 1. For any t ∈ {1,...,T }, i ∈ N , h and β , γ˜t (· | xt) ∈ argmax E R (Xt,At) t γi(·|xi )∈∆(Ai) t t i −i,? ? c i,T ? c i β ,β ,µt [ht ] i i,T j j,T N i i Vt (µt [ht ], xt) ≥ E R (Xt,At) + δV F (πt , γ˜t ,At) ,Xt+1 | πt, xt (2) t+1 j=1 N + δV i,T F (µj,?[hc], βj,?,A ) ,Xi | hi (10) (see below for the various quantities involved in the above t+1 t t t t j=1 t+1 t expression). i i Proof. We use proof by contradiction. Suppose ∃ i, t, bht, βb 3) Then define, i such that (10) is violated. Construct strategy γbt as i,T i γ˜i,T (·|xi ),γ˜−i,T ,π−i i ( V (π , x ) = t t t t R (X ,A ) i i i i i t t t E t t i i i βbt(at | bht) if xt = xt γ (a | x ) = 1 b (11) i,T j j,T N i i bt t t i i |Ai| if xt 6= xt + δVt+1 F (πt , γ˜t ,At) j=1,Xt+1 | πt, xt (3) b Then

i i,T i,T i i,T ? c i Denote by θt the mapping πt 7→ γ˜t i.e., γ˜t = θt[πt]. Vt (µt [bht ], xbt) (12a) In (2), the expectation is with the following distribution i i −i,? ? c i,T γbt (·|xbt),βt ,µt [bht ] i ≥ E R (Xt,At)δVt+1 (12b) i,T N (X−i,Ai,A−i,Xi ) ∼ π−i(x−i)γ (ai | xi) + F (µj,?[hc], βj,?(· | hc, ·),A ) ,Xi | µ?[hc], xi t t t t+1 t t t t t t bt t bt t j=1 t+1 t bt bt −i,T −i −i i i i i −i,? ? c γ˜t (at | xt )Q (xt+1 | xt, at). (4) βbt ,βt ,µt [bht ] i i,T = E R (Xt,At) + δVt+1 (12c) and j,? c j,? c N i i F (µt [bht ], βt (· | bht , ·),At) j=1,Xt+1 | bht j j 0j i,T F (π , γ , a)(x ) > V (µ?[bhc], xi) (12d)  t t t bt P πj(xj)γj(aj | xj)Qj(x0j | xj, a)  xj ∈X j The first inequality above follows from the algorithm defi- P j j j j i if Den. > 0 = x˜j π (˜x )γ (a | x˜ ) nition in (2) and (3), the second equality follows from the P j j j 0j j  xj ∈X j π (x )Q (x | x , a) if Den. = 0 definition above in (11) and finally the last inequality follows (5) from the assumption at the beginning of this proof. Since the above is clearly a contradiction, the result Below we define strategy-belief pair (β?, µ?), follows. β? = βi,? βi,? : Hi → ∆(Ai) (6a) t t∈{1,...,T },i∈N t t Lemma 2. ? i,? i,? i µ = µt t∈{1,...,T },i∈N µt : Ht → ∆(Ht) (6b) T i −i,? ? c β ,β ,µ [h ,at] X n−(t+1) i based on the mapping θ produced by the above algorithm. E t+1:T t+1:T t+1 t δ R (Xn,An) ? ? n=t+1 Define belief µ inductively as follows: set µ1(x1) = QN i i + δT +1−tGi(Π ,Xi ) | hi, a , xi i=1 Q0(x1). Then for t ∈ {1,...,T }, T +1 T +1 t t t+1 T i,? c i,? c i ? c i −i,? ? c X µ h = F µ [h ], θ µ [h ] , at (7a) βt:T ,βt:T ,µt [ht ] n−(t+1) i t+1 t+1 t t t t t = E δ R (Xn,An) N n=t+1 ? c Y i,? c i µt+1 ht+1 (xt+1) = µt+1 ht+1 (xt+1) (7b) T +1−t i i i i + δ G (ΠT +1,XT +1) | ht, at, xt+1 (13) i=1 i,T i,? Proof. This result relies on the structure of the update in (7), Denote the strategy arising out of γ˜t by βt i.e., −i,? c specifically that µt+1 [ht+1] is a deterministic function of i,? i i i ? c i i −i,? c −i,? i βt (at | ht) = θt µt [ht ] (at | xt) (8) µt [ht ], βt , at and does not depend on β . Consider the joint pmf-pdf of random variables involved Proof. We use backward induction for this. At time T , using in the expectation the maximization property from (2),

i −i,? ? c i,T ? c i βt:T ,βt:T ,µt [ht ] −i i VT (µT [hT ], xT ) (19a) P xt+1, at+1:T , xt+2:T , xT +1, γ˜i,T (·|xi ),γ˜−i,T ,µ? [hc] i i i T T T T t R (XT ,AT ) (19b) πT +1 | ht, at, xt+1 (14) , E i j,? c j,T N i ? c i A + δG F (µT [hT ], γ˜T ,AT ) j=1,XT +1 | µT [hT ], xT This can be written as B with i,T i −i,T ? c γT (·|xT ),γ˜T ,µT [ht ] i i −i,? ? c ≥ E R (XT ,AT ) (19c) X β ,β ,µ [h ] −i A = t:T t:T t t x , a , x , a , x , N P t t t+1 t+1:T t+2:T + δGiF (µj,?[hc ], γ˜j,T ,A ) ,Xi | µ? [hc ], xi −i T T T T j=1 T +1 T T T xt i i i = W i,β ,T (hi ) xT +1, πT +1 | ht (15a) T T (19d) Here the second inequality follows from (2) and (3) and the X i −i,? ? c βt:T ,βt:T ,µt [ht ] −i i i B = P x˜t , at, xt+1 | ht (15b) final equality is by definition in (9). −i Assume that the result holds for all n ∈ {t + 1,...,T }, x˜t then at time t we have Using causal decomposition we can write i,T ? c i Vt (µt [ht ], xt) (20a) X i −i,? ? c βt:T ,βt:T ,µt [ht ] −i i i i i i −i,? ? c A = P (xt | ht)βt(at | ht) βt ,βt ,µt [ht ] i ≥ E R (Xt,At) (20b) x−i t i,T j,? c j,? N i i −i,? −i c −i + δVt+1 F (µt [ht ], βt ,At) j=1,Xt+1 | ht β (a | h , x )Q(xt+1 | xt, at) t t t t i −i,? ? c i −i,? ? c βt ,βt ,µt [ht ] i βt:T ,βt:T ,µt [ht ] i ≥ E R (Xt,At) (20c) P (at+1:T , xt+2:T , xT +1, πT +1 T i −i i −i,? ? c | h , at, x , xt+1) (16a) β ,β ,µ [h ,At] X n−(t+1) i t t + δE t+1:T t+1:T t+1 t δ R (Xn,An) n=t+1 X −i,? c −i i i i −i,? −i c −i + δT −tGi(Π ,Xi ) | hi,A ,Xi | hi = µt [ht ](xt )βt(at | ht)βt (at | ht , xt ) T +1 T +1 t t t+1 t x−i T t βi ,β−i,?,µ?[hc] X n−t i i i i −i −i −i = E t:T t:T t t δ R (Xn,An) (20d) Q (x | x , at)Q (x | x , at) t+1 t t+1 t n=t i −i,? ? c βt+1:T ,βt+1:T ,µt+1[ht+1] i T +1−t i i i P (at+1:T , xt+2:T , xT +1, πT +1 + δ G (ΠT +1,XT +1) | ht i −i i | h , at, x , xt+1) (16b) i,β ,T i t t = Wt (ht) (20e) where the second equality follows from the Here the first inequality follows from Lemma 1, the second i −i ? c fact that given ht, at, xt , xt+1 and µt [ht ], the inequality from the induction hypothesis, the third equality i probability of (at+1:T , xt+2:T , xT +1, πT +1) depends on follows from Lemma 2 and the final equality by definition (9). i −i ? c i −i,? ht, at, xt , xt+1, µt+1[ht+1] only through βt+1:T , βt+1:T . Also the second equality above uses the fact that types evolve conditionally independent given action. Performing IV. SPBE IN INFINITE HORIZON similar decomposition of the denominator B and substituting In this section we consider the infinite horizon dynamic back in the expression from (14) allows us to cancel the game, with naturally no terminal reward. i i terms β (·) and Q (·). Using the belief update from (7), this A. Perfect Bayesian Equilibrium gives that the expression in (14) is A perfect Bayesian equilibrium is the pair of strategy and −i,? i −i,? ? c ? ? c −i βt+1:T ,βt+1:T ,µt+1[ht+1] belief (β , µ ), where µt+1 [ht+1](xt+1)P (at+1:T , xt+2:T , i i ? i,? i,? i i xT +1, πT +1 | ht, at, xt+1) β = βt t≥1,i∈N βt : Ht → ∆(A ) (21a) i −i,? ? c ? i,? i,? i βt+1:T ,βt+1:T ,µt+1[ht+1] −i = P (xt+1, at+1:T , xt+2:T , µ = µt t≥1,i∈N µt : Ht → ∆(Ht) (21b) i i i xT +1, πT +1 | ht, at, xt+1) (17a) such that sequential rationality is satisfied: ∀ i ∈ N , βi, i i The above equality follows directly from definition. t ≥ 1, ht ∈ Ht, This completes the proof. ∞ βi,?,β−i,?,µ?[hc] X n−t i i t t δ R (X ,A ) | h The result below shows that the value function from the E n n t n=t backwards recursive algorithm is higher than any reward-to- ∞ i −i,? ? c β ,β ,µt [ht ] X n−t i i go. ≥ E δ R (Xn,An) | ht (22) i i n=t Lemma 3. For any t ∈ {1,...,T }, i ∈ N , ht and β , and beliefs satisfy certain consistency conditions (please i,T ? c i i,βi,T i Vt (µt [ht ], xt) ≥ Wt (ht) (18) refer [1, pp. 331] for the exact conditions). i j j,T N i i B. Fixed Point Equation for Infinite Horizon + δV F (πT , γ˜T ,AT ) j=1,XT +1 | πT , xT (27b) In this section, we state the fixed-point equation that Comparing the above set of equations with (23), we can see defines the value function and strategy mapping for the that the pair (V, γ˜) arising out of (23) satisfies the above. Now infinite horizon problem. This is analogous to the backwards i,T i assume that Vn ≡ V for all n ∈ {t+1,...,T }. At time t, recursion ((2) and (3)) that defined the value function and θ in the finite horizon construction from (2), (3), substituting mapping for the finite horizon problem. V i in place of V i,T from the induction hypothesis, we get i N j i t+1 Define the set of functions V : ×j=1∆(X ) × X → R i,T i i i i the same set of equations as (27). Thus Vt ≡ V satisfies and strategies γ˜ : X → ∆(A ) (which is generated formally it. as γ˜i = θi[π] for given π) via the following fixed-point equation: ∀ i ∈ N , xi ∈ X i, D. Equilibrium Result

i i γi(·|xi),γ˜−i,π−i i Below we state the central result of this paper. It states that γ˜ (· | x ) ∈ argmax E R (X,A) the strategy-belief pair (β?, µ?) constructed from the solution γi(·|xi)∈∆(Ai) i j j j N i,0 i of the ﬁxed-point equation (23) and the forward recursion + δV [F (π , γ˜ ,A )]j=1,X | π, x , (23a) of (25) and (26) indeed constitutes a PBE.

i i −i −i Theorem 5. Assuming that the fixed-point equation i i γ˜ (·|x ),γ˜ ,π i (23) V (π, x ) = E R (X,A) admits an absolutely bounded solution V i (for all i ∈ N ), i j j j N i,0 i ? ? + δV [F (π , γ˜ ,A )]j=1,X | π, x . (23b) the strategy-belief pair (β , µ ) defined in (25) and (26) is a PBE of the infinite horizon discounted reward dynamic game Note that the above is a joint fixed-point equation in (V, γ˜), i.e., ∀ i ∈ N , βi, t ≥ 1, hi ∈ Hi, unlike the backwards recursive algorithm earlier which re- t t ∞ quired solving a fixed-point equation only in γ˜. Here the i,? −i,? ? c β ,β ,µt [ht ] X n−t i i unknown quantity is distributed as E δ R (Xn,An) | ht n=t (X−i,Ai,A−i,Xi,0) ∼ π−i(x−i)γi(ai | xi) ∞ βi,β−i,?,µ?[hc] X n−t i i ≥ t t δ R (X ,A ) | h (28) γ˜−i(a−i | x−i)Q(xi,0 | xi, a). (24) E n n t n=t and F (·) is as defined in (5). Remark: Note that by definition in (7), µ? already satis- ? ? Define belief µ inductively as follows: set µ1(x1) = fies the consistency conditions required for perfect Bayesian QN i i i=1 Q0(x1). Then for t ≥ 1, equilibrium. i,? c i,? c i ? c µt+1 ht+1 = F µt [ht ], θ µt [ht ] , at (25a) Proof. We divide the proof into two parts: first we show that i N the value function V is at least as big as any reward-to- ? c Y i,? c i β? µt+1 ht+1 (xt+1) = µt+1 ht+1 (xt+1) (25b) go function; secondly we show that under the strategy i , i i=1 reward-to-go is V . i By construction the belief defined above satisfies the consis- Part 1: For any i ∈ N , β define the following reward- tency condition needed for a Perfect Bayesian Equilibrium. to-go functions Denote the stationary strategy arising out of γ˜ by β? i.e., ∞ i,βi i βi,β−i,?,µ?[hc] X n−t i i W (h ) = t t δ R (X ,A ) | h i,? i i i ? c i i t t E n n t βt (at | ht) = θ µt [ht ] (at | xt) (26) n=t (29a) C. Relation between Infinite and Finite Horizon problems

The following result highlights the similarities between the T i i −i,? ? c fixed-point equation in infinite horizon and the backwards i,β ,T i β ,β ,µt [ht ] X n−t i Wt (ht) = E δ R (Xn,An) recursion in the finite horizon. n=t i i T +1−t i i i Lemma 4. Consider the finite horizon game with G ≡ V . + δ V (ΠT +1,XT +1) | ht (29b) Then V i,T = V i, ∀ i ∈ N , t ∈ {1,...,T } satisfies the t Since X i, Ai are finite sets the reward Ri is absolutely i backwards recursive construction (2) and (3). i,β i i i bounded, the reward-to-go Wt (ht) is finite ∀ i, t, β , ht. i i Proof. Use backward induction for this. Consider the finite For any i ∈ N , ht ∈ Ht, i,T i horizon algorithm at time t = T , noting that VT +1 ≡ G ≡ i i ? c i i,βi i V , V µt [ht ], xt − Wt (ht) = i,βi,T i,βi,T i,βi i,T i i −i,T −i i ? c i i i i i γT (·|xT ),γ˜T ,πT i V µt [ht ], xt −Wt (ht) + Wt (ht)−Wt (ht) γ˜T (· | xT ) ∈ argmax E R (XT ,AT ) i i i γT (·|xT )∈∆(A ) (30) i j j,T N i i + δV F (πT , γ˜T ,AT ) j=1,XT +1 | πT , xT (27a) Combining results from Lemma 3 and 4, the term in the first bracket in RHS of (30) is non-negative. Using (29), the term i,T i,T i −i,T −i in the second bracket is i γ˜T (·|xT ),γ˜T ,πT i VT (πT , xT ) = E R (XT ,AT ) ∞ i,? n β?,µ?[hc] i,β i i −i,? ? c X t t T +1−t β ,β ,µt [ht ] n−(T +1) i ≤ δ sup E Wt+n (Πt+n,Xt+n) δ E − δ R (Xn,An) i ht n=T +1 i ? c i ? c i i i i − V (µt [ht ], xt) | µt [ht ], xt (37) + V (ΠT +1,XT +1) | ht . (31) Now using the fact that W ,V i are bounded The summation in the expression above is bounded by a t+n i and that we can choose n arbitrarily large, we get i,? convergent Geometric series. Also, V is bounded. Hence the i,β ? c i i ? c i sup i |W (µ [h ], x ) − V (µ [h ], x )| = 0. above quantity can be made arbitrarily small by choosing T ht t t t t t t t appropriately large. Since the LHS of (30) does not depend V. A CONCRETE EXAMPLE on T , this gives that In this section, we consider an infinite horizon version i ? c i i,βi i of the public goods example from [1, ch. 8, Example 8.3]. V µt [ht ], xt ≥ Wt (ht) (32) We solve the corresponding fixed point equation (arising out ? Part 2: Since the strategy β generated in (26) is such of (23)) numerically to calculate the mapping θ (which in i,? i ? c i that βt depends on ht only through µt [ht ] and xt, the ? ? i,? turn generates the perfect Bayesian equilibrium (β , µ )). i,β ? reward-to-go Wt , at strategy β , can be written (with The example consists of two symmetric agents. The type abuse of notation) as space and action sets are X 1 = X 2 = {xH , xL} and A1 = 2 i,? i,? A = {0, 1}. Each agents’ type is static and does not vary i,β i i,β ? c i Wt (ht) = Wt (µt [ht ], xt) with time. ∞ ? ? c The actions represents whether agents are willing to con- β ,µt [ht ] X n−t i ? c i = E δ R (Xn,An) | µt [ht ], xt (33) tribute for a common public good. If at least one agent n=t contributes then both agents receive utility 1 and the agent(s) i i For any ht ∈ Ht, that contributed receive cost equal to their type. If no one

i,? ? ? c i,? contributes then both agents receive utility 0. Thus the reward i,β ? c i β ,µt [ht ] i i,β Wt (µt [ht ], xt) = E R (Xt,At) + δWt+1 function is j,? c i ? c j N i ? c i 1 − xi if ai = 1 F (µt [ht ], θ [µt [ht ]],At+1) j=1,Xt+1 | µt [ht ], xt i R (x, a) = −i i (38) (34a) a if a = 0 where a−i represents the action taken by the agent other than ? ? c i ? c i β ,µt [ht ] i i V (µt [ht ], xt) = E R (Xt,At) + δV i. H L N We use the following values x = 1.2, x = 0.2 and con- F (µj,?[hc], θi[µ?[hc]],Aj ) ,Xi | µ?[hc], xi t t t t t+1 j=1 t+1 t t t sider three values δ = 0, 0.5, 0.95. Since type sets have two (34b) 1 elements we can represent the distribution π1(·) ∈ ∆(X ) H Repeated application of the above for the first n time periods with only π1(x ) ∈ [0, 1], similarly for agent 2. For any 2 i i gives π = (π1, π2) ∈ [0, 1] , the mapping θ[π] produces γ˜ (· | x ) for every i ∈ {1, 2} and xi ∈ X i = {xH , xL}. Since the t+n−1 i,βi,? ? c i β?,µ?[hc] X m−t i action space contains two elements, we can represent the W (µ [h ], x ) = t t δ R (X ,A ) t t t t E t t distribution γ˜i(· | xi) by γ˜i(ai = 1 | xi) i.e., the probability m=t i,? of taking action 1. We solve the fixed-point equation by n i,β i ? c i + δ Wt+n Πt+n,Xt+n | µt [ht ], xt (35a) discretizing the π−space [0, 1]2 and all solutions that we find 1 L are symmetric w.r.t. agents i.e., γ˜ (· | x ) for π = (π1, π2) t+n−1 is the same as γ˜2(· | xL) for π0 = (π , π ) and similarly for ? ? c X 2 1 i ? c i β ,µt [ht ] m−t i H V (µt [ht ], xt) = E δ R (Xt,At) type x . m=t For δ = 0, the game is instantaneous and for the values n i i ? c i H + δ V Πt+n,Xt+n | µt [ht ], xt (35b) considered, we have 1 − x = −0.2 < 0. This implies that whenever agent 1’s type is xH , it is instantaneously profitable Here Πt+n is the n−step belief update under strategy and 1 1 H ? ? not to contribute. This gives γ˜ (a = 1 | x ) = 0, for belief prescribed by β , µ . 1 1 L Taking difference gives all π. Thus we only plot γ˜ (a = 1 | x ); in Fig. 1. For δ = 0 the fixed-point equation (23) is only for the variable i,βi,? ? c i i ? c i γ˜ and not V , and can be solved analytically. Refer to [13, Wt (µt [ht ], xt) − V (µt [ht ], xt) ? ? c i,? eq. (20) and Fig. (1)], where this solution is stated. There are n β ,µt [ht ] i,β i = δ E Wt+n Πt+n,Xt+n multiple solutions to the fixed-point equation and our result i i ? c i − V Πt+n,Xt+n | µt [ht ], xt (36a) from Fig. 1 matches with the one of the results in [13]. Intuitively, with type xL the only value of π for which Taking absolute value of both sides then using Jensen’s agent 1 would not wish to contribute is if he anticipates agent inequality for f(x) = |x| and finally taking supremum over L i 2’s type to be x with high probability and rely on agent 2 ht gives us to contribute. This is why for lower values of π2 (i.e., agent L 1 1 L i,βi,? ? c i i ? c i 2’s type likely to be x ) we see γ˜ (a = 1 | x ) = 0 in sup Wt (µt [ht ], xt) − V (µt [ht ], xt) i Fig. 1. ht Now consider γ˜1(a1 = 1 | xL) plotted in Fig. 1, 2 0.16. The gain via coordination is evident here too. and 4. As δ increases, future rewards attain more priority and signaling comes into play. So while taking an action, agents not only look for their instantaneous reward but also 1 how their action affects the future public belief π about their 0.8 private type. It is evident in the figures that as δ increases, 0.6 0.4 at high π1, up to larger values of π2 agent 1 chooses not 0.2 L 0 to contribute when his type is x . This way he intends to 1 0.8 1 H 0.9 0.6 0.8 send a “wrong” signal to agent 2 i.e., that his type is x and 0.7 0.6 0.4 0.5 0.4 0.3 π 0.2 0.2 subsequently force agent 2 to invest. This way agent 1 can 2 0 0.1 π 0 1 free-ride on agent 2’s investment. 1 1 H 1 1 L Now consider Fig. 3 and 5, where γ˜ (a = 1 | x ) is Fig. 1. γ˜ (a = 1 | x ) vs. (π1, π2) at δ = 0. plotted. Coordination via signaling is evident here. Although it is instantaneously not profitable to contribute if agent 1’s H type is x , by contributing at higher values of π2 (i.e., H agent 2’s type is likely x ) and low π1, agent 1 coordinates with agent 2 to achieve net profit greater than 0 (reward 1 0.8 when no one contributes). This can be done since the loss of 0.6 contributing is −0.2 whereas profit from free-riding on agent 0.4 2’s contribution is 1. 0.2 0 1 0.9 0.8 1 Under the equilibrium strategy, beliefs Π form a Markov 0.7 0.9 0.8 t 0.6 0.7 0.5 0.6 0.4 0.5 0.3 0.4 0.3 chain. One can trace this Markov chain to study the signaling 0.2 0.2 0.1 0.1 π 0 0 2 π effect at equilibrium. On numerically simulating this Markov 1 chain for the above example (at δ = 0.95) we observe that 1 1 L for almost all initial beliefs, within a few rounds agents Fig. 2. γ˜ (a = 1 | x ) vs. (π1, π2) at δ = 0.5. completely learn each other’s private type truthfully (or at least with very high probability). In other words, agents manage to reveal their private type via their actions at equilibrium and to such an extent that it negates any possibly 1 incorrect initial belief about their type. 0.8 As a measure of cooperative coordination at equilibrium 0.6 one can perform the following calculation. Compare the value 0.4 function V 1(·, x) of agent 1 arising out of the fixed-point 0.2 0 H L 1 equation, for δ = 0.95 and x ∈ {x , x } (normalize it by 0.8 1 0.9 0.6 0.8 0.7 0.6 multiplying with 1 − δ so that it represents per-round value) 0.4 0.5 0.4 0.2 0.3 0.2 0.1 π 0 2 0 π with the best possible attainable single-round reward under a 1 symmetric mixed strategy with a) full coordination and b) no 1 1 H coordination. Note that the two cases need not be equilibrium Fig. 3. γ˜ (a = 1 | x ) vs. (π1, π2) at δ = 0.5. themselves, which is why this will result in a bound on the efficiency of the evaluated equilibria. In case a), assuming both agents have the same type x, full 1+1−x coordination can lead to the best possible reward of 2 = 1− x i.e., agent 1 contributes with probability 0.5 and agent 2 1 2 0.8 contributes with probability 0.5 but in a coordinated manner 0.6 so that it doesn’t overlap with agent 1 contributing. 0.4

In case b) when agents do not coordinate and invest with 0.2

0 probability p each, then the expected single-round reward is 1

0.8 1 0.9 0.6 0.8 p(1 − x) + p(1 − p). The maximum possible value of this 0.7 0.6 0.4 0.5 0.4 x 2 0.2 0.3 0.2 (1 − ) π 0.1 expression is . 0 2 2 0 π L 1 L 1 For x = x = 0.2, the range of values of V (π1, π2, x ) over (π , π ) ∈ [0, 1]2 is [0.865, 0.894]. Whereas full coor- 1 2 Fig. 4. γ˜1(a1 = 1 | xL) vs. (π , π ) at δ = 0.95. dination produces 0.9 and no coordination 0.81. It is thus 1 2 evident that agents at equilibrium end up achieving reward close to the best possible and gain signiﬁcantly compared to VI.CONCLUSION the strategy of no coordination. This paper considers the inﬁnite horizon discounted reward Similarly for x = xH = 1.2 the range is [0.3, 0.395]. dynamic game with private types i.e., where each agent can Whereas full coordination produces 0.4 and no coordination only observe their own type. The types evolve as a controlled [9] D. Vasal and A. Anastasopoulos. (2016) Decentralized bayesian learning in dynamic games. Submitted to IEEE Allerton Conference ∼ 1 2016. [Online]. Available: http://web.eecs.umich.edu/ anastas/docs/ NetEcon16.pdf 0.8 [10] A. Nayyar, A. Mahajan, and D. Teneketzis, “Decentralized stochastic 0.6 control with partial history sharing: A common information approach,” 0.4 IEEE Transactions on Automatic Control, vol. 58, no. 7, pp. 1644– 0.2 1658, July 2013.

0 [11] A. Gupta, A. Nayyar, C. Langbort, and T. Bas¸ar, “Common informa- 1

0.8 1 tion based Markov perfect equilibria for linear-Gaussian games with 0.9 0.6 0.8 0.7 0.6 0.4 0.5 asymmetric information,” SIAM Journal on Control and Optimization, 0.4 0.2 0.3 0.2 0.1 π 0 vol. 52, no. 5, pp. 3228–3260, 2014. 2 0 π 1 [12] A. Nayyar, A. Gupta, C. Langbort, and T. Bas¸ar, “Common information based Markov perfect equilibria for stochastic games with asymmetric 1 1 H Fig. 5. γ˜ (a = 1 | x ) vs. (π1, π2) at δ = 0.95. information: Finite games,” IEEE Transactions on Automatic Control, vol. 59, no. 3, pp. 555–570, March 2014. [13] D. Vasal and A. Anastasopoulos, “A systematic process for evaluating structured perfect bayesian equilibria in dynamic games with Markov process and are conditionally independent across asymmetric information,” in American control conference, July 2016. [Online]. Available: http://arxiv.org/abs/1508.06269 agents given the action profile. Asymmetry of information [14] D. Vasal and A. Anastasopoulos. (2015) Signaling equilibria for between agents exists in this model, since each agent only dynamic LQG games with asymmetric information. Submitted to knows their own private type. IEEE CDC 2016. [Online]. Available: http://www-personal.umich. edu/∼dvasal/papers/cdc16.pdf To date, there exists no universal algorithm for calcu- [15] L. Li and J. Shamma, “Lp formulation of asymmetric zero-sum lating PBE in models with asymmetry of information that stochastic games,” in 53rd IEEE Conference on Decision and Control, decouples, w.r.t. time, the calculation of strategy. Section IV Dec 2014, pp. 1930–1935. [16] H. L. Cole and N. Kocherlakota, “Dynamic games with hidden actions provides a single-shot fixed-point equation for calculating the and hidden states,” Journal of Economic Theory, vol. 98, no. 1, pp. equilibrium generating function θ, which in conjunction with 114–126, 2001. the forward recursion in (25) and (26) gives a subset of PBEs (β?, µ?) of this game. The method proposed in this paper finds PBE of a certain type i.e., where for any agent i, his strategy at equilibrium depends on the public history only through the common belief πt and on private history only i through agent i’s current private type xt. Finally, we demonstrate our methodology by a concrete example of a two agent symmetric public goods game and observe the signaling effect in agents’ strategies at equilibrium as discount factor is increased. The signaling effect implies that agents take into account how their actions affect future public beliefs π about their private type. One important direction for future work is characterization of games where the proposed SPBE exists. This boils down to existence of a solution to the fixed-point equation in Section IV, as any SPBE must satisfy this equation.

REFERENCES

[1] D. Fudenberg and J. Tirole, Game theory. MIT press, 1991. [2] T. Bas¸ar and G. J. Olsder, Dynamic noncooperative game theory. Siam, 1999, vol. 23. [3] G. J. Mailath and L. Samuelson, Repeated games and reputations: long-run relationships. Oxford university press, 2006. [4] V. Krivan and G. Zaccour, Advances in dynamic games. Springer, 2014. [5] D. Vasal and A. Anastasopoulos, “Stochastic control of relay channels with cooperative and strategic users,” IEEE Transactions on Commu- nications, vol. 62, no. 10, pp. 3434–3446, Oct 2014. [6] S. Bikhchandani, D. Hirshleifer, and I. Welch, “A theory of fads, fash- ion, custom, and cultural change as informational cascades,” Journal of political Economy, pp. 992–1026, 1992. [7] D. Acemoglu, M. A. Dahleh, I. Lobel, and A. Ozdaglar, “Bayesian learning in social networks,” The Review of Economic Studies, vol. 78, no. 4, pp. 1201–1236, 2011. [8] T. N. Le, V. G. Subramanian, and R. A. Berry, “The impact of observation and action errors on informational cascades,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 1917–1922.