From Behavior to Sparse Graphical Games: Efficient Recovery of Equilibria

Asish Ghoshal Jean Honorio Department of Computer Science Department of Computer Science Purdue University Purdue University West Lafayette, IN - 47907 West Lafayette, IN - 47907 Email: [email protected] Email: [email protected]

Abstract—In this paper we study the problem of exact recovery learning graphical games from behavioral data, using max- of the pure- Nash equilibria (PSNE) set of a graphical imum likelihood estimation (MLE) and sparsity-promoting game from noisy observations of joint actions of the players methods. Honorio and Ortiz [6] and Irfan and Ortiz [4] have alone. We consider sparse linear influence games — a parametric class of graphical games with linear payoffs, and represented by also demonstrated the usefulness of learning sparse graphical directed graphs of n nodes (players) and in-degree of at most k. games from behavioral data in real-world settings, through We present an `1-regularized logistic regression based algorithm their analysis of the voting records of the U.S. congress as for recovering the PSNE set exactly, that is both computationally well as the U.S. supreme court. efficient — i.e. runs in polynomial time — and statistically In this paper, we analyze a particular method proposed by efficient — i.e. has logarithmic sample complexity. Specifically, we show that the sufficient number of samples required for exact Honorio and Ortiz [6] for learning sparse linear influence PSNE recovery scales as O (poly(k) log n). We also validate our games, namely, using logistic regression for learning the neigh- theoretical results using synthetic experiments. borhood of each player in the graphical game, independently. Honorio and Ortiz [6] showed that the method of independent I.INTRODUCTIONAND RELATED WORK logistic regression is likelihood consistent; i.e. in the infinite sample limit, the likelihood estimate converges to the true Non-cooperative is widely regarded as an achievable likelihood. In this paper we obtain the stronger appropriate mathematical framework for studying strategic guarantee of recovering the true PSNE set exactly. Needless to behavior in multi-agent scenarios. The of say, our stronger guarantee comes with additional conditions describes the stable outcome of the overall that the true game must satisfy in order to allow for exact behavior of self-interested agents — for instance people, PSNE recovery. The most crucial among our conditions is the companies, governments, groups or autonomous systems — assumption that the minimum payoff across all players and all interacting strategically with each other and in distributed joint actions in the true PSNE set is strictly positive. We show, settings. through simulation experiments, that our assumptions indeed Over the past few years, considerable progress has been bears out for large classes of graphical games, where we are made in analyzing behavioral data using game-theoretic tools, able to exactly recover the PSNE set. e.g. computing Nash equilibria [1], [2], [3], most influential Finally, we would like to draw the attention of the reader agents [4], price of anarchy [5] and related concepts in the con- to the fact that `1-regularized logistic regression has been text of graphical games. In political science for instance, Irfan analyzed by Ravikumar et. al. [7] in the context of learning and Ortiz [4] identified, from congressional voting records, the sparse Ising models. Apart from technical differences and most influential senators in the U.S. congress — a small set of differences in proof techniques, our analysis of `1-penalized senators whose collective behavior forces every other senator logistic regression for learning sparse graphical games differs to a unique choice of vote. Irfan and Ortiz [4] also observed from Ravikumar et. al. [7] conceptually — in the sense that that the most influential senators were strikingly similar to the we are not interested in recovering the edges of the true gang-of-six senators, formed during the national debt ceiling game graph, but only the PSNE set. Therefore, we are able to negotiations of 2011. Further, using graphical games, Honorio avoid some stronger and non-intuitive conditions required by and Ortiz [6] showed that Obama’s influence on Republicans Ravikumar et. al. [7], such as mutual incoherence. increased in the last sessions before candidacy, while McCain’s The rest of the paper is organized as follows. In section II we influence on Republicans decreased. provide a brief overview of graphical games and, specifically, The problems in described above, linear influence games. We then formalize the problem of i.e. computing the Nash equilibria, computing the price of learning linear influence games in section III. In section IV, anarchy or finding the most influential agents, require a we present our main method and associated theoretical results. known graphical game which is not available apriori in real- Then, in section V we provide experimental validation of our world settings. Therefore, Honorio and Ortiz [6] proposed theoretical guarantees. Finally, in section VI we conclude by discussing avenues for future work. where w−i denotes the i-th row of W without the i-th entry, i.e. w−i = {wij|j 6= i}. Note that we have diag(W) = 0. II.PRELIMINARIES For linear influence games G(n), we can define the neighbor- In this section we review key concepts behind graphical hood and signed neighborhood of the i-th vertex as N (i) = games introduced by Kearns et. al. [8] and linear influence {j| |wi,j| > 0} and N±(i) = {sign(wij)|j ∈ N (i)} respec- games (LIGs) introduced by Irfan and Ortiz [4] and Honorio tively. It is important to note that we don’t include the outgoing and Ortiz [6]. arcs in our definition of the neighborhood of a vertex. Thus, for linear influence games, the weight matrix W and the bias A. Graphical Games vector b, completely specify the game and the PSNE set in- A normal-form game G in classical game theory is defined duced by the game. Finally, let G(n, k) denote sparse games by the triple G = (V, A, U) of players, actions and payoffs. V over n players where the in-degree of any vertex is at most k, is the set of players, and is given by the set V = {1, . . . , n}, i.e. for all i, |Ni| ≤ k. if there are n players. A is the set of actions or pure-strategies . III.PROBLEM FORMULATION and is given by the Cartesian product A = ×i∈V Ai, where . Having introduced the necessary definitions, we now intro- Ai is the set of pure-strategies of the i-th player. Finally, U = n duce the problem of learning sparse linear influence games {ui}i=1, is the set of payoffs, where ui : Ai ×j∈V \i Aj → R specifies the payoff for the i-th player given its action and the from observations of joint actions only. We assume that there ∗ ∗ ∗ joint actions of the all the remaining players. exists a game G (n, k) = (W , b ) from which a “noisy” (l) m A key solution concept in non- is data set D = {x }l=1 of m observations is generated, where (l) the Nash equilibrium. For a non-cooperative game, a joint each observation x is sampled independently and identically action x∗ ∈ A is a pure-strategy Nash equilibrium (PSNE) from the following distribution: ∗ ∗ ∗ ∗ if, for each player i, xi ∈ argmaxxi∈Ai ui(xi, x−i), where q1 [x ∈ N E ] (1 − q)1 [x ∈/ N E ] ∗ ∗ ∗ p(x) = ∗ + ∗ . (3) x−i = {xj |j 6= i}. In other words, x constitutes the mutual |N E | 2n − |N E | best-response for all players and no player has any incentive ∗ In the above distribution, q is the probability of observing a to unilaterally deviate from their optimal action xi given the joint actions of the remaining players x∗ . The set of all pure- data point from the set of Nash equilibria and can be thought of −i 1−q strategy Nash equilibrium (PSNE) for a game G is defined as as the “signal” level in the data set, while can be thought NE∗ follows: of as the “noise” level in the data set. We use instead of NE(G∗(n, k))   to simplify notation. We further assume that ∗ ∗ ∗ the game is non-trivial1 i.e. |N E∗| ∈ {1,... 2n − 1} and NE(G) = x (∀i ∈ V ) xi ∈ argmax ui(xi, x−i) . (1) ∗ n xi∈Ai that q ∈ (|N E |/2 , 1). The latter assumption ensures that the 2 Graphical games, introduced by Kearns et. al. [8], extend signal level in the data set is more than the noise level . We ∗ ∗ ∗ the formalism of Graphical models to games. That is, a graph- define the equality of two games G (n, k) = (W , b ) and ical game G is defined by the directed graph, G = (V,E), of Gb(n, k) = (Wc, b) as follows: vertices and directed edges (arcs), where vertices correspond G∗(n, k) = Gb(n, k) iff to players and arcs encode “influence” among players i.e. the ∗ ∗ ∗ payoff of the i-th player only depends on the actions of its (∀i) N±(i) = Nb±(i) ∧ sign(bi ) = sign(bi) ∧ N E = NEd. (incoming) neighbors. A natural question to ask then is that, given only the data set D and no other information, is it possible to recover the weight B. Linear Influence Games matrix Wc and the bias vector b such that Gb(n, k) = G∗(n, k)? Linear influence games (LIGs), introduced by Irfan and Honorio and Ortiz [6] showed that it is in general impossible Ortiz [4] and Honorio and Ortiz [6], are graphical games to learn the true game G∗(n, k) from observations of joint with binary actions, or pure strategies, and parametric (linear) actions only because multiple weight matrices W and bias payoff functions. We assume, without loss of generality, that vectors b can induce the same PSNE set and therefore have n the joint action space A = {−1, +1} . A linear influence the same likelihood under the observation model (3) — an game between n players, G(n) = (W, b), is characterized issue known as non-identifiablity in the statistics literature. It n×n by (i) a matrix of weights W ∈ R , where the entry is, however, possible to learn the equivalence class of games Wij indicates the amount of influence (signed) that the j-th that induce the same PSNE set. We define the equivalence of n player has on the i-th player and (ii) a bias vector b ∈ R , two games G∗(n, k) and Gb(n, k) simply as : where bi captures the prior of the i-th player for a ∗ ∗ particular action xi ∈ {−1, +1}. The payoff of the i-th player G (n, k) ≡ Gb(n, k) iff NE = NEd. given the actions of the remaining players is then given as Therefore, our goal in this paper is efficient and consistent u (x , x ) = x (wT x − b ), and the PSNE set is defined i i −i i −i −i i recovery of the pure-strategy Nash equilibria set (PSNE) from as follows: 1  T This comes from Definition 4 in [6]. NE(G(n)) = x|(∀i) xi(w−ix−i − bi) ≥ 0 , (2) 2See Proposition 5 and Definition 7 in [6] for a justification of this. observations of joint actions only; i.e. given a data set D, Assumption 1. Let S be the support of the vector v, i.e. ∗ . drawn from some game G (n, k) according to (3), we infer a S = {i| |vi| > 0}. Then, there exists constants Cmin > 0 and ∗ game Gb(n, k) from D such that Gb(n, k) ≡ G (n, k). Dmax ≤ |S| such that: IV. METHODAND RESULTS ∗  T  λmin(HSS) ≥ Cmin and λmax(Ex zSzS ) ≤ Dmax, Our main method for learning the structure of a sparse LIG, G∗(n, k), is based on using ` -regularized logistic re- for all i, where λmin(.) and λmax(.) denote the minimum and 1 maximum eigenvalues respectively and H∗ = {H∗ |i, j ∈ gression, to learn the parameters (w , b ) for each player SS i,j −i i S}. i independently. We denote by vi(W, b) = (w−i, −bi) the parameter vector for the i-th player, which characterizes its To better understand the implications of the above assump- payoff; and by zi(x) = (xix−i, xi) the “feature” vector. In tions first note that the distribution over joint actions in (3), the rest of the paper we use vi and zi instead of vi(W, b) after some algebraic manipulations, can be written as follows: and zi(x) respectively, to simplify notation. Then, we learn  |N E∗| n  ∗ the parameters for the i-th player as follows: q − /2 1 [x ∈ N E ] p(x) = ∗ ∗ + 1 − |N E |/2n |N E | v = argmin `(v , D) + λ kv k bi i i 1 (4)  1 − q  1 vi ∗ . (8) m 1 − |N E |/2n 2n 1 X (l) `(v , D) = log(1 + exp(−vT z )). (5) i m i i Thus, we have that the distribution over joint actions is a l=1 mixture of two uniform distributions, one over the number We then set wb −i = [vbi]1:(n−1) and bi = −[vbi]n, where the of Nash equilibria and the other being the Rademacher distri- notation [.]i:j denotes indices i to j of the vector. We show bution over n variables. We denote the latter distribution by ∗ n n n that, under suitable assumptions on the true game G (n, k) = R , i.e. x ∼ R =⇒ x ∈ {−1, +1} ∧ p(x) = 1/2n for all (W∗, b∗), the parameters Wc and b obtained using (5) induce x. Therefore the expected Hessian matrix H∗ decomposes as the same PSNE set as the true game, i.e. NE(W∗, b∗) = a convex combination of two other Hessian matrices: NE(W, b). Before presenting our main results, however, we c b ∗ NE∗ R state and discuss the assumptions on the true game under H = νH + (1 − ν)H , (9) which it is possible to recover the true PSNE set of the game where we have defined: using our proposed method. . 1 X T HN E∗ = η(v∗ z)zzT , A. Assumptions |N E∗| x∈N E∗ The success of our method hinges on certain assumptions on R . h ∗T T i the structure of the underlying game. These assumptions are H = Ez∼Rn η(v z)zz , ∗ in addition to the assumptions imposed on the parameter q and q − |N E |/2n  ∗ . the number of Nash equilibria of the game |N E |, as required ν = ∗ . 1 − |N E |/2n by definition of a LIG. Since the assumptions are related to the Hessian of the loss function (5), we take a moment to Now, by concavity of λmin(.) and the Jensen’s inequality we introduce the expressions of the gradient and the Hessian here. have that The gradient and Hessian of the loss function for any vector ∗ C ≥ νλ (HNE ) + (1 − ν)λ (HR ) v and the data set D is given as follows: min min SS min SS NE∗ ∗ m ≥ νλmin(H ) + (1 − ν)η(kv k ) 1 X  −z(l)  SS 1 ∇`(v, D) = (6) ≥ (1 − ν)η(kv∗k ) > 0, m 1 + exp(vT z(l)) 1 l=1 ∗ m NE where the last line follows from the fact that HSS is positive 1 X T ∗ ∇2`(v, D) = η(vT z(l))z(l)z(l) , NE ∗ (7) semi-definite. In fact, λmin(HSS ) = 0, if |N E | < n. Thus, m ∗ l=1 we have that our assumption of q ∈ (NE /2n, 1) automatically 1 x/2 −x/2 2 m ∗ where η(x) = /(e +e ) . Finally, Hi denotes the sample implies that λmin(Hi ) ≥ Cmin > 0. We can also verify that Hessian matrix with respect to the i-th player and the true the maximum eigenvalue is bounded as follows: ∗ ∗ ∗ . parameter vi , and Hi denotes it’s expected value, i.e. Hi = ! [Hm] = ∇2`(v∗, D) 1 X T ED i ED i . In subsequent sections we drop Dmax ≤ νλmax zSz + ∗ |N E∗| S the notational dependence of Hi and zi on i to simplify x∈N E∗ notation. However, it should be noted that the assumptions  T  (1 − ν)λmax(Ez∼Rn zSzS ) are with respect to each player and must hold individually for ≤ ν |S| + (1 − ν). each player i. Now we are ready to state and describe the implications of our assumptions. The following assumption characterizes the minimum pay- The following assumption ensures that the expected loss is off in the Nash equilibria set. strongly convex and smooth. 1 Assumption 2. The minimum payoff in the PSNE set, ρmin, Setting δ = /2 we get that is strictly positive, specifically: 4mµmin " # |S| ∗ T 5C ∗ r x (w x − b ) ≥ ρ > min/Dmax (∀ x ∈ N E ). 2 i −i −i i min m µ Pr {λmin(HSS) ≤ min/2} ≤ |S| Note that as long as the minimum payoff is strictly posi- e ∗ ∗   tive, we can scale the parameters (W , b ) by the constant −mCmin 5C 5C ≤ |S| exp . min/Dmax to satisfy the condition: ρmin > min/Dmax, with- 2 |S| out changing the PSNE set or the likelihood of the data. Indeed the assumption that the minimum payoff is strictly positive is Therefore, we have is unavoidable for exact recovery of the PSNE set in a noisy −mC  setting such as ours, because otherwise this is akin to exactly m C min Pr {λmin(HSS) > min/2} > 1 − |S| exp . recovering the parameters v for each player i. For example, 2 |S| if x ∈ N E∗ is such that v∗T x = 0, then it can be shown that ∗ Next, we have that even if kv − vbk∞ = ε, for any ε arbitrarily close to 0, then vT x < 0 and therefore NE(W∗, b∗) 6= NE(W, b). Next, b c b µ = λ ( z zT ) we present our main theoretical results for learning LIGs. max max Ex S S  T  ≥ λmin(Ex zSzS ) B. Theoretical Guarantees ! 1 X Our main strategy for obtaining exact PSNE recovery guar- ≥ νλ z zT + (1 − ν) min |N E∗| S S antees is to first show, using results from random matrix theory, x∈N E∗ that given the assumptions on the eigenvalues of the population ≥ (1 − ν). Hessian matrices, the assumptions hold in the finite sample case with high probability. Then, we exploit the convexity Once again invoking Theorem 1.1 from [9] and setting δ = 1 properties of the logistic loss function to show that the weight we have that vectors learned using penalized logistic regression is “close” to (mµmax)/R0 the true weight vectors. By our assumption that the minimum  eδ  Pr {λ ≥ (1 + δ)µ } ≤ |S| payoff in the PSNE set is strictly greater than zero, we show max max (1 + δ)1+δ that the weight vectors inferred from a finite sample of joint h e i(mµmax)/|S| Pr {λ ≥ 2µ } ≤ |S| actions induce the same PSNE set as the true weight vectors. max max 4 1) Minimum and Maximum Eigenvalues of Finite Sample   −mµmax Hessian and Scatter Matrices: The following technical lemma ≤ |S| exp 4 |S| shows that the assumptions on the eigenvalues of the Hessian −m(1 − ν) matrices, hold with high probability in the finite sample case. ≤ |S| exp 4 |S| ∗  T  Lemma 1. If λmin(HSS) ≥ Cmin and λmax(Ex zSzS ) ≤ Dmax then we have that Therefore, we have that m ! Cmin X (l) (l)T   λ (Hm ) ≥ and λ z z ≤ 2D −m(1 − ν) min SS 2 max S S max Pr {λmax < 2Dmax} > 1 − |S| exp . l=1 4 |S| with probability at least −mC  −m(1 − ν) 1 − |S| exp min and 1 − |S| exp 2 |S| 4 |S| 2) Recovering the Pure Strategy Nash Equilibria (PSNE) respectively. Set: Before presenting our main result on the exact recovery of the PSNE set from noisy observations of joint actions, we Proof. Let first present a few technical lemmas that would be helpful . ∗ .  T  µmin = λmin(HSS) and µmax = λmax(Ex zSzS ). in proving the main result. The following lemma bounds the ∗ First note that for all z ∈ {−1, +1}n: gradient of the loss function (5) at the true vector v , for all players. |S| . λ (η(v∗ T z )z zT ) ≤ = R max S S S S 4 Lemma 2. With probability at least 1 − δ for δ ∈ [0, 1], we T . 0 have that λmax(zSzS ) ≤ |S| = R . Using the Matrix Chernoff bounds from Tropp [9](Theorem r 2 2n k∇`(v∗, D)k < νκ + log , 1.1), we have that ∞ m δ mµmin  e−δ  R m where κ = 1/(1+exp(ρ )) and ρ ≥ 0 is the minimum payoff Pr {λmin(HSS) ≤ (1 − δ)µmin} ≤ |S| . min min (1 − δ)1−δ in the PSNE set. m . ∗ m Proof. Let u = ∇`(v , D) and uj denote the j-th index of The lemmas together show that the optimal vector is close to um. We have that the true vector. c  m Lemma 3. If the regularization parameter λ satisfies the fol- (∀ j ∈ S ∧ j 6= n) E uj lowing condition: ν X zj r = ∗ ∗ T + 2 |N E | 1 + exp((v ) zS) 5Cmin 2 2n x∈N E∗ S λ ≤ − νκ − log , 16 |S| D m δ   max zj (1 − ν) n then Ez∼R 1 + exp((v∗ )T z ) S S 5C ∗ min kvS − vSk2 ≤ , ν X zj b p = + 4 |S|Dmax ∗ ∗ T |N E | ∗ 1 + exp((vS) zS) x∈N E with probability at least 1 − (δ + |S| exp((−mCmin)/2|S|) +

  1   |S| exp((−m(1−ν))/4|S|)). (1 − ν) |S| [z ] EzS ∼R ∗ T Ezj ∼R j 1 + exp((vS) zS) Proof. The proof of this lemma follows the general proof structure of Lemma 3 in [7]. First, we reparameterize the ` - νκ X 1 ≤ x x = νκ regularized loss function |N E∗| i j x∈N E∗ f(vS) = `(vS) + λ kvSk Similarly, 1 f  m as the loss function e, which gives the loss at a point that is (∀ j ∈ S ∧ j 6= n) E uj ∆ distance away from the true parameter v∗ as follows: S S ν z X j f(∆ ) = `(v∗ + ∆ ) − `(v∗ ) + λ(kv∗ + ∆ k − kv∗ k ), = ∗ ∗ T + e S S S S S S 1 S 1 |N E | 1 + exp((v ) zS) x∈N E∗ S ∗   where ∆S = vS − vS. Also note that the loss function fe zj ∗ (1 − ν) n is shifted such that the loss at the true parameter vS is 0, Ez∼R ∗ T 1 + exp((vS) zS) i.e. fe(0) = 0. Further, note that the function fe is convex ∗ and is minimized at ∆b S = vS − v , since vS minimizes f. ν X zj b S b ≤ + |N E∗| 1 + exp((v∗ )T z ) Therefore, clearly fe(∆b S) ≤ 0. Thus, if we can show that the ∗ S S x∈N E function fe is strictly positive on the surface of a ball of radius ∗ b, then the point ∆b S lies inside the ball i.e. kvS − v k ≤ b. (1 − ν) [z ] b S 2 Ezj ∼R j f Using the Taylor’s theorem we expand the first term of e to get the following: ≤ νκ. ∗ T T 2 ∗ fe(∆S) = ∇`(v ) ∆S + ∆ ∇ `(v + θ∆S)∆S Following the same procedure as above, it can be easily shown S S S + λ(kv∗ + ∆ k − kv∗ k ), that the above bounds hold for the case j = n as well. Also S S 1 S 1 (10) m note that uj ≤ 1. Therefore, by using the Hoeffding’s in- for some θ ∈ [0, 1]. Next, we lower bound each of the terms equality [10] and a union bound argument we have that: in (10). Using the Cauchy-Schwartz inequality, the first term

  2 in (10) is bounded as follows: n m  m −mt /2 Pr max uj − uj < t > 1 − 2ne j=1 E ∗ T ∗ ∇`(vS) ∆S ≥ − k∇`(vS)k∞ k∆Sk1 2 =⇒ Pr {kum − [um]k < t} > 1 − 2ne−mt /2 ∗ p E ∞ ≥ − k∇`(vS)k∞ |S| k∆Sk2 2 m m −mt /2 r ! =⇒ Pr {ku k∞ − kE [u ]k∞ < t} > 1 − 2ne p 2 2n 2 ≥ −b |S| νκ + log , (11) m −mt /2 m δ =⇒ Pr {ku k∞ < νκ + t} > 1 − 2ne .

2 Setting 2n exp(−mt /2) = δ, we prove our claim. with probability at least 1 − δ for δ ∈ [0, 1]. It is also easy to upper bound the last term in equation 10, using the reverse A consequence of Lemma 2 is that, even with an infinite triangle inequality as follows: number of samples, the gradient of the loss function at the ∗ ∗ ∗ true vector v doesn’t vanish. Therefore, we cannot hope to λ |kvS + ∆Sk1 − kvSk1| ≤ λ k∆Sk1 . recover the parameters of the true game perfectly even with an Which then implies the following lower bound: infinite number of samples. In the following technical lemma ∗ ∗ we show that the optimal vector vb for the logistic regression λ(kvS + ∆Sk1 − kvSk1) ≥ −λ k∆Sk1 ∗ problem is close to the true vector v in the support set S of ≥ −λp|S| k∆ k ∗ S 2 v . Next, in Lemma 4, we bound the difference between the p ∗ = −λ |S|b. (12) true vector v and the optimal vector vb in the non-support set. 0 Now we turn our attention to computing a lower bound of the where in the second line we used the fact√ that η (.) < 1/10 second term of (10), which is a bit more involved. and in the last line we assumed that (b |S|Dmax)/5 ≤ Cmin/4 — an assumption that we verify momentarily. Having upper ∆T ∇2`(v∗ + θ∆ )∆ ≥ min ∆T ∇2`(v∗ + θ∆ )∆ S S S S S S S S bounded the spectral norm of A(θ), we have k∆S k2=b 2 2 ∗ = b λ (∇ `(v + θ∆ )). Cmin min S S λ ∇2`(v∗ + θ∆ ) ≥ . (13) min S S 4 Now, Plugging back the bounds given by (11), (12) and (13) in (10) 2 ∗ λmin(∇ `(vS + θ∆S)) and equating to zero we get 2 ∗  r ! ≥ min λmin ∇ `(vS + θ∆S) 2 2n b2C θ∈[0,1] −bp|S| νκ + log + min − λp|S|b = 0 m ! m δ 4 1 X ∗ T (l) (l) (l) T = min λmin η((vS + θ∆S) zS )zS (zS ) p r ! θ∈[0,1] m 4 |S| 2 2n l=1 =⇒ b = λ + νκ + log . Cmin m δ Again, using the Taylor’s theorem to expand the function η we get Finally, coming back to our prior assumption we have ! ∗ T (l) 4p|S| r 2 2n 5C η((vS + θ∆S) z ) min S b = λ + νκ + log ≤ p . ∗ T (l) 0 ∗ ¯ T (l) T (l) Cmin m δ 4 |S|Dmax = η((vS) zS ) + η ((vS + θ∆S) zS )(θ∆S) zS The above assumption holds if the regularization parameter λ , where θ¯ ∈ [0, θ]. Continuing from above and from Lemma 1 is bounded as follows: we have, with probability at least 1 − |S| exp((−mCmin)/2|S|): 5C2 r 2 2n 2 ∗  λ ≤ min − log − νκ. λmin ∇ `(vS + θ∆S) 16 |S| Dmax m δ m 1 X ∗ T (l) (l) (l) T ≥ min λmin η((vS) zS )zS (zS ) θ∈[0,1] m l=1 Lemma 4. If the regularization parameter λ satisfies the fol- m ! 1 X (l) (l) (l) (l) lowing condition: + η0((v∗ + θ¯∆ )T z )((θ∆ )T z )z (z )T m S S S S S S S r l=1 2 2n m λ ≥ νκ + log , ≥ λmin(HSS) − max |||A(θ)|||2 m δ θ∈[0,1] then we have that Cmin ≥ − max |||A(θ)|||2, ∗ 5Cmin 2 θ∈[0,1] kvb − v k1 ≤ Dmax where we have defined with probability at least 1 − (δ + |S| exp((−mCmin)/2|S|) + m (−m(1−ν)) . 1 X 0 ∗ T (l) T (l) (l) (l) T |S| exp( /4|S|)). A(θ) = η ((vS + θ∆S) zS )(θ∆S) zS zS (zS ) . m . ∗ l=1 Proof. Define ∆ = vb − v . Also for any vector y let the y y Next, the spectral norm of A(θ) can be bounded as follows: notation S denote the vector with the entries not in the support, S, set to zero, i.e. |||A(θ)|||  2 yi if i ∈ S, ( m [y ] = 1 S i 0 otherwise. X 0 ∗ T (l) T (l) ≤ max η ((vS + θ∆S) zS ) ((θ∆S) zS ) kyk =1 m 2 l=1 Similarly, let the notation ySc denote the vector y with the ) entries not in Sc set to zero, where Sc is the complement of T (l) (l) T × y (zS (zS ) )y S. Having introduced our notation and since, S is the support of the true vector v∗, we have by definition that v∗ = v∗ . We S ( m ) then have, using the reverse triangle inequality, 1 X (l) (l) (l) < max k(θ∆ )k z yT (z (z )T )y S 1 S S S ∗ ∗ kyk2=1 10m ∞ kvk = kv + ∆k = v + ∆ + ∆ l=1 b 1 1 S S Sc 1 ( m ) = v∗ − (−∆ ) + k∆ k 1 X p T (l) (l) T S S 1 Sc 1 ≤ θ max |S| k∆Sk2 y (zS (zS ) )y ∗ kyk =1 10m ≥ kv k − k∆ k + k∆ c k . (14) 2 l=1 1 S 1 S 1 m 1 X (l) (l) Also, from the optimality of v for the `1-regularized problem = θbp|S| z (z )T b 10m S S we have that l=1 2 p `(v∗) + λ kv∗k ≥ `(v) + λ kvk (b |S|Dmax) Cmin 1 b b 1 ≤ ≤ , ∗ ∗ 5 4 =⇒ λ(kv k1 − kvbk1) ≥ `(vb) − `(v ). (15) Next, from convexity of `(.) and using the Cauchy-Schwartz for all players. Therefore, we have that NE(Wc, b) = NE∗ 0 0 inequality we have that with high probability. Finally, setting δ = δ /3n, for some δ ∈

∗ ∗ T ∗ [0, 1], and ensuring that the last two terms in (20) are at most `(v) − `(v ) ≥ ∇`(v ) (v − v ) 0 b b δ /3n each, we prove our claim. ∗ ≥ − k∇`(v )k∞ k∆k1 λ To better understand the implications of the theorem above, ≥ − k∆k , (16) we discuss some possible operating regimes for learning sparse 2 1 linear influence games in the following paragraphs. ∗ where in the last line we used the fact that λ ≥ k∇`(v )k∞. Thus, we have from (14), (15) and (16) that Remark 1 (Sample complexity for fixed q). In the theorem above, if q is constant, which in turn makes ν constant, then 1 ∗ K = Ω (1/k2), and the sample complexity of learning sparse k∆k1 ≥ kvbk1 − kv k1 2 linear games grows as O k4 log n. However, if q is small 1 =⇒ k∆k ≥ k∆ k − k∆ k enough such that ν ≤ 1/k, then the constant D is no longer 2 1 Sc 1 S 1 max a function of k and hence K = Ω (1/k). Therefore, the sample 1 1 2  =⇒ k∆ c k + k∆ k ≥ k∆ c k − k∆ k complexity scales as O k log n for exact PSNE recovery. 2 S 1 2 S 1 S 1 S 1

=⇒ 3 k∆Sk1 ≥ k∆Sc k1 . (17) The sample complexity of O (poly(k) log n) for exact re- covery of the PSNE set can be compared with the sample Finally, from (17) and Lemma 3 we have that complexity of O kn3 log2 n for the maximum likelihood estimate (MLE) of the PSNE set as obtained by Honorio [11]. k∆k1 = k∆Sk1 + k∆Sc k1 Note that while the MLE procedure is consistent, i.e. MLE of ≤ 4 k∆ k ≤ 4p|S| k∆ k S 1 S 2 the PSNE set is equal to the true PSNE set with probability 5C ≤ min . converging to 1 as the number of samples tend to infinity, Dmax it is NP-hard 3. In contrast, the logistic regression method is computationally efficient. Further, while the sample com- plexity of our method seems to be better than the empirical Now we are ready to present our main result on recovering log-likelihood minimizer as given by Theorem 3 in [11], in the true PSNE set. this paper we restrict ourselves to LIGs with strictly positive Theorem 1. If for all i, |Si| ≤ k, the minimum payoff ρmin ≥ payoff in the PSNE set — such games are invariably easier to 5C min/Dmax, and the regularization parameter and the number learn than general LIGs considered by [11]. of samples satisfy the following conditions: Honorio [11] also obtained lower bounds on the number r r of samples required by any conceivable method, for exact 2 6n2 2 6n2 νκ + log ≤ λ ≤ 2K + νκ − log (18) PSNE recovery, by scaling the parameter q with the number m δ m δ of players. Therefore, in order to compare our results with ( 2 6n2  2k 3kn the information-theoretic limits of learning LIGs, we consider m ≥ max 2 log , log , K δ Cmin δ the regime where the parameter q scales with the number of ) players n. 4k 3kn log , (19) 1 − ν δ Remark 2 (Sample complexity for q varying with number of players). If we consider the regime where the signal level ∗ . 5C2 (|N E |+1) n k n ∗ where K = min/32kDmax − νκ, then with probability at least scales as q = /2 . Then, Dmax = O ( /(2 −|N E |)), n 1 − δ, for δ ∈ [0, 1], we recover the true PSNE set, i.e. and as a result K = Ω (2 /k2). Therefore, the sample complex- NE(Wc, b) = NE(W∗, b∗). ity, which is dominated by the second and third terms in (19), is given as O (k log(kn)). In general if q = Θ (exp(−n)), then Proof. From Cauchy-Schwartz inequality and Lemma 4 we the sample complexity for recovering the PSNE set exactly is have O (k log(kn)). ∗ T ∗ 5Cmin (vi − v ) zi ≤ kvi − v k kzik ≤ . Once again we observe that even in the regime of q scaling b i b i 1 ∞ D max exponentially with n, the sample complexity of O (k log(kn)) Therefore, we have that is better than the information theoretic limit of O kn log2 n,

∗ T 5Cmin T ∗ T 5Cmin by a factor of O (n log n). This can be attributed to the fact (vi ) zi − ≤ vbi zi ≤ (vi ) zi + . Dmax Dmax that we consider restricted ensembles of games with strictly positive payoffs in the PSNE set, as opposed to general LIGs. ∗ ∗ T 5C T Now, if ∀ x ∈ N E , (v ) z ≥ min/Dmax, then v z ≥ 0. i i bi i Further, from the aforementioned remarks we see that as Using an union bound argument over all players i, we can the signal level q decreases, the sufficient number of samples show that the above holds with probability at least 3 (−mC ) (−m(1−ν)) Irfan and Ortiz [4] showed that counting the number of Nash equilibria is 1 − n(δ + k exp( min /2k) + k exp( /4k)) (20) #P-complete. Therefore, computing the log-likelihood is NP-hard. k = 1 k = 3 k = 5 1.0 consistent as the number of players n is changed from 10 to

0.8 20.

0.6 VI.CONCLUSION 0.4 n = 10 n = 12

Prob. of success 0.2 n = 15 In this paper, we presented a computationally efficient and n = 20 0.0 statistically consistent method, based on `1-regularized logistic −3 −2 −1 0 1 −3 −2 −1 0 1 −3 −2 −1 0 1 control parameter regression, for learning linear influence games — a subclass of parametric graphical games with linear payoffs. Under some Fig. 1. The probability of exact recovery of the PSNE set computed across 40 randomly sampled LIGs as the number of samples is scaled mild conditions on the true game, we showed that as long 2 as b(C)(10c)(k2 log(6n /δ))c, where c is the control parameter and the as the number of samples scales as O (poly(k) log n), where constant C is 10000 for the k = 1 case and 1000 for the remaining two n is the number of players and k is the maximum number case. of neighbors of any player; then we can recover the pure- strategy Nash equilibria set of the true game in polynomial needed to recover the PSNE set reduces; and in the limit- time and with probability converging to 1 as the number of ing case of q decreasing exponentially with the number of samples tend to infinity. An interesting direction for future players, the sample complexity scales as O (k log(kn)). This work would be to consider structured actions — for instance seems counter-intuitive — with increased signal level, a learn- permutations, directed spanning trees, directed acyclic graphs ing problem should become easier and not harder. To under- among others — thereby extending the formalism of linear stand this seemingly counter-intuitive behavior, first observe influence games to the structured prediction setting. A more D that the constant max/Cmin can be thought of as the “con- technical extension would be to consider a local noise model dition number” of the loss function given by 5, Therefore, where the observations are drawn from the PSNE set but with the sample complexity as given by Theorem 1 can be written each action independently corrupted by some noise. Other 2 D 2  as O k ( max/Cmin) log n . From (9), we see that as the ideas that might be worth pursuing are: considering mixed signal level increases, the Hessian of the loss becomes more strategies, correlated equilibria and epsilon Nash equilibria, ill-conditioned, since the data set now comprises of many and incorporating latent or unobserved actions and variables repetitions of the few joint-actions that are in the PSNE set; in the model. thereby increasing the dependency (Dmax) between actions of REFERENCES players in the sample data set. [1] B. Blum, C. R. Shelton, and D. Koller, “A continuation method for V. EXPERIMENTS nash equilibria in structured games,” Journal of Artificial Intelligence Research, vol. 25, pp. 457–502, 2006. In order to verify that our results and assumptions indeed [2] L. E. Ortiz and M. Kearns, “Nash propagation for loopy graphical hold in practice, we performed various simulation experiments. games,” in Advances in Neural Information Processing Systems, 2002, pp. 793–800. We generated random LIGs for n players and exactly k neigh- [3] D. Vickrey and D. Koller, “Multi-agent algorithms for solving graphical bors by first creating a matrix W of all zeros and then setting k games,” in AAAI/IAAI, 2002, pp. 345–351. off-diagonal entries of each row, chosen uniformly at random, [4] M. T. Irfan and L. E. Ortiz, “On influence, stable behavior, and the most influential individuals in networks: A game-theoretic approach,” to −1. We set the bias for all players to 0. We found that any Artificial Intelligence, vol. 215, pp. 79–119, 2014. odd value of k produces games with strictly positive payoff [5] O. Ben-Zwi and A. Ronen, “Local and global price of anarchy of in the PSNE set. Therefore, for each value of k in {1, 3, 5}, graphical games,” Theoretical Computer Science, vol. 412, no. 12, pp. 1196–1207, 2011. and n in {10, 12, 15, 20}, we generated 40 random LIGs. The [6] J. Honorio and L. Ortiz, “Learning the structure and parameters of large- parameters q and δ were set to the constant value of 0.01 and population graphical games from behavioral data,” Journal of Machine the regularization parameter λ was set according to Theorem 1 Learning Research, vol. 16, pp. 1157–1210, 2015. p [7] P. Ravikumar, M. Wainwright, and J. Lafferty, “High-dimensional Ising as some constant multiple of (2/m) log(2n/δ). Figure 1 shows model selection using L1-regularized logistic regression,” The Annals the probability of successful recovery of the PSNE, for various of Statistics, vol. 38, no. 3, pp. 1287–1319, 2010. [Online]. Available: combinations of (n, k), where the probability was computed http://arxiv.org/abs/1010.0311 [8] M. Kearns, M. L. Littman, and S. Singh, “Graphical models for game as the fraction of the 40 randomly sampled LIGs for which theory,” in Proceedings of the Seventeenth conference on Uncertainty the learned PSNE set matched the true PSNE set exactly. For in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 2001, pp. each experiment, the number of samples was computed as: 253–260. [9] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” c 2 2 b(C)(10 )(k log(6n /δ))c, where c is the control parameter Foundations of computational mathematics, vol. 12, no. 4, pp. 389–434, and the constant C is 10000 for k = 1 and 1000 for k = 3 and 2012. 5. Thus, from Figure 1 we see that, the sample complexity of [10] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association, vol. 58, no. 2  O k log n as given by Theorem 1 indeed holds in practice 301, pp. 13–30, 1963. — i.e. there exists constants c and c0 such that if the number of [11] J. Honorio, “On the Sample Complexity of Learning Sparse Graphical samples is less than ck2 log n, we fail to recover the PSNE set Games,” 2016. [Online]. Available: http://arxiv.org/abs/1601.07243 exactly with high probability, while if the number of samples is greater than c0k2 log n then we are able to recover the PSNE set exactly, with high probability. Further, the scaling remains