Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions

Asish Ghoshal Jean Honorio [email protected] [email protected] Department of Computer Science, Purdue University, West Lafayette, IN, U.S.A. - 47907.

Abstract identified, from congressional voting records, the most influential senators in the U.S. congress — a small In this paper we obtain sufficient and neces- set of senators whose collective behavior forces every sary conditions on the number of samples re- other senator to a unique choice of vote. Irfan and quired for exact recovery of the pure- Ortiz [4] also observed that the most influential sena- Nash equilibria (PSNE) set of a graphical tors were strikingly similar to the gang-of-six senators, game from noisy observations of joint actions. formed during the national debt ceiling negotiations of We consider sparse linear influence games — 2011. Further, using graphical games, Honorio and Or- a parametric class of graphical games with tiz [6] showed that Obama’s influence on Republicans linear payoffs, and represented by directed increased in the last sessions before candidacy, while graphs of n nodes (players) and in-degree of McCain’s influence on Republicans decreased. at most k.Weshowthatonecanefficiently The problems in algorithmic described recover the PSNE set of a linear influence above, i.e., computing the Nash equilibria, comput- game with k2 log n samples, under very O ing the price of anarchy or finding the most influ- general observation models. On the other ential agents, require a known graphical game which hand, we show that ⌦(k log n) samples are is not available apriori in real-world settings. There- necessary for any procedure to recover the fore, Honorio and Ortiz [6] proposed learning graphi- PSNE set from observations of joint actions. cal games from behavioral data, using maximum likeli- hood estimation (MLE) and sparsity-promoting meth- ods. On the other hand, Garg and Jaakkola [7] provide 1IntroductionandRelatedWork adiscriminativeapproachtolearnaclassofgraphical games called potential games. Honorio and Ortiz [6] Non- is widely considered as and Irfan and Ortiz [4] have also demonstrated the use- an appropriate mathematical framework for studying fulness of learning sparse graphical games from behav- strategic behavior in multi-agent scenarios. In Non- ioral data in real-world settings, through their analysis cooperative game theory, the of the voting records of the U.S. congress as well as the of describes the stable outcome of U.S. supreme court. the overall behavior of self-interested agents — for instance people, companies, governments, groups or In this paper, we obtain necessary and sufficient condi- autonomous systems — interacting strategically with tions for recovering the PSNE set of a graphical game each other and in distributed settings. in polynomial time. We also generalize the observa- tion model from Ghoshal and Honorio [8], to arbitrary Over the past few years, considerable progress has distributions that satisfy certain mild conditions. Our been made in analyzing behavioral data using game- polynomial time method for recovering the PSNE set, theoretic tools, e.g. computing Nash equilibria [1, 2, 3], which was proposed by Honorio and Ortiz [6], is based most influential agents [4], price of anarchy [5] and on using logistic regression for learning the neighbor- related concepts in the context of graphical games. hood of each player in the graphical game, indepen- In political science for instance, Irfan and Ortiz [4] dently. Honorio and Ortiz [6] showed that the method of independent logistic regression is likelihood consis- Proceedings of the 20th International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- tent; i.e., in the infinite sample limit, the likelihood erdale, Florida, USA. JMLR: W&CP volume 54. Copy- estimate converges to the best achievable likelihood. right 2017 by the author(s). In this paper we obtain the stronger guarantee of re- Sufficient and Necessary Conditions for Learning Graphical Games covering the true PSNE set exactly. among players, i.e., the payoffof the i-th player only depends on the actions of its (incoming) neighbors. Finally, we would like to draw the attention of the reader to the fact that `1-regularized logistic regres- sion has been analyzed by Ravikumar et. al. [9] in 2.2 Linear Influence Games the context of learning sparse Ising models. Apart Irfan and Ortiz [4] and Honorio and Ortiz [6], intro- from technical differences and differences in proof tech- duced a specific form of graphical games, called Linear ` niques, our analysis of 1-penalized logistic regres- Influential Games,characterizedbybinaryactions,or sion for learning sparse graphical games differs from pure strategies, and linear payofffunctions. We as- Ravikumar et. al. [9] conceptually — in the sense sume, without loss of generality, that the joint ac- that we are not interested in recovering the edges of tion space = 1, +1 n.Alinearinfluencegame X { } the true game graph, but only the PSNE set. There- between n players, (n)=(W, b),ischaracterized fore, we are able to avoid some stronger conditions G n n by (i) a matrix of weights W R ⇥ , where the required by Ravikumar et. al. [9], such as mutual 2 entry wij indicates the amount of influence (signed) incoherence. that the j-th player has on the i-th player and (ii) n abiasvectorb R , where bi captures the prior 2 2Preliminaries of the i-th player for a particular action x 1, +1 . The payoffof the i-th player is a lin- i 2{ } In this section we provide some background informa- ear function of the actions of the remaining players: T tion on graphical games introduced by Kearns et. al. ui(xi, x i)=xi(w ix i bi),andthePSNEsetis [10]. defined as follows:

T ( (n)) = x ( i) xi(w ix i bi) 0 , (2) 2.1 Graphical Games NE G | 8 where w i denotes the i-th row of W without the i- A normal-form game in classical game theory is de- G th entry, i.e. w i = wij j = i .Notethatwehave fined by the triple =(V, , ) of players, actions and { | 6 } G X U diag(W)=0. Thus, for linear influence games, the payoffs. V is the set of players, and is given by the set weight matrix W and the bias vector b,completely V = 1,...,n ,iftherearen players. is the set of { } X specify the game and the PSNE set induced by the actions or pure-strategies and is given by the Cartesian def game. Finally, let (n, k) denote a sparse game over product = i V i, where i is the set of pure- G 2 n players where the in-degree of any vertex is at most X ⇥ X X def n strategies of the i-th player. Finally, = ui ,is k. U { }i=1 the set of payoffs, where ui : i j V i j R spec- X ⇥ 2 \ X ! ifies the payofffor the i-th player given its action and 3ProblemFormulation the joint actions of the all the remaining players. An important solution concept in the theory of non- Now we turn our attention to the problem of learn- cooperative games is that of Nash equilibrium.Fora ing graphical games from observations of joint actions def non-cooperative game, a joint action x⇤ is a pure- only. Let ⇤ = ( ⇤(n, k)).Weassumethat 2X NE NE G strategy Nash equilibrium (PSNE) if, for each player there exists a game ⇤(n, k)=(W⇤, b⇤) from which i x argmax u (x , x ) x = x j = G (l) m , i⇤ xi i i i ⇤ i , where ⇤ i j⇤ a“noisy”dataset = x l=1 of m observations is 2 2X { | 6 D { } (l) i . In other words, x⇤ constitutes the mutual best- generated, where each observation x is sampled in- } response for all players and no player has any incentive dependently and identically from some distribution . P to unilaterally deviate from their optimal action x⇤ We will use two specific distributions and , which i Pg Pl given the joint actions of the remaining players x⇤ i. we refer to as the global and local noise model, to pro- The set of all pure-strategy Nash equilibrium (PSNE) vide further intuition behind our results. In the global for a game is defined as follows: noise model, we assume that a joint action is observed G from the PSNE set with probability q ( ⇤ /2n, 1), g 2 |N E | ( )= x⇤ ( i V ) xi⇤ argmax ui(xi, x⇤ i) . i.e. NE G 8 2 2 x ⇢ i2Xi qg1 [x ⇤] (1 qg)1 [x / ⇤] (1) g(x)= 2NE + 2NE . (3) P ⇤ 2n ⇤ |N E | |NE | Graphical games, introduced by Kearns et al. [10], In the above distribution, qg can be thought of as the are game-theoretic analogues of graphical models. A “signal” level in the data set, while 1 q can be g graphical game G is defined by the directed graph, G = thought of as the “noise” level in the data set. In (V,E), of vertices and directed edges (arcs), where ver- the local noise model we assume that the joint ac- tices correspond to players and arcs encode “influence” tions are drawn from the PSNE set with the action of Ghoshal, Honorio each player corrupted independently by some Bernoulli To get some intuition for the above assumption, con- noise. Then in the local noise model the distribution sider the global noise model. In this case we have that q over joint actions is given as follows: p˜ =˜p =(1 q ), p = g/ ⇤ ,and x min max g max |N E | 8 2 ⇤, (x)=pmax.Forthelocalnoisemodel,con- 1 n NE P 1[xi=yi] 1[xi=yi] sider, for simplicity, the case when there are only two l(x)= qi (1 qi) 6 , P ⇤ 1 2 y i=1 joint actions in the PSNE set: ⇤ = x , x ,such |N E | 2XNE⇤ Y 1 2 1 NE2 { } that x1 =+1,x1 = 1 and xi = xi =+1for all i =1. (4) n 6 Then, p˜min =0.5 (1 q2) ... (1 qn) (2 2), ⇥ ⇥ ⇥ n ⇥ where qi > 0.5.Whilethesenoisemodelswerein- p˜max =0.5 (1 qj)( i/ j,1 qi) (2 2), where ⇥ 2{ } ⇥ troduced in [6], we obtain our results with respect to qj =min q2,...,qn ,andpmax =0.5 q2 ...qn. very general observation models, satisfying only some { } Q ⇥ ⇥ mild conditions. A natural question to ask then is Our next assumption concerns with the minimum pay- that: “Given only the data set and no other in- offin the PSNE set. D formation, is it possible to recover the game graph?” Assumption 2. The minimum payoffin the PSNE Honorio and Ortiz [6] showed that it is in general im- set, ⇢min, is strictly positive, specifically: possible to learn the true game ⇤(n, k) from obser- G T 5C vations of joint actions only because multiple weight xi(w⇤ i x i bi) ⇢min > min/Dmax ( x ⇤), 8 2NE matrices W and bias vectors b can induce the same PSNE set and therefore have the same likelihood un- where Cmin > 0 and Dmax are the minimum and max- der the global noise model (3) — an issue known as imum eigenvalue of the expected Hessian and scatter non-identifiablity in the statistics literature. It is also matrices respectively. easy to see that the same holds true for the local noise Note that as long as the minimum payoffis strictly model. It is, however, possible to learn the equivalence positive, we can scale the parameters (W⇤, b⇤) by the class of games that induce the same PSNE set. We de- 5C constant min/Dmax to satisfy the condition: ⇢min > fine the equivalence of two games ⇤(n, k) and (n, k) 5C G G min/Dmax, without changing the PSNE set. Indeed simply as : the assumption that the minimum payoffis strictly b ⇤(n, k) (n, k) iff ( ⇤(n, k)) = ( (n, k)). positive is is unavoidable for exact recovery of the G ⌘ G NE G NE G PSNE set in a noisy setting such as ours, because oth- Therefore, ourb goal in this paper is efficient andb consis- erwise this is akin to exactly recovering the parameters tent recovery of the pure-strategy Nash equilibria set v for each player i.Forexample,ifx ⇤ is such T 2NE (PSNE) from observations of joint actions only; i.e., that v⇤ x =0, then it can be shown that even if given a data set , drawn from some game ⇤(n, k) ac- v⇤ v = ",forany" arbitrarily close to 0,then D G k T k1 cording to the distribution ,weinferagame (n, k) v x < 0 and therefore (W⇤, b⇤) = (W, b). P G NE 6 NE from such that (n, k) ⇤(n, k). b D G ⌘G b 4.2b Method c b 4MethodandResultsb Our main method for learning the structure of a sparse (n, k) ` In order to efficiently learn games, we make a few as- LIG, ⇤ ,isbasedonusing 1-regularized logistic G (w ,b ) sumptions on the probability distribution from which regression, to learn the parameters i i for each i v (W, b)= samples are drawn and also on the underlying game. player independently. We denote by i (w i, bi) the parameter vector for the i-th player, which characterizes its payoff; by zi(x)=(xix i,xi) 4.1 Assumptions the “feature” vector. In the rest of the paper we use vi The following assumption ensures that the distribution and zi instead of vi(W, b) and zi(x) respectively, to assigns non-zero mass to all joint actions in and simplify notation. Then, we learn the parameters for P X that the signal level in the data set is more than the the i-th player as follows: noise level. vi = argmin `(vi, )+ vi 1 (5) Assumption 1. There exists constants p˜min, p˜max and vi D k k p m max such that the data distribution satisfies the fol- 1 (l) P `(v , b)= log(1 + exp( vT z )). (6) lowing: i D m i i l=1 p˜min p˜max X 0 < n (x) n , x ⇤, 2 ⇤ P  2 ⇤ 8 2X\NE We then set w i =[vi]1:(n 1) and bi = [vi]n, where |NE | |NE | p˜max the notation [.]i:j denotes indices i to j of the vector. < (x) pmax 1, x ⇤. 2n ⇤ P   8 2NE We take a momentb tob introduce theb expressionsb of the |NE | Sufficient and Necessary Conditions for Learning Graphical Games gradient and the Hessian of the loss function (6), which sample of joint actions induce the same PSNE set as will be useful later. The gradient and Hessian of the the true weight vectors. loss function for any vector v and the data set is D The following lemma shows that the expected Hessian given as follows: matrices for each player is positive definite and the 1 m z(l) maximum eigenvalues of the expected scatter matrices `(v, )= (7) r D m 1+exp(vT z(l)) are bounded from above by a constant. l=1 ⇢ Xm Lemma 1. Let S be the support of the vector v,i.e., 2 1 T (l) (l) (l)T def `(v, )= ⌘(v z )z z , (8) S = i vi > 0 . There exists constant Cmin n r D m ⌘( v⇤ {)2 ||p˜ | } l=1 1 min n k n k > 0 and Dmax 2 pmax, such that X 2 ⇤ |N E |  T 1 x/2 x/2 2 m (H )=C ( z z )= where ⌘(x)= /(e +e ) .Finally,Hi denotes the we have min SS⇤ min and max Ex S S i Dmax. sample Hessian matrix with respect to the -th player ⇥ ⇤ and the true parameter vi⇤,andHi⇤ denotes its ex- def m 2 pected value, i.e. Hi⇤ = E [Hi ]=E `(vi⇤, ) . Proof. D D r D In subsequent sections we drop the notational depen- ⇥ ⇤ T H⇤ z i T dence of i and i on to simplify notation. min(HSS⇤ )=min Ex ⌘(v⇤ z)zSzS We show that, under the aforementioned assumptions ⇣ h T i⌘ = ⌘( v⇤ 1)min(Ex zSzS ). on the true game ⇤(n, k)=(W⇤, b⇤),theparameters k k G W and b obtained using (6) induce the same PSNE def def ⇥ ⇤ Let Z = zS x and P = Diag(( (x))x ), { | 2X} P 2X set as the true game, i.e., (W⇤, b⇤)= (W, b). where z denotes the feature vector for the i-th player c b NE NE S constrained to the support set S for some i.Notethat c b 2n S 2n 2n 4.3 Sufficient Conditions Z 1, 1 ⇥| |; P R ⇥ and is positive defi- 2{ } 2 nite by our assumption that the minimum probability In this section, we derive sufficient conditions on the p˜min n > 0.FurthernotethatthecolumnsofZ 2 ⇤ number of samples for efficiently recovering the PSNE |N E | T n are orthogonal and Z Z =2I S , where I S is the set of graphical games with linear payoffs. To start | | | | S S identity matrix. Then we have that with, we make the following observation regarding the | |⇥| | number of Nash equilibria of the game satisfying As- T T T min(Ex zSzS )= min y Z PZy sumption 2. The proof of the following proposition, as y S y =1 { 2R| ||k k2 } well as other missing proofs can be found in Appendix ⇥ ⇤ n T =min2 (y0) Py0 n A. y 2 y =Zy/p2n y S y =1 { 02R | 0 ^ 2R| |^k k2 } n T Proposition 1. The number of Nash equilibria of a min 2 (y0) Py0 n y 2n y =1 non-trivial game ( ⇤ [1, 2 1]) satisfying As- { 02R |k 0k2 } |N E n|21 n sumption 2 is at most 2 . n 2 p˜min =2min(P)= 2n ⇤ We will denote the fraction of joint actions that are |NE | def n 1 ⇤ 2 in the PSNE set by f ⇤ = |N E |/ .Byproposi- Therefore, the minimum eigenvalue of HSS⇤ is lower NE tion 1, f (0, 1]. Then, our main strategy for ob- bounded as follows: NE⇤ 2 taining sufficient conditions for exact PSNE recovery ⌘( v )2np guarantees is to first show that under any data dis- ⇤ 1 min min(HSS⇤ )=Cmin kn k > 0. tribution that satisfies Assumption 1, the expected 2 ⇤ P |NE | loss is smooth and strongly convex, i.e., the population T Hessian matrix is positive definite and the population Similarly, the maximum eigenvalue of Ex zSzS can T T be bounded as max(Ex zSz )=max(Z PZ) scatter matrix has eigenvalues bounded by a constant. S ⇥ ⇤  2np . Then using tools from random matrix theory, we show max ⇥ ⇤ that the sample Hessian and scatter matrices are “well behaved”, i.e., are positive definite and have bounded 4.3.1 Minimum and Maximum Eigenvalues eigenvalues respectively, with high probability. Then, of Finite Sample Hessian and Scatter we exploit the convexity properties of the logistic loss Matrices function to show that the weight vectors learned us- ing penalized logistic regression are “close” to the true The following technical lemma shows that the eigen- weight vectors. By our assumption that the minimum values conditions of the expected Hessian and scatter payoffin the PSNE set is strictly greater than zero, matrices, hold with high probability in the finite sam- we show that the weight vectors inferred from a finite ple case. Ghoshal, Honorio

T Lemma 2. If min(H⇤ ) Cmin,max(Ex zSz ) Therefore, we have that SS S  Dmax, then we have that ⇥ ⇤ mp˜min m Pr max < 2Dmax > 1 S exp . T m Cmin (l) (l) { } | | 4 S min(HSS) ,max zS zS 2Dmax ✓ | | ◆ 2 !  Xl=1 with probability at least 4.3.2 Recovering the Pure Strategy Nash mCmin mp˜min 1 S exp and 1 S exp Equilibria (PSNE) Set | | 2 S | | 4 S ✓ | | ◆ ✓ | | ◆ Before presenting our main result on the exact recov- respectively. ery of the PSNE set from noisy observations of joint actions, we first present a few technical lemmas that Proof. Let would be helpful in proving the main result. The fol- def def T lowing lemma bounds the gradient of the loss function µmin = min(H⇤ ) and µmax = max(Ex zSz ). SS S (6) at the true vector v⇤,forallplayers. First note that for all z 1, +1 n: ⇥ ⇤ Lemma 3. With probability at least 1 for (0, 1), 2{ } 2 we have that T T S def max(⌘(v⇤ zS)zSz ) | | = R S S  4 2 2n T def `(v⇤, ) <⌫+ log , (z z ) S = R0. kr D k1 m max S S | | r 1 Using the Matrix Chernoffbounds from Tropp [11], we where  = /(1+exp(⇢min)), ⇢min 0 is the minimum n 1 ⇤ 2 payoffin the PSNE set, f ⇤ = |N E |/ , and have that NE mµmin R e def (˜pmax p˜min) f ⇤ p˜min m ⌫ =  (x)+ + NE (9) Pr min(HSS) (1 )µmin S 1 . {  }| | (1 ) P 2 f ⇤ 2 f ⇤ x ⇤ NE NE ✓ ◆ 2XNE Setting = 1/2 we get that m def Proof. Consider the i-th player. Let u = `(v⇤, ) 4mµmin i S m m r D | | and u denote the j-th index of u .Foranysubset m 2 j Pr (H ) µmin/2 S n 1 min SS 0 such that 0 =2 define the function g( 0) {  }| | " e # S ⇢X |S | S r as follows: mCmin S exp . def | | 2 S g( 0) = (x)f(x) (x)f(x), ✓ | | ◆ S P c P x 0 x 0 Therefore, we have X2S X2S c where 0 denotes the complement of the set 0 and mC m C min S 1 T S Pr (H ) > min/2 > 1 S exp . f(x)= /1+exp(vi⇤ zi(x)).Forx ⇤, f(x) , { min SS } | | 2 S 2NE  ✓ ◆ while for x / ⇤ we have 1/2 f(x) 1.Lastly,let | | 2NE   Next, we have that ij = x xixj =+1 and i = x xi =+1 . S { 2X| } S { 2m X| } From (7) we have that, for j = n, E u = g( ij) , T 6 j | S | µmax = max(Ex zSzS ) m while for j = n E uj = g( i) . Thus we get n | S | ⇥ ⇤ ⇥ T ⇤ 2 p˜min min(Ex zSzS ) m ⇥ ⇤ n ⇤ u max g( 0) (10) 2 n 1 |NE | k k1  0 0 =2 S ⇥ ⇤ S ⇢X ||S | Once again invoking Theorem 1.1 from [11] and setting def =1we have that Let be the set that maximizes (10), A = ⇤ S def S\NE (mµ ) c max /R0 and B = ⇤. Continuing from above, e S \NE Pr (1 + )µ S { max max}| | (1 + )1+  g( ) = (x)f(x)+ (x)f(x) (mµmax)/ S Pr 2µ S e | | | S | P P max max 4 x A x A { }| | 2XS\ X2 mµmax S exp⇥ ⇤ 4 S (x)f(x) (x)f(x) | | | | P P ⇣ n ⌘2 x c B x B m2 p˜min 2XS \ X2 S exp S (2n ) | | | | |N E⇤|  (x)+ (x)f(x) (x)f(x) ⇣ mp˜min ⌘ S exp  P P c P | | 4 S x ⇤ x A x B | | 2XNE 2XS\ 2XS \ ⇣ ⌘ Sufficient and Necessary Conditions for Learning Graphical Games

Assume that the first term inside the absolute value Proof. The proof of this lemma follows the general above dominates the second term, if not then we can proof structure of Lemma 3 in [9]. First, we repa- proceed by reversing the two terms. rameterize the `1-regularized loss function

n 1 n 1 f(vS)=`(vS)+ vS 2 p˜ (2 ⇤ )˜p 1 g( )  (x)+ max |NE | min k k | S | P 2n ⇤ f x ⇤ as the loss function , which gives the loss at a point 2XNE |NE | that is S distance away from the true parameter v⇤ (˜pmax p˜min)+f p˜min S =  (x)+ NE⇤ = ⌫ e as : f(S)=`(vS⇤ +S) `(vS⇤ )+( vS⇤ +S 1 P 2 f ⇤ k k x ⇤ NE v⇤ ), where = v v⇤ .Alsonotethattheloss 2XNE k Sk1 S S S m functione f is shifted such that the loss at the true pa- Also note that uj ⌫ 1.Finally,fromHoeffding’s   rameter v⇤ is 0,i.e.,f(0)=0.Further,notethatthe inequality [12] and using a union bound argument over S function feis convex and is minimized at = v v⇤ , all players, we have that: S S S since v minimizes f.e Therefore, clearly f( ) 0. S S  n mt2 e b m m /2 Thus, if we can show that the function f isb strictly Pr max uj E uj 1 2ne j=1 positiveb on the surface of a ball of radius eb,thentheb ⇢ 2 m ⇥ m⇤ mt /2 point lies inside the ball i.e., v v⇤e b.Us- = Pr u E [u ] 1 2ne S k S Sk2  ) {k k1 } 2 ing the Taylor’s theorem we expand the first term of m m mt /2 = Pr u E [u ] 1 2ne f to getb the following: ) {k k1 k k1 } b mt2 m /2 T T 2 = Pr u <⌫+ t > 1 2ne . f(S)= `(vS⇤ ) S +S `(vS⇤ + ✓S)S ) {k k1 } e r r + ( v⇤ +S v⇤ ), (11) mt2 S 1 S 1 Setting 2n exp( /2)=,weproveourclaim. e k k k k for some ✓ [0, 1].Next,welowerboundeachofthe 2 To get some intuition for the lemma above, consider terms in (11). Using the Cauchy-Schwartz inequality, the constant ⌫ as given in (9). First, note that  1/2. the first term in (11) is bounded as follows: ⇢   T Also, as the minimum payoff min increases, decays `(vS⇤ ) S `(vS⇤ ) S 1 to 0 exponentially. Similarly, if the probability mea- r kr k1 k k `(vS⇤ ) S S 2 sure on the non-Nash equilibria set is close to uniform, kr k1 | |k k meaning p˜max p˜min 0,thenthesecondtermin(9) p 2 2n ⇡ vanishes. Finally, if the fraction of actions that are in b S ⌫ + log , (12) | | rm ! the PSNE set (f ⇤ )issmall,thenthethirdtermin p NE with probability at least 1 for [0, 1]. It is (9) is small. Therefore, if the minimum payoffis high, 2 the noise distribution, i.e., the distribution of the non- also easy to upper bound the last term in equation 11, Nash equilibria joint actions, is close to uniform, and using the reverse triangle inequality as follows: the fraction of joint actions that are in the PSNE set vS⇤ +S 1 vS⇤ 1 S 1 . is small, then the expected gradient vanishes. In the |k k k k | k k following technical lemma we show that the optimal Which then implies the following lower bound: vector v for the logistic regression problem is close to ( v⇤ + v⇤ ) k S Sk1 k Sk1 k Sk1 the true vector v⇤ in the support set S of v⇤.Next, S S in Lemmab 5, we bound the difference between the true | |k k2 vector v⇤ and the optimal vector v in the non-support = p S b. (13) set. The lemmas together show that the optimal vec- | | Now we turn our attention to computingp a lower bound tor is close to the true vector. b of the second term of (11), which is a bit more in- Lemma 4. If the regularization parameter satisfies volved. the following condition: T 2 T 2 S `(vS⇤ + ✓S)S min S `(vS⇤ + ✓S)S r S =b r 2 k k2 5Cmin 2 2n 2 2 ⌫ log , = b min( `(vS⇤ + ✓S)).  16 S Dmax rm r | | Now, then 2 ( `(v⇤ + ✓ )) min r S S 5Cmin 2 v⇤ vS , min min `(vS⇤ + ✓S) S 2 ✓ [0,1] r k k  4 S Dmax 2 | | m 1 (l) (l) (l) b p ( mC ) T T with probability at least 1 (+ S exp( min /2 S )+ =minmin ⌘((vS⇤ + ✓S) zS )zS (zS ) | | | | ✓ [0,1] m S exp( mp˜min/4 S )). 2 l=1 | | | | ⇣ X ⌘ Ghoshal, Honorio

Again, using the Taylor’s theorem to expand the func- Plugging back the bounds given by (12), (13) and (14) tion ⌘ we get in (11) and equating to zero we get

T (l) 2 ⌘((v + ✓ ) z ) 2 2n b Cmin S⇤ S S b S ⌫ + log + S b =0 (l) (l) (l) | | m 4 | | = ⌘((v )T z )+⌘ ((v + ✓¯ )T z )(✓ )T z r ! S⇤ S 0 S⇤ S S S S p p 4 S 2 2n , where ✓¯ [0,✓].Finally,fromLemma2wehave, = b = | | + ⌫ + log . 2 ) Cmin m with probability at least 1 S exp(( mCmin)/2 S ): p r ! | | | | 2 Finally, coming back to our prior assumption we have `(v⇤ + ✓ ) min r S S m 1 T (l) (l) (l) T 4 S 2 2n 5Cmin min min ⌘((vS⇤ ) zS )zS (zS ) b = | | + ⌫ + log . ✓ [0,1] m Cpmin rm !  4 S Dmax 2 ⇣ l=1 | | m X 1 T (l) T (l) (l) (l) T The above assumption holds if the regularizationp pa- + ⌘0((v⇤ + ✓¯ ) z )((✓ ) z )z (z ) m S S S S S S S rameter is bounded as follows: Xl=1 ⌘ m 5C2 2 2n min(HSS) max A(✓) 2 min ✓ [0,1] ||| ||| log ⌫. 2  16 S Dmax rm Cmin | | max A(✓) 2, 2 ✓ [0,1] ||| ||| 2 Lemma 5. If the regularization parameter satisfies where we have defined the following condition: m def 1 T (l) A(✓) = ⌘0((v⇤ + ✓ ) z ) 2 2n m S S S ⇥ ⌫ + log , Xl=1 rm T (l) (l) (l) T (✓S) zS zS (zS ) . then we have that Next, the spectral norm of A(✓) can be bounded as 5Cmin v v⇤ 1 follows: k k  Dmax

1 (+ S exp(( mCmin)/2 S )+ with probability at leastb | | A(✓) 2 | | ||| ||| S exp( mp˜min/4 S )). m | | | | 1 T (l) T (l) max ⌘0((vS⇤ + ✓S) zS ) ((✓S) zS ) y =1 Now we are ready to present our main result on recov-  2 (m k k l=1 ering the true PSNE set. X (l) (l) i S k yT (z (z )T )y Theorem 1. If for all , i , the minimum pay- S S 5C | | ⇥ off ⇢ min/Dmax, and the regularization param- ) min eter and the number of samples satisfy the following 1 m (l) T (l) (l) T conditions: < max (✓S) 1 zS y (zS (zS ) )y y =1(10m) k k k k2 Xl=1 1 2 6n2 2 6n2 m ⌫ + log 2K + ⌫ log (15) 1 T (l) (l) T m   m ✓ max S S 2 y (zS (zS ) )y r r  y =1 10m | |k k 2 k k2 ( ) 2 6n 2k 3kn Xl=1 m max log , log , m p 2 K Cmin 1 (l) (l) T ( ✓ ◆ ✓ ◆ = ✓b S zS (zS ) | |10m 4k 3kn l=1 2 p X log , (16) (b S D ) C p˜min ) | |max min , ✓ ◆  5  4 def p 5C2 where K = min/32kDmax ⌫, then with probability at where in the second line we used the fact that ⌘0(.) < least 1 , for (0, 1), we recover the true PSNE (bp S Dmax) 2 1/10 and in the last line we assumed that | | set, i.e., (W, b)= (W⇤, b⇤). 5  NE NE Cmin/4 —anassumptionthatweverifymomentarily. Having upper bounded the spectral norm of A(✓),we Proof. From Cauchy-Schwartzc b inequality and Lemma have 5wehave

2 Cmin T 5Cmin min `(vS⇤ + ✓S) . (14) (vi vi⇤) zi vi vi⇤ 1 zi . r 4 k k k k1  Dmax b b Sufficient and Necessary Conditions for Learning Graphical Games

Therefore, we have that Hence, we have that as the noise level gets too low, the Hessian of the loss becomes ill-conditioned, since the T 5Cmin T T 5Cmin (vi⇤) zi vi zi (vi⇤) zi + . data set now comprises of many repetitions of the few Dmax   Dmax joint-actions that are in the PSNE set; thereby increas- T 5C T ing the dependency (Dmax)betweenactionsofplayers Now, if x ⇤, (v⇤) bz min/Dmax,thenv z 8 2NE i i i i 0.Usinganunionboundargumentoverallplayersi, in the sample data set. we can show that the above holds with probabilityb at least 4.4 Necessary Conditions

1 n( + k exp(( mCmin)/2k)+k exp(( mp˜min)/4k) In this section we derive necessary conditions on the (17) number of samples required to learn graphical games. Our approach for doing so is information-theoretic: for all players. Therefore, we have that (W, b)= we treat the inference procedure as a communication NE ⇤ with high probability. Finally, setting = 0/3n, channel and then use the Fano’s inequality to lower NE for some 0 [0, 1],andensuringthatthelasttwoc b bound the estimation error. Such techniques have been 2 terms in (17) are at most 0/3n each, we prove our widely used to obtain necessary conditions for model claim. selection in graphical models, see e.g. [13, 14], spar- sity recovery in linear regression [15], and many other To better understand the implications of the theorem problems. above, we instantiate it for the global and local noise Consider an ensemble Gn of n-player games with the model. in-degree of each player being at most k.Naturepicks atruegame ⇤ G,andthengeneratesadataset Remark 1 (Sample complexity under global noise G 2 D min(k, 2np ) of m joint actions. A decoder is any function : model). Recall that max max , and for D  q m the global noise model given by (3) p = g/ ⇤ .If Gn that maps a data set to a game, ( ),in max |N E | X ! D D 1 2 qg is constant, then Dmax = k. Then K =⌦(/k ), Gn. The minimum estimation error over all decoders and the sample complexity of learning sparse linear ,fortheensembleGn, is then given as follows: 4 games grows as k log n . However, if qg is small def O ⇤ n enough, i.e., qg = (|N E |/2 ), then Dmax is no longer perr =minmax Pr ( ( )) = ( ⇤) , (18) O ⇤ Gn {N E D 6 NE G } a function of k and K =⌦(1/k). Hence, the sample G 2 complexity scales as k2 log n for exact PSNE re- O where the probability is computed over the data dis- covery. tribution. Our objective here is to compute the num- ber of samples below which PSNE recovery fails with Next, we consider the implications of Theorem 1 under probability greater than 1/2. the local noise model given by (4). we consider the regime where the parameter q scales with the number Theorem 2. The number of samples required to learn of players n. graphical games over n players and in-degree of at most k,is⌦(k log n). Remark 2 (Sample complexity under local noise). In the local noise model if the number of Nash-equilibria Remark 3. From the above theorem and from Theo- rem 1 we observe that the method of l1-regularized lo- is constant, then pmax = (exp( n)), and once again O gistic regression for learning graphical games, operates Dmax becomes independent of k, which results in a sample complexity of k2 log n . close to the fundamental limit of ⌦(k log n). O Also, observe the dependence of the sample complex- Results from simulation experiments for both global and local noise model can be found in Appendix B. ity on the minimum noise level p˜min. The number of samples required to recover the PSNE set increases as p˜min decreases. From the aforementioned remarks we Concluding Remarks. An interesting direction for see that if the noise level is too low, i.e., p˜ 0, future work would be to consider structured actions min ! then number of samples needed goes to infinity; This —forinstancepermutations,directedspanningtrees, seems counter-intuitive — with reduced noise level, a directed acyclic graphs among others — thereby ex- learning problem should become easier and not harder. tending the formalism of linear influence games to the To understand this seemingly counter-intuitive behav- structured prediction setting. Other ideas that might D be worth pursuing are: considering mixed strategies, ior, first observe that the constant max/Cmin can be thought of as the “condition number” of the loss func- correlated equilibria and epsilon Nash equilibria, and tion given by (6). Then, the sample complexity as incorporating latent or unobserved actions and vari- 2 2 k Dmax ables in the model. given by Theorem 1 can be written as C2 log n . O min ⇣ ⌘ Ghoshal, Honorio

References [13] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary [1] B. Blum, C. R. Shelton, and D. Koller, “A con- graphical models in high dimensions,” Informa- tinuation method for Nash equilibria in struc- tion Theory, IEEE Transactions on,vol.58,no.7, tured games,” Journal of Artificial Intelligence pp. 4117–4134, 2012. Research,vol.25,pp.457–502,2006. [14] W. Wang, M. J. Wainwright, and K. Ramchan- [2] L. E. Ortiz and M. Kearns, “Nash propagation dran, “Information-theoretic bounds on model se- for loopy graphical games,” in Advances in Neural lection for Gaussian Markov random fields.” in Information Processing Systems,2002,pp.793– ISIT. Citeseer, 2010, pp. 1373–1377. 800. [15] M. J. Wainwright, “Information-theoretic limits [3] D. Vickrey and D. Koller, “Multi-agent algorithms on sparsity recovery in the high-dimensional and for solving graphical games,” in AAAI/IAAI, noisy setting,” Information Theory, IEEE Trans- 2002, pp. 345–351. actions on,vol.55,no.12,pp.5728–5741,2009. [4] M. T. Irfan and L. E. Ortiz, “On influence, stable [16] B. Yu, Festschrift for Lucien Le Cam: Research behavior, and the most influential individuals in Papers in Probability and Statistics.New networks: A game-theoretic approach,” Artificial York, NY: Springer New York, 1997, ch. Intelligence,vol.215,pp.79–119,2014. Assouad, Fano, and Le Cam, pp. 423–435. [Online]. Available: http://dx.doi.org/10.1007/ [5] O. Ben-Zwi and A. Ronen, “Local and global price 978-1-4612-1880-7_29 of anarchy of graphical games,” Theoretical Com- puter Science,vol.412,no.12,pp.1196–1207, 2011. [6] J. Honorio and L. Ortiz, “Learning the struc- ture and parameters of large-population graphical games from behavioral data,” Journal of Machine Learning Research,vol.16,pp.1157–1210,2015. [7] V. Garg and T. Jaakkola, “Learning tree struc- tured potential games,” Neural Information Pro- cessing Systems,vol.29,2016. [8] A. Ghoshal and J. Honorio, “From behavior to sparse graphical games: Efficient recovery of equi- libria,” arXiv preprint arXiv:1607.02959,2016. [9] P. Ravikumar, M. Wainwright, and J. Lafferty, “High-dimensional Ising model selection using L1-regularized logistic regression,” The Annals of Statistics,vol.38,no.3,pp.1287–1319,2010. [Online]. Available: http://arxiv.org/abs/1010. 0311 [10] M. Kearns, M. L. Littman, and S. Singh, “Graphi- cal models for game theory,” in Proceedings of the Seventeenth conference on Uncertainty in Artifi- cial Intelligence.MorganKaufmannPublishers Inc., 2001, pp. 253–260. [11] J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of computational mathematics,vol.12,no.4,pp.389–434,2012. [12] W. Hoeffding, “Probability inequalities for sums of bounded random variables,” Journal of the American statistical association,vol.58,no.301, pp. 13–30, 1963.