<<

Ming Yu, Varun Gupta, Mladen Kolar

ATechnicalproofs By definition we have

K n A.1 Proof of Lemma 1. 1 X⇤ = m⇤ ⇥⇤ = B⇤A⇤B⇤> n ik k 1 2 k=1 i=1 Proof. Since ⇥ =(⇥1,...,⇥K ) are the global minimum X ⇣ X ⌘ of (5), we have = Q1R1A⇤R2>Q2> = Q1(Adiag + Ao↵ )Q2>. b b b µ⇥ 2 Plugging back into (12) we obtain 0 f(⇥) f(⇥⇤) f(⇥⇤), ⇥ ⇥⇤ + ⇥ ⇥⇤ . r 2 F X Q (A + A )Q 2 E , ⌦ ↵ 1 diag o↵ 2> F F We thenb have b b  k k and hence e 2 2 K ⇥ ⇥⇤ F f(⇥⇤), ⇥ ⇥⇤ µ⇥ r kukvk> Q1AdiagQ2> 2 ⌦ ↵ F b b k=1 f(⇥⇤) F ⇥ ⇥⇤ F . X (13)  µ⇥ r · 2 E eF e+ eQ1Ao↵ Q> F  k k k 2 k and hence b 2 E + ⇢ .  k kF 0 2 1/2 ⇥ ⇥ f(⇥ ) . Under mild conditions we have that E F n and ⇤ F ⇤ F k k /  µ⇥ r therefore can be arbitrarily small with large enough n. b Moreover, the left hand side of (13) is the difference of two singular value decompositions. According to the matrix perturbation theory, for each k we have (up to A.2 Proof of Theorem 2. permutation)

Proof. We apply the non-convex optimization result in kukvk> q1,k adiag,k q2>,k 2C⇢0, · · F  [37]. Since the initialization condition and (RSC/RSS) are satisfied for our problem according to Lemma 1, we and hencee e e apply Lemma 3 in [37] and obtain 1 n kukvk> mik⇤ ⇥k⇤ 2C⇢0. 2 n F  2 (t+1) 2 2 (t) i=1 d B ,B⇤ ⇠ 1 ⌘ µminmax d B ,B⇤ X  · 5 · e e e e ⇣ ⌘ ⇣ ⌘ ⇣ ⌘ Finally we obtain L⇥ + µ⇥ + ⌘ e2 , 1 · L µ · stat,⇥ K u v ⇥ = K u v ⇥ ⇥ ⇥ k k k> k⇤ k k k> k⇤ · (11) · F · K F n 2 2 1 1 where ⇠ =1+p and max = maxk ⇥k⇤ 2.Define c 1 k k Ke e2Ce⇢0 + mik⇤ e e e ⇥k⇤ F the contraction value  · n K ·k k ⇣ i=1 ⌘ X 2 2 2CK⇢0e+(⌘ 1)max. = ⇠ 1 ⌘ µminmax < 1,  · 5 ⇣ ⌘ e we can iteratively apply (11) for each t =1, 2,...,T and obtain A.4 Proof of Theorem 6 2 2 (T ) T 2 (0) ⇠ ⌘ L⇥ + µ⇥ 2 We analyze the two estimation step in Algorithm 2. d B ,B⇤ d B ,B⇤ + estat,⇥  1 · L⇥ µ⇥ · ⇣ ⌘ ⇣ ⌘ · which shows linear convergence up to statistical error. Update on B1 and B2. The update algorithm on B1 and B2 is the same with known M.Besidesthe statistical error defined in (7),wenowhaveanaddi- tional error term due to the error in M.Recallthat A.3 Proof of Lemma 5. 2 1 n K 2 d (M,M⇤)= (mik m⇤ ) ,Lemma7 n i=1 k0=1 0 ik0 X K quantifies the effect of one estimation step on B. Proof. Since is the best rank approximation for P P X,andX⇤ is also rank K,wehave X X F Lemma 7. Suppose the conditions in Theorem 2 hold k k  X⇤ X ande hence and suppose condition (DC) and (OC) hold, we have k kF e 2 [t] 2 2 [t] d B ,B⇤ C1 e + 1 d M ,M⇤ , X X⇤ F X X F + X⇤ X F  · stat,⇥ · k k k k k k (12) 2 X⇤ X F =2 E F . for some constant C1 and 1. e  ke k k k Learning Influence-Receptivity Network Structure with Guarantee

Update on M. Lemma 8 quantifies the effect of one Taking summation over all k,thefirsttermR1 gives estimation step on M. the statistical error as before, the terms R2 and R3 Lemma 8. Suppose the condition (TC) holds, we have gives

2 [t] 2 2 [t] d M ,M⇤ C e + d B ,B⇤ , 2 stat,M 2 K 2 n K  · · 2 2 e0 + Kmax 2 R2 F + R3 F (mik mik⇤ ) . for some constant C2 and 2. k k k k  n i=1 kX=1 ⇣ X kX=1 ⌘ Denote = min , ,aslongasthesignal is 0 { 1 2} max small and the noise Ei is small enough we can guarantee that 0 < 1. Combine Lemma 7 and 8 we complete the proof. A.6 Proof of Lemma 8.

A.5 Proof of Lemma 7. Proof. The estimation on M is separable with each mi. Denote the objective function on observation i as Proof. The analysis is exactly the same with the case M where is known except that the statistical error is K 2 different. Specifically, for each k we have fi(⇥,mi)= Xi mik ⇥k . (14) · F kX=1 ⇥k f(⇥⇤,M) r n K 1 According to condition (DC), the objective function = Xi mik0 ⇥k⇤0 mik n · (14) is µM -strongly convex in mi.Similartotheproof i=1 k0=1 X ⇣ X ⌘ of Lemma 1, we obtain 1 n K = Ei + (mik⇤ 0 mik0 )⇥k⇤0 mik n · K i=1 k0=1 X ⇣ X ⌘ 2 4 2 n n (m m⇤ ) f (⇥,m⇤) 1 1 ik ik  µ2 rmi i i F = E m⇤ + E (m⇤ m ) k=1 M n i ik n i ik ik X i=1 i=1 K X X 4 2 = f (⇥,m⇤) . R1 R2 2 mik i i µM r n K k=1 h i | 1{z } | {z } X + (m m⇤ )⇥⇤ m n ik0 ik0 k0 · ik i=1 X kX0=1 Moreover, we have R3 K The first term| R1 is just the{z usual statistical} error term 1 n 2 mik fi(⇥,mi⇤)= Xi mik⇤ 0 ⇥k0 , ⇥k on ⇥.FortermR2,denotee0 = n i=1 Ei F ,we r · k k k0=1 have D X E P n n = Ei, ⇥k⇤ + Ei, (⇥k⇤ ⇥k) 2 1 2 2 R2 F Ei F (mik mik⇤ ) D E D E k k  n2 k k · T1 T2 i=1 i=1 ⇣ X ⌘ X K n | {z } | {z } e0 2 + mik⇤ 0 (⇥k0 ⇥k⇤0 ), ⇥k . (mik mik⇤ ) .  n k0=1 i=1 D X E X T3 For term R3,wehave | {z } 2 R3 F k k The first term T1 is just the usual statistical error term n K 1 2 on M.FortermT2,wehave (mik0 mik⇤ )⇥k⇤ mik  n2 0 0 · F i=1 k0=1 X X n K n K n K n K 2 2 2 1 2 2 2 (T2) Ei F ⇥k⇤ ⇥k F (mik0 mik⇤ ) ⇥k⇤ F mik  k k ·k k  n2 0 k 0 k · i=1 k=1 i=1 k=1 i=1 k0=1 i=1 k0=1 X X X X ⇣ X X ⌘⇣ X X ⌘ n K 2 n n K 2 2 Kmax 2 2 = Ei F ⇥k⇤ ⇥k F . m (mik m⇤ ) . k k · k k  n2 ik · 0 ik0 ✓ i=1 ◆ ✓ k=1 ◆ i=1 i=1 X X ✓ X ◆ ✓ X kX0=1 ◆ (15) Ming Yu, Varun Gupta, Mladen Kolar

K For term T3 we have USV > = k=1 kukvk>. It is well known that X is the best rank K approximation for X.Weproposethe K K K 2 P 2 2 followinge ee initializatione e e method e (T3) ⇥k F mik⇤ (⇥k0 ⇥k⇤ )  k k 0 0 F kX=1 ✓ kX=1 ◆ kX0=1 K K (0) ⇥ = K kukv>. 2 2 2 k k K ⇥⇤ ⇥ m⇤ ·  max k k kkF ik k=1 k=1 ✓ X ◆✓ X ◆ e e e K To see why this initialization works, we first build 2 2 K ⇥⇤ ⇥k . intuition for the easiest case, where Ei =0for each i, max k F 1 n 1  k k m⇤ = for each k,andthecolumnsofB⇤ ✓ kX=1 ◆ n i=1 ik K 1 (16) and B2⇤ are orthogonal. In this case it is easy to see P K X = X⇤ = 1 ⇥ Moreover, we have that k=1 K k⇤.Notethatthisexpression X⇤ in a singular valueP decomposition of since we have 1 ⇤ 2 ⇤> 1 2 > 1 2 1 K ⇥k⇤ ⇥k F = bk bk bkbk F ⇥⇤ = b ⇤b ⇤> and the columns b ⇤ and columns k k k k k k k k k=1 2 ⇤ K { } 1 ⇤ 2 ⇤ 2 2 1 ⇤ 1 bk k=1 are orthogonal. Now that X is exactly rank bk 2 bk bk 2 + bk 2 bk bk 2 { } k k k k k k k k K,thebestrankK approximation would be itself, i.e., 2 ⇤ 2 1 ⇤ 1 K 2max bk bk 2 + bk bk 2 ,  k k k k X = X = k=1 kukvk>.Bytheuniquenessofsingular and hence value decomposition, as long as the singular values are P 1 K distinct,e we havee (upe e to permutation) K ⇥k⇤ = kukvk> 2 and therefore ⇥k⇤ = K kukvk>. This is exactly what ⇥k⇤ ⇥k F · k k we want to estimate. e e e kX=1 K e e e 2 With this intuition in mind, we relax the restrictions 42 b2 ⇤ b2 + b1 ⇤ b1  max k k kk2 k k kk2 we have and impose the following condition. k=1 2 X2 8 d (B,B⇤).  max Orthogonal Condition (OC). Let B1⇤ = Q1R1 B = Q R B Combine (15) and (16),takingsummationoveri,we and 2⇤ 2 2 be the QR decomposition of 1⇤ and B2⇤,respectively.DenoteA⇤ as a diagonal matrix with obtain 1 n diagonal elements n i=1 mik⇤ .DenoteR1A⇤R2> = n K 1 2 2 Adiag +Ao↵ where Adiag captures the diagonal elements (T2) +(T3) P n and Ao↵ captures the off-diagonal elements. We require i=1 k=1 X X that Ao↵ F ⇢0 for some constant ⇢0.Moreover,we K k k 1 n 2 2 require that n i=1 mik⇤ ⌘/K for some ⌘. e + K ⇥⇤ ⇥   0 max · k k kkF k=1 This condition requires that B⇤ and B⇤ are not too ⇣ ⌘ ✓ X ◆ P 1 2 2 2 2 far away from orthogonal matrix, so that when doing 8max e0 + Kmax d (B,B⇤).  · the QR rotation, the offdiagonal values of R1 and 1 n ⇣ ⌘ R are not too large. The condition m⇤ 2 n i=1 ik  ⌘/K is trivially satisfied with ⌘ = K.However,in P general ⌘ is usually a constant that does not scale with BDetailedrationaleoninitialization K,meaningthatthetopicdistributionamongthen and condition (OC) for jointly observations is more like evenly distributed than several learning topics dominate. Finally note that the condition (OC) is for this specific Initialization. Define X, X⇤, E as the sample mean initialization method only. Since we are doing singular of Xi, Xi⇤, Ei,respectively.Wehave value decomposition, we end up with orthogonal vectors so we require that B1⇤ and B2⇤ are not too far away from n n K n 1 n 1 1 1 orthogonal; since we do not know the value i=1 mik⇤ X = X⇤ + E = m⇤ ⇥⇤ + E n n i i n ik k n i and use 1/K to approximate, we require that topics i=1 i=1 i=1 X X kX=1 X P K n are not far away from evenly distributed so that this 1 approximation is reasonable. In practice we can also = m⇤ ⇥⇤ + E = X⇤ + E. n ik k initialize using other methods, for example we can do k=1 ⇣ i=1 ⌘ X X alternating gradient descent on B1,B2 and M based We then do the rank-K SVD on X and obtain on the objective function (10). This method also works [U,S,V ]=rank-K SVD of X.WedenoteX = reasonably well in practice.

e e e e Learning Influence-Receptivity Network Structure with Guarantee

CDetailednode-topicmatricesfor citation dataset

The detailed two node-topic matrices for citation dataset is given in Table 4 and Table 5.

DAdditionalfigures

Figure 3 and Figure 4 shows the comparison result for binary observation in Section 6, with known topics and unknown topics, respectively. Ming Yu, Varun Gupta, Mladen Kolar

1 1.05

0.95 1

0.9 0.95 One matrix One matrix 0.85 K matrices 0.9 K matrices Ours Ours 0.8 0.85

0.75 0.8 0.7

prediction error prediction error 0.75

0.65 0.7 0.6 0 50 100 150 200 0 50 100 150 200 #samples #samples

Figure 3: Prediction error for binary observa- Figure 4: Prediction error for binary observa- tion, with known topics tion, with unknown topics

Table 4: The influence matrix B1 for citation dataset

black quantum gauge algebra states hole model theory space space theory energy field field group noncommutative chains theory effective structure boundary supersymmetric Christopher Pope 0.359 0.468 0.318 Arkady Tseytlin 0.223 0.565 0.25 Emilio Elizalde 0.109 0.85 0.623 0.679 0.513 0.204 0.795 0.678 1.87 Ashok Das 0.155 0.115 1.07 Sergei Odintsov Sergio Ferrara 0.297 0.889 0.345 0.457 0.453 0.249 0.44 0.512 0.326 0.382 Mirjam Cvetic 0.339 0.173 0.338 Burt A. Ovrut 0.265 0.191 0.127 0.328 0.133 Ergin Sezgin 0.35 0.286 Ian I. Kogan 0.193 Gregory Moore 0.323 0.91 0.325 0.536 I. Antoniadis 0.443 0.485 0.545 0.898 0.342 Mirjam Cvetic 0.152 0.691 0.228 0.187 0.207 0.374 0.467 1.15 Barton Zwiebach 0.16 0.222 0.383 0.236 P.K. Townsend 0.629 0.349 0.1 Robert C. Myers 0.439 0.28 E. Bergshoeff0.357 0.371 Amihay Hanany 0.193 0.327 1.09 0.319 0.523 0.571 Learning Influence-Receptivity Network Structure with Guarantee

Table 5: The receptivity matrix B2 for citation dataset

black quantum gauge algebra states string hole model theory space space theory energy field field group noncommutative supergravity chains theory effective structure boundary supersymmetric Christopher Pope 0.477 0.794 0.59 Arkady Tseytlin 0.704 1.16 0.312 0.487 0.119 Emilio Elizalde Cumrun Vafa 0.309 0.428 0.844 0.203 0.693 Edward Witten 0.352 0.554 0.585 0.213 0.567 Ashok Das 0.494 0.339 0.172 Sergei Odintsov 0.472 Sergio Ferrara 0.423 0.59 0.664 0.776 Renata Kallosh 0.123 0.625 0.638 0.484 0.347 Mirjam Cvetic 0.47 0.731 0.309 Burt A. Ovrut 0.314 0.217 0.72 0.409 0.137 Ergin Sezgin 0.108 0.161 0.358 Ian I. Kogan 0.357 0.382 0.546 Gregory Moore 0.375 0.178 0.721 0.69 0.455 0.517 I. Antoniadis 0.461 0.699 0.532 0.189 Mirjam Cvetic 0.409 1.11 0.173 0.361 Andrew Strominger 0.718 0.248 0.196 0.133 Barton Zwiebach 0.308 0.204 0.356 P.K. Townsend 0.337 0.225 0.245 0.522 Robert C. Myers 0.364 0.956 0.545 0.139 E. Bergshoeff0.487 0.459 0.174 0.619 Amihay Hanany 0.282 0.237 0.575 0.732 Ashoke Sen 0.214 0.18 0.37