Lecture 12: Concentration for sums of random matrices CSE 525, Spring 2019 Instructor: James R. Lee

1 Random matrices

Consider the graph sparsification problem: Given a graph G  V, E , we want to approximate G (in a ( ) sense to be defined later) by a sparse graph H  V, E . Generally we would like that E E and ( 0) 0 ⊆ moreover E is as small as possible—say O n or O n log n where n  V . We will be able to do | 0| ( ) ( ) | | this by choosing a (nonuniform) random sample of the edges, but to analyze such a process, we will need a large-deviation inequality for sums of random matrices.

1.1 Symmetric matrices

If A is a d d real symmetric , then A has all real eigenvalues which we can order × λ1 A > λ2 A > > λd A . The operator norm of A is ( ) ( ) ··· ( )

A : max Ax 2  max λi A : i 1,..., d . k k x 21 k k {| ( )| ∈ { }} k k  Íd  Íd  Íd The trace of A is Tr A i1 Aii i1 λi A . The trace norm of A is A i1 λi A .A ( ) ( ) k k∗ | ( )| is positive semidefinite (PSD) if all its eigenvalues are nonnegative. Note that for a  A  Í Ak PSD matrix A, we have Tr A A . We also recall the e ∞k which is ( ) k k∗ 0 k! well-defined for all real symmetric A and is itself also a real symmetric matrix. If A is symmetric, then eA is always PSD, as the next argument shows. Every real symmetric matrix can be diagonalized, writing A  UT DU, where U is an , i.e. UUT  UTU  I and D is diagonal. One can easily check that Ak  UT DkU for any k N, thus Ak and A are simultaneously diagonalizable. It follows that A and eA are simultaneously ∈ A λ A diagonalizable. In particular, we have λi e  e i . ( ) ( ) Finally, note that for symmetric matrices A and B, we have Tr AB 6 A B . To see this, | ( )| k k · k k∗ let ui be an orthonormal basis of eigenvectors of B with Bui  λi B ui. Then { } ( )

Õd Õd Õd Tr AB  ui , ABui  λi B ui , Aui 6 λi B A  B A . | ( )| h i ( )h i | ( )| · k k k k∗ k k i1 i1 i1

Many classical statements are either false or significantly more difficult to prove when translated to the matrix setting. For instance, while ex+y  ex e y  e y ex is true for arbitrary real numbers x and y, it is only the case that eA+B  eAeB if A and B are simultaneously diagonalizable. However, somewhat remarkably, the matrix analog does hold if we do it inside the trace.

Theorem 1.1 (Golden-Thompson inequality). If A and B are real symmetric matrices, then

+ Tr eA B 6 Tr eAeB . ( ) ( ) Proof. We can prove this using the non-commutative Hölder inequality: For any even integer p > 2 and real symmetric matrices A1, A2,..., Ap:

Tr A1A2 Ap 6 A1 S A2 S Ap S , ( ··· ) k k p k k p · · · k k p

1 T p 21 p where A S  Tr A A / is the Schatten p-norm. Consider real symmetric matrices U, V. k k p ( ) / Applying this with A1  Ap  UV gives, for every even p > 2: ··· p p T T p 2 2 p 2 2 2 p 2 Tr UV 6 UV  Tr V U UV /  Tr VU V /  Tr U V / , (( ) ) k kSp (( ) ) (( ) ) (( ) ) where the last inequality uses the cyclic property of the trace. Applying this inequality repeatedly now yields, for every even p, Tr UV p 6 Tr UpV p . (( ) ) ( ) A p B p If we now take U  e / and V  e / , this gives

A p B p p A B Tr e / e / 6 Tr e e . (1.1) (( ) ) ( ) A p 2 B p For p large, we can use the Taylor approximation e /  1 + A p + O 1 p and similarly for e / . + + 2 / ( / ) Thus: eA p eB p e A B p O 1 p . Therefore taking p in (1.1) gives / / ∼ ( )/ ( / ) → ∞ + Tr eA B 6 Tr eAeB .  ( ) ( ) 1.2 The Laplace transform for matrices

We will consider now a random d d real matrix X. The entries Xij of X are all (not necessarily × ( ) independent) random variables. We have seen inequalities (like those named after Chernoff and Azuma) which assert that if X  X1 + X2 + + Xn is a sum of independent random numbers, then ··· X is tightly concentrated around its mean. Our goal now is to prove a similar fact for sums of independent random symmetric matrices. First, observe that the trace is a linear operator; this is easy to see from the fact that it is the sum of the diagonal entries of its argument. If A and B are arbitrary real matrices, then Tr A + B  Tr A + Tr B . This implies that if X is a random matrix, then E Tr X  Tr E X . ( ) ( ) ( ) [ ( )] ( [ ]) Note that E X is the matrix defined by E X ij  E Xij . [ ] ( [ ]) [ ] Suppose that X1, X2,..., Xn are independent random real symmetric matrices. Let X  X1 + X2 + + Xn. Let Sk  X1 + + Xk be the partial sum of the first k terms so that X  Sn. Our ··· ··· first goal will be to bound the probability that X has an eigenvalue bigger than t. To do this, we will try to extend the method of exponential moments to work with symmetric matrices, as discovered by Ahlswede and Winter. It is much simpler than previous approaches that only worked for special cases. βX βλ X Note that for β > 0, we have λi e  e i . Therefore: ( ) ( )     βX βt  βX βt  P max λi X > t  P max λi e > e 6 P Tr e > e , (1.2) i ( ) i ( ) ( ) where the last inequality uses the fact that all the eigenvalues of eβX are nonnegative, hence βX Í βX βX Tr e  λi e > maxi λi e . ( ) i ( ) ( ) Now Markov’s inequality implies that

E Tr eβX P Tr eβX > eβt 6 [ ( )] . (1.3) [ ( ) ] eβt As in our earlier uses of the Laplace transform, our goal is now to bound E Tr eβX by a product [ ( )] that has one factor for each term Xi.

2 In the matrix setting, this is more subtle: Using Theorem 1.1,

βX β Sn 1+Xn βSn 1 βXn E Tr e  E Tr e ( − ) 6 E Tr e − e . [ ( )] [ ( )] [ ( )]

Now we push the expectation over Xn inside the trace:

βSn 1 βXn   βSn 1 βXn    βSn 1 βXn  E Tr e − e E Tr e − E e X1,..., Xn 1 E Tr e − E e , [ ( )] ( [ | − ]) ( [ ]) βSn 1 and we have used independence to pull e − outside the expectation and then to remove the conditioning. Finally, we use the fact that Tr AB 6 A B and B  Tr B when B has all ∗ ∗ βS(n 1 ) k k · k k k k ( ) nonnegative eigenvalues (as is the case for e − ):

 βSn 1 βXn  βXn  βSn 1  E Tr e − E e 6 E e E Tr e − . ( [ ]) [ ] ( ) Doing this n times yields

n n Ö Ö E Tr eβX 6 Tr I E eβXi  d E eβXi . [ ( )] ( ) [ ] [ ] i1 i1 Combining this with (1.2) and (1.3) yields

  Ön βt βXi P max λi X > t 6 e− d E e . i ( ) [ ] i1 We can also apply this to X to get − ! Ön Ön βt βXi βXi P X > t 6 e− d E e + E e− . (1.4) [k k ] [ ] [ ] i1 i1

1.3 Concentration Let Y be a random, symmetric, psd d d matrix with E Y  I. Suppose that Y 6 L with × [ ] k k probability one.

Theorem 1.2. If Y1, Y2,..., Yn are i.i.d. copies of Y, then for any ε 0, 1 the following holds. Let 1 Ín ∈ ( ) λ1, λ2, . . . , λn denote the eigenvalues of n i1 Yi. Then 2  P λ1, λ2, . . . , λn 1 ε, 1 + ε > 1 2d exp ε n 4L . [{ } ⊆ [ − ]] − − / There is a slightly nicer way to write this using the Löwner ordering of symmetric matrices: We write A B to denote that the matrix A B is positive semidefinite. We can rewrite the conclusion  − of Theorem 1.2 as " n # 1 Õ 2  P 1 ε I Yi 1 + ε I > 1 2d exp ε n 4L . (1.5) ( − )  n  ( ) − − / i1

Proof of Theorem 1.2. Define Xi : Yi E Yi and X : X1 + + Xn. Then (1.5) is equivalent to − [ ] ··· P X > εn 6 2d exp ε2n 4L . [k k ] − /

We know from (1.4) that it will suffice to bound E eβXi for each i. To do this, we will use the [ ] fact that 1 + x 6 ex 6 1 + x + x2 x 1, 1 . ∀ ∈ [− ] 3 Note that if A is a real symmetric matrix, then since I, A, A2, and eA are simultaneously diagonalizable, this yields I + A eA I + A + A2 (1.6)   for any A with A 6 1. k k Observe that E Xi  0. To evaluate Xi , let us use the fact that for real symmetric A, we have  [T ] k k d  A max x 21 x Ax . So consider some x R with x 2 1 and write k k k k | | ∈ k k T T T x Yi x x E Yi x T x Xi x  | − [ ] | 6 x Yi x 6 L , | | n | |

T T where we have used the fact that since Yi is PSD, so is E Yi , and thus x E Yi x and x Yi x are both [ ] [ ] nonnegative. We also used our assumption that Yi 6 L. We conclude that Xi 6 L. k k k k Moreover, we have

2 2 2 2 2 E X  E Yi E Yi  E Y E Yi E Y E Yi Yi L E Yi , [ i ] [( − [ ]) ] [ i ] − ( [ ])  [ i ]  [k k ]  [ ] where in the final line we again used the assumption Yi 6 L. Finally, since E Yi  I, conclude k k [ ] that E X2 6 L. k [ i ]k Therefore for any β 6 1 L, we can apply (1.6), yielding / 2 2 βXi 2 2 2 2 β E X E e I + β E Xi + β E X  I + β E X e [ i ] . [ ]  [ ] [ i ] [ i ]  2 We conclude that for β 6 1 L, we have E eβXi 6 eβ L. / k [ ]k Plugging this into (1.4), we see that

εnβ β2Ln P X > εn 6 2de− e . [k k ]  ε Choosing β : 2L yields ε2n 4L P X > εn 6 2de− / , [k k ] completing the argument. 

Finally, we can prove a generalization of Theorem 1.2 for random matrices whose expectation is not the identity.

Theorem 1.3. Let Z be a d d random real, symmetric, PSD matrix. Suppose also that Z L E Z for ×  · [ ] some L > 1. If Z1, Z2,..., Zn are i.i.d. copies of Z, then for any ε > 0, it holds that

" n # 1 Õ ε2n 4L P 1 ε E Z Zi 1 + ε E Z > 1 2de− / . ( − ) [ ]  n  ( ) [ ] − i1

1 Ín In other words, the empirical mean n i1 Zi is very close (in a spectral sense) to the expectation E Z . [ ] Proof of Theorem 1.3. This is made difficult only because it may be that A  E Z is not invertible. T [ ] Suppose that A  UDU where D is a D  diag λ1, λ2, . . . , λk , 0,..., 0 , where k ( ) is the rank of A and λi , 0 for i  1,..., k. Then the pseudoinverse of A is defined by

+ 1 1 1 T A  Udiag λ− , λ− , . . . , λ− , 0,..., 0 U . ( 1 2 k )

4 +  +  Note that AA A A Iim A , where Iim A denotes the operator that acts by the identity on the ( ) ( ) image of A, and annihilates ker A . + ( ) + 2 + Since A is PSD, A is also PSD, and we can define A / as the square root of A . One can write this explicitly as + 2 1 2 1 2 1 2 T A /  Udiag λ− / , λ− / , . . . , λ− / , 0,..., 0 U . ( 1 2 k ) + 2 + 2 Now to prove Theorem 1.3, it suffices to apply Theorem 1.2 to the matrices Yi  A / ZiA / . Verification is left as an exercise. 

5