1 Random Matrices

Lectures 13-14: Concentration for sums of random matrices CSE 525, Winter 2015 Instructor: James R. Lee

1 Random matrices

1.1 Symmetric matrices

If A is a d d real symmetric matrix, then A has all real eigenvalues which we can order λ1 A > × ( ) λ2 A > > λd A . The operator norm of A is ( ) ··· ( )

A : max Ax 2 max λi A : i 1,..., d . k k x 2 1 k k {| ( )| ∈ { }} k k Pd Pd Pd The trace of A is Tr A Aii λi A . The trace norm of A is A λi A .A i 1 i 1 ∗ i 1 symmetric matrix is(positive) semidenite (PSD)( if all) its eigenvalues are nonnegative.k k Note| that( )| for a A P Ak PSD matrix A, we have Tr A A . We also recall the matrix exponential e ∞ which is ∗ k 0 k! well-dened for all real symmetric( ) k kA and is itself also a real symmetric matrix. If A is symmetric, then eA is always PSD, as the next argument shows. Every real symmetric matrix can be diagonalized, writing A UT DU, where U is an orthogonal matrix, i.e. UUT UTU I and D is diagonal. One can easily check that Ak UT DkU for any k , thus Ak and A are simultaneously diagonalizable. It follows that A and eA are simultaneously ∈ A λi A diagonalizable. In particular, we have λi e e ( ). Finally, note that for symmetric matrices( )A and B, we have Tr AB 6 A B . To see this, let ( ) k k · k k∗ ui be an orthonormal basis of eigenvectors of B with Bui λi B ui. Then { } ( ) Xd Xd Xd Xd Tr AB λi AB ui , ABui λi B ui , Aui 6 λi B A B A . · ∗ ( ) i1 ( ) i1 h i i1 ( )h i i1 | ( )| k k k k k k

Many classical statements are either false or signicantly more di cult to prove when translated to the matrix setting. For instance, while ex+y ex e y e y ex is true for arbitrary real numbers x and y, it is only the case that eA+B eAeB if A and B are simultaneously diagonalizable. However, somewhat remarkably, the matrix analog does hold if we do it inside the trace.

Theorem 1.1 (Golden-Thompson inequality). If A and B are real symmetric matrices, then

Tr eA+B 6 Tr eAeB . ( ) ( ) The proof of this theorem is subtle; see the extra material on the web page.

1.2 The Laplace transform for matrices

We will consider now a random d d real matrix X. The entries Xij of X are all (not necessarily × independent) random variables. We have seen inequalities (like( those) named after Cherno and Azuma) which assert that if X X1 + X2 + + Xn is a sum of independent random numbers, then ··· X is tightly concentrated around its mean. Our goal now is to prove a similar fact for sums of independent random symmetric matrices. First, observe that the trace is a linear operator; this is easy to see from the fact that it is the sum of the diagonal entries of its argument. If A and B are arbitrary real matrices, then Tr A + B ( )

1 Tr A + Tr B . This implies that if X is a random matrix, then ¾ Tr X Tr ¾ X . Note that ¾ X ( ) ( ) [ ( )] ( [ ]) [ ] is the matrix dened by ¾ X ij ¾ Xij . ( [ ]) [ ] Suppose that X1, X2,..., Xn are independent random real symmetric matrices. Let X X1 + X2 + + Xn. Let Sk X1 + + Xk be the partial sum of the rst k terms so that X Sn. Our rst ··· ··· goal will be to bound the probability that X has an eigenvalue bigger than t. To do this, we will try to extend the method of Laplace transforms to work with symmetric matrices. This method was found by Ahlswede and Winter. It is much simpler than previous approaches that only worked for special cases. βX βλ X Note that for β > 0, we have λi e e i . Therefore: ( ) ( ) βX βt βX βt max λi X > t max λi e > e 6 Tr e > e , (1.1) i ( ) i ( ) ] [ ( ) where the last inequality uses the fact that all the eigenvalues of eβX are nonnegative, hence Tr eβX P βX βX ( ) i λi e > maxi λi e . Now( Markov’s) inequality( ) implies that

¾ Tr eβX Tr eβX > eβt 6 [ ( )] . (1.2) [ ( ) ] eβt As in our earlier uses of the Laplace transform, our goal is now to bound ¾ Tr eβX by a product [ ( )] that has one factor for each term Xi. In the matrix setting, this is more subtle: Using Theorem 1.1,

βX β Sn 1+Xn βSn 1 βXn ¾ Tr e ¾ Tr e ( − ) 6 ¾ Tr e − e . [ ( )] [ ( )] [ ( )]

Now we push the expectation over Xn inside the trace:

βSn 1 βXn βSn 1 βXn βSn 1 βXn ¾ Tr e − e ¾ Tr e − ¾ e X1,..., Xn 1 ¾ Tr e − ¾ e , [ ( )] ( [ | − ]) ( [ ]) βSn 1 and we have used independence to pull e − outside the expectation and then to remove the conditioning. Finally, we use the fact that Tr AB 6 A B and B Tr B when B has all ∗ ∗ βS(n 1 ) k k · k k k k ( ) nonnegative eigenvalues (as is the case for e − ):

βSn 1 βXn βXn βSn 1 ¾ Tr e − ¾ e 6 ¾ e ¾ Tr e − . ( [ ]) [ ] ( ) Doing this n times yields

Yn Yn ¾ Tr eβX 6 Tr I ¾ eβXi d ¾ eβXi . [ ( )] ( ) [ ] [ ] i1 i1 Combining this with (1.1) and (1.2) yields

Yn βt βXi max λi X > t 6 e− d ¾ e . i ( ) [ ] i1 We can also apply this to X to get − Yn Yn βt βXi βXi X > t 6 e− d ¾ e + ¾ e− . (1.3) [k k ] [ ] [ ] i1 i1 * + , - 2 1.3 Concentration Let Y be a random, symmetric, psd d d matrix with ¾ Y I. Suppose that Y 6 L with × probability one. [ ] k k

Theorem 1.2. If Y1, Y2,..., Yn are i.i.d. copies of Y, then for any ε 0, 1 the following holds. Let 1 Pn ∈ ( ) λ1, λ2, . . . , λn denote the eigenvalues of n i1 Yi. Then

2 λ1, λ2, . . . , λn 1 ε, 1 + ε > 1 2d exp ε n/4L . [{ } ⊆ [ − ]] − − There is a slightly nicer way to write this using the Löwner ordering of symmetric matrices: We write A B to denote that the matrix A B is positive semidenite. We can rewrite the conclusion − of Theorem 1.2 as Xn 1 2 1 ε I Yi 1 + ε I > 1 2d exp ε n/4L . (1.4) ( − ) n ( )  − −  i1    Proof of Theorem 1.2. Dene Xi Yi ¾ Yi /L and let X X1 + + Xn. Then (1.4) is equivalent to  ( − [ ])  ··· εn X > 6 2d exp ε2n/4L . k k L − We know from (1.3) that it will su ce to bound ¾ eβXi for each i. To do this, we will use the [ ] fact that 1 + x 6 ex 6 1 + x + x2 x 1, 1 . ∀ ∈ [− ] Note that if A is a real symmetric matrix, then since I, A, A2, and eA are simultaneously diagonalizable, this yields I + A eA I + A + A2 (1.5) for any A with A 6 1. k k Observe that ¾ Xi 0. To evaluate Xi , let us use the fact that for real symmetric A, we have T[ ] k k d A max x 21 x Ax . So consider some x with x 2 1 and write k k k k | | ∈ k k T T ¾ T T x Yi x x Yi x x Yi x x Xi x | − [ ] | 6 | | 6 1 , | | L L T where we have used the fact that since Yi is PSD, so is ¾ Yi , and thus x ¾ Yi x > 0. We also used [ ] [ ] our assumption that Yi 6 L. We conclude that Xi 6 1. Moreover, we havek k k k

2 1 2 1 2 2 1 2 1 1 ¾ X ¾ Yi ¾ Yi ¾ Y ¾ Yi ¾ Y ¾ Yi Yi ¾ Yi , [ i ] L2 [( − [ ]) ] L2 [ i ] − ( [ ]) L2 [ i ] L2 [k k ] L [ ] where in the nal line we again used the assumption Yi 6 L. Finally, we have ¾ Yi I, so we ¾ 2 1 k k [ ] conclude that Xi 6 L . Using (1.5),k this[ yields]k

2 2 βXi 2 2 2 2 β ¾ X ¾ e I + β ¾ Xi + β ¾ X I + β ¾ X e [ i ] . [ ] [ ] [ i ] [ i ] 2 We conclude that ¾ eβXi 6 eβ /L. Plugging this intok [ (1.3),]k we see that

εn εβn/L nβ2/L X > 6 2de− e . k k L

3 And setting β εn/L yields εn nε2/4L X > 6 2de− , k k L completing the argument.

Finally, we can prove a generalization of Theorem 1.2 for random matrices whose expectation is not the identity.

Theorem 1.3. Let Z be a d d random real, symmetric, PSD matrix. Suppose also that Z L ¾ Z for × · [ ] some L > 1. If Z1, Z2,..., Zn are i.i.d. copies of Z, then for any ε > 0, it holds that

Xn 1 ε2n/4L 1 ε ¾ Z Zi 1 + ε ¾ Z > 1 2de− . ( − ) [ ] n ( ) [ ] −  i1     1 Pn  In other words, the empirical mean n i1 Zi is very close (in a spectral sense) to the expectation ¾ Z . [ ] Proof of Theorem 1.3. This is made di cult only because it may be that A ¾ Z is not invertible. T [ ] Suppose that A UDU where D is a diagonal matrix D diag λ1, λ2, . . . , λk , 0,..., 0 , where k ( ) is the rank of A and λi , 0 for i 1,..., k. Then the pseudoinverse of A is dened by

+ 1 1 1 T A Udiag λ− , λ− , . . . , λ− , 0,..., 0 U . ( 1 2 k ) + + Note that AA A A Iim A , where Iim A denotes the operator that acts by the identity on the ( ) ( ) image of A, and annihilates ker A . Since A is PSD, A+ is also PSD,( ) and we can dene A+/2 as the square root of A+. One can write this explicitly as +/2 1/2 1/2 1/2 T A Udiag λ− , λ− , . . . , λ− , 0,..., 0 U . ( 1 2 k ) +/2 +/2 Now to prove Theorem 1.3, it su ces to apply Theorem 1.2 to the matrices Yi U ZiU . We leave verication of the proof as an exercise.