Random Theory for sample

Narae Lee May 1, 2014

1 Introduction

This paper will investigate the statistical behavior of the eigenvalues of real symmetric random matrices, especially sample covariance matrices. A random matrix is a matrix- valued in . Many complex systems in nature and society show chaotic behavior at microscopic level and order at macroscopic level. Investigating the distribution of eigenvalues of random matrices is to understand ”the order” at macroscopic level when physical systems are expressed as random matrix. In many applications of random matrix theory, eigenvalues are very related to understand how the systems with randomness elements work.

2 Motivation - Dimensionality reduction

In machine learning and , dimensionality reduction is the process of reducing the number of random variables while preserving essential information on data. Principal Com- ponent Analysis(PCA), widely-used technique, is orthogonal linear transformation that trans- forms the data to a new (projected) coordinate system such that the greatest variance on the projected coordinate has achieved.

2.1 Principal Component Analysis(PCA)

Let X be an n p data matrix. Typically, one thinks of n observations xi of a p-dimensional row vector which⇥ has covariance matrix ⌃. We can assume that X has zero empirical mean without loss of generosity by constructing new data Y = X X¯ and applying the following 1 p p properties to Y . Next, sample covariance matrix is defined as S = X0X R ⇥ . Note that n n 2 Sn is symmetric and positive semi-definite. If all features of data are linearly independent, we can assume that Sn has full-rank. Let Sn have the ordered sample eigenvalues l1 l2 1 l .Bysingularvaluedecomposition,wecanfactorizeS = X0X = ULU0 = l u u0 ··· n n n j j j with eigenvalues in the L and uj are orthonormal eigenvectors collected as the columns of U. { } P Eigenvalues occurs in PCA, also widely known as the Karhunen-Lo`eve transformation.

1 Remark. Algorithm to find successively orthonormal basis of (d-dimensional) projected co- ordinate systems having the greatest variance is equivalent to that to find the eigenvectors corresponding to the first d largest eigenvalues l ,l , ,l of S .Thatis, 1 2 ··· d n

lj =max u0Snu : u u1, ,uj 1, u =1 { ? ··· || || }

Such uk maximing u0Snu subject to u u1, ,uj 1, u =1isaneigenvectoroflk and called ”k-th principal component” ? ··· || || We can derive this remark easily by using Lagrange Multiplier Method. For the first 1 2 principal component u1,wewanttomaximizevarianceofprojecteddata n i(u10 xi)

1 2 1 1 P (u0 x ) = u0 x x0 u = u0 ( x x0 )u = u0 S u n 1 i n 1 i i 1 1 n i i 1 1 n 1 i i X X X

We can express this problem as maxu1 u10 Su1 subject to u1 =1.ByusingLagrange 2 || || Multiplier, L(u )=u0 S u + (1 u ). 1 1 n 1 || 1|| @L =2Snu1 2u1 =0= Snu1 = u1 @u1 )

So, (, u1)istheeigenvaluesandeigenvectortoitofSn. To maximize objective function u10 Snu1 = u10 u1 = ,weneedtochoosethelargesteigenvalue = l1 and eigenvector u1 corresponding to l .Forotherprincipalcomponents,findu such that u u =0,and 1 2 1? 2 u2Snu2 is maximized. This results in Snu2 = u2 such that u20 u1 =0.Easytosee,the second principal component should be the second largest eigenvalue l2 and its eigenvector u2. This remark illustrates the relation between eigenvalues and PCA.

3 Global Distribution of Eigenvalues

3.1 Definitions and Notations

Definition 3.1. A Wigner random matrix is defined as n n A =(Aij) with i.i.d. entries satisfying ⇥ 2 1+ij Aij = Aji, E(Aij)=0,E(Aij)= 2 and Aij’s moments are all finite for all i, j

1+ij For a Wigner random matrix, if Aij has of N(0, 2 ), then we call this matrix A as Gaussian Orthogonal Ensembles(GOE). Definition 3.2. A p p random matrix M is said to have a A with ⇥ scale matrix ⌃ and degrees of freedom n if M = X0X where X Nn p(µ, ⌃). This is denoted ⇥ by M W (n, ⌃) ⇠ ⇠ p The Wishart Wp(n, ⌃) distribution has a density function only when n p,andif M W (n, ⌃), it⇠ has the following density function ⇠ p np/2 2 1 1 (n p 1)/2 n n/2 etr( ⌃ M)(detM) (1) p( 2 )(det⌃) 2

2 where ext stands for exponential of trace of matrix. 1 Note that if data matrix X Nn p(µ, ⌃), the sample covariance matrix Sn = X0X has the ⇥ n Wishart distribution W (n ⇠ 1, 1 ⌃). Hence any general result regarding the eigenvalues of p n matrices in Wp(n, ⌃) can be easily applied to the eigenvalues of sample covariance matrices.

Definition 3.3. Let A be a p p matrix with eigenvalues l1, ,lp. The empirical (cumu- lative) distribution function for⇥ the eigenvalues of A is ··· p 1 l(x):= (l x) p i  i=1 X 1 p Then, empirical density function l0(x)= p i=1 (x li). Now, we are ready to investigate the global and local distribution of the eigenvalues of Wigner matrices and sample covariance matrices. P

3.2 Wigner Semi-circle Law Consider a family of Wigner matrices A,ofdimensionn,chosenfromsomedistribution.Like the Central Limit Theorem, Wigner Semi-circle Law shows us that dependently on the type of random matrix, the empirical distribution function can converges to a certain non-random law.

Theorem 3.1. Let A be a Wigner matrices with dimension n Let Pn(x) be the empirical 1 distribution of the eigenvalues for normalized ( 2pn An) so that the eigenvalues lies in the interval [ 1, 1]. Then, its empirical density distribution n 1 li 2 2 l0(x)= (x ) P (x)= p1 x n 2pn ! ⇡ i=1 X with probability 1 as n (See Figure 1) !1 Proof The basic idea to prove this semi-circle law is to compare moments of distribution of eigenvalues with that of semi-circle distribution. That is because the actual distribution is determined by its moments, provided that those moments do not increase too rapidly with k. k Let U(x )bethek-thmomentsofl0(x): n k 1 k 1 lj k U(x )= x l0(x)dx = ( ) n 2pn j=1 Z1 X Compute the expected value of each moment: 1 n l 1 1 n E(U(x1)) = E( ( j )) = E(Tr(A)) = E(A )=0 n 2pn 2n3/2 2n3/2 jj j=1 j=1 X X 1 n l 1 1 n n 1 E(U(x2)) = E( ( j )2)= E(Tr(A2)) = E(A )2)= n 2pn 4n2 4n2 jk 4 j=1 j=1 X X Xk=1

3 And so on, up to higher-order moments. On the other hands, let C(xk)bek-thmomentofthesemicircle,P (x):

1 1 k k 2 k 2 C(x )= x l0(x)dx = x p1 x dx 1 ⇡ 1 Z Z Substitute x =sin✓:

⇡/2 2 C(xk)= sink ✓ cos2 ✓d✓ ⇡/2 ⇡ Z This integrals can be evaluated analytically. Define n!! = 2 4 n if n is even, and n!! = 1 3 n if n is odd. · ··· Then· ··· 2(k 1)!! C(xk)= (k +2)!!

1 2 1 In particular, we have C(x )=0,C(x )= 4 .Thesecoincideswiththemomentsofeigenvalue distribution. By extending this approach to include higher moments, we can prove that the eigenvalues distribution goes asymptotically to the semicircle.

Figure 1: Distribution of eigenvalues for Gaussian Orthogonal Ensemble:Semi-cricle Law

3.3 Marcˇenko - Pastur Distribution The Marcˇenko - Pastur Distribution shows us the ’semi-circle’ type law to the sample co- variance matrix Sn

p p Theorem 3.2. Let Sn R ⇥ be the sample covariance matrix with ordered eigenvalues l l l Then its empirical2 density distribution 1 2 ··· p n 1 li l0(x)= (x ) G0(x)= (b x)(x a),a x b n 2pn ! 2⇡x   i=1 X p n 1/2 2 1/2 2 almost surely as , and a =(1 ) ,b=(1+ ) if 1. p ! When <1, there is an additional mass point at x =0of weight (1 )

4 Sketch of proof Similar to proof of Semi-Circle aaw, this theorem is also proved by comparing moments of two distributions - empirical distribution of eigenvalues and Marcˇenko -PasturDistribution.Thek-thmomentsoftheMarcˇenko-Pasturdensityf(x)is

k k r k 1 k x f (x)dx = /(r +1) r r r=0 Z X ✓ ◆✓ ◆

It suces to show that k-th moments of l0(x)

k k 1 k r k 1 k E(U(x )) = E( Tr(S )) /(r +1) p n ! r r ⇥ r=0 X ✓ ◆✓ ◆

In this theorem, we can guess convergence of the largest and smallest eigenvalues l1 and lmin n,p .Itisshownthatl1 and lmin n,p converges almost surely to the edges of the support { } { } [a, b]ofG(x)[Geman’80 and Silverstein’85].

1/2 2 l (1 + ) almost surely 1 ! 1/2 2 lmin n,p (1 ) almost surely { } !

4 Local Distribution of Eigenvalues

The rest of this paper aims to demonstrate the local distribution of eigenvalues of Wishart ensemble: (i) representation of the join density function and (ii) extraction of the marginal density of the largest eigenvalues of Wishart matrices.

4.1 Joint probability density function for eigenvalues

Theorem 4.1. If A Wp(n, I) with n>p 1, then the joint density function of the eigenvalues 1 > 1 > ⇠ >l > 0 of A is 1 2 ··· p p2/2 p p ⇡ (n p 1)/2 1 l l l exp( l ) (2)np/2 (p/2) (p/2) i | i j| 2 i p p i

5 The Jacobian for this change of variables takes the form

@A11/@l1 @A11/@lp @A11/@↵1 @A11/@↵(n(n 1)/2) @A /@l ··· @A /@l @A /@↵ ··· @A /@↵ n(n 1)/2) 12 1 12 p 12 1 12 ( J = . ···. . . ···. . ...... @Ann/@l1 @Ann/@lp @Ann/@↵1 @Ann/@↵(n(n 1)/2) ··· ··· p = li lj h(↵1, ,↵n(n 1)/2) | |· ··· i

p 1 P (l1, ,lp,↵1, ,↵n(n 1)/2)=P (A) J = C exp( Tr(A)) li lj h(↵1, ,↵n(n 1)/2) ··· ··· | | 2 | |· ··· i

p p (n p 1)/2 1 P (l ,l , ,l )=K l l l exp( l ) 1 2 ··· p i | i j| 2 i i

Essentially, one takes equation, and integrates out each of the variables lk except for one:

1 1 p(l )= p(l ,l , ,l )dl dl 1 ··· 1 2 ··· p 1 ··· p Z1 Z1 The above integration was first carried out by Mehta and Gaudin, using variety of techniques, including an integration over alternating variables, rewriting the integrand as a determinant, and eventually expressing it in terms of orthogonal .

We can generalize the above theorem to matrix S W (n, ⌃)(not W (n, I)) The joint n ⇠ p p density function of the eigenvalues l1 >l2 > >lp > 0ofsamplecovariancematrix S W (n, ⌃)(n>p)isofthefollowingform ··· n ⇠ p 2 p p p /2 n/2 np/2 ⇡ (det⌃) n (n p 1)/2 1 1 l (l l ) F ( nL, ⌃ ( n ) ( n ) 2 i i j 0 0 2 p 2 p 2 i=1 j:j>i Y Y where 0F0( )isthe(two-matrix)multivariatehypergeometricfunction-thefunctionofa matrix arugements· expressible in the form of zonal polynomial series. There is a stable interest in the methods of ecient evaluation of the hypergeometric function.

4.2 Asymptotic distribution function of the largest eigenvalue The asymptotic distribution for the largest eigenvalues in Wishart sample covariance matrices were found by Johstone(2001). The main result from his work is as follows

6 Theorem 4.2. Let S W (n, I) and l is the largest eigenvalue of S. Then ⇠ p 1 l µ 1 F ! 1

2 1/2 in distribution, with center and scaling constants µ =(pn 1+pp) , = µ((n 1) + 1/2 1/3 p ) and F1 stands for the distribution function of the Tracy-Widom law of order 1

The limiting distribution function F1 is a particular distribution from a family of distri- butions F, and can be computed as follows.

1 F (s)=exp( (x s)q2(x)dx) 2 Zs 1 1 F (s)=exp( q(x)dx)[F (s)]1/2 1 2 2 Zs

Here, q(s)istheuniquesolutiontothePainl´eveIIequation

3 q00 = sq +2q + ↵ with ↵ = 0 with boundary condtion q(s) A (s)ass ⇠ i !1 Note that Tracy-Widom distributions can be numericall evaluated and tabulated in software S-plus. From numerical work, Tracy and Widom report that the F1 distribution has mean approximately -1.21 and standard deviation 1.27. The density is asymmetric, and its left s 3/24 tail has exponential order of decay like e| | ,whileitsrighttailisofexponentiallyor- 2/3s3/2 der e . In addition, Tracy and Widom also derived the expressions for the limiting distribution of the k-th largest eigenvalue.

Figure 2: Density of the Tracy-Widom distribution F1

5 Universality

Finally, it is very natural to ask ”what happens if the elements of the data matrix X are i.i.d. from a non-Gaussian distribution? even more if the elements are not i.i.d?”. Many

7 researchers have been studied to answer this question. Roughly speaking, it is generally be- lieved with strong numerical and theoretical evidence that the global and local distributions are universal, namely that results with the assuption of GOE should hold for Wigner matri- ces(real and complex), or even more general classes for random matrices. In the aspect of global distribution, the limit laws of random matrix theory (semi-circle, circluar) are the same for Aij from di↵erent distribution such as bernoulli or poison distributions(Pastur(1973)). It indicates that behavior or random variables as element of matrix may be di↵erent(and even unknown), but, macroscopic picture becomes very similar. Soshnkov(1999)established ”uni- versality” of the Tracy-Widom limit for square Wigner matrices. Recently, Tao and Vu(2010) have established that universality for local spectral statistics of non-Hermitian matrices with independent entries, under the additional hypotheses that the entries of the matrix decay ex- ponentially, and match moments with either the real or complex gaussian ensemble to fourth order. Besides these, many remarkable researches to extend the well-established theories in classic random matrix theory for general matrices.

6 Conclusion

How eigenvalues of random matrices especially sample covariance matrics are distributed has been studied in this paper. Special emphasis on distribution of the largest eigenvalue was put on the study motivated by Principal Component Analysis. Under the standard assumption of the data normality, many theorems are established, and generalized to the class of the covariance matrices which have the Wishart distribution. Nowadays, to evaluate asymptotic behavior of eigenvalues’ distribution such as Tracy-Widom distribution, many algorithms have been developed in S-Plus. However, many interesting problems from the random matrix theory have not been touched yet. For example, the distributions of random matrix having special structure such as banded or sparse matrices have not been researched yet. It need some further development and study. This project started from specific question in machine learning fields, and raise many further questions. This project can serve a base for more serious investigation in the are of study of the largest eigenvalues of sample covariance matrices.

8 References

[1] Andrei Bejan Largest eigenvalues and sample covariance matrices,MScDissertation,The University of Warwick. 2005

[2] Iain M.Johnstone, On the Distribuion of the largest eigenvalue in principla components analysis, Standford University, The Annals of Statistics, Vol 99, No.2 (2001), 295-327

[3] Yi-Kai Liu, Statistical behavior of the eigenvalues of random matrices,Mathematics Jounior Seminar, Princeton University(2001)

[4] Basor, E.L., Tracy, C.A. and Widom, Asymptotics of level-spacing distributions for ran- dom matrices,Phys.Rev.Lett69(1992) 5-8

[5] Greg W.anderson, Alice Guionnet and Ofer Zeitouni An introduction to Random Matri- ces, Cambrige University Press. 2010

[6] Anderson T.W. Asymptotic theory of principal component analysis, The Annals of Math- ematical Statistics, Vol 34 (1963), 122-148.

[7] Baik, J., Silverstein, J.W. Eigenvalues of large sample covariance random matrices,a review. Statist. Sinica. Vol 9(2004).

[8] Roman Vershynin, Random Matrices: Invertibility, Structure, and Applica- tions,Presentation notes: Canandian Math Society Summer Meeting (2011) http://www-personal.umich.edu/ romanv/slides/2011-Edmonton/2011-Edmonton.pdf

[9] Terrance Tao and Van Vu, Random Covariance Matrices: Universality of local statistics of eigenvalues, The Annals of Probability, Vol 40. No.3 (2012) 1285-1325

[10] G.Ben Arous and S.P´ech´e, Universality of Local Eigenvalues Statistics for some Sample Covariance Matrices,Communications on Pure and Applied Mathematics, Vol 58(2005) 1-42.

[11] Fraydoun Rezakhanlou, Lecture note: Lectures on Random Matrices,UC Berkeley(2012)

[12] Alexander Soshinikov, A note on universality of the distribution of the largest eigenval- ues in certain sample covariance matrices, Journal of Statistical Physics, Vol 108 (2001) 1033-1056.

9