Random Matrix Theory for sample covariance matrix
Narae Lee May 1, 2014
1 Introduction
This paper will investigate the statistical behavior of the eigenvalues of real symmetric random matrices, especially sample covariance matrices. A random matrix is a matrix- valued random variable in probability theory. Many complex systems in nature and society show chaotic behavior at microscopic level and order at macroscopic level. Investigating the distribution of eigenvalues of random matrices is to understand ”the order” at macroscopic level when physical systems are expressed as random matrix. In many applications of random matrix theory, eigenvalues are very related to understand how the systems with randomness elements work.
2 Motivation - Dimensionality reduction
In machine learning and statistics, dimensionality reduction is the process of reducing the number of random variables while preserving essential information on data. Principal Com- ponent Analysis(PCA), widely-used technique, is orthogonal linear transformation that trans- forms the data to a new (projected) coordinate system such that the greatest variance on the projected coordinate has achieved.
2.1 Principal Component Analysis(PCA)
Let X be an n p data matrix. Typically, one thinks of n observations xi of a p-dimensional row vector which⇥ has covariance matrix ⌃. We can assume that X has zero empirical mean without loss of generosity by constructing new data Y = X X¯ and applying the following 1 p p properties to Y . Next, sample covariance matrix is defined as S = X0X R ⇥ . Note that n n 2 Sn is symmetric and positive semi-definite. If all features of data are linearly independent, we can assume that Sn has full-rank. Let Sn have the ordered sample eigenvalues l1 l2 1 l .Bysingularvaluedecomposition,wecanfactorizeS = X0X = ULU0 = l u u0 ··· n n n j j j with eigenvalues in the diagonal matrix L and uj are orthonormal eigenvectors collected as the columns of U. { } P Eigenvalues occurs in PCA, also widely known as the Karhunen-Lo`eve transformation.
1 Remark. Algorithm to find successively orthonormal basis of (d-dimensional) projected co- ordinate systems having the greatest variance is equivalent to that to find the eigenvectors corresponding to the first d largest eigenvalues l ,l , ,l of S .Thatis, 1 2 ··· d n
lj =max u0Snu : u u1, ,uj 1, u =1 { ? ··· || || }
Such uk maximing u0Snu subject to u u1, ,uj 1, u =1isaneigenvectoroflk and called ”k-th principal component” ? ··· || || We can derive this remark easily by using Lagrange Multiplier Method. For the first 1 2 principal component u1,wewanttomaximizevarianceofprojecteddata n i(u10 xi)
1 2 1 1 P (u0 x ) = u0 x x0 u = u0 ( x x0 )u = u0 S u n 1 i n 1 i i 1 1 n i i 1 1 n 1 i i X X X
We can express this problem as maxu1 u10 Su1 subject to u1 =1.ByusingLagrange 2 || || Multiplier, L(u )=u0 S u + (1 u ). 1 1 n 1 || 1|| @L =2Snu1 2 u1 =0= Snu1 = u1 @u1 )
So, ( , u1)istheeigenvaluesandeigenvectortoitofSn. To maximize objective function u10 Snu1 = u10 u1 = ,weneedtochoosethelargesteigenvalue = l1 and eigenvector u1 corresponding to l .Forotherprincipalcomponents,findu such that u u =0,and 1 2 1? 2 u2Snu2 is maximized. This results in Snu2 = u2 such that u20 u1 =0.Easytosee,the second principal component should be the second largest eigenvalue l2 and its eigenvector u2. This remark illustrates the relation between eigenvalues and PCA.
3 Global Distribution of Eigenvalues
3.1 Definitions and Notations
Definition 3.1. A Wigner random matrix is defined as n n symmetric matrix A =(Aij) with i.i.d. entries satisfying ⇥ 2 1+ ij Aij = Aji, E(Aij)=0,E(Aij)= 2 and Aij’s moments are all finite for all i, j
1+ ij For a Wigner random matrix, if Aij has normal distribution of N(0, 2 ), then we call this matrix A as Gaussian Orthogonal Ensembles(GOE). Definition 3.2. A p p random matrix M is said to have a A Wishart Distribution with ⇥ scale matrix ⌃ and degrees of freedom n if M = X0X where X Nn p(µ, ⌃). This is denoted ⇥ by M W (n, ⌃) ⇠ ⇠ p The Wishart Wp(n, ⌃) distribution has a density function only when n p,andif M W (n, ⌃), it⇠ has the following density function ⇠ p np/2 2 1 1 (n p 1)/2 n n/2 etr( ⌃ M)(detM) (1) p( 2 )(det⌃) 2
2 where ext stands for exponential of trace of matrix. 1 Note that if data matrix X Nn p(µ, ⌃), the sample covariance matrix Sn = X0X has the ⇥ n Wishart distribution W (n ⇠ 1, 1 ⌃). Hence any general result regarding the eigenvalues of p n matrices in Wp(n, ⌃) can be easily applied to the eigenvalues of sample covariance matrices.
Definition 3.3. Let A be a p p matrix with eigenvalues l1, ,lp. The empirical (cumu- lative) distribution function for⇥ the eigenvalues of A is ··· p 1 l(x):= (l x) p i i=1 X 1 p Then, empirical density function l0(x)= p i=1 (x li). Now, we are ready to investigate the global and local distribution of the eigenvalues of Wigner matrices and sample covariance matrices. P
3.2 Wigner Semi-circle Law Consider a family of Wigner matrices A,ofdimensionn,chosenfromsomedistribution.Like the Central Limit Theorem, Wigner Semi-circle Law shows us that dependently on the type of random matrix, the empirical distribution function can converges to a certain non-random law.
Theorem 3.1. Let A be a Wigner matrices with dimension n Let Pn(x) be the empirical 1 distribution of the eigenvalues for normalized ( 2pn An) so that the eigenvalues lies in the interval [ 1, 1]. Then, its empirical density distribution n 1 li 2 2 l0(x)= (x ) P (x)= p1 x n 2pn ! ⇡ i=1 X with probability 1 as n (See Figure 1) !1 Proof The basic idea to prove this semi-circle law is to compare moments of distribution of eigenvalues with that of semi-circle distribution. That is because the actual distribution is determined by its moments, provided that those moments do not increase too rapidly with k. k Let U(x )bethek-thmomentsofl0(x): n k 1 k 1 lj k U(x )= x l0(x)dx = ( ) n 2pn j=1 Z 1 X Compute the expected value of each moment: 1 n l 1 1 n E(U(x1)) = E( ( j )) = E(Tr(A)) = E(A )=0 n 2pn 2n3/2 2n3/2 jj j=1 j=1 X X 1 n l 1 1 n n 1 E(U(x2)) = E( ( j )2)= E(Tr(A2)) = E(A )2)= n 2pn 4n2 4n2 jk 4 j=1 j=1 X X Xk=1
3 And so on, up to higher-order moments. On the other hands, let C(xk)bek-thmomentofthesemicircle,P (x):
1 1 k k 2 k 2 C(x )= x l0(x)dx = x p1 x dx 1 ⇡ 1 Z Z Substitute x =sin✓:
⇡/2 2 C(xk)= sink ✓ cos2 ✓d✓ ⇡/2 ⇡ Z This integrals can be evaluated analytically. Define n!! = 2 4 n if n is even, and n!! = 1 3 n if n is odd. · ··· Then· ··· 2(k 1)!! C(xk)= (k +2)!!
1 2 1 In particular, we have C(x )=0,C(x )= 4 .Thesecoincideswiththemomentsofeigenvalue distribution. By extending this approach to include higher moments, we can prove that the eigenvalues distribution goes asymptotically to the semicircle.
Figure 1: Distribution of eigenvalues for Gaussian Orthogonal Ensemble:Semi-cricle Law
3.3 Marcˇenko - Pastur Distribution The Marcˇenko - Pastur Distribution shows us the ’semi-circle’ type law to the sample co- variance matrix Sn
p p Theorem 3.2. Let Sn R ⇥ be the sample covariance matrix with ordered eigenvalues l l l Then its empirical2 density distribution 1 2 ··· p n 1 li l0(x)= (x ) G0(x)= (b x)(x a),a x b n 2pn ! 2⇡x i=1 X p n 1/2 2 1/2 2 almost surely as , and a =(1 ) ,b=(1+ ) if 1. p ! When <1, there is an additional mass point at x =0of weight (1 )
4 Sketch of proof Similar to proof of Semi-Circle aaw, this theorem is also proved by comparing moments of two distributions - empirical distribution of eigenvalues and Marcˇenko -PasturDistribution.Thek-thmomentsoftheMarcˇenko-Pasturdensityf (x)is
k k r k 1 k x f (x)dx = /(r +1) r r r=0 Z X ✓ ◆✓ ◆
It su ces to show that k-th moments of l0(x)
k k 1 k r k 1 k E(U(x )) = E( Tr(S )) /(r +1) p n ! r r ⇥ r=0 X ✓ ◆✓ ◆
In this theorem, we can guess convergence of the largest and smallest eigenvalues l1 and lmin n,p .Itisshownthatl1 and lmin n,p converges almost surely to the edges of the support { } { } [a, b]ofG(x)[Geman’80 and Silverstein’85].
1/2 2 l (1 + ) almost surely 1 ! 1/2 2 lmin n,p (1 ) almost surely { } !
4 Local Distribution of Eigenvalues
The rest of this paper aims to demonstrate the local distribution of eigenvalues of Wishart ensemble: (i) representation of the join density function and (ii) extraction of the marginal density of the largest eigenvalues of Wishart matrices.
4.1 Joint probability density function for eigenvalues
Theorem 4.1. If A Wp(n, I) with n>p 1, then the joint density function of the eigenvalues 1 > 1 > ⇠ >l > 0 of A is 1 2 ··· p p2/2 p p ⇡ (n p 1)/2 1 l l l exp( l ) (2 )np/2 (p/2) (p/2) i | i j| 2 i p p i
5 The Jacobian for this change of variables takes the form
@A11/@l1 @A11/@lp @A11/@↵1 @A11/@↵(n(n 1)/2) @A /@l ··· @A /@l @A /@↵ ··· @A /@↵ n(n 1)/2) 12 1 12 p 12 1 12 ( J = . ···. . . ···. . ...... @Ann/@l1 @Ann/@lp @Ann/@↵1 @Ann/@↵(n(n 1)/2) ··· ··· p = li lj h(↵1, ,↵n(n 1)/2) | |· ··· i p 1 P (l1, ,lp,↵1, ,↵n(n 1)/2)=P (A) J = C exp( Tr(A)) li lj h(↵1, ,↵n(n 1)/2) ··· ··· | | 2 | |· ··· i