<<

A Tutorial on Multivariate Statistical

Craig A. Tracy UC Davis

SAMSI September 2006

1 ELEMENTARY Collection of (real-valued) from a sequence of

X1,X2,...,Xn

Might make assumption underlying law is N(µ, σ2) with unknown µ and σ2. Want to estimate µ and σ2 from the data. Mean & Sample Variance:

1 1 2 X¯ = X ,S = X X¯ n j n 1 j − j − j X X  Estimators are “unbiased”

E(X¯)= µ, E(S)= σ2

2 2 Theorem: If X1,X2,... are independent N(µ, σ ) variables then X¯ and S are independent. We have that X¯ is N(µ, σ2/n) and (n 1)S/σ2 is χ2(n 1). − −

Recall χ2(d) denotes the chi-squared distribution with d degrees of freedom. Its density is

1 d/2−1 −x/2 fχ2 (x)= x e , x 0, 2d/2 Γ(d/2) ≥ where ∞ Γ(z)= tz−1 e−t dt, (z) > 0. ℜ Z0

3 MULTIVARIATE GENERALIZATIONS From the classic textbook of Anderson[1]: Multivariate statistical analysis is concerned with data that consists of sets of measurements on a number of individuals or objects. The sample data may be heights and weights of some individuals drawn randomly from a population of school children in a given city, or the statistical treatment may be made on a collection of measurements, such as lengths and widths of petals and lengths and widths of sepals of iris plants taken from two species, or one may study the scores on batteries of mental tests administered to a number of students.

p = # of sets of measurements on a given individual, n = # of observations = sample size

4 Remarks: In above examples, one can assume that p n since typically • ≪ many measurements will be taken. Today it is common for p 1, so n/p is no longer necessarily • ≫ large. Vehicle Sound Signature Recognition: Vehicle noise is a stochastic signal. The power spectrum is discretized to a vector of length p = 1200 with n 1200 samples from the ≈ same kind of vehicle. Astrophysics: Sloan Digital Sky typically has many observations (say of quasar spectrum) with the spectra of each quasar binned resulting in a large p. Financial data: S&P 500 stocks observed over monthly intervals for twenty years.

5 GAUSSIAN DATA MATRICES The data are now n independent column vectors of length p

~x1, ~x2,...,~xn from which we construct the n p data matrix × ~xT ←− 1 −→  T  ~x2 X = ←− −→  .   .     T   ~xn   ←− −→    The Gaussian assumption is that

~x N (µ, Σ) j ∼ p Many applications assume the mean has been already substracted out of the data, i.e. µ = 0.

6 Multivariate Gaussian Distribution If x and y are vectors, the matrix x y is defined by ⊗ (x y) = x y ⊗ jk j k If µ = E(x) is the mean of the random vector x, then the matrix of x is the p p matrix × Σ= E [(x µ) (x µ)]] − ⊗ − Σ is a symmetric, non-negative definite matrix. If Σ > 0 (positive definite) and X N (µ, Σ), then the density function of X is ∼ p 1 f (x)=(2π)−p/2(det Σ)−1/2 exp x µ, Σ−1(x µ , > x Rp X −2 − − ∈    Sample mean: 1 x¯ = ~x , E(¯x)= µ n j j X

7 Sample : 1 n S = (~x x¯) (~x x¯) n 1 j − ⊗ j − j=1 − X For µ = 0 the sample covariance matrix can be written simply as 1 XT X n 1 − Some Notation: If X is a n p data matrix formed from the n × independent column vectors xj , cov(xj ) = Σ, we can form one column vector vec(X) of length pn

x1

 x2  vec(X)=  .   .       xn     

8 The covariance of vec(X) is the np np matrix × Σ 0 0 0 ···  0 Σ 0 0  I Σ= ··· n ⊗  . . . .   . . . .   ···     0 0 0 Σ   ···    In this case we say the data matrix X constructed from n independent x N (µ, Σ) has distribution j ∼ p N (M,I Σ) p n ⊗ where M = E(X)= 1 µ, 1 is the column vector of all 1’s. ⊗

9 Definition: If A = XT X where the n p matrix X is × N (0,I Σ), Σ > 0, then A is said to have Wishart distribution p n ⊗ with n degrees of freedom and covariance matrix Σ. We will say A is Wp(n, Σ). Remarks: The Wishart distribution is the multivariate generalization of • the chi-squared distribution. A W (n, Σ) is positive definite with probability one if and • ∼ p only if n p. ≥ The sample covariance matrix, • 1 S = A n 1 − is W (n 1, 1 Σ). p − n−1

10 WISHART DENSITY FUNCTION, n p ≥ Let denote the space of p p positive definite (symmetric) Sp × matrices. If A =(a ) , let jk ∈ Sp

(dA) = volume element of A = dajk ≤ j^k The multivariate gamma function is

Γ (a)= e−tr(A) (det A)a−(p+1)/2 (dA), (a) > (p 1)/2. p ℜ − ZSp Theorem: If A is W (n, Σ) with n p, then the density function p ≥ of A is

1 1 −1 (n−p−1)/2 − 2 tr(Σ A) np n/2 e (det A) 2 Γp(n/2)(detΣ)

11 Sketch of Proof: The density function for X is the multivariate Gaussian • (including volume element (dX))

−np/2 −n/2 − 1 tr(Σ−1XT X) (2π) (det Σ) e 2 (dX)

Recall the QR factorization [8]: Let X denote an n p matrix • × with n p with full column rank. Then there exists an unique ≥ n p matrix Q, QT Q = I , and an unique n p upper × p × triangular matrix with positive diagonal elements so that X = QR. Note A = XT X = RT R. A Jacobian calculation [1, 13]: If A = XT X, then • (dX) = 2−p (det A)(n−p−1)/2 (dA)(QT dQ)

12 where p n T T (Q dQ)= qk dqj j=1 ^ k=^j+1 and Q =(q1,...,qp) is the column representation of Q. Thus the joint distribution of A and Q is • −np/2 −n/2 − 1 tr(Σ−1A) (2π) (det Σ) e 2 × 2−p (det A)(n−p−1)/2 (dA)(QT dQ)

Now integrate over all Q. Use fact that • 2pπnp/2 (QT dQ)= Γ (n/2) ZVn,p p

and n,p is the set of real n p matrices Q satisfying T V × Q Q = Ip. (When n = p this is the orthogonal group.)

13 Remarks regarding the Wishart density function Case p = 2 obtain by R. A. Fisher in 1915. • General p by J. Wishart in 1928 by geometrical arguments. • Proof outlined above came later. (See [1, 13] for complete • proof.) When Q is a p p orthogonal matrix • × Γ (p/2) p QT dQ 2pπp2/2 is normalized Haar measure for the orthogonal group (p). We O denote this Haar measure by (dQ). Siegel proved (see, e.g. [13]) • p 1 Γ (a)= πp(p−1)/4 Γ a (j 1) p − 2 − j=1 Y  

14 EIGENVALUES OF A WISHART MATRIX Theorem: If A is W (n, Σ) with n p the joint density function p ≥ for the eigenvalues ℓ1,..., ℓp of A is

2 p πp /2 2−np/2 (det Σ)−n/2 ℓ(n−p−1)/2 ℓ ℓ Γ (p/2) Γ (n/2) j | j − k|× p p j=1 Y j > λ ) 1 ··· p ZO(p) where L = diag(ℓ1,...,ℓp) and (dQ) is normalized Haar measure. Note that ∆(ℓ) := (ℓ ℓ ) is the Vandermonde. j

− 1 P ℓ e 2 j j .

15 Proof: Recall that the Wishart density function (times the volume element) is

1 1 −1 (n−p−1)/2 − 2 tr(Σ A) np n/2 e (det A) (dA) 2 Γp(n/2)(detΣ)

The idea is to diagonalize A by an orthogonal transformation and then integrate over the orthogonal group thereby giving the density function for the eigenvalues of A. Let ℓ > >ℓ be the ordered eigenvalues of A. 1 ··· p A = QLQT , L = diag(ℓ ,...,ℓ ), Q (p) 1 p ∈ O The jth column of Q is a normalized eigenvector of A. The transformation is not 1–1 since Q =[ q ,..., q ] works for each ± 1 ± p fixed A. The transformation is made 1–1 by requiring that the 1st element of each qj is nonnegative. This restricts Q (as A varies) to a 2−p part of (p). We compensate for this at the end. O

16 We need an expression for the volume element (dA) in terms of Q and L. First we compute the differential of A

dA = dQLQT + QdLQT + QLdQT QT dA Q = QT dQ L + dL + L dQT Q = dQt Q dL + L dQT Q + dL − = L, dQT Q + dL

(We used QT Q = I implies QT dQ = dQT Q.) − We now use the following fact (see, e.g., page 58 in [13]): If X = BYBT where X and Y are p p symmetric matrices, B is a × nonsingular p p matrix, then (dX) = (det B)p+1(dY ). In our case × Q is orthogonal so the volume element (dA) equals the volume element (QT dAQ). The volume element is the exterior product of the diagonal elements of QT dA Q times the exterior product of the elements above the diagonal.

17 Since L is diagonal, the commutator

L, dQT Q has zero diagonal elements. Thus the exterior product of the T diagonal elements of Q dA Q is j dℓj. The exterior product of the elementsV coming from the commutator is (ℓ ℓ ) qT dq j − k k j j

18 One is interested in limit laws as n, p . For Σ = I , • → ∞ p Johnstone [11] proved, using RMT methods, for centering and scaling constants 2 µ = √n 1+ √p , np − 1/3  1 1 σnp = √n 1+ √p + − √n 1 √p    − that ℓ µ 1 − np σnp converges in distribution as n, p , n/p γ < , to the → ∞ → ∞ GOE largest eigenvalue distribution [15]. El Karoui [6] has extended the result to γ . The case • ≤ ∞ p n appears, for example, in microarray data. ≫ Soshnikov [14] has lifted Gaussian assumption under the • additional restriction n p = O(p1/3). −

19 For Σ = I , the difficulty in establishing limit theorms comes • 6 p from the integral

− 1 tr(Σ−1QΛQT ) e 2 (dQ) ZO(p) Using zonal polynomials infinite series expansions have been derived for this integral, but these expansions are difficult to analyze. See Muirhead [13]. For complex Gaussian data matrices X similar density formulas • are known for the eigenvalues of X∗X. Limit theorems for Σ = I are known since the analogous group integral, now over 6 p the unitary group, is known explicitly—the Harish Chandra–Itzykson–Zuber (HCIZ) integral (see, e.g. [17]). See the work of Baik, Ben Arous and P´ech´e[2, 3] and El Karoui [7].

20 PRINCIPAL COMPONENT ANALYSIS (PCA), H. Hotelling, 1933 Population Principal Components: Let x be a p 1 random × vector with E(x)= µ and cov(x) = Σ > 0. Let λ λ λ 1 ≥ 2 ≥···≥ p denote the eigenvalues of Σ and H an orthogonal matrix T diagonalizing Σ: H ΣH = Λ = diag(λ1,...,λp). We write H in column vector form

H =[h1,...,hp] so that h is the p 1 eigenvector of Σ corresponding to eigenvalue j × λ . Define the p 1 vector j × T T u = H x =(u1,...,up)

then cov(u) = E (HT x HT µ) (HT x HT µ) − ⊗ − = HT E ((x µ) (x µ)) H − ⊗ −  = HT Σ H =Λ u uncorrelated. ⇒ j

21 th Definition: uj is called the j principal component of x. Note var(uj)= λj .

Statistical interpretations: The claim is that u1 is that linear combination of components of x that has maximum variance. Proof: For simplicity of notation, set µ = 0. Let b denote any p 1 × vector, bT b = 1, and form bT x. var(bT x)= E bT x bT x = E bT x (bT x)T = bT E(xxT )b = bT Σ b. · · We want to maximize the right hand side subject to the constraint bT b = 1. By the method of Lagrange multipliers we maximize bT Σ b λ(bT b 1) − − Since Σ is symmetric the vector of partial derivatives is 2Σ b 2λb − Thus b must be an eigenvector with eigenvalue λ.

22 The largest variance corresponds to choosing the largest eigenvalue.

The general result is that ur has maximum variance of all normalized combinations uncorrelated with u1,..., ur−1. Sample principal components: Let S denote the sample covariance matrix of the data matrix X and let Q =[q1,...,qp] a p p orthogonal matrix diagonalizing S: × T Q S Q = diag(ℓ1,...,ℓp)

The ℓj are the sample that are estimates for λj . The vectors qj are sample estimates for the vectors hj . If x is the random vector and u = HT x is the vector of principal components, thenu ˆ = QT x is the vector of sample principal components.

23 SCREE PLOTS

In applications: How many of the ℓj’s are significant?

24 ANALYSIS (CCA) H. Hotelling, 1936 Suppose a large data set is naturally decomposed into two groups. For example, p 1 random vectors ~x ,..., ~x make up one set and × 1 n q 1 random vectors ~y ,..., ~y the other. We are interested in the × 1 m correlations between these two data sets. For example, in medicine we might have n measurements of age, height, and weight (p = 3) and m measurements of systolic and diastolic blood pressures (q = 2). We are interested in what combination of the components of x is most correlated with a combination of the components of y. Population Canonical Correlations: Let x and y be two random vectors of size p 1 and q 1, respectively. We assume × × E(x)= E(y)=0 and p q. Form the (p + q) 1 vector ≤ × x  y    25 and its (p + q) (p + q) covariance matrix ×

Σ11 Σ12 Σ= .  Σ21 Σ22  Let   u := αT x R, v := γT y R ∈ ∈ where α and γ are vectors to be determined. We want to maximize the correlation cov(u, v) corr(u, v)= var(u)var(v) The correlation does not changep under scale transformations u cu, etc. so we can maximize this correlation subject to the → constraints

E(u2) = E(αT x αT x)= αT Σ α = 1 (1) · 11 2 T E(v ) = γ Σ22γ =1 (2)

26 Under these constraints corr(u, v)= E(αT x γT y)= αT Σ γ. · 12 Let 1 1 ψ = αT Σ γ ρ (αT Σ α 1) λ (γtΣ γ 1) 12 − 2 11 − − 2 22 − where ρ and λ are Lagrange multipliers. Set the vector of partial derivatives to zero: ∂ψ = Σ γ ρΣ α = 0 (3) ∂α 12 − 11 ∂ψ = ΣT α λΣ γ =0 (4) ∂γ 12 − 22 If we left multiply (3) by αT and (4) by γT , use the normalization conditions (1) and (2) we conclude λ = ρ. Thus (3) and (4) become

ρΣ11 Σ12 α − =0 (5)  Σ ρΣ   γ  21 − 22     27 with corr(u, v)= ρ and ρ 0 is a solution to ≥

ρΣ11 Σ12 det − = 0  Σ ρΣ  21 − 22   This is a polynomial in ρ of degree (p + q). Let ρ1 denote the maximum root and α1 and γ1 corresponding solutions to (5).

Definition: u1 and v1 are called the first canonical variables and their correlation ρ1 = corr(u1, v1) is called the first canonical correlation coefficient. More generally, the rth pair of canonical variables is the pair of (r) T (r) T linear combinations ur =(α ) x and vr =(γ ) y, each of unit variance and uncorrelated with the first r 1 pairs of canonical − variables and having maximum correlation. The correlation corr(ur, vr) is the rth canonical correlation coefficient.

28 REDUCTION TO AN EIGENVALUE PROBLEM Since

ρΣ11 Σ12 − =  Σ ρΣ  21 − 22   Σ 0 ρI Σ−1Σ Σ−1 1 0 11 − 11 12 22  0 1   Σ ρI   0 Σ  21 − 22 Thus the determinantal  equation becomes   

det ρ2I Σ Σ−1Σ Σ−1 = 0 − 21 11 12 22

The nonzero roots ρ1, ..., ρk are called the population canonical correlation coefficients. k = rank(Σ12). In applications Σ is not known. One uses the sample covariance matrix S to obtain sample canonical correlation coefficients.

29 AN EXAMPLE The first application of Hotelling’s canonical correlations is by F. Waugh [16] in 1942. He begins his paper with Professor Hotelling’s paper, “Relations between Two Sets of Variates,” should be widely known and his method used by practical . Yet, few practical statisticians seem to know of the paper, and perhaps those few are inclined to regard it as a mathematical curiosity rather than an important and useful method for analyzing concrete problems. This may be due to ... The Practical Problem: Relation of wheat characteristics to flour characteristics. Guiding principle The grade of the raw material should give a good indication of the probable grade of the finished product.

30 The data: 138 samples of Canadian Hard Red Spring wheat and the flour made from each these samples.a

wheat quality flour quality

x1 = kerneltexture y1 = wheat per barrels of flour

x2 = testweight y2 = ash in flour

x3 = damaged kernels y3 = crude protein in flour

x4 = crude protein in wheat y4 = gluten quality index

x5 = foreign material

T T u1 = α1 x = index of wheat quality, v1 = γ1 y = index of flour quality

corr(u1, v1) = 0.909

aIn this example p > q. The data are normalized to mean 0 and variance 1.

31 DISTRIBUTION OF SAMPLE CANONICAL CORRELATION COEFFICIENTS

x Give a sample of size n observations on drawn from  y 

Np+q(µ, Σ) and A the (unnormalized) sample  covariance matrix.

Then W is Wp+q(n, Σ). We have [13]

Theorem (Constantine, 1963): Let A have the Wp+q(n, Σ) distribution where p q, n p + q and Σ and A are partitioned as ≤ ≥ 2 2 above. Then the joint probability density function of r1,..., rp, the −1 −1 2 eigenvalues of A11 A12A22 A21 (let ξj := rj ) is

p p c (1 ρ2)n/2 ξ(q−p−1)/2(1 ξ )(n−p−q−1)/2 ∆(ξ) (ξ) p,q,n − j j − j ·| |·F j=1 j=1 Y Y h i where

32 c is a normalization constant. • p,q,n ∆(ξ) = Vandermonde determinant = p (ξ ξ ). • j

Null Distribution: For Σ12 =0(x and y are independent), the above joint density for the sample canonical correlation coefficients reduces to p c ξ(q−p−1)/2(1 ξ )(n−p−q−1)/2 ∆(ξ) p,q,n j − j ·| | j=1 Y h i In this case the distribution of the largest sample canonical correlation coefficient r1 can be used for testing the null hypothesis:

H : Σ12 = 0. We reject H for large values of r1.

33 ZONAL POLYNOMIALS Zonal polynomials naturally arise in multivariate analysis when considering group integrals such as

T e−tr(XHYH ) (dH) ZO(m) where X and Y are m m symmetric, positive definite matrices × and (dH) is normalized Haar measure. Zonal polynomials can be defined either through the representation theory of GL(m, R) [12] or as eigenfunctions of certain Laplacians [5, 13]. Let denote the space of m m symmetric, positive definite Sm × matrices. Zonal polynomials, C (X), X are certain λ ∈ Sm homogeneous polynomials in the eigenvalues of X that are indexed by partitions λ. To give the precise definition we need some preliminary definitions.

34 Definition: A partition λ of n is a sequence λ =(λ1, λ2,...,λℓ) where the λ 0 are weakly decreasing and λ = n. We denote j ≥ j j this by λ n. ⊢ P For example, λ = (5, 3, 3, 1) is a partition of 12. The number of nonzero parts of λ is called the length of λ, denoted ℓ(λ). If λ and µ are two partitions of n, we say λ<µ (lexicographic order) if, for some index i, λj = µj for j < i and λj <µj. For example (1, 1, 1, 1) < (2, 1, 1) < (2, 2) < (3, 1) < (4)

If λ n with ℓ(λ)= m, we define the monomial ⊢ xλ = xλ1 xλ2 xλm 1 2 ··· m and say xµ is of higher weight than xλ if µ > λ.

35 Metrics and Laplacians Let X and dX =(dx ) the matrix of differentials of X. We ∈ Sm jk define a metric on by Sm (ds)2 = tr X−1dX X−1dX · A simple computation shows this metric is invariant under

X LXLT , L GL(m, R). −→ ∈ Let n = m(m + 1)/2 = # of independent elements of X. We denote by vec(X) Rn the column vector representation of X, e.g. ∈

x11 x11 x12 X = , x := vec(X)=  x12   x12 x22   x     22    The metric (ds)2 is a quadratic differential in dx.

36 For example,

dx11 (ds)2 = dx dx dx G(x)  dx  11 12 22 · · 12    dx   22    where

G(x)=(gij)=

x2 2x x x2 22 − 12 22 12 1 2 2 2  2x12x22 2x11x22 + 2x12 2x11x12  (x11x22 x12) − − −  x2 2x x x2   12 − 11 12 11    Labeling x = vec(X) with a single index the differential is in the standard form 2 (ds) = dxi gij dxj i

37 and the Laplacian associated to this metric is

n ∂ n ∂ ∆ = (det G)−1/2 (det G)1/2 gij X ∂x ∂x j=1 j " i=1 i # X X where G−1 =(gij ) More succinctly, if ∂ ∂x1 . =  .  , ∇X .  ∂     ∂xn    ∆ = (det G)−1/2 , (det G)1/2G−1 X ∇X ∇X where ( , ) is the standard inner product on Rn.  · ·

38 One can show that ∆X is an invariant differential operator:

∆ T = ∆ , L GL(m, R) LXL X ∈ We now diagonalize X

X = HY HT , Y = diag(y ,...,y ), H (m). 1 m ∈ O

The Laplacian ∆X is now expressed in terms of a radial part and an angular part. The radial part of ∆X is the differential operator

m 2 m m 2 2 ∂ yj ∂ ∂ yj 2 + + yj ∂y yj yk ∂yj ∂yj j=1 j j=1 k=1 j X X Xk=6 j − X

We now let ∆X denote only the radial part. We can now define zonal polynomials!

39 Definition: Let X with eigenvalues x ,..., x and let ∈ Sm 1 m λ =(λ ,...,λ m) k into not more than m parts. Then C (X) is 1 − ⊢ λ the symmetric, homogeneous polynomial of degree k in xj such that

λ 1. The term of highest weight in Cλ(X) is x

2. Cλ(X) is an eigenfunction of the Laplacian ∆X .

k k 3. (tr(X)) =(x1 + + xm) = Cλ(X). ··· λ⊢k ℓ(Xλ)≤m Remarks: Must show there is an unique polynomial satisfying these • requirements. Eigenvalue in (2) equals α := λ (λ j)+ k(m + 1)/2. • λ j j j − Program MOPS [5] computes zonalP polynomials. •

40 The following theorem is at the core of why zonal polynomials appear in multivariate statistical analysis. Theorem: If X,Y S , then ∈ p C (X)C (Y ) C (XHYHT )(dH)= λ λ (6) λ C (I ) ZO(p) λ p where (dH) is normalized Haar measure. Proof: Let f (Y ) denote the left-hand side of (6) and Q (p). λ ∈ O f (QY QT )= f (Y ) (let H HQ in integral and use invariance of λ λ → the measure). Thus fλ is a symmetric function of the eigenvalues of Y . Since C is homogeneous of degree λ , so is f . Now apply the λ | | λ

41 Laplacian to fλ.

T ∆Y fλ(Y ) = ∆Y Cλ(XHYH )(dH) ZO(p) 1/2 T 1/2 = ∆Y Cλ(X HY H X )(dH) Z T 1/2 = ∆Y Cλ(LY L )(dH) (L = X H) Z T = ∆LY LT Cλ(LY L )(dH) invariance of ∆Y Z T = αλ Cλ(LY L )(dH) Z = αλ fλ(Y )

By definition of Cλ(Y ) we have fλ(Y )= dλ Cλ(Y ). Since fλ(Ip)= Cλ(X), we find dλ and the theorem follows.

42 Using Zonal Polynomials to Evaluate Group Integrals

T e−ρ tr(XHYH ) (dH) ZO(p) ∞ k ρ k = tr(XHYHt) (dH) k! k=0 ZO(p) X  ∞ ρk = C (XHYHT )(dH) k! λ k=0 λ⊢k ZO(p) X ℓ(Xλ)≤m ∞ ρk C (X)C (Y ) = λ λ k! Cλ(Ip) k=0 λ⊢k X ℓ(Xλ)≤m

43 References

[1] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, third edition, John Wiley & Sons, Inc., 2003. [2] J. Baik, G. Ben Arous, and S. P´ech´e, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, Ann. Probab. 33 (2005), 1643–1697. [3] J. Baik, Painlev´eformulas of the limiting distributions for nonnull complex sample covariance matrices, Duke Math. J. 133 (2006), 205–235. [4] M. Dieng and C. A. Tracy, Application of random matrix theory to multivariate statistics, arXiv: math.PR/0603543. [5] I. Dumitriu, A. Edelman and G. Shuman, MOPS: Multivariate Orthogonal Polynomials (symbolically), arXiv:math-ph/0409066. [6] N. El Karoui, On the largest eigenvalue of Wishart matrices with identity covariance when n, p and p/n tend to infinity, arXiv:

44 math.ST/0309355. [7] N. El Karoui, Tracy-Widom limit for the largest eigenvalue of a large class of complex Wishart matrices, arXiv: math.PR/0503109. [8] G. H. Golub and C. F. Van Loan, Matrix Computations, second edition, Johns Hopkins University Press, 1989. [9] H. Hotelling, Analysis of a complex of statistical variables into principal components, J. of 24 (1933), 417–441, 498–520. [10] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936), 321–377. [11] I. M. Johnstone, On the distribution of the largest eigenvalue in principal component analysis, Ann. Stat. 29 (2001), 295–327. [12] I. G. Macdonald, Symmetric Functions and Hall Polynomials, 2nd ed., Oxford University Press, 1995. [13] R. J. Muirhead, Aspects of Multivariate , John

45 Wiley & Sons Inc., 1982.

[14] A. Soshnikov, A note on universality of the distribution of the largest eigenvalue in certain sample covariance matrices, J. Statistical Physics 108 (2002), 1033–1056.

[15] C. A. Tracy and H. Widom, On orthogonal and symplectic matrix ensembles, Commun. Math. Phys. 177 (1996), 727–754.

[16] F. V. Waugh, Regression between sets of variables, Econometrica 10 (1942), 290–310.

[17] P. Zinn-Justin and J.-B. Zuber, On some integrals over the U(N) unitary group and their large N limit, J. Phys. A: Math. Gen. 36 (2003), 3173–3193.

46