Random Matrices and Multivariate Statistical Analysis

Iain Johnstone, Statistics, Stanford [email protected]

SEA’06@MIT – p.1 Agenda • Classical multivariate techniques • Principal Component Analysis • Canonical Correlations • Multivariate Regression • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Single Wishart • Double Wishart • Largest Eigenvalue • Single Wishart • Double Wishart • Concluding Remarks

SEA’06@MIT – p.2 Classical Multivariate Statistics Canonical methods are based on spectral decompositions:

One matrix (Wishart) • Principal Component analysis • Factor analysis • Multidimensional scaling Two matrices(independent Wisharts) • Multivariate Analysis of Variance (MANOVA) • Multivarate regression analysis • Discriminant analysis • Canonical correlation analysis • Tests of equality of covariance matrices

SEA’06@MIT – p.3 Gaussian data matrices . variables Sx ?nd S)SwS?7>DS) z = = cases Sx ?nz

Independent rows: xi ∼ Np(0, Σ),i=1,...n or: X ∼ N(0,In ⊗ Σp)

Zero mean ⇒ no centering in sample covariance matrix: n 1 T 1 S =(S ),S= X X, S = x x kk n kk n ik ik i=1 nS ∼ Wp(n, Σ)

SEA’06@MIT – p.4 Principal Components Analysis

Hotelling, 1933 X1,...,Xn ∼ Np(µ, Σ),

Low dim. subspace“explaining most variance”:

li =max{u Su : u u =1,uuj =0,j

Eigenvalues of Wishart: A = nS ∼ Wp(n, Σ):

Aui = liui l1 ≥ ...≥ lp ≥ 0.

Key question: How many li are “signiﬁcant”?

"scree" plot of singular values of phoneme data 300

250

200

150

100

50 SEA’06@MIT – p.5 0 0 50 100 150 Canonical Correlations X1 X ··· n jointly p + q –variatenormal. Y1 Yn “Most predictable criterion”: (Hotelling, 1935, 1936).

max Corr (uiX, viY ) ui,vi

⇒ 2 2 ≥ ≥ 2 Avi = ri (A + B)vi,r1 ... rp.

Two independent Wishart distributions:

A ∼ Wp(q,Σ),B∼ Wp(n − q,Σ).

SEA’06@MIT – p.6 Multivariate Multiple Regression

Y = XB + U U ∼ Np(0,I ⊗ Σ) n×pn×qq×pn×p n = #observations;p = # response variables; q = # predictor variables V ; pz P = X(XT X)−1XT projection on V|S)hV span{cols(X)} (

Y T Y = Y T PY + Y T (I − P )Y H : hypothesis SSP + E : error SSP

H ∼ Wp(q,Σ) indep of E ∼ Wp(n − q,Σ)

SEA’06@MIT – p.7 Agenda • Classical multivariate techniques • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Largest Eigenvalue • Concluding Remarks

SEA’06@MIT – p.8 Hypothesis Testing

Null hypothesis H0,nestedwithinAlternative hypothesis HA

Test Statistics: functions of eigenvalues: T = T (l1,...,lp).

Null hypothesis distribution: P (T>t|H0 true). RMT oﬀers tools for evaluation,andapproximation based on p →∞

Single Wishart A ∼ Wp(n, I) e-vals det(A − liI)=0. Test H0 :Σ=I (or λI)versusHA :Σunrestricted.

Double Wishart H ∼ Wp(q,Σ),E ∼ Wp(n − q,Σ) independently. Eigenvalues det(H − li(E + H)) = 0. Typical hypothesis test (e.g. from Y = XB + U): H0 : B =0versus HA : B unrestricted

SEA’06@MIT – p.9 Likelihood Ratio Test

If X ∼ Np(0,Ip ⊗ Σ), the density −n/2 −1 fΣ(X) = det(2πΣ) exp{−(n/2)trΣ S} Log likelihood Σ → (Σ|X)= n n −1 log fΣ(X)=cnp − 2 log det Σ − 2 trΣ S Maximum likelihood occurs at Σ=ˆ S: n maxΣ (Σ|X)=cnp − 2 log det S

Likelihood ratio test of H0 :Σ=I vs. HA :Σunrestricted: log LR =max(Σ|X) − max (Σ|X) Σ∈ Σ∈ H0 HA n = cnp + 2 ( log li − li) i i Linear statistics in eigenvalues of S: i log li, i li.

SEA’06@MIT – p.10 (Union-) Intersection Principle Combine univariate test statistics:

T H0 :Σ=I ⇔∩|a|=1H0a : a Σa =1.

T T T T Var(a X)=a Σa, so reject H0a if Var(a X)=a Sa > ca

Reject H0 ⇔ reject some H0a T ⇔ max a Sa > cmax a ⇔ lmax(S) >cmax Summary: Likelihood ratio principle → linear statistics in eigenvalues Intersection principle → extreme eigenvalues

SEA’06@MIT – p.11 Agenda • Classical multivariate techniques • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Largest Eigenvalue • Concluding Remarks

SEA’06@MIT – p.12 Eigenvalue densities - single Wishart

n−p−1 − 2 N 2 li/ | − | Statistics (n, p): c i=1 lj e j

Notation change has signiﬁcance! Statistics: no necessary relation between p and n; traditional approximation uses p ﬁxed, n →∞.

RMT: N →∞with α ﬁxed is most natural. (in Stat, ﬁxing n − p would be less natural).

SEA’06@MIT – p.13 Eigenvalue densities - double Wishart

Statistics: If H ∼ Wp(q,I) and E ∼ Wp(n − q,I) are indep, −1 then joint density of eigenvalues {ui} of H(H + E) is

p m (q−p−1)/2 − (n−q−p−1)/2 − f(u)=c ui (1 ui) (ui uj). ⎛i=1 ⎞ ⎛ ⎞ i

N+1 N+1 (α−1)/2 (β−1)/2 f(x)=c (1 − xi) (1 + xi) |xi − xj|. i=1 i

SEA’06@MIT – p.14 Convergence of Empirical Spectra { }p −1 { ≤ }→ For e-values li i=1 Gp(t)=p # li t G(t)=g(t)dt.

Single Wishart (Marˇcenko-Pastur, 67) A ∼ Wp(n, I) → If p/n c>0, − − √ MP (b+ t)(t b−) 2 g (t)= ,b± =(1± c) . 2πct

Double Wishart (Wachter, 80) det(H − li(H + E)) = 0. If p ≤ q, p/n → c =sin2(γ/2) > 0,q/n→ sin2(φ/2), − − ± W (b+ t)(t b−) 2 φ γ g (t)= ,b± =sin ( ). 2πct(1 − t) 2

SEA’06@MIT – p.15 Agenda • Classical multivariate techniques • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Single Wishart • Double Wishart • Largest Eigenvalue • Concluding Remarks

SEA’06@MIT – p.16 Linear Statistics: Single Wishart Approximate distributions: Statistics: • Typically p ﬁxed; standard χ2 approximation, • improvements by ’Bartlett correction’

RMT: • Central Limit Theorems (p large) for linear statistics of eigenvalues. Large literature

Jonsson (1982): S ∼ Wp(n, I), p/n → c>0 With d(c)=(1− c−1)log(1− c) − 1,

D 1 log det S − pd(c) → N( 2 log(1 − c), −2log(1− c)) (1) D trS − p → N(0, 2c)

Surprise: quality of approximation in (1) for p small (e.g. 2!)

SEA’06@MIT – p.17 Small p asymptotics

npqtile pFix pBaiS 100 2 0.90 0.923 0.899 QQ Plot of Sample Data versus Standard Normal 5 100 20 0.90 1.000 0.900

0 100 60 0.90 1.000 0.902

−5 1000 20 0.90 0.990 0.900 100 2 0.95 0.965 0.951 −10 100 20 0.95 1.000 0.951 −15 100 60 0.95 1.000 0.949 Quantiles of Input Sample −20 1000 20 0.95 0.997 0.950

−25 100 2 0.99 0.995 0.992

−30 100 20 0.99 1.000 0.990 −4 −3 −2 −1 0 1 2 3 4 Standard Normal Quantiles 100 60 0.99 1.000 0.990 1000 20 0.99 1.000 0.990

SEA’06@MIT – p.18 CLT for Likelihood Ratio distribution Bai-Silverstein(2004)

p MP D f(li) − p f(x)g (x)dx → Xf ∼ N(EXf , Cov(Xf )), 1 −1 f(z(m1))g(z(m2)) Cov(Xf ,Xg)= 2 − 2 dm1dm2 2π Γ1 Γ2 (m1 m2)

⇒ CLT for null distribution of the LR test of H0 :Σ=I, p D 1 −1 (log li −li +1) → N(pd(c)+2 log(1−c), 2[log(1−c) −c]). 1

SEA’06@MIT – p.19 Linear Statistics: Double Wishart

−1 Hypothesis tests based on e-vals ui of H(H + E) ,i.e. e-vals w = u /(1 − u ) of HE−1. i i i p Many standard tests are linear statistics SN (g)= 1 g(ui): • p Wilks Λ: log Λ = 1 log(1 − ui) [Likelihood ratio test] • p Pillai’s trace = 1 ui • p Hotelling-Lawley trace = ui/(1 − ui)= 1 wi

• Roy’s largest root = u(1).

Basor-Chen (05) Unitary case, formal; N →∞α, β ﬁxed.

D SN (g) − (2N + α + β)ag → N(0,bg·g),

q R R R 2 1 1 g(x) 1 1 g(x) 1 1−y [ = q · = q ( ) ] ag 2π −1 dx, bg g 2 2 −1 P −1 y−x g y dydx 1−x2 π 1−x2

SEA’06@MIT – p.20 Agenda • Classical multivariate techniques • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Largest Eigenvalue • Single Wishart • Double Wishart • Concluding Remarks

SEA’06@MIT – p.21 Largest Eigenvalue - Single Wishart

’Usual’ approach to maxima is (classically) infeasible: { ≤ } p { ≤ } l(1) x = i=1 I li x Key role: determinants, not independence: − k−1 (li lj)=det[li ]1≤i,k≤p ix}. i k i i=1 k=0 i=1 ···⇒P { max li ≤ t} = det(I − Kpχ[t,∞)) 1≤i≤p

Kp(x, y) is (2 × 2 matrix) kernel uses {Laguerre, Jacobi} orthogonal polynomials via Christoﬀel-Darboux summation.

SEA’06@MIT – p.22 Tracy-Widom Limit For real (β =1,IMJ)orcomplex(β =2, Johansson) data, if n/p → c ∈ (0, ∞):

Fp(s)=P {l1 ≤ µnp + σnps}→Fβ(s), with √ √ √ √ 1 1 1/3 µ =( n + p)2,σ=( n + p) √ + √ np np n p

El Karoui (2004) In complex case, for reﬁned µnp,σnp,

−s −2/3 |Fp(s) − F2(s)|≤Ce p . Also, results for • N →∞,p→∞separately, and • under alternative hypotheses.

SEA’06@MIT – p.23 Painlev´e II and Tracy-Widom

Painlev´eII: q = xq +2q3

0.8 q(x) ∼ Ai(x) as x →∞

0.6

0.4

0.2

-2 -1 1 2

F2(s)= Tracy-Widom distributions: ∞ exp{− (x − s)q2(x)dx} s F1(s)= ∞ 1/2 1 (F2(s)) exp{− 2 q(x)dx}. s

-4 -2 0 2 4

SEA’06@MIT – p.24 Largest Root - Double Wishart Assume p, q(p),n(p) →∞. r r γ −1 p−.5 φ −1 q − .5 p =sin , p =sin . 2 n−1 2 n − 1 “ ± ” 1 sin4( + ) =cos2 π − φp γp 3 = φp γp µ± ,σp+ 2 . 2 2 (2n − 2) sin φp sin γp Simply, u1 − µ+ D → W1 ∼ F1 σ+

More precisely, logit transform (u)=log(u/(1 − u)) : (IMJ,PJF)

−s/4 −2/3 |P {(u1) ≤ (µ+)+sσ+ (µ+)}−F1(s)|≤Ce p

• corrections (.5, 1, 2) improve approx’n for p, q small,

• ⇒ error is O(p−2/3) [insteadofO(p−1/3)]

SEA’06@MIT – p.25 Approximation vs. Tables for p =5

Table vs. Approx at 95th %tile; mc = (q−p−1)/2; nc = (n−q−p−1)/2 1 nc = 2

0.9 nc = 5

nc = 10 Tables: William Chen, IRS, 0.8 0.7

(2002) nc = 20 0.6 q − p − 1 0.5 mc = ∈ [0, 15], 2 nc = 40

Squared Correlation 0.4 n − q − p − 1 Tracy−Widom n = ∈ [1, 1000] Chen Table p=5 c 2 0.3 nc = 100 0.2

0.1 nc = 500 0 0 5 10 15 mc

SEA’06@MIT – p.26 Remarks

−2 3 • p / scale of variability for u1 • . . 95th %tile = µp+ + σp+, 99th %tile = µp+ +2σp+ • if µp+ >.7, logit scale vi =logui/(1 − ui) better.

• Smallest eigenvalue: with previous assumptions and 4 3 1 sin (φp−γp) γ0 <φ0, σ − = 2 then p (2n−2) sin φp sin γp

µp− − up D → W1 (W2) σp− • Corresponding limit distributions for u2 ≥···≥uk, up−k ≥···≥up−1, k ﬁxed

SEA’06@MIT – p.27 Agenda • Classical multivariate techniques • Hypothesis Testing: Single and Double Wishart • Eigenvalue densities • Linear Statistics • Largest Eigenvalue • Concluding Remarks

SEA’06@MIT – p.28 Concluding Remarks Numerous other topics deserve attention: • distributions under alternative hypotheses: integral representations, matrix hypergeometric functions • empirical distributions and graphical display (Wachter) • computational advances: (Dumitriu, Edelman, Koev, Rao) • operations on random matrices • multivariate orthogonal polynomials • matrix hypergeometric functions • estimation and testing for eigenvectors (Paul) • technical role for RMT in other statistical areas: e.g. via large deviations results

SEA’06@MIT – p.29 Back-Up Slides

SEA’06@MIT – p.30 Upper Bound in SAS

n−q u1 Approximate by F − q 1−u1 q,n q

Table vs. Approx at 95th %tile; mc = (q−p−1)/2; nc = (n−q−p−1)/2 1 nc = 2 0.9 nc = 5

0.8 nc = 10

0.7 nc = 20 0.6 Tracy−Widom 0.5 Chen Table p=5 nc =SAS 40 F−approx 0.4

Squared Correlation 0.3

0.2 nc = 100

0.1 nc = 500 0 0 5 10 15 mc SEA’06@MIT – p.31 Testing Subsequent Correlations ⎡ ⎤ 2 ρ1 0 ··· 0 ⎢ . . .⎥ Suppose: ΣXY = ⎣ .. . .⎦ p ≤ q,n−p 2 ρp 0 ··· 0

If largest r correlations are large, test

Hr :ρr+1 = ρr+2 = ...= ρp =0?

Comparison Lemma (from SVD interlacing) st L(ur+1|p, q, n; SXY ∈ Hr) < L(u1|p, q − r, n; I)

⇒ conservative P −values for Hr via TW(p, q − r, n) approx’n to RHS

[Aside: L(u1|p − r, q − r, n; I) may be better, but no bounds] SEA’06@MIT – p.32