Statistical Inference for Astronomers: Multivariate Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Summer School in Statistics for Astronomers & Physicists June 5{10, 2005 Center for Astrostatistics Pennsylvania State University Statistical Inference for Astronomers: Multivariate Analysis Thriyambakam Krishnan Systat Software Asia{Paci¯c Limited Bangalore, India 1 Multivariate Statistical Analysis: Statistical theory, methods, algorithms, etc. ² for simultaneous study of more than one vari- able Descriptive statistics and graphical represen- ² tation Inference problems similar to univariate ² ¥ based on the multivariate normal Study of relationships between variables and ² ¯nding structure Problems of combining variables and dimen- ² sionality reduction 2 Multivariate Normal Distribution Notation: X s (¹; 2) univariate normal N X: p-column vector X p(¹; §) multivariate normal » N Reasons for studying Multivariate Normal: 1. p-variate generalization of univariate normal; 2. same reasons as for univariate normal in univariate analysis; 3. multivariate central limit theorem; 4. robustness of some procedures; 5. theory and methods analogous to univariate based on , like t and Hotelling's T 2, ANOVA and MANOVA; N 6. not many other multivariate models; 7. mathematically tractable and elegant; 8. similar parameters{mean vector ¹, covariance ma- trix §. 3 Bivariate Normal Let T X = (X1; X2); X s (¹; §); N2 T 11 12 ¹ = (¹1; ¹2); § = " 21 22 # with 12 = 21. § is non-negative de¯nite. Let 2 2 1 = 11; 2 = 22 Correlation coe±cient ½ = 12 p1122 Density:f(x ; x ) = (if § is p.d.) N2 1 2 1 x ¹ x ¹ x ¹ x ¹ 1 [( 1¡ 1 )2 2 ( 1¡ 1 )( 2¡ 2 )+( 2¡ 2 )2 2(1 ½2) ½ e¡ ¡ 1 ¡ 1 2 2 ] 2 2¼12 (1 ½ ) ¡ 2 p (x1; x2) 2 < If (X1; X2) s 2 then N X1; X2 independent ½ = 0 In general ½ = 0 does()not imply independence 4 Bivariate Normal Densities ¹1 = ¹2 = 0; 11 = 22 = 1 ½ = 0:8; ½ = 0:5; ½ = 0; ½ = 0:8; ½ = 0:5; ¡ ¡ 5 In 2 density, term inside the exponential is N 1 T 1 (x ¹) §¡ (x ¹) ¡2 ¡ ¡ 1 and constant is p 1 ; where p = 2: p2¼ § 2 j j This is the form of p density: N 1 1 T § 1 1 exp (x ¹) ¡ (x ¹) p p2¼ § 2 f¡2 ¡ ¡ g j j if § is strictly p.d. (it has anyway to be n.n.d., being a covariance matrix). Indeed, x; ¹ are p-vectors and § is a nonsingular (symmetric) p p matrix. The term £ T 1 Q = (x ¹) §¡ (x ¹) ¡ ¡ is a positive-de¯nite quadratic form. Q is covariance-matrix adjusted distance of x from ¹ ² Larger this distance, smaller is probability density ² density decreases exponentially with square of distance ² You can de¯ne p by this density and investigate its N properties. An alternative and elegant way is to use the following de¯nition: A random p-vector X is said to be multivariate normally distributed if p-vectors `, `T X has a univariate normal 8 distribution (or is a constant). This de¯nition makes sense even if § is singular. 6 Properties of p: N 1. ¹ is the vector of means of X1; X2; : : : ; Xp. 2. § is the (symmetric) matrix of variances and co- variances of X1; X2; : : : ; Xp. 3. If variance-covariance matrix (also called simply co- variance matrix or dispersion matrix) is singular, above density does not hold, but the alternative de¯nition still holds. For instance, if X (0; 4), »4N 12 then (X; 3X + 2) has covariance matrix , 12 36 singular, but all linear combinations are ·of the form¸ a + bX for constants a; b and hence are univariate normal. (X; 3X + 2) is bivariate normal by alterna- tive de¯nition. 4. The covariance matrix is singular (multivariate nor- mal or not) with linear dependence of columns given by §` = 0, i® XT ` is a constant (degenerate ran- dom variable)¹ (deterministic linear dependence of variables). In such cases, by removing deterministi- cally dependent components, § of remaining com- ponents can be made nonsingular. Near-singularity of covariance matrix is a computa- tional and conceptual problem. Some exploratory methods detect this problem. Some methods (e.g., ridge regression) overcome this problem. 5. Let us deal only with nonsingular §. 7 k 6. Let X s p(¹; §); A, k p matrix, c . Then N £ 2 < Y = AX + c (A¹ + c; A§AT ) » Nk [If k > p, then A§AT is singular.] 7. § diagonal means X1; X2; : : : ; Xp are independent random vari- ables. 8. X p(0; Ip) means X1; X2; : : : Xp are independent standard normal» N va¹riables. 9. Let X p(¹; §). Then » N 1 1 §¡ 2 X p(§¡ 2 ¹; Ip) » N 1 Y = §¡ 2 (X ¹) p(0; Ip) ¡ » N ¹ 10. Marginal Distributions: All marginal (1-dimensional and q < p-dimensional) are (multivariate) normal. That is, if you par- tition X1 ¹ §11 §12 X = ; ¹ = 1 ; § = § § X2 ¹2 21 22 analogously³ as ´q and p q³ dimensional´ vectoh rs and matrixi § §T ¡ (note that 21 = 12), then X1 s q(¹ ; §11); X2 s q(¹ ; §22) N 1 N 2 11. Under the above (multivariate normal) set-up, X 1 and X2 are independent i® §12 = 0, that is all covariances are zero. ¹ 8 Conditional Distributions and Regression s § 12. (X2 X1 = x1) (¹2:1; 22:1), where j § §N1 § § § § 1§ ¹22:1 = ¹2+ 21 11¡ (x1 ¹1); 22:1 = 22 21 11¡ 12 [Notation: A B stands¡for event A conditional¡ on event B; alsojused as X Y = y for variable X given variable Y = y.] j (a) if § is nonsingular, so is §11. (b) this conditional expectation is linear in x1. (c) regression being de¯ned as conditional expec- tation, this shows that under multivariate nor- mality, (multiple) regression of any subset of variables on the others is linear. (d) this linear regression formula is exactly the same as what you obtain by the least-squares crite- rion. s 2 2 (e) p = 2, (X2 X1 = x1) (¯0 + ¯1x1; 2(1 ½ )); j 21 N ¡ where ¯1 = and ¯0 = ¹2 ¯1¹1, the well- 11 ¡ known formulas for (least-squares) simple linear regression. (f) conditional covariance matrix does not depend on x1. (g) These results justify linearity and homoscedas- ticity (common variance) assumptions in the multiple linear regression model. 9 conditional means on a straight line ² conditional variances same ² 10 More Properties 1. [We know: if X s N(0; 1), then X2 s Â2(1)] 2 T 1 T 2 ¢ (X) = (X ¹) §¡ (X ¹) = Y Y s  (p), being sum of squa¡ res of p indep¡ endent N(0,1)'s by (9) above 2. Sample (of size n) mean vector X¹ and sample sum of squares and products matrix S are independently distributed. 3. X¹ s (¹; 1§) N n 4. S s p(n 1; §) is called the Wishart distribu- tion,Wthe multiva¡ riateW analog of the Â2 distribution| we shall not discuss it here. 5. For (¹; 2), Student's t statistic based on a sam- ple ofN n with mean X¹ and sample (mean-corrected) 2 X¹ ¹ sum of squares S is t = ¡ extended to S=pn(n 1) Hotelling's ¡ 2 T 1 T = (X¹ ¹) S¡ (X¹ ¹) ¡ ¡ 6. Analysis of Variance (ANOVA) which decomposes observed variation into its components is analo- gously extended to Multivariate Analysis of Variance (MANOVA) 11 Estimation of p(¹; §) parameters N Random sample X ; X ; : : : ; Xn from p 1 2 N Observed values: x1; x2; : : : ; xn Data matrix: n p matrix with row and col- £ U umn names as indicated: Y Y : : : Y p 1 # 2 # # X X11 X12 : : : X1 1 ! p X X X : : : X 2 ! 2 21 22 2p 3 : : : : : : : : : : : : : : : 6 7 X 6 X X : : : X 7 n 6 n1 n2 np 7 ! 4 5 12 X¹ = (X¹1; X¹2; : : : ; X¹p): Sample mean vector; S = ((Sij)): Sample (mean-corrected) Sum of Squares and Products Matrix n S = (X X¹ )(X X¹ ) = Y T Y nX¹ X¹ ; ij i` ¡ i j` ¡ j i j ¡ i j `X=1 i; j = 1; 2; : : : ; p n n S = (X X¹ )(X X¹ )T = X XT nX¹ X¹ T `¡ `¡ ` ` ¡ `X=1 `X=1 = T nX¹ X¹ T U U ¡ Analogs of Univariate Normal: X¹ : unbiased estimate of ¹ ² 1 S: unbiased estimate of § ² n 1 X¹¡; S: su±cient statistics for ¹; § [In a sense, ² these statistics contain all the information in the sample X1; X2; : : : ; Xn in respect of ¹; §.] 13 Maximum Likelihood Estimation of ¹; §: Rao (1973) Density: 1 (2¼) p=2 § 1=2 exp[ tr § 1(x ¹)(x ¹)T ] ¡ j j¡ ¡2 f ¡ ¡ ¡ g Joint density of observations (but for constant not involving parameters), which is the likeli- hood function in terms of parameters: n n=2 1 1 T L = §¡ exp[ tr §¡ (x ¹)(x ¹) ] ¡2 f ` ¡ ` ¡ g =1 X` n (x ¹)(x ¹)T ` ¡ ` ¡ =1 X` n = (x x¹)(x x¹)T + n(x¹ ¹)(x¹ ¹)T ` ¡ ` ¡ ¡ ¡ =1 X` = S + n(x¹ ¹)(x¹ ¹)T ¡ ¡ 1 T tr §¡ (x ¹)(x ¹) f ¡ ¡ g 1 1 T = tr §¡ S + ntr §¡ (x¹ ¹)(x¹ ¹) f g f ¡ ¡ g p p ij T 1 = Sij + n(x¹ ¹) §¡ (x¹ ¹) ¡ ¡ i=1 j=1 X X 1 ij where §¡ = (( )). 14 p p n 1 1 ij n T 1 log L = log §¡ Sij (x¹ ¹) §¡ (x¹ ¹) 2 j j¡ 2 ¡ 2 ¡ ¡ i=1 j=1 X X (A) p p n 1 1 ij = log §¡ [Sij + n(x¹i ¹ )(x¹j ¹ )] (B) 2 j j ¡ 2 ¡ i ¡ j i=1 j=1 X X Di®erentiating (A) w.r.t. ¹ leads to 1 §¡ (x¹ ¹) = 0 x¹ = ¹ x¹ = ¹^ (C) ¡ ) ) Di®erentiating (B) w.r.t. ij leads to 1 n @ §¡ j j = [S + n(x¹ ¹ )(x¹ ¹ )] (D) 1 ij ij i i j j §¡ @ ¡ ¡ j j § 1 @ ¡ ij 1 j j = cofactor of in §¡ @ij 1 n @ §¡ j j = n (E) 1 ij ij §¡ @ j j Equations (C), (D) and (E) lead to 1 §^ = S n a slightly biased estimate as in the univariate case.