Estimation of Variances and Covariances for Highdimensional Data

Advanced Review Estimation of variances and covariances for high-dimensional data: a selective review Tiejun Tong,1 Cheng Wang1 and Yuedong Wang2∗ Estimation of variances and covariances is required for many statistical methods such as t-test, principal component analysis and linear discriminant analysis. High-dimensional data such as gene expression microarray data and ﬁnancial data pose challenges to traditional statistical and computational methods. In this paper, we review some recent developments in the estimation of variances, covariance matrix, and precision matrix, with emphasis on the applications to microarray data analysis. © 2014 Wiley Periodicals, Inc. How to cite this article: WIREs Comput Stat 2014, 6:255–264. doi: 10.1002/wics.1308 Keywords: covariance matrix; high-dimensional data; microarray data; precision matrix; shrinkage estimation; sparse covariance matrix INTRODUCTION sample size, there is a large amount of uncertainty associated with standard estimates of parameters such ariances and covariances are involved in the con- as the sample mean and covariance. As a consequence, struction of many statistical methods including V statistical analyses based on such estimates are usually t-test, Hotelling’s T2-test, principal component anal- unreliable. ysis, and linear discriminant analysis. Therefore, the Let Y = (Y , … , Y )T be independent random estimation of these quantities is of critical impor- i i1 ip samples from a multivariate normal distribution,1,2 tance and has been well studied over the years. The recent flood of high-dimensional data, however, poses 1∕2 , , , , new challenges to traditional statistical and compu- Yi =Σ Xi + i = 1 … n (1) tational methods. For example, the microarray tech- T nology allows simultaneous monitoring of the whole where = ( 1, … , p) is a p-dimensional mean vec- genome. Due to the cost and other experimental dif- tor, Σ is a p × p positive definite covariance matrix, T ficulties such as the availabilities of biological mate- Xi = (Xi1, … , Xip) ,andXij are independent and iden- rials, microarray data are usually collected in a lim- tically distributed (i.i.d.) random variables from the ited number of samples. These kinds of data are often standard normal distribution. For microarray data, referred to as high-dimensional small sample size data, Yij represents the normalized gene expression level of or ‘large p small n’ data, where p is the number of gene j in the ith sample. In two-sample cDNA arrays, genes and n is the number of samples. Due to the small Yij may also represent the normalized log ratio of two-channel intensities. In multivariate statistical analysis, one often ∗ Corresponding to: [email protected] needs to estimate the covariance matrix Σ or the 1 Department of Mathematics, Hong Kong Baptist University, Hong inverse covariance matrix Σ− 1. The inverse covariance Kong, Hong Kong matrix is also called the precision matrix Ω=Σ− 1.The 2Department of Statistics and Applied Probability, University of California, Santa Barbara, CA, USA estimation of the covariance matrix or its inverse has applications in many statistical problems including Conflict of interest: The authors have declared no conflicts of linear discriminant analysis,3 Hotelling’s T2-test,4 and interest for this article. Markowitz mean-variance analysis.5 We write the Volume 6, July/August 2014 © 2014 Wiley Periodicals, Inc. 255 Advanced Review wires.wiley.com/compstats sample covariance matrix as sample variance. Then, only an estimate of D is ∑n ( )( ) needed rather than the whole covariance matrix 1 T S = Y − Y Y − Y , Σ. The second example is the class prediction (or n n − 1 i i i=1 classification) problem. If we use Diagonal Linear 8 ∑ Discriminant Analysis (DLDA) for class prediction, n < then again we need to estimate D rather than Σ.For where Y = i=1 Yi∕n is the sample mean. When p n, (n − 1)S follows a Wishart distribution and S−1∕ more details about DLDA and its variants, see Bickel n n 9 10 11 (n − 1) follows( ) an inverse Wishart distribution. In and Levina, Lee et al., Pang et al., and Huang 12 addition, E S−1 = (n − 1) Ω∕ (n − p − 2). A common et al. The third example is the multivariate test- n ing problem. To overcome the singularity problem, practice is to estimate Σ by the sample covariance several researchers proposed diagonal Hotelling’s matrix Sn and estimate Ω by the scaled inverse covari- T2-tests where only an estimate of D is required. For ance matrix (n − p − 2) S−1∕ (n − 1). These two esti- n more details, see, for example, Wu et al.,13 Srivas- mators are consistent estimators of Σ and Ω when p tava and Du,14 Srivastava,15 Park and Ayyala,16 and is fixed and n goes to infinity. Srivastava et al.17 For high-dimensional data such as microarray Due to the small sample size n, however, the data, however, p can be as large as or even larger than standard gene-specific sample variance s2 is usually n. As a consequence, the sample covariance matrix S j n unstable. Consequently, the standard t-tests in the is close to or is a singular matrix. This brings new first example, the diagonal discriminant rules inthe challenges to the estimation of the covariance matrix second example, and the diagonal Hotelling tests in and the precision matrix. In this paper, we review some the third example may not be reliable in practice. Var- recent developments in the estimation of variances and ious methods have been proposed for improving the covariances. Specifically, we review (1) the estimation estimation of gene-specific variances. Some of these of variances, i.e., the diagonal matrix of ,(2)the methods are reviewed in the remainder of this section. estimation of the covariance matrix , and (3) the estimation of the precision matrix Ω. Shrinkage Estimators ESTIMATION OF VARIANCES A key to improving the variance estimation is to bor- row information across genes, implicitly or explicitly, As reviewed in Cui and Churchill6 and Ayroles and locally or globally. One of the earliest methods to Gibson,7 one commonly used method to identify dif- stabilize the variance estimation was proposed by ferentially expressed genes is the analysis of variance Tusher et al.18 in 2001. In order to avoid the undue (ANOVA). ANOVA is a very flexible approach for influence of the small variance estimates, Tusher microarray experiments to compare more than two et al.18 proposed to estimate the standard deviation conditions. When there are only two conditions, the j by (sj + c)/2 in their SAM test, where c is a con- t-test may be used for detecting differential expression. stant acting as a shrinkage factor. For the choice of Throughout the paper, for simplicity of illustration we the constant c,Efronetal.19 suggested to use the consider only the two-color arrays with one factor at 90th percentile of all estimated standard deviations, two levels, in( which a paired) t-test may be employed. whereas Cui and Churchill6 suggested to use the 2, ,2 2 pooled sample variance. Let D = diag 1 … p ,where j are gene-specific 20 variances for j = 1, … , p, respectively. When the factor In 2005, Cui et al. proposed a James–Stein shrinkage estimator for( the) variances. For microar- has more than two levels or the experiment involves . 2 ray data with Y i i~d N ,2 , we have s2 = 22 ∕, more than one factor, the variances j correspond to ij j j j j j, residual variances in ANOVA or regression models. 2 where for ease of notation, j, denote i.i.d. random In microarray data analysis, rather than the variables which have a chi-squared distribution with whole covariance matrix , there are many situations = n − 1 degrees of freedom. Taking the log transfor- where only the estimation of gene-specific variances is mation leads to required. We now provide several examples of these situations. The first example is a multiple testing 2 , Zj = ln j + j (2) problem in microarray data analysis. To identify ( ) ( ) differentially expressed genes, we test the hypotheses where Z = ln s2 − m, = ln 2 ∕ − m,andm = ≠ { ( j )} j j j, Hj0 : j = 0 against Hj1 : j √0 for each gene j. Con- E ln 2 ∕ . Treating Z in Eq. (2) as normal ran- sider the test statistic Tj = n Yj∕sj,whereYj is the j, j 2 21 gene-specific sample mean and sj is the gene-specific dom variables, the James–Stein shrinkage method 256 © 2014 Wiley Periodicals, Inc. Volume 6, July/August 2014 WIREs Computational Statistics Estimation of variances and covariances 2 estimator for the variances that shrunk the individual can be applied to derive a shrinkage estimate for ln j . Transforming back to the original scale, the final esti- sample variance toward the arithmetic mean. For both mates of the variances are as follows: shrinkage to the geometric mean and shrinkage to the arithmetic mean estimators, optimal shrinkage ( ) ⎡⎛ ⎞ p ( ) parameters were derived under both the Stein and ∏ 1∕p ⎢⎜ ⎟ 2 2 (p − 3) V squared loss functions. Asymptotic properties were ̃ = B s exp ⎢⎜1 − ∑ ( ) ⎟ j j ⎢⎜ 2 ⎟ investigated under the two schemes when either the j=1 ln s2 − ln s2 ⎣⎝ j j ⎠ number of degrees of freedom of each individual ] + ( ) estimate or the number of individuals approaches 2 2 , × ln sj − ln sj (3) infinity. ∑ ( ) 2 p 2 Bayesian Estimators where V = var( j), ln s = ln s ∕p,andB = exp j j=1 j ∏ 24 p Baldi and Long applied a Bayesian method to (−m) is the bias correction factor such that B ( ) j=1 1∕p improve the estimation of variances. Specifically,( ) they s2 gives an unbiased estimator of 2 when 2 = ,2 j j assumed the following conjugate prior for j j , 2 for all j.

Estimation of Variances and Covariances for Highdimensional Data

UCLA STAT 13 Comparison of Two Independent Samples

Basic ES Computations, P. 1 BASIC EFFECT SIZE GUIDE with SPSS

BHS 307 – Statistics for the Behavioral Sciences

When Does the Pooled Variance T-Test Fail?

Accepted Manuscript

Hazards in Choosing Between Pooled and Separate- Variances T Tests

1 Two Independent Samples T Test Overview of Tests Presented Three

Non-Paired T-Test (Welch)

Discriminant Function Analysis

Multivariate Control Chart

United States Court of Appeals for the Federal Circuit ______

(MANOVA) Compares Groups on a Set of Dependent Variables Simultaneously