Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models

Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models Lee H. Dicker Murat A. Erdogdu Rutgers University Stanford University Abstract 1 INTRODUCTION Variance parameters | including the residual vari- We study maximum likelihood estimators ance, the proportion of explained variation, and, under (MLEs) for the residual variance, the signal- some interpretations (e.g. Dicker, 2014; Hartley & Rao, to-noise ratio, and other variance parame- 1967; Janson et al., 2015), the signal-to-noise ratio | ters in high-dimensional linear models. These are important parameters in many statistical models. parameters are essential in many statisti- While they are often not the primary focus of a given cal applications involving regression diagnos- data analysis or prediction routine, variance parame- tics, inference, tuning parameter selection for ters are fundamental for a variety of tasks, including: high-dimensional regression, and other applications, including genetics. The estimators (i) Regression diagnostics, e.g. risk estimation (Bay- that we study are not new, and have been ati et al., 2013; Mallows, 1973). widely used for variance component estima- (ii) Inference, e.g. hypothesis testing and construct- tion in linear random-effects models. How- ing confidence intervals (Fan et al., 2012; Javan- ever, our analysis is new and it implies that mard & Montanari, 2014; Reid et al., 2013). the MLEs, which were devised for random- (iii) Optimally tuning other methods, e.g. tuning pa- effects models, may also perform very well rameter selection for high-dimensional regression in high-dimensional linear models with fixed- (Sun & Zhang, 2012). effects, which are more commonly studied (iv) Other applications, e.g. genetics (De los Campos in some areas of high-dimensional statistics. et al., 2015; Listgarten et al., 2012; Loh et al., The MLEs are shown to be consistent and 2015; Yang et al., 2010). asymptotically normal in fixed-effects models with random design, in asymptotic set- This paper focuses on variance estimation in high- tings where the number of predictors (p) is dimensional linear models. We study maximum proportional to the number of observations likelihood-based estimators for the residual variance (n). Moreover, the estimators' asymptotic and other related variance parameters. The results in variance can be given explicitly in terms mo- this paper show that | under appropriate conditions ments of the Marˇcenko-Pastur distribution. | the maximum likelihood estimators (MLEs) out- A variety of analytical and empirical results perform several other previously proposed variance es- show that the MLEs outperform other, pre- timators for high-dimensional linear models, in theory viously proposed estimators for variance pa- and in numerical examples, with both real and simu- rameters in high-dimensional linear models lated data. with fixed-effects. More broadly, the results in this paper illustrate a strategy for draw- The MLEs studied here are not new. Indeed, they ing connections between fixed- and random- are widely used for variance components estimation in effects models in high dimensions, which may linear random-effects models (e.g. Searle et al., 1992). be useful in other applications. However, the analysis in this paper is primarily con- cerned with the performance of the MLEs in fixed- effects models, which are more commonly studied in Appearing in Proceedings of the 19th International Con- some areas of modern high-dimensional statistics (e.g. ference on Artificial Intelligence and Statistics (AISTATS) Bühlmann& Van de Geer, 2011). Thus, from one per- 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright spective, this paper may be viewed as an article on 2016 by the authors. model misspecification: We show that a classical es- Running heading title breaks the line timator for random-effects models may also be used where the design matrix X was highly structured, i.e. effectively in fixed-effects models. More broadly, the xij ∼ N(0; 1) were all iid (a detailed description of the results in this paper illustrate a strategy for drawing numerical simulations used to generate Figure 1 may connections between fixed- and random-effects mod- be found in the Supplementary Material). In general, els in high dimensions, which may be useful in other structured X methods work best when the entries of X applications; for instance, a similar strategy has been are uncorrelated. Indeed, by generating X with highly employed in Dicker's (2016) analysis of ridge regres- correlated columns and taking β to be very sparse, it sion. is easy to generate plots similar to those in Figure 1, where the structured X methods are badly biased and 1.1 Related Work the structured β methods perform well. On the other hand, if the design matrix X is correlated, it may be There is an abundant literature on variance compo- possible to improve the performance of structured X nents estimation for random-effects models, going back methods by \decorrelating" X (i.e. right-multiplying to at least the 1940s (Crump, 1946; Satterthwaite, X by a suitable positive definite matrix); this is not 1946). More recently, a substantial literature has de- pursued in detail here, but may be an interesting area veloped on the use of random-effects models in high for future research. throughput genetics (e.g. De los Campos et al., 2015; Listgarten et al., 2012; Loh et al., 2015; Yang et al., The method labeled \MLE" in Figure 1 is the main 2010). Maximum likelihood estimators for variance focus of this paper. The other methods depicted in parameters, like those studied in this paper, are widely Figure 1 that rely on structured X are \MM," \Eigen- Prism," and \AMP." MM is a method-of-moments es- used in this research area. A variety of questions about 2 the use of MLEs for variance parameters in modern timator for σ0 proposed in (Dicker, 2014); EigenPrism genetics remain unanswered; including fundamental was proposed in (Janson et al., 2015) and is the so- lution to a convex optimization problem; AMP is a questions about whether fixed- or random-effects mod- 2 els are more appropriate (De los Campos et al., 2015). method for estimating σ0 based on approximate mes- Our work here focuses primarily on fixed-effects mod- sage passing algorithms and was proposed in (Bay- els, and thus differs from much of this recent work on ati et al., 2013). In Figure 1, each of the structured genetics. However, Theorem 2 (see, in particular, the X methods is evidently unbiased. Additionally, it is discussion in Section 4.2) and the data analysis in Sec- clear that the variability of MLE is uniformly smaller tion 6 below suggest that fixed-effects analyses could than that of MM and EigenPrism (detailed results are possibly offer improved power over random-effects ap- reported in Table S1 of the Supplementary Material). proaches to variance estimation, in some settings. The AMP method is unique among the four structured X methods, because it is the only one that can adapt Switching focus to high-dimensional linear models to sparsity in β; consequently, while the variability of with fixed-effects, previous work on variance estima- MLE is smaller than that of AMP when β is not sparse tion has typically been built on one of two sets of (e.g. when sparsity is 40%), the AMP estimator has assumptions: Either (i) the underlying regression pa- smaller variance when β is sparse. This is certainly rameter [β in the model (1) below] is highly structured an attractive property of the AMP estimator. How- (e.g. sparse) or (ii) the design matrix X = (xij) is ever, little is known about the asymptotic distribution highly structured [e.g. xij are iid N(0; 1)]. Notable of the AMP estimator. By contrast, one important work on variance estimation under the \structured β" feature of the MLE is that it is asymptotically nor- assumption includes (Chatterjee & Jafarov, 2015; Fan mal in high dimensions and its asymptotic variance is et al., 2012; Sun & Zhang, 2012). Bayati et al. (2013), given by the simple formula in Theorem 2. This is 2 Dicker (2014), and Janson et al. (2015) have stud- a useful property for performing inference on σ0 and ied variance estimation in the \structured X" setting. related parameters, and for better understanding the This paper also focuses on the structured X setting. asymptotic performance of the MLE. As is characteristic of the structured X setting, many of the results in this paper require strong assumptions 1.2 Overview of the Paper on X; however, none of our results require any sparsity assumptions on β. Section 2 covers preliminaries. We introduce the statistical model and define a bivariate MLE for the resid- Figure 1 illustrates some potential advantages of the ual variance and the signal-to-noise ratio | this is the structured X approach. In particular, it shows that main estimator of interest. We also give some addi- structured β methods for estimating the residual vari- tional background on the \structured X" model that ance σ2 can be biased, if β is not extremely sparse, 0 is studied here. while the structured X methods are effective regard- less of sparsity. Figure 1 was generated from datasets In Section 3, we present a coupling argument, which Lee H. Dicker, Murat A. Erdogdu 10%−sparse β 40%−sparse β 90%−sparse β 99.8%−sparse β MLE MLE MLE MLE MM MM MM MM EigenPrism EigenPrism EigenPrism EigenPrism AMP AMP AMP AMP Methods Methods Methods Methods Scaled−Lasso Scaled−Lasso Scaled−Lasso Scaled−Lasso RCV−Lasso RCV−Lasso RCV−Lasso RCV−Lasso 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 2 2 2 2 σ^ σ^ σ^ σ^ 2 Figure 1: Estimates of the residual variance σ0 from 500 independent datasets.

Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models

Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation

Measures of Explained Variation and the Base-Rate Problem for Logistic Regression

Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]

Prediction of Stress Increase in Unbonded Tendons Using Sparse Principal Component Analysis

Simple Linear Regression 80 60 Rating 40 20

Correlation and Regression Overview

Linear Regression

Measures of Explained Variation in Gamma Regression Models

Linear Regression Using Ordinary Least Squares

Statistical Analysis in JASP

Choosing Sample Size

Principal Component Analysis and Optimization: a Tutorial Robert Reris Virginia Commonwealth University, [email protected]