Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models
Total Page:16
File Type:pdf, Size:1020Kb
Maximum Likelihood for Variance Estimation in High-Dimensional Linear Models Lee H. Dicker Murat A. Erdogdu Rutgers University Stanford University Abstract 1 INTRODUCTION Variance parameters | including the residual vari- We study maximum likelihood estimators ance, the proportion of explained variation, and, under (MLEs) for the residual variance, the signal- some interpretations (e.g. Dicker, 2014; Hartley & Rao, to-noise ratio, and other variance parame- 1967; Janson et al., 2015), the signal-to-noise ratio | ters in high-dimensional linear models. These are important parameters in many statistical models. parameters are essential in many statisti- While they are often not the primary focus of a given cal applications involving regression diagnos- data analysis or prediction routine, variance parame- tics, inference, tuning parameter selection for ters are fundamental for a variety of tasks, including: high-dimensional regression, and other appli- cations, including genetics. The estimators (i) Regression diagnostics, e.g. risk estimation (Bay- that we study are not new, and have been ati et al., 2013; Mallows, 1973). widely used for variance component estima- (ii) Inference, e.g. hypothesis testing and construct- tion in linear random-effects models. How- ing confidence intervals (Fan et al., 2012; Javan- ever, our analysis is new and it implies that mard & Montanari, 2014; Reid et al., 2013). the MLEs, which were devised for random- (iii) Optimally tuning other methods, e.g. tuning pa- effects models, may also perform very well rameter selection for high-dimensional regression in high-dimensional linear models with fixed- (Sun & Zhang, 2012). effects, which are more commonly studied (iv) Other applications, e.g. genetics (De los Campos in some areas of high-dimensional statistics. et al., 2015; Listgarten et al., 2012; Loh et al., The MLEs are shown to be consistent and 2015; Yang et al., 2010). asymptotically normal in fixed-effects mod- els with random design, in asymptotic set- This paper focuses on variance estimation in high- tings where the number of predictors (p) is dimensional linear models. We study maximum proportional to the number of observations likelihood-based estimators for the residual variance (n). Moreover, the estimators' asymptotic and other related variance parameters. The results in variance can be given explicitly in terms mo- this paper show that | under appropriate conditions ments of the Marˇcenko-Pastur distribution. | the maximum likelihood estimators (MLEs) out- A variety of analytical and empirical results perform several other previously proposed variance es- show that the MLEs outperform other, pre- timators for high-dimensional linear models, in theory viously proposed estimators for variance pa- and in numerical examples, with both real and simu- rameters in high-dimensional linear models lated data. with fixed-effects. More broadly, the results in this paper illustrate a strategy for draw- The MLEs studied here are not new. Indeed, they ing connections between fixed- and random- are widely used for variance components estimation in effects models in high dimensions, which may linear random-effects models (e.g. Searle et al., 1992). be useful in other applications. However, the analysis in this paper is primarily con- cerned with the performance of the MLEs in fixed- effects models, which are more commonly studied in Appearing in Proceedings of the 19th International Con- some areas of modern high-dimensional statistics (e.g. ference on Artificial Intelligence and Statistics (AISTATS) B¨uhlmann& Van de Geer, 2011). Thus, from one per- 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright spective, this paper may be viewed as an article on 2016 by the authors. model misspecification: We show that a classical es- Running heading title breaks the line timator for random-effects models may also be used where the design matrix X was highly structured, i.e. effectively in fixed-effects models. More broadly, the xij ∼ N(0; 1) were all iid (a detailed description of the results in this paper illustrate a strategy for drawing numerical simulations used to generate Figure 1 may connections between fixed- and random-effects mod- be found in the Supplementary Material). In general, els in high dimensions, which may be useful in other structured X methods work best when the entries of X applications; for instance, a similar strategy has been are uncorrelated. Indeed, by generating X with highly employed in Dicker's (2016) analysis of ridge regres- correlated columns and taking β to be very sparse, it sion. is easy to generate plots similar to those in Figure 1, where the structured X methods are badly biased and 1.1 Related Work the structured β methods perform well. On the other hand, if the design matrix X is correlated, it may be There is an abundant literature on variance compo- possible to improve the performance of structured X nents estimation for random-effects models, going back methods by \decorrelating" X (i.e. right-multiplying to at least the 1940s (Crump, 1946; Satterthwaite, X by a suitable positive definite matrix); this is not 1946). More recently, a substantial literature has de- pursued in detail here, but may be an interesting area veloped on the use of random-effects models in high for future research. throughput genetics (e.g. De los Campos et al., 2015; Listgarten et al., 2012; Loh et al., 2015; Yang et al., The method labeled \MLE" in Figure 1 is the main 2010). Maximum likelihood estimators for variance focus of this paper. The other methods depicted in parameters, like those studied in this paper, are widely Figure 1 that rely on structured X are \MM," \Eigen- Prism," and \AMP." MM is a method-of-moments es- used in this research area. A variety of questions about 2 the use of MLEs for variance parameters in modern timator for σ0 proposed in (Dicker, 2014); EigenPrism genetics remain unanswered; including fundamental was proposed in (Janson et al., 2015) and is the so- lution to a convex optimization problem; AMP is a questions about whether fixed- or random-effects mod- 2 els are more appropriate (De los Campos et al., 2015). method for estimating σ0 based on approximate mes- Our work here focuses primarily on fixed-effects mod- sage passing algorithms and was proposed in (Bay- els, and thus differs from much of this recent work on ati et al., 2013). In Figure 1, each of the structured genetics. However, Theorem 2 (see, in particular, the X methods is evidently unbiased. Additionally, it is discussion in Section 4.2) and the data analysis in Sec- clear that the variability of MLE is uniformly smaller tion 6 below suggest that fixed-effects analyses could than that of MM and EigenPrism (detailed results are possibly offer improved power over random-effects ap- reported in Table S1 of the Supplementary Material). proaches to variance estimation, in some settings. The AMP method is unique among the four structured X methods, because it is the only one that can adapt Switching focus to high-dimensional linear models to sparsity in β; consequently, while the variability of with fixed-effects, previous work on variance estima- MLE is smaller than that of AMP when β is not sparse tion has typically been built on one of two sets of (e.g. when sparsity is 40%), the AMP estimator has assumptions: Either (i) the underlying regression pa- smaller variance when β is sparse. This is certainly rameter [β in the model (1) below] is highly structured an attractive property of the AMP estimator. How- (e.g. sparse) or (ii) the design matrix X = (xij) is ever, little is known about the asymptotic distribution highly structured [e.g. xij are iid N(0; 1)]. Notable of the AMP estimator. By contrast, one important work on variance estimation under the \structured β" feature of the MLE is that it is asymptotically nor- assumption includes (Chatterjee & Jafarov, 2015; Fan mal in high dimensions and its asymptotic variance is et al., 2012; Sun & Zhang, 2012). Bayati et al. (2013), given by the simple formula in Theorem 2. This is 2 Dicker (2014), and Janson et al. (2015) have stud- a useful property for performing inference on σ0 and ied variance estimation in the \structured X" setting. related parameters, and for better understanding the This paper also focuses on the structured X setting. asymptotic performance of the MLE. As is characteristic of the structured X setting, many of the results in this paper require strong assumptions 1.2 Overview of the Paper on X; however, none of our results require any sparsity assumptions on β. Section 2 covers preliminaries. We introduce the sta- tistical model and define a bivariate MLE for the resid- Figure 1 illustrates some potential advantages of the ual variance and the signal-to-noise ratio | this is the structured X approach. In particular, it shows that main estimator of interest. We also give some addi- structured β methods for estimating the residual vari- tional background on the \structured X" model that ance σ2 can be biased, if β is not extremely sparse, 0 is studied here. while the structured X methods are effective regard- less of sparsity. Figure 1 was generated from datasets In Section 3, we present a coupling argument, which Lee H. Dicker, Murat A. Erdogdu 10%−sparse β 40%−sparse β 90%−sparse β 99.8%−sparse β MLE MLE MLE MLE MM MM MM MM EigenPrism EigenPrism EigenPrism EigenPrism AMP AMP AMP AMP Methods Methods Methods Methods Scaled−Lasso Scaled−Lasso Scaled−Lasso Scaled−Lasso RCV−Lasso RCV−Lasso RCV−Lasso RCV−Lasso 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 2 2 2 2 σ^ σ^ σ^ σ^ 2 Figure 1: Estimates of the residual variance σ0 from 500 independent datasets.