Maximum Likelihood for Estimation in High-Dimensional Linear Models

Lee H. Dicker Murat A. Erdogdu Rutgers University Stanford University

Abstract 1 INTRODUCTION

Variance parameters — including the residual vari- We study maximum likelihood estimators ance, the proportion of explained variation, and, under (MLEs) for the residual variance, the signal- some interpretations (e.g. Dicker, 2014; Hartley & Rao, to-noise ratio, and other variance parame- 1967; Janson et al., 2015), the signal-to-noise ratio — ters in high-dimensional linear models. These are important parameters in many statistical models. parameters are essential in many statisti- While they are often not the primary focus of a given cal applications involving regression diagnos- data analysis or prediction routine, variance parame- tics, inference, tuning parameter selection for ters are fundamental for a variety of tasks, including: high-dimensional regression, and other appli- cations, including genetics. The estimators (i) Regression diagnostics, e.g. risk estimation (Bay- that we study are not new, and have been ati et al., 2013; Mallows, 1973). widely used for variance component estima- (ii) Inference, e.g. hypothesis testing and construct- tion in linear random-effects models. How- ing confidence intervals (Fan et al., 2012; Javan- ever, our analysis is new and it implies that mard & Montanari, 2014; Reid et al., 2013). the MLEs, which were devised for random- (iii) Optimally tuning other methods, e.g. tuning pa- effects models, may also perform very well rameter selection for high-dimensional regression in high-dimensional linear models with fixed- (Sun & Zhang, 2012). effects, which are more commonly studied (iv) Other applications, e.g. genetics (De los Campos in some areas of high-dimensional . et al., 2015; Listgarten et al., 2012; Loh et al., The MLEs are shown to be consistent and 2015; Yang et al., 2010). asymptotically normal in fixed-effects mod- els with random design, in asymptotic set- This paper focuses on variance estimation in high- tings where the number of predictors (p) is dimensional linear models. We study maximum proportional to the number of observations likelihood-based estimators for the residual variance (n). Moreover, the estimators’ asymptotic and other related variance parameters. The results in variance can be given explicitly in terms mo- this paper show that — under appropriate conditions ments of the Marˇcenko-Pastur distribution. — the maximum likelihood estimators (MLEs) out- A variety of analytical and empirical results perform several other previously proposed variance es- show that the MLEs outperform other, pre- timators for high-dimensional linear models, in theory viously proposed estimators for variance pa- and in numerical examples, with both real and simu- rameters in high-dimensional linear models lated data. with fixed-effects. More broadly, the results in this paper illustrate a strategy for draw- The MLEs studied here are not new. Indeed, they ing connections between fixed- and random- are widely used for variance components estimation in effects models in high dimensions, which may linear random-effects models (e.g. Searle et al., 1992). be useful in other applications. However, the analysis in this paper is primarily con- cerned with the performance of the MLEs in fixed- effects models, which are more commonly studied in Appearing in Proceedings of the 19th International Con- some areas of modern high-dimensional statistics (e.g. ference on Artificial Intelligence and Statistics (AISTATS) B¨uhlmann& Van de Geer, 2011). Thus, from one per- 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright spective, this paper may be viewed as an article on 2016 by the authors. model misspecification: We show that a classical es- Running heading title breaks the line timator for random-effects models may also be used where the design matrix X was highly structured, i.e. effectively in fixed-effects models. More broadly, the xij ∼ N(0, 1) were all iid (a detailed description of the results in this paper illustrate a strategy for drawing numerical simulations used to generate Figure 1 may connections between fixed- and random-effects mod- be found in the Supplementary Material). In general, els in high dimensions, which may be useful in other structured X methods work best when the entries of X applications; for instance, a similar strategy has been are uncorrelated. Indeed, by generating X with highly employed in Dicker’s (2016) analysis of ridge regres- correlated columns and taking β to be very sparse, it sion. is easy to generate plots similar to those in Figure 1, where the structured X methods are badly biased and 1.1 Related Work the structured β methods perform well. On the other hand, if the design matrix X is correlated, it may be There is an abundant literature on variance compo- possible to improve the performance of structured X nents estimation for random-effects models, going back methods by “decorrelating” X (i.e. right-multiplying to at least the 1940s (Crump, 1946; Satterthwaite, X by a suitable positive definite matrix); this is not 1946). More recently, a substantial literature has de- pursued in detail here, but may be an interesting area veloped on the use of random-effects models in high for future research. throughput genetics (e.g. De los Campos et al., 2015; Listgarten et al., 2012; Loh et al., 2015; Yang et al., The method labeled “MLE” in Figure 1 is the main 2010). Maximum likelihood estimators for variance focus of this paper. The other methods depicted in parameters, like those studied in this paper, are widely Figure 1 that rely on structured X are “MM,” “Eigen- Prism,” and “AMP.” MM is a method-of-moments es- used in this research area. A variety of questions about 2 the use of MLEs for variance parameters in modern timator for σ0 proposed in (Dicker, 2014); EigenPrism genetics remain unanswered; including fundamental was proposed in (Janson et al., 2015) and is the so- lution to a convex optimization problem; AMP is a questions about whether fixed- or random-effects mod- 2 els are more appropriate (De los Campos et al., 2015). method for estimating σ0 based on approximate mes- Our work here focuses primarily on fixed-effects mod- sage passing algorithms and was proposed in (Bay- els, and thus differs from much of this recent work on ati et al., 2013). In Figure 1, each of the structured genetics. However, Theorem 2 (see, in particular, the X methods is evidently unbiased. Additionally, it is discussion in Section 4.2) and the data analysis in Sec- clear that the variability of MLE is uniformly smaller tion 6 below suggest that fixed-effects analyses could than that of MM and EigenPrism (detailed results are possibly offer improved power over random-effects ap- reported in Table S1 of the Supplementary Material). proaches to variance estimation, in some settings. The AMP method is unique among the four structured X methods, because it is the only one that can adapt Switching focus to high-dimensional linear models to sparsity in β; consequently, while the variability of with fixed-effects, previous work on variance estima- MLE is smaller than that of AMP when β is not sparse tion has typically been built on one of two sets of (e.g. when sparsity is 40%), the AMP estimator has assumptions: Either (i) the underlying regression pa- smaller variance when β is sparse. This is certainly rameter [β in the model (1) below] is highly structured an attractive property of the AMP estimator. How- (e.g. sparse) or (ii) the design matrix X = (xij) is ever, little is known about the asymptotic distribution highly structured [e.g. xij are iid N(0, 1)]. Notable of the AMP estimator. By contrast, one important work on variance estimation under the “structured β” feature of the MLE is that it is asymptotically nor- assumption includes (Chatterjee & Jafarov, 2015; Fan mal in high dimensions and its asymptotic variance is et al., 2012; Sun & Zhang, 2012). Bayati et al. (2013), given by the simple formula in Theorem 2. This is 2 Dicker (2014), and Janson et al. (2015) have stud- a useful property for performing inference on σ0 and ied variance estimation in the “structured X” setting. related parameters, and for better understanding the This paper also focuses on the structured X setting. asymptotic performance of the MLE. As is characteristic of the structured X setting, many of the results in this paper require strong assumptions 1.2 Overview of the Paper on X; however, none of our results require any sparsity assumptions on β. Section 2 covers preliminaries. We introduce the sta- tistical model and define a bivariate MLE for the resid- Figure 1 illustrates some potential advantages of the ual variance and the signal-to-noise ratio — this is the structured X approach. In particular, it shows that main estimator of interest. We also give some addi- structured β methods for estimating the residual vari- tional background on the “structured X” model that ance σ2 can be biased, if β is not extremely sparse, 0 is studied here. while the structured X methods are effective regard- less of sparsity. Figure 1 was generated from datasets In Section 3, we present a coupling argument, which Lee H. Dicker, Murat A. Erdogdu

10%−sparse β 40%−sparse β 90%−sparse β 99.8%−sparse β

MLE MLE MLE MLE

MM MM MM MM

EigenPrism EigenPrism EigenPrism EigenPrism

AMP AMP AMP AMP Methods Methods Methods Methods

Scaled−Lasso Scaled−Lasso Scaled−Lasso Scaled−Lasso

RCV−Lasso RCV−Lasso RCV−Lasso RCV−Lasso

0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 0.0 2.5 5.0 2 2 2 2 σ^ σ^ σ^ σ^

2 Figure 1: Estimates of the residual variance σ0 from 500 independent datasets. Structured X simulations: 2 2 xij ∼ N(0, 1) are iid. Parameter values: σ0 = 1, η0 = 4, n = 500, p = 1000. “α%-sparse” indicates that (α/100) × p coordinates of β ∈ Rp are set equal to 0. Structured X estimators: MLE, MM (Dicker, 2014), EigenPrism (Janson et al., 2015), AMP (Bayati et al., 2013). Structured β estimators: Scaled-Lasso (Sun & Zhang, 2012), RCV-Lasso (Fan et al., 2012). Detailed description of settings and results may be found in the Supplementary Material.

> n is the key to relating the fixed-effects model of inter-  = (1, ..., n) ∈ R is a random error vector, and > p est and the random-effects model that motivates the β = (β1, ..., βp) ∈ R is an unknown p-dimensional MLEs studied here. parameter. The observed data consists of (y,X). Basic asymptotic properties of the MLEs are described In this paper, we make the following distributional as- in Section 4. Consistency and asymptotic normal- sumptions: ity results (given in Section 4.1) rely heavily on the 2 coupling argument from Section 3: Theorems 1–2 fol- i ∼ N(0, σ0), xij ∼ N(0, 1), (2) low by coupling the fixed-effects model with random 1 ≤ i ≤ n, 1 ≤ j ≤ p are all independent. The predictors (the “structured X” model) to a random- parameter σ2 > 0 is the residual variance. Estimating effects model and then appealing to existing results for 0 σ2 when p and n are large is one of the main objectives random-effects models and quadratic forms. 0 of the paper.1 In Section 5, we use results from random matrix theory The distributional assumptions (2) are strong. How- to derive explicit formulas for the asymptotic variance ever, others following the “structured X” approach to of the MLEs, which are determined by moments of the variance estimation have made the same assumptions Marˇcenko-Pastur distribution. Using these formulas, (e.g. Bayati et al., 2013; Dicker, 2014; Janson et al., we analytically compare the asymptotic variance of the 2015). There is also a substantial literature on estimat- MLEs to that of other estimators. ing β in high-dimensions under similar distributional The results of a real data analysis involving gene assumptions; see, for instance, research on approx- expression data and single nucelotide polymorphism imate message passing (AMP) algorithms (Donoho (SNP) data are described in Section 6. The results et al., 2009; Rangan, 2011; Vila & Schniter, 2013). In illustrate some potential benefits of our results in an the AMP literature, there has been some success at important practical example. relaxing the Gaussian assumption (2) (e.g. Korada & Montanari, 2011; Rangan et al., 2014). Relaxing (2) A concluding discusion may be found in Section 7. for structured X approaches to variance estimation is an important topic for future research. (See also the 2 PRELIMINARIES discussion in Section 7.) 1 If p ≤ n and X has full rank, then the residual sum 2 −1 ˆ 2 Consider a model, where outcomes of squaresσ ˆOLS = (n − p) ky − XβOLSk may be used > n 2 ˆ > −1 > y = (y1, ..., yn) ∈ R are related to an n × p matrix to estimate σ0 , where βOLS = (X X) X y. Thus, es- timating σ2 when p > n is more interesting and may be of predictors X = (xij)1≤i≤n;1≤j≤p via the equation 0 considered the main focus of this work (however, the re- sults in Section 5.2 imply that the MLEσ ˆ2 outperforms 2 y = Xβ + ; (1) σˆOLS even when p < n). Running heading title breaks the line

To estimate the residual variance, we introduce two Then (y˜,X) may be viewed as data drawn from a 2 auxiliary parameters that are closely related to σ0: random-effects linear model, with the regression pa- 2 2 ˜ p−1 2 The signal strength, τ0 = kβk , and the signal-to- rameter β ∼ uniform{S (τ0 )} uniformly distribu- 2 2 2 p p−1 2 noise ratio, η0 = τ0 /σ0 (here and throughout the pa- tion on the sphere in R of radius τ0, S (τ0 ) = {u ∈ 2 p 2 2 per, k·k denotes the ` -norm). Now define the bivariate R ; kuk = τ0 }. Next define the MLE based on (7), 2 2 parameter θ0 = (σ0, η0). Consider the estimator θ˜ = argmax `˜(θ), θˆ = argmax `(θ), (3) σ2,η2≥0 σ2,η2≥0 where `˜(θ) is defined just as in (4), except that y is where θ = (σ2, η2) and replaced by y˜. Then the distribution of θˆ = θˆ(y,X) ˜ ˜ 1 1 η2  is the exact same as the distribution of θ = θ(y˜,X); `(θ) = − log(σ2) − log det XX> + I 2 2 2n p more precisely, for any Borel set B ⊆ R , we have −1 1 η2  {θˆ(y,X) ∈ B} = {θˆ(XU >β˜ + ,X) ∈ B} − y> XX> + I y. (4) P P 2σ2n p > > = P{θˆ(XU β˜ + ,XU ) ∈ B} = {θˆ(Xβ˜ + ,X) ∈ B} The estimator θˆ is the main focus of this paper.2 It is P ˜ the MLE for θ0 under a linear random-effects model, = P{θ(y˜,X) ∈ B}, 2 where the regression parameter β ∼ N{0, (τ0 /p)I} is Gaussian and independent of  and X; indeed, under where the second equality holds because of (6) and the this random-effects model, the distribution of y|X ∼ third inequality follows from (5). Thus, the distribu- 2 > 2 ˆ N{0, (τ0 /p)XX + σ0I} is Gaussian. tional properties of θ — the estimator based on the original data from the fixed-effects model — are the ˜ 3 A COUPLING ARGUMENT exact same as the distributional properties of θ — the estimator based on the random-effects data, i.e. Theoretical properties of θˆ have already been stud- θˆ =D θ˜. (8) ied extensively in the literature, within the context of random-effects models (e.g. Dicker & Erdogdu, 2016; Hartley & Rao, 1967; Jiang, 1996; Searle et al., 1992). The distributional identity (8) links the fixed-effects In this paper we are primarily concerned with the per- model (1)–(2) with the random-effects model (7). formance of θˆ in the fixed-effects model (1)–(2), where However, one remaining issue is that the random- ˜ ˜ ˜ > p β is non-random. The main objective of this section effects vector β = (β1,..., βp) ∈ R has correlated is to explain why one might expect θˆ to perform well components. While most of the existing work on in the fixed-effects model. To do this, we show that random-effects models applies to models with indepen- the model (1)–(2) can be coupled to a random-effects dent random-effects, Dicker & Erdogdu (2016) recently model. The key points of the coupling argument are derived concentration bounds for variance components that the distribution of the design matrix X and the estimators in models with correlated random-effects, estimator θˆ are both invariant under orthogonal trans- which can be applied in the present setting. Dicker & formations, i.e. Erdogdu’s bounds rely on the existence of a tight inde- pendent coupling; this means finding a random vector X =D XU >, (5) β? with independent components, such that kβ˜ − β?k θˆ(y,X) = θˆ(y,XU >) (6) is small with high probability. Such a coupling is de- scribed in the following paragraph. for any p×p orthogonal matrix U (“=”D denotes equal- Let z ∼ N(0,I) be an independent p-dimensional ity in distribution). The argument is given in more Gaussian random vector. Without loss of generality, detail below. we assume that

Let U be an independent uniformly distributed p × p τ0z β˜ = ∼ uniform{Sp−1(τ 2)}. orthogonal matrix, define β˜ = Uβ, and let kzk 0

y˜ = Xβ˜ + . (7) ? 1/2 2 Now define β = (τ0/p )z ∼ N{0, (τ0 /p)I}. Then ? 2If `(θ) has multiple maximizers, then use any pre- β has independent components and, when p is large, 1/2 ˜ ? determined rule to select θˆ. Maximizing `(θ) is a noncon- kzk ≈ p and β ≈ β . This is the required coupling. vex problem; however, it has been extremely well-studied More precisely, applying Chernoff’s bound to the chi- and is straightforward, e.g. (Demidenko, 2013). squared random-variable kzk2 implies that if 0 ≤ r < Lee H. Dicker, Murat A. Erdogdu

2 τ0 , then The right-hand side of (10) (in particular, S(θ0))   involves a quadratic form in y. If the data were n ˜ ? o kzk P kβ − β k > r = P τ0 √ − 1 > r drawn from a random-effects model with independent p random-effects, then standard theory (for random- (  2) effects models or quadratic forms) could be used to 2 r = P kzk > p 1 + argue that n1/2(θˆ − θ ) is approximately normal. τ0 0 However, since we are interested in the fixed-effects (  2) 2 r model (1)–(2) there is an additional step: We must + P kzk < p 1 − τ0 go through a variant of the coupling argument from Section 3 again.  pr2  ≤ 2 exp − , (9) 2 Define S˜(θ) to be the same as S(θ), except replace y 2τ0 with y˜. Then, using (8), for 0 ≤ r < τ0. 1/2 ˆ D 1/2 ˜ n (θ − θ0) = n (θ − θ0) 4 ASYMPTOTIC PROPERTIES ( ˜ )−1 1/2 ∂S ≈ −n (θ0) S˜(θ0). (11) 4.1 Consistency and Asymptotic Normality ∂θ

In this section, we derive consistency and asymptotic The entries of S˜(θ) are quadratic forms in normality results for θˆ, under the assumption that τ0z p/n → ρ ∈ (0, ∞) \{1}.3 Consistency, i.e. conver- y˜ = Xβ˜ +  = X + . (12) kzk gence in probability, is a direct consequence of (9) and Proposition 2 from (Dicker & Erdogdu, 2016). Thus, the approximation (11) allows us to shift focus Theorem 1. Assume that (1)–(2) holds. Let K ⊆ to the random-effects model (7) and quadratic forms 2 2 in y˜. This is an important step, because of the rela- (0, ∞) be a compact set and suppose that σ0, η0 ∈ K. Additionally suppose that ρ ∈ (0, ∞) \ 1. Then θˆ → θ tive simplicity of random-effects models. However, the 0 ˜ in probability, as p/n → ρ. coordinates of β are correlated, which is an additional obstacle that must be addressed. To overcome this is- It takes a little bit more work to show that θˆ is asymp- sue, we use the fact that y˜ can be represented in terms totically normal, but the approach is similar to the of z and , as in (12), and apply the delta method to standard argument for asymptotic normality of maxi- S˜(θ0) (that is, we use another Taylor expansion; cf. mum likelihood and M-estimators (e.g. Chapter 5 of Ch. 3 of Van der Vaart, 2000). Following this strat- ˜ Van der Vaart, 2000). The key fact is that θˆ solves the egy, S(θ0) can be approximated by quadratic forms in score equation S(θˆ) = 0, where the random vectors z and , which have independent components. Asymptotic normality for S˜(θ0) then fol- ∂`  ∂` ∂` > lows from existing normal approximation results for S(θ) = (θ) = (θ), (θ) , ∂θ ∂σ2 ∂η2 quadratic forms in independent random variables. ∂` 1 η2 −1 1 Theorem 2. Assume that the conditions of Theorem (θ) = y> XX> + I y − , ∂σ2 2σ4n p 2σ2 1 hold. Additionally, define −2   ∂` 1 1  η2  ι2(θ0) ι3(θ0) (θ) = y> XX> XX> + I y IN (θ0) = , ∂η2 2σ2n p p ι3(θ0) ι4(θ0) ( ) 1 1  η2 −1 where − tr XX> XX> + I . 2n p p ιk(θ0) = ( k−2 2−k ) Taylor expanding S(θ) about θ0, we obtain 1 1  η2  tr XX> 0 XX> + I ,  −1 2nσ2(4−k) p p 1/2 ˆ 1/2 ∂S 0 n (θ − θ0) ≈ −n (θ0) S(θ0). (10) ∂θ for k = 2, 3, 4, and 3 −1 > If ρ = 1, then inf λmin(p XX ) = 0, where 4   −1 > −1 −1 2η0 0 0 λmin(p XX ) is the smallest nonzero eigenvalue of J (θ0) = IN (θ0) − . (13) p−1XX>. This causes some technical difficulties, which (p/n) 0 1 arise frequently in random matrix theory, but can often 1/2 1/2 ˆ be overcome with some additional work, e.g. (Bai et al., Then n J (θ0) (θ − θ0) → N(0,I) in distribution, 2003). as p/n → ρ. Running heading title breaks the line

ρ As described above, the proof of Theorem 2 relies on The quantities ιk(θ0) — and, consequently, the asymp- three steps: (i) The distributional identity and Taylor totic variance of θˆ, given in Theorem 2 — can be com- approximation (11), (ii) applying the delta method to puted explicitly using properties of the Stieltjes trans- handle the correlated coordinates of β˜, and (iii) normal form of the Marˇcenko-Pastur distribution. The Stielt- approximation results for quadratic forms in z and . jes transform of the Marˇcenko-Pastur distribution is These steps can be rigorously justified using Theorem defined by 2 and Proposition 2 from (Dicker & Erdogdu, 2016). Z b 1 mρ(z) = fρ(s) ds (15) 4.2 Random-Effects Model a s + z 1 h i = 1 − ρz − ρ + {(1 − ρz − ρ)2 + 4ρz}1/2 , The matrix IN (θ0) defined in Theorem 2 is the Fisher 2z information matrix for θ0 under the Gaussian random- 2 where the second equality holds for z > 0 (see, e.g., Ch. effects linear model, where β ∼ N{0, (τ0 /p)I}. Well- known results on random-effects models (Jiang, 1996) 3 of Bai & Silverstein, 2010). Comparing (14)–(15) and standard likelihood theory (e.g. Ch. 6 of Lehmann and differentiating mρ(z) as necessary implies that & Casella, 1998) imply that under this random-effects 1 model, θˆ is asymptotically efficient with asymptotic ιρ(θ ) = , (16) 2 0 2σ4 variance I (θ )−1. Thus, Theorem 2 implies that 0 N 0    ηˆ2 has smaller asymptotic variance under the fixed- ρ 1 1 1 1 ι3(θ0) = 2 2 − 4 mρ 2 , (17) effects model than under the random-effects model (on 2σ0 η0 η0 η0 the other hand,σ ˆ2 has the same asymptotic variance 1  1 2  1  1  1  ιρ(θ ) = − m − m0 . in both models). As a consequence, confidence inter- 4 0 2 η4 η6 ρ η2 η8 ρ η2 2 0 0 0 0 0 vals for variance parameters related to η0 tend to be (18) smaller under the fixed-effects model than under the random-effects models, and corresponding tests may To compute the asymptotic variance of θˆ in terms have more power in the fixed-effects model. A practi- ρ of mρ(z), we use (13) and the expressions for ιk(θ0) cal illustration of this observation may be found in the ρ data analysis in Section 6. given above. In particular, let ψk+l(θ0) denote the −1 kl-element of limp/n→ρ J (θ0) . It follows from basic matrix algebra that 5 MARCENKO-PASTURˇ  −2 2 2  APPROXIMATIONS ρ {m(η ) − η } ψ (θ ) = 2σ4 1 − 0 0 , (19) 2 0 0 0 −2 −2 2 m (η0 ) + m(η0 ) 5.1 Formulas for the Asymptotic Variance of θˆ 2 4  −2 2 ρ 2σ0η0 m(η0 ) − η0 ψ (θ0) = − , (20) 3 m0(η−2) + m(η−2)2 Let F = Fn,p denote the empirical distribution of the 0 0 eigenvalues of p−1XX>. Marˇcenko & Pastur (1967) 2η8 2η4 ψρ(θ ) = − 0 − 0 . (21) showed that under (2), if p/n → ρ ∈ (0, ∞), then 4 0 0 −2 −2 2 m (η0 ) + m(η0 ) ρ Fn,p → Fρ, where Fρ is the Marˇcenko-Pastur distribu- tion with density From (15) and (19)–(21), it is apparent that the asymptotic variance of θˆ is an easily computed alge- p ρ (b − s)(s − a) braic function in σ2, η2, and ρ. f (s) = max{1 − ρ, 0}δ (s) + , 0 0 ρ 0 2π s

−1/2 2 −1/2 2 5.2 Comparison to Previously Proposed for a ≤ s ≤ b and a = (1 − ρ ) , b = (1 + ρ ) . Estimators In Theorem 2, observe that the entries of IN (θ0) can be reexpressed as 4 2 Define the proportion of explained variation r0 = τ 2/(τ 2 + σ2) = η2/(η2 + 1) and the estimatorr ˆ2 =  k−2 0 0 0 0 0 1 Z s ηˆ2/(ˆη2 + 1). Consistency and asymptotic normality ιk(θ0) = dF(s). 2(4−k) η2s + 1 forr ˆ2 follows easily from Theorems 1–2 and the delta 2σ0 0 method. In this section, we compare the asymptotic Thus, it is evident that if ρ ∈ (0, ∞), then variance ofσ ˆ2 andr ˆ2 to that of other previously pro- ρ 2 2 limp/n→ρ ιk(θ0) = ιk(θ0), where posed estimators for σ0 and r0.

4 Z b  k−2 The proportion of explained variation is an important ρ 1 s parameter for regression diagnostics. It is also important ι (θ0) = fρ(s) ds. (14) k 2(4−k) η2s + 1 in genetics; see Section 6 2σ0 a 0 Lee H. Dicker, Murat A. Erdogdu

ρ = 0.5 ρ = 1 ρ = 5 2 2 2 5 ^ ^ ^ σ σ σ 10 50 MLE MM 4 8 OLS 40 3 6 30 2 4 20 1 2 10 0 0 0 Asymptotic variance of Asymptotic variance of Asymptotic variance of Asymptotic variance 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 η2 η2 η2 0 0 0 ρ = 0.5 ρ = 1 ρ = 5 2 2 2 8 r r r ^ ^ ^

2.0 2.5 2.0 6 1.5 1.5 4 1.0 1.0 2 0.5 0.5 0 Asymptotic variance of Asymptotic variance of Asymptotic variance Asymptotic variance of Asymptotic variance 0.0 0.0 0 1 2 3 4 0 2 4 6 8 10 0 2 4 6 8 10 η2 η2 η2 0 0 0

2 2 2 Figure 2: Plots of the asymptotic variance of various estimators for σ0 and r0, as functions of η0 (the signal- 2 to-noise ratio) for ρ = 0.5, 1, 5. Top row: Asymptotic of estimators for σ (vMLE, vMM, and vOLS). 2 Bottom row: Asymptotic variances of estimators for r (wMLE and wMM). Theorem 2 does not apply to the 2 case where ρ = 1; the asymptotic variances for the MLEs in this case are conjectured to be vMLE(η0, 1) and 2 wMLE(η0, 1). The values vMM, vOLS, and wMM are given explicitly in the Supplementary Material.

First define the standardized asymptotic variance of p > n (i.e. ρ > 1) and has infinite variance when σˆ2, p = n; furthermore, even for p < n, it is unclear how a 2 corresponding estimate of r0 should be defined. From 2 n 2 1 ρ vMLE(η0, ρ) = 4 lim Var(ˆσ ) = 4 ψ2 (θ0) Figure 2, it is clear that the MLEs have smaller asymp- 2σ p/n→ρ 2σ 0 0 totic variance (i.e. they are more efficient) than the {m(η−2) − η2}2 = 1 − 0 0 . MLE and OLS estimators. m0(η−2) + m(η−2)2 0 0 In other work on variance estimation within the “struc- The asymptotic variance ofr ˆ2 can be derived using tured X” paradigm, Janson et al. (2015) proposed 2 2 (21) and the delta method; assuming p/n → ρ ∈ the EigenPrism method for estimating σ0 and η0. (0, ∞) \{1}, it is given by While (Janson et al., 2015) do describe methods for constructing confidence intervals for σ2, they do not 2 0 wMLE(η0, ρ) give an explicit formula for the asymptotic variance of  4 EigenPrism; instead, they gives bounds on the asymp- n 2 1 1 ρ = lim Var(ˆr ) = 2 ψ4 (θ0) totic variance in terms of the solution to a convex op- 2 p/n→ρ 2 η + 1 0 timization problem. Given this discrepancy, we omit  η2 4  1 1  = − 0 + . the EigenPrism estimator from Figure 2. However, the 2 0 −2 −2 2 2 η0 + 1 m (η0 ) + m(η0 ) ρη0 performance of EigenPrism relative to the MLE and 2 the MM estimator for σ0 is depicted in Figure 1 (see The asymptotic variances vMLE and wMLE are plot- also Table S1); in these settings, the MLE appears to ted in Figure 2, along with the asymptotic variances have significantly smaller variance than EigenPrism. 2 2 of method-of-moments (MM) estimators for σ0 and r0 (Dicker, 2014), and, in the top-left plot, the asymp- 2 −1 ˆ 2 6 DATA ANALYSIS: ESTIMATING totic variance ofσ ˆOLS = (n−p) ky−XβOLSk , which was defined in Footnote 1. Explicit formulas for the HERITABILITY asymptotic variances of the MM and OLS estimators are given in the Supplementary Material (these for- To illustrate how the results from this paper could mulas have been derived elsewhere previously). We be used in a practical application and to further in- note that the OLS estimator appears only in the top- vestigate the MLE’s performance vis-`a-visother pre- 2 left plot of Figure 2 becauseσ ˆOLS is undefined when viously proposed methods for variance estimation, we Running heading title breaks the line conducted a real data analysis involving publicly avail- that the genes under consideration were already iden- able gene expression data and single nucelotide poly- tified as significant in another study, we take shorter 2 morphism (SNP) data. CIs to be an indicator of more efficient inference on r0. 2 For each gene, we calculated two estimates for r0: (i) 6.1 Background The MLEr ˆ2 and (ii) the method-of-moments esti- 2 mater ˆMM proposed in (Dicker, 2014). After comput- In genetics, heritability measures the fraction of vari- ing these estimates, we then constructed three Wald- ability in an observed trait (the phenotype) that can 2 2 type 95% CIs for the r0 of each gene using: (i)r ˆ and be explained by genetic variability (the genotype). In the asymptotic standard error suggested by Theorem the context of linear models like (1)–(2), it is natu- 2 (referred to as MLE-FE); (ii)r ˆ2 and the asymptotic 2 ral to identify heritability with the parameter r0 = standard errors corresponding to a Gaussian random- 2 2 η0/(η0 + 1) (De los Campos et al., 2015; Yang et al., 2 effects model, where β ∼ N{0, (τ0 /p)I} (MLE-RE); 2010). More specifically, if yi is the i-th subject’s phe- 2 and (iii)r ˆMM and the asymptotic standard error given notype and xij ∈ {0, 1, 2} is a measure of their geno- by Proposition 2 of (Dicker, 2014) (MM). Summary type at location j (i.e. the minor allele count at a given statistics from our analysis are reported in Table 1. 2 location in the i-th subject’s DNA), then r measures The results in Table 1 indicate that the MLE-FE inter- the fraction of variability in the phenotype yi that is vals are typically the shortest, which potentially points explained by the genotype (xij)j=1,...p. Heritability is towards the improved power of this approach. often studied in relation to easily observable pheno- types, such as human height (Yang et al., 2010). How- Table 1: Mean and empirical SD (in parentheses) of ever, other important work has focused on more basic 95% CI length and corresponding estimatesr ˆ2, com- phenotypes, such as gene expression levels, which may puted over 100 genes of interest. be measured by mRNA abundance (Stranger et al., 2007, 2012). MLE-FE MLE-RE MM CI length 0.43 (0.13) 0.47 (0.14) 0.45 (0.26) rˆ2 0.21 (0.21) 0.21 (0.21) 0.22 (0.23) 6.2 Analysis and Results

This data analysis was based on publicly available SNP 7 DISCUSSION data from the International HapMap Project and gene expression data collected by Stranger and coauthors5 The MLEs studied in this paper outperform a vari- on n = 80 individuals from the Han Chinese HapMap ety of other previously proposed methods for vari- population. For each of 100 different genes, we es- ance parameter estimation in high-dimensional lin- 2 ear models. Additionally, our methods highlight con- timated the heritability r0 of the gene’s expression level using genotype data from nearby SNPs, and then nections between fixed- and random-effects models in computed corresponding confidence intervals (CIs). In high dimensions, which may form the basis for fur- 2 ther research in this area. Investigating the extent to other words, we estimated r0 separately for 100 dif- ferent genes, based on a different collection of SNPs which the distributional assumptions (2) may be re- (ranging from p = 17 to p = 190) for each gene, and laxed will be important for determining the breadth 2 of applicability of the results in this paper, and that then computed CIs for each r0. The 100 genes were selected from genes that (Stranger et al., 2007) had of other “structured X” methods for high-dimensional previously identified as having significant association data analysis. In this paper, the key implication of between gene expression levels and nearby SNPs in assumption (2) is the invariance property (5). Two the Japanese HapMap population. More detailed pre- possible directions relaxing (5) are: (i) Replacing the processing steps are described in the Supplementary identity (5) with an approximate identity (as could Material.6 The main objective of this data analysis is be satisfied by certain classes of sub-Gaussian ran- 2 dom vectors (x , . . . , x )> ∈ p) and (ii) assuming to compare the lengths of the various CIs for r0. Given i1 ip R that (5) holds only for U belonging to certain sub- 5Accessed at http://hapmap.ncbi.nlm.nih.gov/ and sets G of p × p orthogonal matrices, such as G = http://www.ebi.ac.uk/arrayexpress/experiments/E- {p × p permutation matrices} (this relaxation seems MTAB-264/ relevant for iid, but not necessarily Gaussian, xij). 6We emphasize that the entries of X are discrete and thus violate the normality assumptions required for most of the theoretical results in this paper, even after prepro- Acknowledgements cessing. However, additional experiments with simulated data (not reported here) suggest that many of these re- L.H. Dicker’s work on this project was supported by sults remain valid for iid discrete xij . Pursuing theoretical NSF Grant DMS-1208785 and NSF CAREER Award support for these observations is of great interest. DMS-1454817. Lee H. Dicker, Murat A. Erdogdu

References and Price, A. L. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet., Bai, Z. and Silverstein, J. W. Spectral Analysis of Large 47:285–290, 2015.

Dimensional Random Matrices. Springer, 2010. Mallows, C. L. Some comments on Cp. Technometrics, 15: Bai, Z. D., Miao, B., and Yao, J. F. Convergence rates of 661–675, 1973. spectral distributions of large sample covariance matri- Marˇcenko, V. A. and Pastur, L. A. Distribution of eigen- ces. SIAM J. Matrix Anal. A., 25:105–127, 2003. values for some sets of random matrices. Math. U.S.S.R. Bayati, M., Erdogdu, M. A., and Montanari, A. Estimating Sb., 1:457–483, 1967. the LASSO risk and noise level. NIPS 2013, 2013. Rangan, S. Generalized approximate message passing for B¨uhlmann,P. and Van de Geer, S. Statistics for High- estimation with random linear mixing. In Proc. IEEE Dimensional Data: Methods, Theory and Applications. Int. Symp. Inf. Theory (ISIT), pp. 2168–2172. IEEE, Springer, 2011. 2011. Chatterjee, S. and Jafarov, J. Prediction error of cross- Rangan, S., Schniter, P., and Fletcher, A. On the con- validated Lasso. arXiv preprint arXiv:1502.06291, 2015. vergence of approximate message passing with arbitrary matrices. In Proc. IEEE Int. Symp. Inf. Theory (ISIT), Crump, S. L. The estimation of variance components in pp. 236–240. IEEE, 2014. . Biometrics Bull., 2:7–11, 1946. Reid, S., Tibshirani, R., and Friedman, J. A study of error De los Campos, G., Sorensen, D., and Gianola, D. Genomic variance estimation in lasso regression. arXiv preprint heritability: What is it? PLoS Genet., 11:e1005048, arXiv:1311.5274, 2013. 2015. Satterthwaite, F. E. An approximate distribution of es- Demidenko, E. Mixed Models: Theory and Applications timates of variance components. Biometrics Bull., 2: with R. Wiley, 2013. 110–114, 1946. Dicker, L. H. Variance estimation in high-dimensional lin- Searle, S. R., Casella, G., and McCulloch, C. E. Variance ear models. Biometrika, 101:269–284, 2014. Components. Wiley, 1992. Dicker, L. H. Ridge regression and asymptotic minimax Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A. S., estimation over spheres of growing dimension. Bernoulli, Bird, C. P., Beazley, C., Inglem, C. E., Dunning, M., 22(1):1–37, 2016. Flicek, P., Koller, D., Montgomery, S., Tavar´e,S., De- Dicker, L. H. and Erdogdu, M. A. Flexible results for loukas, P., and Dermitzakis, E. T. Population genomics quadratic forms with applications to variance compo- of human gene expression. Nat. Genet., 39:1217–1224, nents estimation. Ann. Stat., 2016. To appear. 2007. Donoho, D. L., Maleki, A., and Montanari, A. Message- Stranger, B. E., Montgomery, S. B., Dimas, A. S., Parts, passing algorithms for compressed sensing. P. Natl. L., Stegle, O., Ingle, C. E., Sekowska, M., Smith, G. D., Acad. Sci. USA, 106:18914–18919, 2009. Evans, D., Gutierrez-Arcelus, M., Price, A. L., Raj, T., Nisbett, J., Nica, A. C., Beazley, C., Durbin, R., De- Fan, J., Guo, S., and Hao, N. Variance estimation using loukas, P., and Dermitzakis, E. T. Patterns of cis reg- refitted cross-validation in ultrahigh dimensional regres- ulatory variation in diverse human populations. PLoS sion. J. Roy. Stat. Soc. B, 74:37–65, 2012. Genet., 8:e1002639, 2012. Hartley, H. O. and Rao, J. N. K. Maximum-likelihood Sun, T. and Zhang, C.-H. Scaled sparse linear regression. estimation for the mixed analysis of variance model. Biometrika, 99:879–898, 2012. Biometrika, 54:93–108, 1967. Van der Vaart, A. W. Asymptotic Statistics. Cambridge Janson, L., Barber, R. F., and Cand`es,E. EigenPrism: In- University Press, 2000. ference for high-dimensional signal-to-noise ratios. arXiv Vila, J. P. and Schniter, P. Expectation-maximization preprint arXiv:1505.02097, 2015. Gaussian-mixture approximate message passing. IEEE Javanmard, A. and Montanari, A. Confidence intervals T. Signal Proces., 61:4658–4672, 2013. and hypothesis testing for high-dimensional regression. Yang, J., Benyamin, B., McEvoy, B. P., Gordon, S., Hen- J. Mach. Learn. Res., 15:2869–2909, 2014. ders, A. K., Nyholt, D. R., Madden, P. A., Heath, A. C., Jiang, J. REML estimation: Asymptotic behavior and Martin, N. G., Montgomery, G. W., Goddard, M. E., related topics. Ann. Stat., 24:255–286, 1996. and Visscher, P. M. Common SNPs explain a large pro- Korada, S. B. and Montanari, A. Applications of the Lin- portion of the heritability for human height. Nat. Genet., deberg principle in communications and statistical learn- 42:565–569, 2010. ing. IEEE T. Inform. Theory, 57:2440–2450, 2011. Lehmann, E. L. and Casella, G. Theory of Point Estima- tion. Springer, 2nd edition, 1998. Listgarten, J., Lippert, C., Kadie, C. M., Davidson, R. I., Eskin, E., and Heckerman, D. Improved linear mixed models for genome-wide association studies. Nat. Meth- ods, 9:525–526, 2012. Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B., Patterson, N.,