<<

function estimation in high-dimensions

Mladen Kolar [email protected] James Sharpnack [email protected] Machine Learning Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA

Abstract Penalization of the empirical loss by the `1 norm has become a popular tool for obtaining sparse models and We consider the high-dimensional het- huge amount of literature exists on theoretical prop- eroscedastic regression model, where the erties of estimation procedures (see,e.g., Zhao & Yu, and the log variance are modeled as 2006; Wainwright, 2009; Zhang, 2009; Zhang & Huang, a linear combination of input variables. Ex- 2008, and references therein) and on efficient algo- isting literature on high-dimensional linear rithms that numerically find estimates (see Bach et al., regression models has largely ignored non- 2011, for an extensive literature review). Due to limi- constant error , even though they tations of the ` norm penalization, high-dimensional commonly occur in a variety of applica- 1 inference methods based on the class of concave penal- tions ranging from to finance. ties have been proposed that have better theoretical In this paper we study a class of non- and numerical properties (see,e.g., Fan & Li, 2001; Fan convex penalized pseudolikelihood estimators & Lv, 2009; Lv & Fan, 2009; Zhang & Zhang, 2011). for both the mean and variance parame- ters. We show that the Heteroscedastic Iter- In all of the above cited work, the main focus is on ative Penalized Pseudolikelihood Optimizer and mean parameter estimation. Only (HIPPO) achieves the oracle property, that few papers deal with estimation of the variance in high- is, we prove that the rates of convergence are dimensions (Sun & Zhang, 2011; Fan et al., 2012) al- the same as if the true model was known. We though it is a fundamental problem in . Vari- demonstrate numerical properties of the pro- ance appears in the confidence bounds on estimated cedure on a simulation study and real world regression coefficients and is important for variable se- data. lection as it appears in Akaike’s information criterion (AIC) and the Bayesian information criterion (BIC). Furthermore, it provides confidence on the predictive 1. Introduction performance of a forecaster. High-dimensional regression models have been stud- In applied regression it is often the case that the er- ied extensively in both machine learning and statisti- ror variance is non-constant. Although the assumption cal literature. in high-dimensions, of a constant variance can sometimes be achieved by where the sample size n is smaller than the ambient transforming the dependent variable, e.g., by using a dimension p, is impossible without assumptions. As Box-Cox transformation, in many cases transforma- the concept of parsimony is important in many scien- tion does not produce a constant error variance (Car- tific domains, most of the research in the area of high- roll & Ruppert, 1988). Another approach is to ignore dimensional statistical inference is done under the as- the heterogeneous variance and use standard estima- sumption that the underlying model is sparse, in the tion techniques, but such estimators are less efficient. sense that the number of relevant parameters is much Aside from its use in reweighting schemes, estimat- smaller than p, or that it can be well approximated by ing variance is important because the resulting pre- a sparse model. diction intervals become more accurate and it is often important to explore which input variables drive the MK is partially supported through the grants NIH variance. In this paper, we will model the variance R01GM087694 and AFOSR FA9550010247. JS is partially supported by AFOSR under grant FA95501010382. directly as a parametric function of the explanatory variables. Appearing in Proceedings of the 29 th International Confer- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Heteroscedastic regression models are used in a vari- Copyright 2012 by the author(s)/owner(s). ety of fields ranging from biostatistics to economet- Variance function estimation in high-dimensions rics, finance and in manufacturing. In common to assume that the support β is small, that this paper, we study penalized estimation in high- is, S = supp(β) and |S|  n. Similarly, we assume dimensional heteroscedastic models, that the support T = supp(θ) is small. where the mean and the log variance are modeled as a linear combination of explanatory variables. Mod- 1.2. Related Work eling the log variance as a linear combination of the explanatory variables is a common choice as it guar- Consider the model (1) with constant variance, i.e., antees positivity and is also capable of capturing vari- σ(x, θ) ≡ σ0. Most of the existing high-dimensional ance that may vary over several orders of magnitudes literature is focused on estimation of the mean param- (Carroll & Ruppert, 1988; Harvey, 1976). Main contri- eter β in this homoscedastic regression model. Under butions of this paper are as follows. First, we propose a variety of assumptions and regularity conditions, any HIPPO (Heteroscedastic Iterative Penalized Pseudo- penalized estimation procedure mentioned in introduc- likelihood Optimizer) for estimation of both the mean tion can, in theory, select the correct sparse model with and variance parameters. Second, we establish the or- probability tending to 1. Literature on variance esti- acle property (in the sense of Fan & Lv (2009)) for mation is not as developed. Fan et al. (2012) proposed the estimated mean and variance parameters. Finally, a two step procedure for estimation of the unknown we demonstrate numerical properties of the proposed variance σ0, while (Sun & Zhang, 2011) proposed an procedure on a simulation study, where it is shown estimation procedure that jointly estimates the model that HIPPO outperforms other methods, and analyze and the variance. a real data set. Problem of estimation in the heteroscedastic linear re- gression models have been studied extensively in the 1.1. Problem Setup and Notation classical setting with p fixed, however, the problem of estimation under the model (1) when p  n has not Consider the usual heteroscedastic linear model, been adequately studied. Jia et al. (2010) assume that 0 σ(x, θ) = |x0β| and show that Lasso is sign consistent yi = x β + σ(xi, θ)i, i = 1, . . . , n, (1) i for the mean parameter β under certain conditions. 0 where X = (x1,..., xn) = (X1,..., Xp) is an n × p Their study shows limitations of lasso, for which many matrix of predictors with i.i.d. rows x1,..., xn, y = highly scalable solvers exist. However, no new method- (y1, . . . , yn) is an n-vector of responses, the vectors ology is developed, as the authors acknowledge that β ∈ Rp and θ ∈ Rp are p-vectors of mean and vari- the log- is highly non-convex. Dette ance parameters, respectively, and  = (1, . . . n) is & Wagener (2011) study the adaptive lasso under the an n-vector of i.i.d. random noise with mean 0 and model in (1). Under certain regularity conditions, they variance 1. We assume that the noise  is indepen- show that the adaptive lasso is consistent, with sub- dent of the predictors X. The function σ(x, θ) has optimal asymptotic variance. However, the weighted a known parametric form and, for simplicity of pre- adaptive lasso is both consistent and achieves optimal sentation, we assume that it takes a particular form asymptotic variance, under the assumption that the 0 σ(xi, θ) = exp(xiθ/2). variance function is consistently estimated. However, they do not discuss how to obtain an estimator of the Throughout the paper we use [n] to denote the set variance function in a principled way and resort to {1, . . . , n}. For any index set S ⊆ [p], we denote βS an ad-hoc fitting of the residuals. Daye et al. (2011) to be the subvector containing the components of the develop HHR procedure that optimizes the penalized vector β indexed by the set S, and XS denotes the log-likelihood under (1) with the ` -norm penalty on submatrix containing the columns of X indexed by 1 n both the mean and variance parameters. As the ob- S. For a vector a ∈ R , we denote supp(a) = {j : jective is not convex, HHR estimates β with θ fixed aj 6= 0} the support set, ||a||q, q ∈ (0, ∞), the `q- and then estimates θ with β fixed, until convergence. norm defined as ||a|| = (P aq)1/q with the usual q i∈[n] i Since the objective is biconvex, HHR converges to a extensions for q ∈ {0, ∞}, that is, ||a||0 = |supp(a)| stationary point. However, no theory is provided for and ||a||∞ = maxi∈[n] |ai|. For notational simplicity, the final estimates. we denote || · || = || · ||2 the `2 norm. For a matrix A ∈ n×p we denote |||A||| the operator norm, ||A|| the R 2 F 2. Methodology Frobenius norm, and Λmin(A) and Λmax(A) denote the smallest and largest eigenvalue respectively. In this paper, we propose HIPPO (Heteroscedastic It- Under the model in (1), we are interested in estimating erative Penalized Pseudolikelihood Optimizer) for es- both β and θ. In high-dimensions, when p  n, it is timating β and θ under model (1). Variance function estimation in high-dimensions

In the first step, HIPPO finds the penalized pseudo- parameter with probability converging to one. Fur- likelihood maximizer of β by solving the following ob- thermore, on this support, the estimated coefficients jective should have the same asymptotic distribution as if an estimator that knew the true support was used. Such ˆ 2 X β = arg min ||y − Xβ|| + 2n ρλ (|βj|), (2) p S an estimator satisfies the oracle property. A number of β∈R j∈[p] concave penalties result in estimates that satisfy this property: the SCAD penalty (Fan & Li, 2001), the where ρ is the penalty function and the tuning pa- λS MCP penalty (Zhang, 2010) and a class of folded con- ˆ rameter λS controls the sparsity of the solution β. cave penalties (Lv & Fan, 2009). For concreteness, we In the second step, HIPPO forms the penalized pseu- choose to use the SCAD penalty, which is defined by dolikelihood estimate for θ by solving its derivative X X  (aλ − t)  θˆ = arg min x0 θ + ηˆ2 exp(−x0 θ) 0 + p i i i ρλ(t) = λ I{t ≤ λ} + I{t > λ} , (5) θ∈R (a − 1)λ i∈[n] i∈[n] X (3) + 4n ρλT (|θj|) where often a = 3.7 is used. Note that estimates pro- j∈[p] duced by the `1-norm penalty are biased, and hence this penalty does not achieve oracle property. where ηˆ = y − Xβˆ is the vector of residuals. HIPPO is related to the iterative HHR algorithm of Finally, HIPPO computes the reweighted estimator of Daye et al. (2011). In particular, the first two itera- the mean by solving tions of HHR are equivalent to HIPPO with the SCAD

0 2 penalty replaced with the `1 norm penalty. In prac- ˆ X (yi − xiβ) X βw = arg min + 2n ρλ (|βj|) tice, one can continue iterating between solving (3) p S β∈R σˆi i∈[n] j∈[p] and (4), however, establishing theoretical properties (4) for those iterates is a non-trivial task. From our nu- 0 ˆ merical studies, we observe that HIPPO performs well whereσ ˆi = exp(xiθ/2) are the weights. when stopped after the first two iterations. In classical literature, estimation under heteroscedas- tic models is achieved by employing a pseudolikelihood 2.1. Tuning Parameter Selection objective. The pseudolikelihood maximization princi- ple prescribes the scientist to maximize a surrogate As described in the previous section, HIPPO requires likelihood, i.e. one that is believed to be similar to the selection of the tuning parameters λS and λT , which likelihood with the true unknown fixed variances (or balance the complexity of the estimated model and alternatively). In classical theory, central limit the fit to data. A common approach is to form a grid theorems are derived for many pseudo-maximum like- of candidate values for the tuning parameters λS and lihood (PML) estimators using generalized estimating λT and chose those that minimize the AIC or BIC equations (Ziegler, 2011). HIPPO fits neatly into the criterion pseudolikelihood framework because the first step is a X ˆ ˆ regularized PML where only the mean structure needs AIC(λS, λT ) = `(yi, xi; β, θ) + 2df,b (6) to be correctly specified. The second step and third i∈[n] steps may be similarly cast as PML estimators. In- X ˆ ˆ deed, all our theoretical results are due to the fact that BIC(λS, λT ) = `(yi, xi; β, θ) + dfb log n (7) in each step we are optimizing a pseudolikelihood that i∈[n] is similar to the true unknown likelihoods (with alter- nating free parameters). Moreover, it is known that where, up to constants, if the surrogate variances in the mean PML are more 0 0 2 0 similar to the true variances then the resulting esti- `(y, x; β, θ) = x θ + (y − x β) exp(−x θ) mates will be more asymptotically efficient. With this in mind, we recommend a third reweighting procedure is the negative log-likelihood and with the variance estimates from the second step. dfb = |supp(βˆ)| + |supp(θˆ)| Fan & Li (2001) advocate usage of penalty functions that result in estimates satisfying three properties: un- is the estimated degrees of freedom. In Section 4, biasedness, sparsity and continuity. A reasonable esti- we compare performance of the AIC and the BIC for mator should correctly identify the support of the true HIPPO in a simulation study. Variance function estimation in high-dimensions

2.2. Optimization Procedure Assumption 3. There are two constants β¯ and θ¯ such that ||β|| ≤ β¯ < ∞ and ||θ|| ≤ θ¯ < ∞. In this section, we describe numerical procedures used αS αT to solve optimization problems in (2), (3) and (4). Our Assumption 4. |S| = CSn and |T | = CT n for procedures are based on the local linear approximation some αS ∈ (0, 1) and αT ∈ (0, 1/3) and constants for the SCAD penalty developed in (Zou & Li, 2008), CS,CT > 0. which gives: The following assumption will be needed for showing ˆ (k) 0 (k) (k) the consistency of the weighted estimator βw in (4). ρλ(|βj|) ≈ ρλ(|βj |) + ρλ(|βj |)(|βj| − |βj |), (k) Assumption 5. Define for βj ≈ βj . D = n−1X0 diag(exp(−Xθ))X . This approximation allows us to substitute the SCAD SS S S P penalty j∈[p] ρλ(|βj|) in (2), (3) and (4) with There exist constants 0 ≤ Dmin,Dmax ≤ ∞ such that

(k) X 0 ˆ lim [Λmax(DSS) ≤ Dmax] = 1, and ρλ(|βj |)|βj|, (8) n→∞P j∈[p] lim [Λmin(DSS) ≥ Dmin] = 1. n→∞P and iteratively solve each objective until convergence ˆ(k) ˆ(0) ˆ(0) Furthermore, we have that of {β }k. We set the initial estimates β and θ to be the solutions of the `1-norm penalized problems. lim |||DSS − EDSS|||2 = oP (1). The convergence of these iterative approximations n→∞ follows from the convergence of the MM (minorize- maximize) algorithms (Zou & Li, 2008). With these assumption, we state our first result, re- garding the estimator βˆ in (2). With the approximation of the SCAD penalty given in (8), we can solve (2) and (4) using standard lasso Theorem 1. Suppose that the assumptions (1)- solvers, e.g., we use the proximal method of Beck & (4) are satisfied. Furthermore, assume that λS ≥ q p Teboulle (2009). The objective in (3) is minimized c1 log(p) exp( c2 log(n))/n, minj∈[S] |βj|  λS  using a coordinate descent algorithm, which is detailed q p α0 in Daye et al. (2011). c3 log(s) exp( c2 log(n))/n and log(p) = O(n ) for some α0 ∈ (0, 1). Then there is a strict local minimizer βˆ = (βˆ0 , 00 )0 of (2) that satisfies 3. Theoretical Properties of HIPPO S SC s In this section, we present theoretical properties of p ˆ exp( c2 log(n)) log(s) HIPPO. In particular, we show that HIPPO achieves ||βS − βS||∞ ≤ c3 (9) n the oracle property for estimating the mean and vari- ance under the model (1). All the proofs are deferred for some positive constants c1, c2, and c3 and suffi- to Appendix. ciently large n. We will analyze HIPPO under the following assump- In addition, if we suppose that assumption (5) is sat- s tions, which are standard in the literature on high- isfied, then for any fixed a ∈ R with ||a||2 = 1 the dimensional statistical learning (see, e.g. Fan et al., following weak convergence holds 2012). √ n D 0 n×p 0 ˆ Assumption 1. The matrix X = (x1,..., xn) ∈ R a (βS − βS) −→ N (0, 1) (10) 1/2 ζ has independent rows that satisfy xi = Σ zi where 2 0 −1 −1 {zi}i are i.i.d. subgaussian random variables with where ζ = a Σ DSSΣ a. 0 SSE SS Ezi = 0, Ezizi = I and parameter K (see Appendix for more details on subgaussian random variables). The first result stated in Theorem 1 established that βˆ Furthermore, there exist two constants C ,C > 0 min max achieves the weak oracle property√ in the sense of (Lv such that & Fan, 2009). The extra term exp( log n) is subpoly- nomial in n and appears in the bound (9) due to the 0 < Cmin ≤ Λmin(Σ) ≤ Λmax(Σ) ≤ Cmax < ∞. heteroscedastic nature of the errors. The second result establishes the strong oracle property of the estimator ˆ Assumption 2. The errors 1, . . . , n are i.i.d. subgaus- β in the sense of (Fan & Lv, 2009), that is, we es- sian with zero mean and parameter 1. tablish the asymptotic normality on the true support Variance function estimation in high-dimensions

ˆ ˆ S. The asymptotic normality shows that βS has the ||θ − θ||2 Preθ Recθ same asymptotic variance as the ordinary ρ = 0 (OLS) estimator on the true support. However, in the case of a heteroscedastic model the OLS estimator is HHR-AIC 0.59(0.13) 0.4(0.17) 1.00(0.00) dominated by the generalized least squares estimator. HIPPO-AIC 0.26(0.15) 0.6(0.22) 1.00(0.00) ˆ Later in this section, we will demonstrate that βw has HHR-BIC 0.59(0.13) 0.39(0.16) 1.00(0.00) better asymptotic variance. Note that βˆ correctly se- HIPPO-BIC 0.26(0.15) 0.59(0.22) 1.00(0.00) lects the mean model and estimates the parameters at the correct rate. From the upper and lower bounds ρ = 0.5 on λS, we see how the rate at which p can grow and the minimum coefficient size are related. Larger the HHR-AIC 0.32(0.12) 0.68(0.21) 1.00(0.00) HIPPO-AIC 0.38(0.22) 0.69(0.25) 1.00(0.03) ambient dimension p gets, larger the size of λS, which lower bounds the size of the minimum coefficient. HHR-BIC 0.32(0.12) 0.68(0.21) 1.00(0.00) HIPPO-BIC 0.38(0.22) 0.69(0.25) 0.99(0.03) Our next result establishes correct model selection for the variance parameter θ. Table 1. Mean (sd) performance of HHR and HIPPO under Theorem 2. Suppose that assumptions (1)- the model in Example 1 (averaged over 100 independent (5) are satisfied. Suppose further that λT ≥ runs). The mean parameter β is assumed to be known. αT −1/2 n log(p) log(n) and minj∈[T ] |θj| ≥ λT . Then there is a strict local minimizer θˆ = (θˆ0 , 00 )0 with T T C Theorem 3 establishes convergence of the weighted es- the strong oracle property, ˆ timator βW in (4) and the model selection consistency. (1−αT )/2 ˆ The rate of convergence depends on the rate of con- ||n (θ − θ)|| = OP(1) vergence of the variance estimator, rn. From Theo- t Morover, for any fixed a ∈ R with ||a||2 = 1 the fol- rem 2, we show the parametric rate of convergence ˆ lowing weak convergence holds for θS. The second result of Theorem 3 states that ˆ √ the weighted estimator βw,S is asymptotically normal, n 0 ˆ D a (θT − θT ) −→ N (0, 1) (11) with the same asymptotic variance as the generalized ζ least squares estimator which knows the true model 2 0 −1 and variance function σ(x, θ). where ζ = a ΣTT a. With the convergence result of θˆ we can prove consis- 4. Monte-Carlo Simulations tency and asymptotic normality of the weighted esti- mator βˆ in (4). In this section, we conduct two small scale simulation studies to demonstrate finite sample performance of Theorem 3. Suppose that the assumptions (1)-(5) are HIPPO . We compare it to the HHR procedure (Daye satisfied and that there exists an estimator θˆ satisfy- ˆ et al., 2011). ing ||θ − θ||2 = O(rn), for a sequence rn → 0 and ˆ supp(θ) = supp(θ). Furthermore, assume that λS ≥ Convergence of the parameters is measured in the `2 q norm, ||βˆ − β|| and ||θˆ − θ||. We measure the iden- c log(p) exp(pc log(n))/n, min |β |  λ  1 2 j∈[S] j S tification of the support of β and θ using precision p α0 c3rn exp( c2 log(n)) log(n) and log(p) = O(n ) for and recall. Let Sˆ denote the estimated set of non-zero some α0 ∈ (0, 1). Then there is a strict local mini- coefficients of S, then the precision is calculated as ˆ ˆ0 ˆ ˆ ˆ mizer βw = (βw,S, 0SC ) of (4) that satisfies Preβ := |S∩S|/|S| and the recall as Recβ := |S∩S|/|S|. Similarly, we can define precision and recall for the ˆ p ||βw,S − βS||∞ ≤ c3rn exp( c2 log(n)) log(n) (12) variance coefficients. We report results averaged over 100 independent runs. for some positive constants c1, c2, and c3 and suffi- ciently large n. 4.1. Example 1 Furthermore, for any fixed a ∈ s with ||a|| = 1 the R 2 Assume that the data is generated iid from the fol- following weak convergence holds lowing model Y = σ(X) where  follows a standard √ n and the logarithm of the variance a0(βˆ − β) −→D N (0, 1) (13) ζw is given by

2 0 −1 2 where ζw = a (EDSS) a. log σ(X) = X1 + X2 + X3. Variance function estimation in high-dimensions

ˆ ˆ #it ||β − β||2 Preβ Recβ ||θ − θ||2 Preθ Recθ n = 200 HHR-AIC 1st 0.78(0.52) 0.44(0.22) 1.00(0.00) 2.10(0.11) 0.25(0.10) 0.54(0.16) 2nd 0.31(0.13) 0.88(0.15) 1.00(0.00) 1.80(0.16) 0.29(0.07) 0.71(0.14) HIPPO-AIC 1st 0.66(0.84) 0.75(0.29) 1.00(0.02) 2.00(0.16) 0.20(0.10) 0.52(0.16) 2nd 0.08(0.07) 0.84(0.24) 1.00(0.00) 1.50(0.30) 0.30(0.11) 0.75(0.12)

HHR-BIC 1st 0.77(0.48) 0.58(0.17) 1.00(0.00) 2.10(0.10) 0.41(0.18) 0.45(0.14) 2nd 0.31(0.13) 0.89(0.13) 1.00(0.00) 1.90(0.16) 0.38(0.15) 0.65(0.17) HIPPO-BIC 1st 0.70(0.83) 0.80(0.25) 0.99(0.03) 2.00(0.14) 0.39(0.18) 0.50(0.17) 2nd 0.08(0.06) 0.97(0.07) 1.00(0.00) 1.60(0.28) 0.44(0.16) 0.72(0.14)

n = 400 HHR-AIC 1st 0.59(0.37) 0.58(0.26) 1.00(0.00) 1.90(0.11) 0.36(0.14) 0.72(0.18) 2nd 0.30(0.24) 0.98(0.06) 1.00(0.00) 1.70(0.16) 0.43(0.13) 0.81(0.16) HIPPO-AIC 1st 0.44(0.54) 0.87(0.22) 1.00(0.00) 1.80(0.18) 0.28(0.10) 0.67(0.15) 2nd 0.06(0.29) 0.97(0.12) 1.00(0.02) 1.00(0.31) 0.56(0.18) 0.93(0.09)

HHR-BIC 1st 0.59(0.37) 0.66(0.20) 1.00(0.00) 1.90(0.11) 0.46(0.18) 0.66(0.20) 2nd 0.30(0.23) 0.98(0.06) 1.00(0.00) 1.70(0.17) 0.46(0.13) 0.80(0.17) HIPPO-BIC 1st 0.46(0.58) 0.89(0.19) 1.00(0.01) 1.80(0.18) 0.39(0.17) 0.65(0.17) 2nd 0.06(0.29) 0.99(0.06) 1.00(0.02) 1.00(0.31) 0.63(0.20) 0.92(0.09)

Table 2. Mean (sd) performance of HHR and HIPPO under the model in Example 2 (averaged over 100 independent runs). We report estimated models after the first and second iteration.

The covariates associated with the variance are jointly 4.2. Example 2 normal with equal correlation ρ, and marginally The following non-trivial model is borrowed from Daye N (0, 1). The remaining covariates, X ,...,X are iid 4 p et al. (2011). The response variable Y satisfies random variables following the standard Normal dis- X X tribution and are independent from (X1,X2,X3). We Y = β + X β + exp(θ + X θ ) set (n, p) = (200, 2000) and use ρ = 0 and ρ = 0.5. 0 j j 0 j j j∈[p] j∈[p] Estimation procedures know that β = 0 and we only estimate the variance parameter θ. This example is with p = 600, β0 = 2, θ0 = 1, provided to illustrate performance of the penalized 0 pseudolikelihood estimators in an idealized situation. β[12] = (3, 3, 3, 1.5, 1.5, 1.5, 0, 0, 0, 2, 2, 2) , When the mean parameter needs to be estimated as 0 θ[15] = (1, 1, 1, 0, 0, 0, 0.5, 0.5, 0.5, 0, 0, 0, 0.75, 0.75, 0.75) , well, we expect the performance of the procedures only to get worse. Since the mean is known, both HHR and and the remainder of the coefficients are 0. The co- |i−j| HIPPO only solve the optimization procedure in (3), variates are jointly Normal with cov(Xi,Xj) = 0.5 and the error  follows the standard Normal distribu- HHR with the `1-norm penalty and HIPPO with the SCAD penalty, without iterating between (4) and (3). tion. This is a more realistic model than the one de- scribed in the previous example. We set p = 600 and Table 1 summarizes the results. Under this toy model, the number of samples n = 200 and n = 400. we observe that HIPPO performs better than HHR when the correlation between the relevant predictors Table 2 summarizes results of the simulation. We ob- is ρ = 0. However, we do not observe the difference serve that HIPPO consistently outperforms HHR in between the two procedures when ρ = 0.5. The dif- all scenarios. Again, a general observation is that the ference between the AIC and BIC is already visible in AIC selects more complex models although the differ- this example when ρ = 0. The AIC tends to pick more ence is less pronounced when the sample size n = 400. complex models, while the BIC is more conservative Furthermore, we note that the estimation error signif- and selects a model with fewer variables. icantly reduces after the first iteration, which demon- strates final sample benefits of estimating the variance. Variance function estimation in high-dimensions

Recall that Theorem 1 proves that the estimate βˆ con- HIPPO HHR sistently estimates the true parameter β. However, it MSE 0.0089 0.0091 is important to estimate the variance parameter θ well, −`(y, X; βˆ, θˆ) 0.4953 0.6783 both in theory (see Theorem 3) and practice. |Sb| 5.4 8.9 |T | 8.2 5.1 5. Real Data Application b Forecasting the gross domestic product (GDP) of a Table 3. Performance of HIPPO and HHR on WEO data country based on macroeconomic indicators is of sig- averaged over 10 folds. nificant interest to the economic community. We ob- tain both the country GDP figures (specifically we use the GDP per capita using current prices in units of a ‘national currency’) and macroeconomic variables λT . The metrics used to compare the algorithms are 1 P 2 from the International Monetary Fund’s World Eco- mean square error (MSE) defined by n i(yi,t −yˆi,t) , nomic Outlook (WEO) database. The WEO database the partial prediction defined as the average of contains records for macroeconomic variables from the negative log likelihoods, and the number of se- 1980 to 2016 (with forecasts). lected mean parameters and variance parameters. We perform 10-fold cross validation to obtain unbiased es- To form our response variable, yi,t, we form log- timates of these metrics. In Table 3 we observe that returns of the GDP for each country (i) for each time HIPPO outperforms HHR in terms of MSE and partial point (t) after records began and before the forecast- prediction score. ing commenced (each country had a different year at which forecasting began). After removing missing val- ues, we obtained 31 variables that can be grouped into 6. Discussion a few broad categories: balance of payments, govern- We have addressed the problem of statistical infer- ment finance and debt, inflation, and demographics. ence in high-dimensional linear regression models with We apply various transformations, including lagging heteroscedastic errors. Heteroscedastic errors arise in and logarithms forming the vectors xi,t. We fit the many applications and industrial settings, including heteroscedastic AR(1) model with HIPPO. biostatistics, finance and quality control in manufac- 0 0 yi,t = xi,t−1β + exp(xi,t−1θ)i,t turing. We have proposed HIPPO for model selection and estimation of both the mean and variance param- In order to initially assess the of the eters under a heteroscedastic model. HIPPO can be data, we form the LASSO estimator with the LARS deployed naturally into an existing data analysis work- package in R selecting with BIC. It is common practice flow. Specifically, as a first step, a statistician performs when diagnosing heteroscedasticity to plot the studen- penalized estimation of the mean parameters and then, tized residuals against the fitted values. We bin the as a second step, tests for heteroscedasticity by run- bulk of the samples into three groups by fitted val- ning the second step of HIPPO. If heteroscedasticity ues, and observe the box-plot of each bin by residuals is discovered, HIPPO can then be used to solve penal- (Figure 2). It is apparent that there is a difference of ized generalized least squares objective. Furthermore, variances between these bins, which is corroborated by HIPPO is well motivated from the penalized pseudo- performing a F-test of equal variances across the sec- likelihood maximization perspective and achieves the ond and third bins (p-value of 4 × 10−6). We further oracle property in high-dimensional problems. observe differences of variance between country GDP Throughout the paper, we focus on a specific para- log returns. We analyzed the distribution of responses metric form of the variance function for simplicity of separated by countries: Canada, Finland, Greece and presentation. Our method can be extended to any Italy. The p-value from the F-test for equality of parametric form, however, the assumptions will be- variances between the countries Canada and Greece come more cumbersome and the particular numerical is 0.008, which is below even the pairwise Bonferroni procedure would change. It is of interest to develop correction of 0.0083 at 0.05 significance level. This general unified framework for estimation of arbitrary demonstrates heteroscedasticity in the WEO dataset, parametric form of the variance function. Another and we are justified in fitting non-constant variance. open research direction includes non-parametric esti- We compare the results from HIPPO and HHR when mation of the variance function in high-dimensions, applied to the WEO data set. The tuning parame- which could be achieved with sparse additive models ters were selected with BIC over a grid for λS and (see Ravikumar et al., 2009). Variance function estimation in high-dimensions

dimensional sparse heteroscedastic models. 2011. Fan, J. and Li, R. Variable selection via nonconcave pe- nalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348–1360, De- cember 2001. Fan, J. and Lv, J. Non-Concave Penalized Likelihood with NP-Dimensionality. ArXiv e-prints, October 2009. Fan, Jianqing, Guo, Shaojun, and Hao, Ning. Variance es- timation using refitted cross-validation in ultrahigh di- mensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):37–65, 2012. ISSN 1467-9868. Harvey, A. C. Estimating regression models with multi- plicative heteroscedasticity. Econometrica, 44(3):461– 466, 1976. Figure 1. A box-plot of the GDP log-returns for the 4 coun- Jia, J., Rohe, K., and Yu, B. The lasso under heteroscedas- tries with the most observed time points (Canada, Finland, ticity. Arxiv preprint arXiv:1011.1026, 2010. Greece, and Italy). Lv, Jinchi and Fan, Yingying. A unified approach to model selection and sparse recovery using regularized least squares. Annals of Statistics, 37(6A):3498–3528, 2009. Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(5):1009– 1030, 2009. Sun, T. and Zhang, C.-H. Scaled Sparse Linear Regression. ArXiv e-prints, April 2011. Wainwright, Martin J. Sharp thresholds for high- dimensional and noisy sparsity recovery using `1 - constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183–2202, May 2009. Zhang, C.-H. and Zhang, T. A General Theory of Concave Regularization for High Dimensional Sparse Estimation Problems. ArXiv e-prints, 2011. Figure 2. A box-plot of the studentized residuals binned Zhang, C.H. Nearly unbiased variable selection under min- by LASSO predicted yi,t. Only the segment of the pre- dicted response with the bulk of the samples was binned; imax concave penalty. The Annals of Statistics, 38(2): 894–942, 2010. the breaks in the bins are at 0.04, 0.06, 0.08, and 0.1. Zhang, Cun-Hui and Huang, Jian. The sparsity and bias of the lasso selection in high-dimensional linear regression. References Annals of Statistics, 36(4):1567–1594, 2008. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Op- Zhang, T. Some sharp performance bounds for least timization with Sparsity-Inducing Penalties. ArXiv e- squares regression with L1 regularization. Annals of prints, 2011. Statistics, 37(5A):2109–2144, 2009. Beck, A. and Teboulle, M. A fast iterative shrinkage- Zhao, Peng and Yu, Bin. On model selection consistency thresholding algorithm for linear inverse problems. of lasso. J. Mach. Learn. Res., 7:2541–2563, 2006. SIAM Journal on Imaging Sciences, 2(1):183202, 2009. Ziegler, Andreas. Generalized . Num- Carroll, R.J. and Ruppert, D. Transformation and weight- ber 204 in Lecture Notes in Statistics. Springer, 2011. ing in regression, volume 30. Chapman & Hall/CRC, Zou, Hui and Li, Runze. One-step sparse estimates in non- 1988. concave penalized likelihood models. Annals of Statis- Daye, Z. John, Chen, Jinbo, and Li, Hongzhe. High- tics, 36(4):1509–1533, 2008. dimensional heteroscedastic regression with an applica- tion to eqtl data analysis. Biometrics, pp. no–no, 2011. ISSN 1541-0420. Dette, H. and Wagener, J. The adaptive lasso in high