Variance Function Estimation in High-Dimensions

Variance function estimation in high-dimensions Mladen Kolar [email protected] James Sharpnack [email protected] Machine Learning Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213 USA Abstract Penalization of the empirical loss by the `1 norm has become a popular tool for obtaining sparse models and We consider the high-dimensional het- huge amount of literature exists on theoretical prop- eroscedastic regression model, where the erties of estimation procedures (see,e.g., Zhao & Yu, mean and the log variance are modeled as 2006; Wainwright, 2009; Zhang, 2009; Zhang & Huang, a linear combination of input variables. Ex- 2008, and references therein) and on efficient algo- isting literature on high-dimensional linear rithms that numerically find estimates (see Bach et al., regression models has largely ignored non- 2011, for an extensive literature review). Due to limi- constant error variances, even though they tations of the ` norm penalization, high-dimensional commonly occur in a variety of applica- 1 inference methods based on the class of concave penal- tions ranging from biostatistics to finance. ties have been proposed that have better theoretical In this paper we study a class of non- and numerical properties (see,e.g., Fan & Li, 2001; Fan convex penalized pseudolikelihood estimators & Lv, 2009; Lv & Fan, 2009; Zhang & Zhang, 2011). for both the mean and variance parameters. We show that the Heteroscedastic Iter- In all of the above cited work, the main focus is on ative Penalized Pseudolikelihood Optimizer model selection and mean parameter estimation. Only (HIPPO) achieves the oracle property, that few papers deal with estimation of the variance in high- is, we prove that the rates of convergence are dimensions (Sun & Zhang, 2011; Fan et al., 2012) al- the same as if the true model was known. We though it is a fundamental problem in statistics. Vari- demonstrate numerical properties of the pro- ance appears in the confidence bounds on estimated cedure on a simulation study and real world regression coefficients and is important for variable se- data. lection as it appears in Akaike's information criterion (AIC) and the Bayesian information criterion (BIC). Furthermore, it provides confidence on the predictive 1. Introduction performance of a forecaster. High-dimensional regression models have been stud- In applied regression it is often the case that the er- ied extensively in both machine learning and statisti- ror variance is non-constant. Although the assumption cal literature. Statistical inference in high-dimensions, of a constant variance can sometimes be achieved by where the sample size n is smaller than the ambient transforming the dependent variable, e.g., by using a dimension p, is impossible without assumptions. As Box-Cox transformation, in many cases transforma- the concept of parsimony is important in many scien- tion does not produce a constant error variance (Car- tific domains, most of the research in the area of high- roll & Ruppert, 1988). Another approach is to ignore dimensional statistical inference is done under the as- the heterogeneous variance and use standard estima- sumption that the underlying model is sparse, in the tion techniques, but such estimators are less efficient. sense that the number of relevant parameters is much Aside from its use in reweighting schemes, estimat- smaller than p, or that it can be well approximated by ing variance is important because the resulting pre- a sparse model. diction intervals become more accurate and it is often important to explore which input variables drive the MK is partially supported through the grants NIH variance. In this paper, we will model the variance R01GM087694 and AFOSR FA9550010247. JS is partially supported by AFOSR under grant FA95501010382. directly as a parametric function of the explanatory variables. Appearing in Proceedings of the 29 th International Confer- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Heteroscedastic regression models are used in a vari- Copyright 2012 by the author(s)/owner(s). ety of fields ranging from biostatistics to economet- Variance function estimation in high-dimensions rics, finance and quality control in manufacturing. In common to assume that the support β is small, that this paper, we study penalized estimation in high- is, S = supp(β) and jSj n. Similarly, we assume dimensional heteroscedastic linear regression models, that the support T = supp(θ) is small. where the mean and the log variance are modeled as a linear combination of explanatory variables. Mod- 1.2. Related Work eling the log variance as a linear combination of the explanatory variables is a common choice as it guar- Consider the model (1) with constant variance, i.e., antees positivity and is also capable of capturing vari- σ(x; θ) ≡ σ0. Most of the existing high-dimensional ance that may vary over several orders of magnitudes literature is focused on estimation of the mean param- (Carroll & Ruppert, 1988; Harvey, 1976). Main contri- eter β in this homoscedastic regression model. Under butions of this paper are as follows. First, we propose a variety of assumptions and regularity conditions, any HIPPO (Heteroscedastic Iterative Penalized Pseudo- penalized estimation procedure mentioned in introduc- likelihood Optimizer) for estimation of both the mean tion can, in theory, select the correct sparse model with and variance parameters. Second, we establish the or- probability tending to 1. Literature on variance esti- acle property (in the sense of Fan & Lv (2009)) for mation is not as developed. Fan et al. (2012) proposed the estimated mean and variance parameters. Finally, a two step procedure for estimation of the unknown we demonstrate numerical properties of the proposed variance σ0, while (Sun & Zhang, 2011) proposed an procedure on a simulation study, where it is shown estimation procedure that jointly estimates the model that HIPPO outperforms other methods, and analyze and the variance. a real data set. Problem of estimation in the heteroscedastic linear regression models have been studied extensively in the 1.1. Problem Setup and Notation classical setting with p fixed, however, the problem of estimation under the model (1) when p n has not Consider the usual heteroscedastic linear model, been adequately studied. Jia et al. (2010) assume that 0 σ(x; θ) = jx0βj and show that Lasso is sign consistent yi = x β + σ(xi; θ)i; i = 1; : : : ; n; (1) i for the mean parameter β under certain conditions. 0 where X = (x1;:::; xn) = (X1;:::; Xp) is an n × p Their study shows limitations of lasso, for which many matrix of predictors with i.i.d. rows x1;:::; xn, y = highly scalable solvers exist. However, no new method- (y1; : : : ; yn) is an n-vector of responses, the vectors ology is developed, as the authors acknowledge that β 2 Rp and θ 2 Rp are p-vectors of mean and vari- the log-likelihood function is highly non-convex. Dette ance parameters, respectively, and = (1; : : : n) is & Wagener (2011) study the adaptive lasso under the an n-vector of i.i.d. random noise with mean 0 and model in (1). Under certain regularity conditions, they variance 1. We assume that the noise is indepen- show that the adaptive lasso is consistent, with sub- dent of the predictors X. The function σ(x; θ) has optimal asymptotic variance. However, the weighted a known parametric form and, for simplicity of pre- adaptive lasso is both consistent and achieves optimal sentation, we assume that it takes a particular form asymptotic variance, under the assumption that the 0 σ(xi; θ) = exp(xiθ=2). variance function is consistently estimated. However, they do not discuss how to obtain an estimator of the Throughout the paper we use [n] to denote the set variance function in a principled way and resort to f1; : : : ; ng. For any index set S ⊆ [p], we denote βS an ad-hoc fitting of the residuals. Daye et al. (2011) to be the subvector containing the components of the develop HHR procedure that optimizes the penalized vector β indexed by the set S, and XS denotes the log-likelihood under (1) with the ` -norm penalty on submatrix containing the columns of X indexed by 1 n both the mean and variance parameters. As the ob- S. For a vector a 2 R , we denote supp(a) = fj : jective is not convex, HHR estimates β with θ fixed aj 6= 0g the support set, jjajjq, q 2 (0; 1), the `q- and then estimates θ with β fixed, until convergence. norm defined as jjajj = (P aq)1=q with the usual q i2[n] i Since the objective is biconvex, HHR converges to a extensions for q 2 f0; 1g, that is, jjajj0 = jsupp(a)j stationary point. However, no theory is provided for and jjajj1 = maxi2[n] jaij. For notational simplicity, the final estimates. we denote jj · jj = jj · jj2 the `2 norm. For a matrix A 2 n×p we denote jjjAjjj the operator norm, jjAjj the R 2 F 2. Methodology Frobenius norm, and Λmin(A) and Λmax(A) denote the smallest and largest eigenvalue respectively. In this paper, we propose HIPPO (Heteroscedastic It- Under the model in (1), we are interested in estimating erative Penalized Pseudolikelihood Optimizer) for es- both β and θ. In high-dimensions, when p n, it is timating β and θ under model (1). Variance function estimation in high-dimensions In the first step, HIPPO finds the penalized pseudo- parameter with probability converging to one. Fur- likelihood maximizer of β by solving the following ob- thermore, on this support, the estimated coefficients jective should have the same asymptotic distribution as if an estimator that knew the true support was used. Such ^ 2 X β = arg min jjy − Xβjj + 2n ρλ (jβjj); (2) p S an estimator satisfies the oracle property.

Load more