<<

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION , VOL. , NO. , –, Theory and Methods https://doi.org/./..

Inference in Models with Many Covariates and

Matias D. Cattaneoa, Michael Janssonb,c, and Whitney K. Neweyd aDepartment of Economics and Department of , University of Michigan, Ann Arbor, MI; bDepartment of Economics, University of California Berkeley, CA; cCREATES, Aarhus University, Denmark; dDepartment of Economics, MIT, Cambridge, MA

ABSTRACT ARTICLE HISTORY The linear regression model is widely used in empirical work in economics, statistics, and many other disci- Received September  plines. Researchers often include many covariates in their specification in an attempt to con- Revised April  trol for confounders. We give inference methods that allow for many covariates and heteroscedasticity. Our KEYWORDS results are obtained using high-dimensional approximations, where the number of included covariates is Heteroscedasticity; allowed to grow as fast as the size. We find that all of the usual versions of Eicker–White heteroscedas- High-dimensional models; ticity consistent for linear models are inconsistent under this asymptotics. We then Linear regression; Many propose a new heteroscedasticity consistent standard error formula that is fully automatic and robust to regressors; Standard errors both (conditional) heteroscedasticity of unknown form and the inclusion of possibly many covariates. We apply our findings to three settings: parametric linear models with many covariates, linear panel models with many fixed effects, and semiparametric semi-linear models with many technical regressors. Simulation evidence consistent with our theoretical results is provided, and the proposed methods are also illustrated with an empirical application. Supplementary materials for this article are available online.

1. Introduction example, Angrist and Hahn (2004) discussed when to include many covariates in treatment effect models. A key goal in empirical work is to estimate the structural, We present valid inference methods that explicitly account causal,ortreatmenteffectofsomevariableonanoutcomeof for the presence of possibly many controls in linear regression interest, such as the impact of a labor market policy on out- models under (conditional) heteroscedasticity. We consider the comes like earnings or employment. Since many variables mea- setting where the object of interest is β in a model of the form suring policies or interventions are not exogenous, researchers = βx + γ w + , = ,..., , often employ observational methods to estimate their effects. yi,n i,n n i,n ui,n i 1 n (1) One important method is based on assuming that the vari- x able of interest can be taken as exogenous after controlling where yi,n is a scalar outcome variable, i,n is a regressor of small , w for a sufficiently large set of other factors or covariates. A (i.e., fixed) dimension d i,n isavectorofcovariatesofpossi- , major problem that empirical researchers face when employing bly “large” dimension Kn and ui,n is an unobserved error term. selection-on-observables methods to estimate structural effects Two important cases, discussed in more detail below, are “flexi- is the availability of many potential covariates. This problem ble” parametric modeling of controls via basis expansions such has become even more pronounced in recent years because of as higher-order powers and interactions (i.e., a series-based for- the widespread availability of large (or high-dimensional) new mulation of the partially linear regression model), and models datasets. with many dummy variables such as multi-way fixed effects and Notonlyisitoftenthecasethatsubstantivediscipline- interactions thereof in panel models. In both cases conduct- specific theory (or intuition) will suggest a large set of variables ing OLS-based inference on β in (1) is straightforward when the that might be important, but also researchers usually prefer error ui,n is homoscedastic and/or the dimension Kn of the nui- to include additional “technical” controls constructed using sance covariates is modeled as a vanishing fraction of the sample indicator variables, interactions, and other nonlinear transfor- size. The latter modeling assumption, however, is inappropriate mations of those and other variables. Therefore, many empirical in applications with many dummy variables and does not deliver studies include very many covariates to control for as broad an a good distributional approximation when many covariates are array of confounders as possible. For example, it is common included. practice to include dummy variables for many potentially Motivated by the above observations, this article studies the overlapping groups based on age, cohort, geographic location, consequences of allowing the error ui,n in (1)tobe(condition- w etc. Even when some controls are dropped after valid covariate ally) heteroscedastic in a setting, where the covariate i,n is per- selection (Belloni, Chernozhukov, and Hansen 2014), many mitted to be high-dimensional in the sense that Kn is allowed, controls usually may remain in the final model specification. For butnotrequired,tobeanonvanishingfractionofthesample

CONTACT Matias D. Cattaneo [email protected] Department of Economics and Department of Statistics, University of Michigan,  Tappan St.,  Lorch Hall, Ann Arbor, MI -. Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/JASA. ©  American Statistical Association JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1351 size. Our main purpose is to investigate the possibility of con- is assumed, then certain estimated coefficients and contrasts structing heteroscedasticity consistent estimators for in linear models are asymptotically normal when the number the OLS of β in (1) without (necessarily) assuming any of regressors grow as fast as sample size, but do not discuss specialstructureonthepartofthecovariatewi,n. Wepresent two inference results (even under ). Our result main results. First, we provide high-level sufficient conditions in Theorem 1 below shows that certain contrasts of OLS esti- guaranteeing a valid Gaussian distributional approximation to mators in high-dimensional linear models are asymptotically the finite sample distribution of the OLS estimator of β,allowing normal under fairly general regularity conditions. Intuitively, for the dimension of the nuisance covariates to be “large” rela- we circumvent the problems associated with the lack of asymp- tivetothesamplesize(i.e.,Kn/n → 0). Second, we characterize totic Gaussianity in general high-dimensional linear models the large-sample properties of a class of variance estimators, and by focusing exclusively on a small subset of regressors when use this characterization to obtain both negative and positive the number of covariates gets large. We give inference results results. The negative finding is that the Eicker–White estimator by constructing heteroscedasticity consistent standard errors is inconsistent in general, as are popular variants of this estima- without imposing any distributional assumption or other very tor. The positive result gives conditions under which an alterna- specific restrictions on the regressors. In particular, we do not tive heteroscedasticity robust variance estimator (described in require the coefficients γn to be consistently estimated; in fact, more detail below) is consistent. The main condition needed theywillnotbeinmostofourexamplesdiscussedbelow. for our constructive results is a high-level assumption on the Our high-level conditions allow for Kn ∝ n and restrict the nuisance covariates requiring in particular that their number data-generating process in fairly general and intuitive ways. be strictly less than half of the sample size. As a by-product, In particular, our generic sufficient condition on the nuisance we also find that among the popular HCk class of standard covariates wi,n covers several special cases of interest for empir- errors estimators for linear models, a variant of the HC3 esti- ical work. For example, our results encompass (and weakens mator delivers standard errors that are asymptotically upward in a certain sense) those reported in Stock and Watson (2008), biased in general. Thus, standard OLS inference employing HC3 who investigated the one-way fixed effects panel data regres- standard errors will be asymptotically valid, albeit conserva- sion model and showed that the conventional Eicker–White tive, even in high-dimensional settings where the number of heteroscedasticity-robust variance estimator is inconsistent, covariate wi,n islargerelativetothesamplesize,thatis,when being plagued by a nonnegligible bias problem attributable to Kn/n → 0. the presence of many covariates (i.e., the fixed effects). The very Our results contribute to the already sizeable literature on special structure of the covariates in the one-way fixed effects heteroscedasticity robust variance estimators for linear regres- estimator enables an explicit characterization of this bias, and sion models, a recent review of which is given by MacKinnon also leads to a direct plug-in consistent bias-corrected version (2012). Important papers whose results are related to ours of the Eicker–White variance estimator. The generic variance include (White 1980;MacKinnonandWhite1985;Wu1986; estimator proposed herein essentially reduces to this bias- Chesher and Jewitt 1987;ShaoandWu1987;Chesher1989; correctedvarianceestimatorinthespecialcaseoftheone-way Cribari-Neto, Ferrari, and Cordeiro 2000;Kauermannand fixed effects model, even though our results are derived froma Carroll 2001; Bera, Suprayitno, and Premaratne 2002;Stockand different perspective and generalize to other settings. Watson 2008; Cribari-Neto, da Gloria, and Lima 2011;Müller Furthermore, our general inference results can be used when 2013; Abadie, Imbens, and Zheng 2014). In particular, Bera, many multi-way fixed effects and similar discrete covariates are Suprayitno, and Premaratne (2002)analyzedsomefinitesample introduced in a linear regression model, as is usually the case in properties of a variance estimator similar to the one whose social and network settings. For example, in a recent asymptotic properties are studied herein. They use unbiased- contribution Verdier (2017)developednewresultsfortwo-way ness or minimum norm quadratic unbiasedness to motivate a fixed effect design and projection matrices, and use themto variance estimator that is similar in structure to ours, but their verify our high-level conditions in linear models with two-way results are obtained for fixed Kn and n and are silent about the unobserved heterogeneity and sparsely matched data (which extent to which consistent variance estimation is even possible can also be interpreted as a network setting). These results pro- when Kn/n → 0. vide another interesting and empirically relevant illustration This article also adds to the literature on high-dimensional of our generic theory. Verdier (2017)alsodevelopedinference linear regression where the number of regressors grow with results able to handle dependence in his specific the sample size; see, for example, Huber (1973), Koenker context,whicharenotcoveredbyourassumptionsbecausewe (1988), Mammen (1993), Anatolyev (2012), El Karoui et al. impose independence in the cross-sectional dimension of the (2013), Zheng et al. (2014), Cattaneo, Jansson, and Newey (possibly grouped) data. (2018), and Li and Müller (2017), and references therein. In The rest of this article is organized as follows. Section 2 particular, Huber (1973) showed that fitted regression values presents the variance estimators we study and gives a heuris- are not asymptotically normal when the number of regressors tic description of their main properties. Section 3 introduces growsasfastassamplesize,while(Mammen1993)obtained our general framework, discusses high-level assumptions and asymptotic normality for arbitrary contrasts of OLS estima- illustrates the applicability of our methods using three lead- tors in linear regression models where the dimension of the ing examples. Section 4 givesthemainresultsofthearticle. covariates is at most a vanishing fraction of the sample size. Section 5 reports the results of a Monte Carlo , while More recently, El Karoui et al. (2013) showed that, if a Gaussian Section 6 illustratesourmethodsusinganempiricalapplication. distributional assumption on regressors and homoscedasticity Section 7 concludes. Proofs as well as additional methodological 1352 M. D. CATTANEO, M. JANSSON, AND W. K. NEWEY

( , x , w ) . and numerical results are reported in the online supplemental To fix ideas, suppose yi,n i,n i,n are iid over i It turns out Appendix. that, under certain regularity conditions,

n n ˆ EW 1 2 2 2. Overview of Results  = M , vˆ , vˆ , E u , |x , , w , + o (1), n n ij n i n i n j n j n j n p i=1 j=1 For the purposes of discussing distribution theory and variance βˆ β estimators associated with the OLS estimator n of in (1), ˆ whereas a requirement for (2)toholdisthattheestimatorn when the Kn-dimensional nuisance covariate wi,n is of possibly satisfies “large” dimension and/or the parameter γn cannot be estimated consistently, it is convenient to write the estimator in “partialled n ˆ 1 2 out” form as  = vˆ , vˆ , E u , |x , , w , + o (1). (3) n n i n i n i n i n i n p − i=1 n 1 n n ˆ β = vˆ , vˆ vˆ , y , , vˆ , = M , x , , n i n i,n i n i n i n ij n j n The difference between the leading terms in the expansions is i=1 i=1 j=1 nonnegligible in general unless K /n → 0. In recognition of this n n − ˆ EW, , = 1( = ) − w ( w w ) 1w , , 1(·) problem with we study the more general class of estimators where Mij n i j i,n k=1 k,n k,n j n n denotes the indicator function, and the relevant inverses are of the form ˆ n assumed to exist. Defining  = vˆ , vˆ , /n, the objective n i=1 i n i n n n is to establish a valid Gaussian distributional approximation ˆ 1 2  (κ ) = κ , vˆ , vˆ , uˆ , , βˆ , n n n ij n i n i n j n to the finite sample distribution of the OLS estimator n and i=1 j=1 ˆ then find an√ estimator n of the (conditional) variance of n vˆ , , / i=1 i nui n n such that where κij,n denotes element (i, j) of a symmetric matrix κn = √ κn(w1,n,...,wn,n ). Estimators that can be written in this fash- ˆ − / ˆ ˆ ˆ − ˆ ˆ −  1 2 n(β − β) → N (0, I),  =  1  1, (2) ˆ EW κ = I n n d n n n n ion include n (which corresponds to n )aswellasvari- ants of the so-called HCk estimators, k ∈{1, 2, 3, 4},reviewed β in which case asymptotic valid inference on can be conducted by Long and Ervin (2000) and MacKinnon (2012), among in the usual way by employing the distributional approximation many others. To be specific, a natural variant of HCk is a ˆ ˆ −ξ , β ∼ N (β, n/n). κ κ = ϒ i n n obtained by choosing n to be diagonal with ii,n i,nMii,n , Our first result, Theorem 1 below, gives sufficient condi- where (ϒi,n,ξi,n) = (1, 0) for HC0 (and corresponding to tions for asymptotic standard normality of the infeasible statis- ˆ EW (ϒ ,ξ ) = ( /( − ), ) (ϒ ,ξ ) = √ n ), i,n i,n n n Kn 0 for HC1, i,n i,n −1/2 (βˆ − β),  = ˆ −1 ˆ −1  tic n n n where n n n √n and n denotes (1, 1) for HC2, (ϒi,n,ξi,n) = (1, 2) for HC3, and (ϒi,n,ξi,n) = n vˆ , , / . the (conditional) variance of i=1 i nui n n The assump- (1, min(4, nMii,n/Kn)) for HC4. See Sections 4.3 for more / → tions of Theorem 1 allow for both Kn n 0 and conditional details. heteroscedasticity. Under the assumptions of this theorem, we In Theorem 3,weshowthatalloftheHCk-type estimators, show in the supplemental appendix that  = O (1), so as a by- κ , √ n p which correspond to a diagonal choice of n have the shortcom- βˆ / . product we find that n remains n-consistent also in the high- ing that they do not satisfy (3)whenKn n 0 On the other dimensional setting allowed for in this article. More importantly, hand, it turns out that a certain nondiagonal choice of κn makes Theorem 1 is a useful starting point for discussing valid variance it possible to satisfy ( 3)evenifKn is a nonvanishing fraction of estimation in high-dimensional linear regression models. Defin- n. To be specific, it turns out that under (regularity conditions ˆ n ˆ ˆ and) mild conditions on the weights κij,n, thevarianceestimator ing ui,n = = Mij,n(y j,n − β x j,n), standard choices of n in j 1 n ˆ (κ ) the fixed-Kn case include the homoscedasticity-only estimator n n satisfies

n n n n HO 1 ˆ 1 2 2 ˆ =ˆσ 2ˆ , σˆ 2 = ˆ2 ,  (κ ) = κ , M vˆ , vˆ E u |x , , w , + o (1), n n n n ui,n n n ik n kj,n i n i,n j,n j n j n p n − d − K n = = = n i=1 i 1 j 1 k 1 (4) ˆ ˆ and the Eicker–White-type estimator suggesting that (3)holdswithn = n(κn) provided κn is cho- sen in such a way that n ˆ EW = 1 vˆ vˆ ˆ2 . i,n , u , n n n i n i n i=1 κ 2 = 1( = ), ≤ , ≤ . ik,nMkj,n i j 1 i j n (5) k=1 Perhaps not too surprisingly, Theorem 2 below finds that ˆ HO consistency of n under homoscedasticity holds quite gener- Accordingly, we define ally even for models with many covariates. In contrast, construc-  tion of a heteroscedasticity-robust estimator of n is more chal- n n ˆ EW ˆ HC = ˆ κHC = 1 κHC vˆ vˆ ˆ2 , lenging, as it turns out that consistency of n generally requires n n n ij,n i,n i,nu j,n n = = Kn to be a vanishing fraction of n. i 1 j 1 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1353

where, with Mn denoting the matrix with element (i, j) given by wi,n, define the constants M , and denoting the Hadamard product, ij n n 1 2 ⎛ ⎞ ⎛ ⎞−  = E , , = E , |X , W , HC HC 2 2 1 n [Ri,n] Ri n ui n n n κ , ··· κ , M , ··· M , n ⎜ 11 n 1n n ⎟ ⎜ 11 n 1n n ⎟ i=1 κHC = ⎝ . . . ⎠ = ⎝ . . . ⎠ n n ...... ρ = 1 E 2 , = E |W , κHC ··· κHC 2 ··· 2 n ri,n ri,n [ui,n n] n1,n nn,n Mn1,n Mnn,n n i=1 = (M M )−1. n n n 1 2 χ = E[ Q , ], Q , = E[v , |W ], ˆ HC M M n n i n i n i n n The estimator n is well-defined whenever n n is invert- i=1 ible, a simple sufficient condition for which is that Mn < 1/2, v = x − ( n E x w )( n E w w )−1 where where i,n i,n j=1 [ j,n j,n] j=1 [ j,n j,n] wi,n is the population counterpart of vˆi,n.Also, letting λmin(·) Mn = 1 − min Mii,n. 1≤i≤n denote the minimum eigenvalue of its argument, define 4 4 C = max{E U , |X , W + E[ V , |W ] The fact that Mn < 1/2impliesinvertibilityofMn Mn is a n ≤ ≤ i n n n i n n 1 i n consequence of the Gerschgorin circle theorem. For details, see + /E 2 |X , W }+ /λ (E ˜ |W ), Section 3 in the supplemental Appendix. More importantly, a 1 Ui,n n n 1 min [ n n] M < / slight strengthening of the condition n 1 2willbeshown = − E |X , W ,V = x − E x |W , whereUi,n yi,n [yi,n n n] i,n i,n [ i,n n] ˆ = ˆ HC. n n to be sufficient for (2)and(3)toholdwith n n Our final ˜ = V˜ V˜ / , V˜ = V . n i=1 i,n i,n n and i,n j=1 Mij,n j,n result, Theorem 4, formalizes this finding. We impose the following three high-level conditions. Let The key intuition underlying our variance estimation result lim →∞a = lim sup →∞ a for any sequence a . E 2 |x , w n n n n n is that, even though each conditional variance [ui,n i,n i,n] cannot be well estimated due to the curse of dimensionality, an Assumption 1 (). C[Ui,n,Uj,n|Xn, Wn] = 0fori = j ≤ ≤ T , = ( ), T , averaged version such as the leading term in (3)canbeestimated and max1 i Nn # i n O 1 where # i n is the cardinality of  2 n ˆ2 T {T ≤ ≤ } { ,..., } consistently. Thus, taking E[u , |xi,n, wi,n] = = κik,nu , as i,n and where i,n :1 i Nn is a partition of 1 n i n k 1 k n 2 {( , , V ) ∈ T , } an estimator of E[u , |xi,n, wi,n], plugging it into the leading such that Ut n t,n : t i n are independent over i condi- i n W . term in (3), and computing conditional expectations, we obtain tional on n the leading term in (4). To make this leading term equal to P λ ( n w w )> → , n vˆ vˆ E 2 |x , w , Assumption 2 (Design). [ min i=1 i,n i,n 0] 1 the desired target i=1 i,n i,n [ui,n i,n i,n] it is natural to / < , C = ( ). require limn→∞Kn n 1 and n Op 1 χ = ( ),  + ( − n n Assumption 3 (Approximations). n O√1 n n n κ 2 E 2 |x , w ρn) + nχnn = o(1), and max1≤i≤n vˆi,n / n = op(1). ik,nMkj,n u j,n j,n j,n j=1 k=1 = E 2 |x , w ≤ ≤ . [ui,n i,n i,n]1 i n 3.2. Discussion of Assumptions E 2 |x , w Assumptions 1–3 are meant to be high-level and general, allow- Since [ui,n i,n i,n]areunknown,ourvarianceestimator solves (5), which generates enough equations to solve for all ing for different linear-in-parameters regression models. We ( − )/ κHC. now discuss the main restrictions imposed by these assump- n n 1 2 possibly distinct elements in n tions. We further illustrate them in the following subsection using more specific examples. 3. Setup

This section introduces a general framework encompassing sev- ... Assumption  eral special cases of linear-in-parameters regression models of This assumption concerns the sampling properties of the the form (1). We first present generic high-level assumptions, observed data. It generalizes classical iid sampling by allowing and then discuss their implications as well as some easier to ver- for groups or “clusters” of finite but possibly heterogeneous size ify sufficient conditions. Finally, to close this setup section, we with arbitrary intra-group dependence, which is very common briefly discuss three motivating leading examples: linear regres- in the context of fixed effects linear regression models. As sion models with increasing dimension, multi-way fixed effect currently stated, this assumption does not allow for correlation linear models, and semiparametric semi-linear regression. Tech- in the error terms across units, and therefore excludes clustered, nical details and related results for these examples are given in spacial or time series dependence in the sample. We conjecture the supplemental Appendix. our main results extend to the latter cases, though here we focus on i.n.i.d. (conditionally) heteroscedastic models only, and hence relegate the extension to errors exhibiting clustered, 3.1. Framework spacial or time series dependence for future work. Assumption {( , x , w ) ≤ ≤ } = , T ={} Suppose yi,n i,n i,n :1 i n is generated by (1). Let 1 reduces to classical iid sampling when Nn n i,n i · X = (x ,...,x ), T = denote the Euclidean norm, set n 1,n n,n and [implying max1≤i≤Nn # i,n 1], and all observations have the for a collection Wn of random variables satisfying E[wi,n|Wn] = same distribution. 1354 M. D. CATTANEO, M. JANSSON, AND W. K. NEWEY

M ≥ / ... Assumption  Each of these conditions is interpretable. First, n Kn n n = − This assumption concerns basic design features of the linear because i=1 Mii,n n Kn and a necessary condition for (i) regression model. The first two restrictions are mild and reflect is therefore that Kn/n → 0. Conversely, because the main goal of this article, that is, analyzing linear regression models with many nuisance covariates wi,n. In practice, the first Kn 1 − min1≤i≤n Mii,n Mn ≤ , restriction regarding the minimum eigenvalue of the design n 1 − max1≤i≤n Mii,n n w w matrix i=1 i,n i,n is always imposed by removing redundant (i.e., linearly dependent) covariates; from a theoretical perspec- the condition Kn/n → 0 is sufficient for (i) whenever the tive, this condition requires either restrictions on the distribu- design is “approximately balanced” in the sense that (1 − tional relationship of such covariates or some form of trimming min1≤i≤n Mii,n)/(1 − max1≤i≤n Mii,n) = Op(1). In other words, leading to selection of included covariates (e.g., most software (i) requires and effectively covers the case, where it is assumed packages remove covariates leading to “too” small eigenvalues that Kn is a vanishing fraction of n. In contrast, conditions (ii) of the design matrix by of some hard-thresholding and (iii) can also hold when Kn is a nonvanishing fraction of n, rule). which is the case of primary interest in this article. C = ( ), On the other hand, the last condition, n Op 1 may be Because(ii)isarequirementontheaccuracyofthe restrictive in some settings: for example, if the covariates have E x |w ≈ δ w δ = E w w −1 approximation [ i,n i,n] n i,n with n [ i,n i,n] unbounded support (e.g., they are normally distributed) and E w x , [ i,n i,n] primitive conditions for it are available when, for heteroscedasticity is unbounded (e.g., unbounded multiplica- example, the elements of wi,n are approximating functions. tive heteroscedasticity), then the assumption may fail. Simple Indeed, in such cases one typically has χ = O(K−α ) for some C = ( ) n n sufficient conditions for n Op 1 canbeformulatedwhen α>0, so condition (ii) not only accommodates Kn/n 0, the covariates have compact support, or the heteroscedasticity but actually places no upper bound on the magnitude of Kn in is multiplicative and bounded, because in these cases it is important special cases. This condition also holds when wi,n are easy to bound the conditional moments of the error terms. dummy variables or discrete covariates, as we discuss in more Nevertheless, it would be useful to know whether the condi- detail below. C = ( ) tion n Op 1 canberelaxedtoaversioninvolvingonly Finally, condition (iii), and its underlying higher-level condi- unconditional moments. tion described in the supplemental Appendix, is useful to handle cases where wi,n cannot be interpreted as approximating func- tions, but rather just as many different covariates included in the ... Assumption  linear model specification. This condition is a “sparsity” con- This assumption requires two basic approximations to hold. dition on the projection matrix M , which allows for K /n  n n First, concerning bias, conditions on n are related to the 0. The condition is easy to verify in certain cases, including approximation quality of the linear-in-parameters model those where “locally bounded” approximating functions or fixed E |X , W . (1) for the “long” conditional expectation [yi,n n n] effects are used (see below for concrete examples). Similarly, conditions on ρn and χn are related to linear- in-parameters approximations for the “short” conditional E |W E x |W expectations [yi,n n]and [ i,n n], respectively. All these 3.3. Motivating Examples approximations are measured in terms of population square error, and are at the heart of empirical work employing We briefly mention three motivating examples of linear-in- linear-in-parameters regression models. Depending on the parameter regression models covered by our results. All tech- model of interest, different sufficient conditions can be given nical details are given in the supplemental Appendix. for these assumptions. Here, we briefly mention the most simple one: (a)ifE[ui,n|Xn, Wn] = 0foralli and n,which can be interpreted as exogeneity (e.g., no misspecification ... Linear Regression Model with Increasing Dimension bias), then 0 = ρn = n(n − ρn) + nχnn for all n; and (b)if This leading example has a long tradition in statistics and econo- 2 E[ xi,n ] = O(1) for all i,thenχn = O(1).Othersufficient metrics. The model takes (1) as the data-generating process conditions are discussed below. √ (DGP), typically with i.i.d. data and the exogeneity condition Second, the high-level condition max1≤i≤n vˆi,n / n = E[ui,n|xi,n, wi,n] = 0. However, our assumptions only require 2 op(1) restricts the distributional relationship between the finite- nE[(E[ui,n|xi,n, wi,n]) ] = o(1), and hence (1)canbeinter- dimensional covariate of interest xi,n and the high-dimensional preted as a linear-in-parameters mean-square approximation to nuisance covariate wi,n. This condition can be interpreted as a the unknown conditional expectation E[yi,n|xi,n, wi,n]. Either βˆ negligibility condition and thus comes close to minimal for the way, n is the standard OLS estimator. to hold. At the present level of general- Setting Wn = (w1,n,...,wn,n ), Nn = n, and Ti,n ={i}, ity it seems difficult to formulate primitive sufficient conditions Assumptions 1 and 2 are standard, while Assumption 2 for this restriction that cover all cases of interest, but for com- 3 is satisfied provided that E[ xi,n ] = O(1) [imply- χ = ( ) E (E |x , w )2 = ( ) pleteness we mention that Lemma SA-7 in the supplemental ing n O 1 ], n [ [ui,n i,n i,n] ] o 1 [implying√ Appendix shows that under mild conditions it suffices n(n − ρn) + nχnn = o(1)], and max1≤i≤n vˆi,n / n = M = to require that one of the following conditions hold: (i) n op(1). Primitive sufficient conditions for the latter negligibility ( ), χ = ( ), n ( = ) = op 1 or (ii) n o 1 or (iii) max1≤i≤n j=1 I Mij,n 0 condition were discussed above. For example, under regularity 1/3 op(n ). conditions, χn = o(1) if either (a) E[xi,n|wi,n] = δ wi,n, (b)the JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1355 nuisance covariates are discrete and a saturated dummy vari- Gaussian approximation for the distribution of the least-squares w , βˆ γ ables model is used, or (c) i n are constructed using sieve func- estimator n is valid (see Theorem 1), despite the coefficients n n 1( = ) = ( 1/3) tions. Alternatively, max1≤i≤n j=1 Mij,n 0 op n is not being consistently estimated. On the other hand, consistency satisfied provided the distribution of the nuisance covariates of our generic variance estimator requires T ≥ 3 [implying w , M i n generates a projection matrix n that is approximately a Kn/n < 1/2]; see Theorems 3 and 4. Further details are given in band matrix (see below for concrete examples). Precise regular- Section 4.2 of the supplemental Appendix, where we also discuss ity conditions for this example, including a detailed discussion a case-specific consistent variance estimator when T = 2. (x , w ) ∼ N (0, I), of the special case where i,n i,n are given in Our generic results go beyond one-way fixed effect linear Section 4.1 of the supplemental Appendix. regression models, as they can be used to obtain valid infer- ence in other contexts, where multi-way fixed effects or similar ... Fixed Effects Panel Data Regression Model discrete regressors are included. For a second concrete exam- ple, consider the recent work of (Verdier 2017, and references A second class of examples covered by our results are linear therein) in the context of linear models with two-way unob- panel data models with multi-way fixed effects and related mod- served heterogeneity and sparsely matched data. This model is elssuchasthoseencounteredinnetworks,spillovers,orsocial isomorphic to a network model, where students and teacher (or interactions settings. A common feature in these examples is workers and firms, for another example) are “matched” or “con- thepresenceofpossiblymanydummyvariablesinw , , cap- i n nected” over time, but potential unobserved heterogeneity at turing unobserved heterogeneity or other unobserved effects both levels is a concern. In this setting, under random sampling, across units (e.g., network link or effect). In many appli- Verdier (2017) offerred primitive conditions for our high-level cations, the number of distinct dummy-type variables is large assumptions when two-way fixed effect models are used for esti- because researchers often include multi-group indicators, inter- mation and inference. To give one concrete example, he finds actions thereof, and similar regressors obtained from factor vari- that if T ≥ 5 and for any pair of teachers (firms), the number of ables. In these complicated models, the nuisance covariates need students (workers) assigned to both teachers (firms) in the pair to be estimated explicitly, even in prob- is either zero or greater than three, then our key high-level con- lems, because it is not possible to difference out the multi-way dition in Theorem 4 belowisverified. indicator variables for estimation and inference. Stock and Watson (2008) considered heteroscedasticity- robust inference for the one-way fixed effect panel data regres- ... Semiparametric sion model Another model covered by our results is the partially linear model Yit = αi + β Xit + Uit , i = 1,...,N, t = 1,...,T, (6) yi = β xi + g(zi) + εi, i = 1,...,n, (7) where αi ∈ R is an individual-specific intercept, Xit is a x z ε regressor of dimension d, and Uit is an scalar error term. where i and i are explanatory variables, i is an error term sat- To map this model into our framework, suppose that isfying E[εi|xi, zi] = 0, the function g(z) is unknown, and iid {( ,..., , X ,...,X ) ≤ ≤ } { k(z) = , ,...} Ui1 UiT i1 iT :1 i n are independent over sampling is assumed. Suppose p : k 1 2 are func- i, E[Uit |Xi1,...,XiT ] = 0, and E[UitUis|Xi1,...,XiT ] = 0for tions having the property that linear combinations can approxi- z (z ) ≈ t = s. Then, setting n = NT, Kn = N,γn = (α1,...,αN ) , mate square-integrable functions of well, in which case g i ≤ ≤ ≤ ≤ , = , γ p (z ) γ , p (z) = ( 1(z),..., Kn (z)) . and, for 1 i N and 1 t T y(i−1)T+t,n Yit n n i for some n where n p p = , x = x , w = p (z ), = ε + x(i−1)T+t,n = Xit , u(i−1)T+t,n = Uit , and w(i−1)T+t,n equal to Defining yi,n yi i,n i i,n n i and ui,n i , (z ) − γ w , βˆ the ith unit vector of dimension N the model (6)isalsoofthe g i n i,n the model (7)isoftheform(1), and n is βˆ β. β form (1)and n is the fixed effects estimator of In general, the series estimator of ; see, for example, Donald and Newey this model does not satisfy an iid assumption, but Assumption 1 (1994), Cattaneo, Jansson, and Newey (2018), and references enables us to employ results for independent random variables therein. when developing asymptotics. In particular, unlike (Stock and Constructing the basis pn(zi) in applications may require ( ,..., , X ...,X ) Watson 2008), we do not require Ui1 UiT i1 iT using a large Kn, either because the underlying functions are to be i.i.d. over i, nor do we require any kind of stationarity not smooth enough or because dim(zi) is large. For exam- ( , X ). on the part of Uit it The amount of variance hetero- ple, if a cubic polynomial expansion is used, also known geneity permitted is quite large, since it suffices to require as a power series of order p = 3, then dim(wi) = (p + V |X ,...,X = E 2|X ,...,X (z )) /(p (z ) ) = (z ) = , [Yit i1 iT ] [Uit i1 iT ]tobeboundedand dim i ! !dim i ! 286 when dim i 10 and there- bounded away from zero. (On the other hand, serial corre- fore flexible estimation and inference using the semi-linear lation is assumed away because our assumptions imply that model (7)withasamplesizeofn = 1000 gives Kn/n = 0.286. C[Yit ,Yis|Xi1,...,XiT ] = 0fort = s.)Inotherrespects,this For further technical details on series-based methods see, for model is in fact quite tractable due to the special nature of example, Newey (1997), Chen (2007), Cattaneo and Farrell the covariates wi,n,thatis,adummyvariableforeachunit (2013), and Belloni et al. (2015), and references therein. For i = 1,...,N. another example, when the basis functions pn(z) are con- In this one-way fixed effects example, Kn/n = 1/T and structed using partitioning estimators, the OLS estimator of β therefore a high-dimensional model corresponds to a short becomes a subclassification estimator, a method that has been n 1( = ) = panel model: max1≤i≤n j=1 Mij,n 0 T and hence the proposedintheliteratureonprogramevaluationandtreatment negligibility condition holds easily. If T ≥ 2, our asymptotic effects; see, for example, Cochran (1968), Rosenbaum and Rubin 1356 M. D. CATTANEO, M. JANSSON, AND W. K. NEWEY

(1983), and Cattaneo and Farrell (2011), and references therein. from results obtained under the assumption Kn/n → 0, but can When a partitioning estimator of order 0 is used, the semi-linear be obtained with the help of Theorem 1. In addition, it is worth model becomes a one-way fixed effects linear regression model, mentioning that Theorem 1 is a substantial improvement over where each dummy variable corresponds to one (disjoint) par- (Cattaneo, Jansson, and Newey 2018, Theorem 1) because here tition on the support of zi; in this case, Kn is to the number of it is not required that Kn →∞nor χn = o(1). To achieve this partitions or fixed effects included in the estimation. improvement, a different method of proof is used. This improve- Our primitive regularity conditions for this example include ment applies not only to the partially linear model example, but

2 more generally to linear models with many covariates, because n = min E[|g(zi) − γ pn(zi)| ] = o(1), γ ∈RKn Theorem 1 applies to quite general form of nuisance covariate 2 wi,n beyond specific approximating basis functions. In the spe- χn = min E[ E[xi|zi] − δ pn(zi) ] = O(1), δ∈RKn×d cific case of the partially linear model, this implies that we are  χ = ( ), able to weaken smoothness assumptions (or the curse of dimen- n n n√ o 1 and the negligibility condition max1≤i≤n sionality), otherwise required to satisfy the condition χn = o(1). vˆi,n / n = op(1). A key finding implied by these regularity conditions is that we only require mild smoothness conditions Remark 1. Theorem 1 does not require nor imply consistency of (z ) E x |z . on g i and [ i i] The negligibility condition is automati- the (implicit) estimate of γ , as in fact such a result χ = ( ), n cally satisfied if n o 1 as discussed above, but in fact our will not be true in most applications with many nuisance covari- results do not require any approximation of E[xi|zi], as usually ates wn,i. For example, in a partially linear model (7)theapprox- assumed in the literature, provided a “locally supported” basis imating coefficients γn will not be consistently estimated unless is used; that is, any basis pn(z) that generates an approximately Kn/n → 0, or in a one-way fixed effect panel data model6 ( )the band projection matrix Mn; examples of such basis include par- unit-specific coefficients in γn will not be consistently estimated titioning and spline estimators. See Section 4.3 in the supple- / = / → . unless Kn n √1 T 0 Nevertheless, Theorem 1 shows that mental Appendix for further discussion and technical details. βˆ n can still be n-normal under fairly general conditions; this result is due to the intrinsic linearity and additive separability of 4. Results the model (1). This section presents our main theoretical results for infer- ence in linear regression models with many covariates and het- 4.2. Variance Estimation eroscedasticity. Mathematical proofs, and other technical results that may be of independent interest, are given in the supplemen- Achieving (2), the counterpart of (8)inwhichtheunknown  ˆ , tal Appendix. matrix n is replaced by the estimator n requires additional assumptions. One possibility is to impose homoscedasticity.

4.1. Asymptotic Normality Theorem 2. Suppose the assumptions of Theorem 1 hold. If E 2 |X , W = σ 2, ˆ = ˆ HO. As a means to the end of establishing (2), we give an asymptotic [Ui,n n n] n then (2)holdswith n n βˆ normality result for n which may be of interest in its own right. This result shows in some generality that homoscedastic Theorem 1. Suppose Assumptions 1–3 hold. Then, inference in linear models remains valid even when Kn is pro- √ portional to n, provided the variance estimator incorporates a − / − −  1 2 (βˆ − β) → N (0, I),  = ˆ 1 ˆ 1, ˆ HO n n n d n n n n (8) degrees-of-freedom correction, as  does. n n 2 Establishing (2)isalsopossiblewhenKn is assumed to be where n = = vˆi,nvˆ , E[U , |Xn, Wn]/n. i 1 i n i n a vanishing fraction of n, as is of course the case in the usual In the literature on high-dimensional linear models, fixed-Kn linear regression model setup. The following theorem Mammen (1993) obtained a similar asymptotic normality establishes consistency of the conventional standard error esti- 1+δ/ → ˆ EW result as in Theorem 1 but under the condition Kn n 0 mator  under the assumption Mn →p 0, and also derives δ> n for 0 restricted by certain moment condition on the covari- an asymptotic representation for estimators of the form ˆ (κ ) / < , n n ates. In contrast, our result only requires limn→∞Kn n 1 without imposing this assumption, which is useful to study the but imposes a different restriction on the high-dimensional asymptotic properties of other members of the HCk class of covariates (e.g., condition (i), (ii), or (iii) discussed previously) standard error estimators. and furthermore exploits the fact that the parameter of interest (β,γ ) Theorem 3. Suppose the assumptions of Theorem 1 hold. (a) is given by the first d coordinates of the vector n (i.e., c = (b, 0 ) M → , ˆ = ˆ EW. κ = in Mammen (1993) notation, it considers the case If n p 0 then (2)holdswith n n (b) If n ∞ b 0 n |κ |= ( ), with denoting any d-dimensional vector and denoting a max1≤i≤n j=1 ij,n Op 1 then K -dimensional vector of zeros). n n n n In isolation, the fact that Theorem 1 removes the require- ˆ 1 2 2  (κ ) = κ , M , vˆ , vˆ , E U , |X , W + o (1). / → n n n ik n kj n i n i n j n n n p ment Kn n 0 may seem like little more than a subtle tech- i=1 j=1 k=1 nical improvement over results currently available. It should be recognized, however, that conducting inference turns out to be The conclusion of part (a) typically fails when the condition considerably harder when Kn/n → 0. The latter is an important Kn/n → 0isdropped.Forexample,whenspecializedtoκn = I insight about large-dimensional models that cannot be deduced part (b) implies that in the homoscedastic case (i.e., when the JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1357 assumptions of Theorem 2 aresatisfied)thestandardestima- under which these estimators are consistent in the sense that ˆ EW tor n is asymptotically downward biased in general (unless n / → ˆ 1 2 Kn n 0).Inthefollowingsection,wemakethisresultpre-  (κ ) =  + o (1),  = vˆ , vˆ , E U , |X , W . n n n p n n i n i n i n n n cise and discuss similar results for other popular variants of the i=1 HCk estimators mentioned above. κ = 1( = κHC 2 = 1( = ) More generally, Theorem 3(b) shows that, if ij,n i On the other hand, because 1≤k≤n ik,nMkj,n i j by −ξ )ϒ i,n ˆ HC j i,nMii,n ,then construction, part (b) implies that n is consistent provided κHC = ( ). n ∞ Op 1 A simple condition for this to occur can be ˆ (κ ) = ¯ (κ ) + ( ), HC n n n n op 1 stated in terms of Mn. Indeed, if Mn < 1/2, then κ is diago- n n n −ξ nally dominant and it follows from Theorem 1 of Varah (1975) ¯ 1 i,n 2 2  (κ ) = ϒ , M , M , vˆ , vˆ , E[U , |X , W ]. that n n n i n ii n ij n i n i n j n n n i=1 j=1   κHC ≤ 1 . We therefore obtain the following (mostly negative) results n ∞ / − M 1 2 n about the properties of HCk estimators when Kn/n 0; that is, when potentially many covariates are included. (ϒ ,ξ ) = ( , ). E 2 |X , W = σ 2, As a consequence, we obtain the following theorem, whose con- HC0: i,n i,n 1 0 If [Uj,n n n] n then ditions can hold even if Kn/n 0. σ 2 n ¯ n  (κ ) =  − (1 − M , )vˆ , vˆ , ≤  , Theorem 4. Suppose the assumptions of Theorem 1 hold. If n n n n ii n i n i n n P[M < 1/2] → 1andif1/(1/2 − M ) = O (1),then(2) i=1 n n p ˆ = ˆ HC. −1 n ( − )vˆ vˆ = ( ) holds with n n with n i=1 1 Mii,n i,n i,n op 1 in general / → ˆ (κ ) = ˆ EW (unless Kn n 0). Thus, n n n is inconsistent M ≥ / , ˆ Because n Kn n anecessaryconditionforTheorem 4 in general. In particular, inference based on EW is / < / . n to be applicable is that limn→∞Kn n 1 2 When the design asymptotically liberal (even) under homoscedasticity. is balanced, that is, when M11,n =···=Mnn,n (as occurs in (ϒ ,ξ ) = ( /( − ), ). E 2 |X , W = σ 2 HC1: i,n i,n n n Kn 0 If [Uj,n n n] n the panel data model (6)), the condition limn→∞Kn/n < 1/2is ¯ and if M11,n =··· =Mnn,n, then n(κn) = n, but in also sufficient. It follows from Section 4.1.1 of the supplemental / / < / general this estimator is inconsistent when Kn n 0 Appendix that the condition limn→∞Kn n 1 2isalsosuffi- ˆ EW (and so is any other scalar multiple of n ). cient in the special case where wi,n is iid with a zero-mean nor- HC2: (ϒ , ,ξ, ) = (1, 1). If E[U 2 |X , W ] = σ 2, then mal distribution, but in general it seems difficult to formulate i n i n j,n n n n ¯ (κ ) =  , primitive sufficient conditions for the assumption made about n n n but in general this estimator is incon- / . Mn in Theorem 4. In practice, the fact that Mn is observed sistent under heteroscedasticity when Kn n 0 For = E 2 |X , W = vˆ 2 , means that the condition Mn < 1/2 is verifiable, and therefore instance, if d 1andif [Uj,n n n] j,n then unless Mn is found to be “close” to 1/2 there is reason to expect  n n 2 ˆ HC M ,  to perform well. ¯ 1 ij n −1 −1 n  (κ ) −  = M , + M , n n n n 2 ii n jj n i=1 j=1 Remark 2. Our main results for linear models concern large- − 1(i = j) vˆ 2 vˆ 2 = o (1) sample approximations for the finite-sample distribution of the i,n j,n p usual t -statistics. An alternative, equally automatic approach in general (unless Kn/n → 0). is to employ the bootstrap and closely related pro- HC3: (ϒi,n,ξi,n) = (1, 2). Inference based on this estimator is cedures (see, among others, Freedman (1981), Mammen 1993; asymptotically conservative because Gonçalves and White 2005; Kline and Santos 2012). Assum- ing K /n 0, Bickel and Freedman (1983) demonstrated an n n n ¯ (κ ) −  = 1 −2 2 vˆ vˆ invalidity result for the bootstrap in the context of high- n n n Mii,nMij,n i,n i,n n = = , = dimensional linear regression. Following the recommendation i 1 j 1 j i × E 2 |X , W ≥ , of a reviewer, we explored the numerical performance of the Uj,n n n 0 standard nonparametric bootstrap in our simulation study, −1 n n −2 2 vˆ vˆ E 2 |X , where we found that indeed bootstrap validity seems to fail in where n i=1 j=1, j=i Mii,nMij,n i,n i,n [Uj,n n W = ( ) / → the high-dimensional settings we considered. n] op 1 in general (unless Kn n 0). HC4: (ϒi,n,ξi,n) = (1, min(4, nMii,n/Kn)). If M11,n =···= Mnn,n = 2/3 (as occurs when T = 3 in the fixed effects 4.3. HCk Standard Errors with Many Covariates panel data model), then HC4 reduces to HC3, so this esti- mator is also inconsistent in general. The HCk variance estimators are very popular in empirical Among other things these results show that (asymptotically) ˆ (κ ) κ = work,andinourcontextareoftheform n n with ij,n conservative inference in linear models with many covariates −ξ 1( = )ϒ i,n (ϒ ,ξ ). / → i j i,nMii,n for some choice of i,n i,n See Long and (i.e., even when K n 0) can be conducted using standard lin- Ervin (2000) and MacKinnon (2012) for reviews. Theorem 3(b) ear methods (and software), provided the HC3 standard errors canbeusedtoformulateconditions,includingKn/n → 0, are used. 1358 M. D. CATTANEO, M. JANSSON, AND W. K. NEWEY

In the numerical work reported in the following sections Appendix, we also present results for more sparse dummy vari- and the supplemental Appendix, we present evidence compar- ables in the context of one-way and two-way linear panel data ing all these variance estimators. In line with the theory, we find regression models, and for nonbinary covariates wi,n in both that OLS-based confidence intervals employing HC3 standard increasing dimension linear regression settings and semipara- errors are conservative while our proposed variance estimator metric partially linear regression settings (where γn = 0 and ˆ HC w , n delivers confidence intervals with close-to-correct empiri- i n is constructed using power series expansions). Further- cal coverage. more, we also considered an asymmetric and a bimodal dis- tribution for the unobservable error terms. In all cases, the 5. Simulations numerical results are qualitatively similar to those discussed herein. For each DGP, we investigated both homoscedastic as We conducted a simulation study to assess the finite sample well as (conditional on xi,n and/or wi,n) heteroscedastic mod- properties of our proposed inference methods as well as those els, following closely the specifications in Stock and Watson of other standard inference methods available in the literature. (2008) and MacKinnon (2012). In particular, our heteroscedas- Basedonthegenericlinearregressionmodel(1), we consider ticmodeltakestheform:V[ui,n|xi,n, wi,n] = κu(1 + (t(xi,n) + 2 2 15 distinct data-generating processes (DGPs) motivated by the ι wi,n) ) and V[xi,n|wi,n] = κv (1 + (ι wi,n) ),wherethecon- three examples discussed previously. To conserve space, here stants κu and κv are chosen so that V[ui,n] = V[xi,n] = 1, we only discuss results from Model 1, a representative case, but t(a) = a1(−2 ≤ a ≤ 2) + 2sgn(a)(1 − 1(−2 ≤ a ≤ 2)),andι the supplemental Appendix contains the full set of results and denotes a conformable vector of ones. further details (see Table 1 in the supplement for a synopsis of We conducted S = 5,000 simulations to study the finite sam- the DGPs used). ple performance of 16 confidence intervals: eight based ona We present results for a linear model (1) with i.i.d. data, Gaussian approximation and eight based on a bootstrap approx- n = 700, d = 1andxi,n ∼ N (0, 1), wi,n = 1(vi,n ≥ 2.5) with imation. Our article offers theory for Gaussian-based inference vi,n ∼ N (0, I),andui,n ∼ N (0, 1), all independent of each methods, but we also included bootstrap-based inference meth- other. Thus, this design considers (possibly overlapping) sparse ods for completeness (as discussed in Remark 2,thebootstrap w ; dummy variables entering i,n each column assigns a value is invalid when Kn ∝ n in linear regression models). For each of 1 to approximately five units out of n = 700. We set β = 1 inference method, we report both average coverage and γn = 0, and considered five different model dimensions: and interval length of 95% nominal confidence intervals; the lat- dim(wi,n) = Kn ∈{1, 71, 141, 211, 281}. In the supplemental ter provides a summary of /power for each inference

Table . Simulation results (model  in supplemental appendix).

Gaussian distributional approximation Bootstrap distributional approximation HO HO HC HC HC HC HC HCK HO HO HC HC HC HC HC HCK

(a) Empirical coverage

Homoscedastic model K/n = 0.001 . . . . . . . . . . . . . . . . K/n = 0.101 . . . . . . . . . . . . . . . . K/n = 0.201 . . . . . . . . . . . . . . . . K/n = 0.301 . . . . . . . . . . . . . . . . K/n = 0.401 . . . . . . . . . . . . . . . . Heteroscedastic model K/n = 0.001 . . . . . . . . . . . . . . . . K/n = 0.101 . . . . . . . . . . . . . . . . K/n = 0.201 . . . . . . . . . . . . . . . . K/n = 0.301 . . . . . . . . . . . . . . . . K/n = 0.401 . . . . . . . . . . . . . . . .

(b) Interval length

Homoscedastic model K/n = 0.001 . . . . . . . . . . . . . . . . K/n = 0.101 . . . . . . . . . . . . . . . . K/n = 0.201 . . . . . . . . . . . . . . . . K/n = 0.301 . . . . . . . . . . . . . . . . K/n = 0.401 . . . . . . . . . . . . . . . . Heteroscedastic model K/n = 0.001 . . . . . . . . . . . . . . . . K/n = 0.101 . . . . . . . . . . . . . . . . K/n = 0.201 . . . . . . . . . . . . . . . . K/n = 0.301 . . . . . . . . . . . . . . . . K/n = 0.401 . . . . . . . . . . . . . . . .

NOTES: (i) DGP is Model  from the supplemental appendix, sample size is n = 700, number of bootstrap replications is B = 500, and number of simulation replications is S = 5000; (ii) Columns HO and HO correspond to confidence intervals using homoscedasticity consistent standard errors without and with degrees of freedom correction, respectively, columns HC–HC correspond to confidence intervals using the heteroscedasticity consistent standard errors discussed in Sections  and ., and columns HCK correspond to confidence intervals using our proposed standard errors estimator. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1359 method. To be more specific, for α = 0.05, the confidence inter- Thesampleiscomposedofwhitemalesofagesbetween vals take the form: 28 and 34 years old in 1991, with at most 5 siblings and at ⎡   ⎤ least incomplete secondary education. We split the sample into ˆ ˆ individuals with high school dropouts and high school gradu- = ⎣βˆ − −1( − α/ ) · n, , βˆ − −1(α/ ) · n, ⎦ , I n q 1 2 n q 2 ates, and individuals with college dropouts, college graduates, n n and postgraduates. For each subsample, we consider the linear ˆ ˆ −1 ˆ ˆ −1 = ( )  , =   , , regression model (1)withyi,n log wagesi ,wherewagesi n n n n , = is the log wage in 1991 of unit i xi,n afqti denotes the −1 (adjusted) standardized AFQT score for unit i, and w , collects where q denotes the inverse of a cumulative distribution func- i n ˆ several survey, geographic and dummy variables for unit i. In tion q, and n, with  ∈{HO0, HO1, HC0, HC1, HC2, HC3, particular, w , includes the 14 covariates described in CHV HC4, HCK} corresponds to the variance estimators discussed i n (Table 2, p. 2763), a dummy variable for wether the education in Sections 2 and 4.3. Gaussian-based methods set q equal level was completed, eight cohort fixed effects, county fixed to the standard normal cdf for all , while bootstrap-based effects,andcohort-countyfixedeffects.Forourillustration, methodsarebasedonthenonparametricbootstrapdistribu- = (βˆ − we further restrict the sample to units in counties with at tional approximation to the distribution of the t-test T n least three survey respondents, giving a total of Kn = 122 and β)/ ˆ / . n, n The empirical coverage of these 16 confidence n = 436 (Kn/n = 0.280; Mn = 0.422) for the high school edu- intervals are reported in Panel (a) of Table 1. In addition, Panel cation subsample and Kn = 123 and n = 452 (Kn/n = 0.272; (b) of Table 1 reports the average interval length of each con- Mn = 0.411) for the college education subsample. = −1( − α/ ) − fidence intervals, which is computed as L [q 1 2 The empirical findings are reported in Table 2.Forhigh −1(α/ ) · ˆ / , school educated individuals, we find an estimated returns q 2 ] n, n and thus offers a summary of finite sam- βˆ = . . ple power/efficiency of each inference method. to ability of 0 060 The of this The main findings from the simulation study are in line effect, however, depends on the inference method employed. with our theoretical results. We find that the confidence inter- If homoscedastic consistent standard errors are used, then the val estimators constructed using our proposed standard errors effect is statistically significant at conventional levelsp ( -values are 0.010 and 0.029 for unadjusted and degrees-of-freedom formula ˆ HC, denoted HCK, offer close-to-correct empirical n adjusted standard errors, respectively). If heteroscedasticity coverage. The alternative heteroscedasticity consistent standard consistent standard errors are used, the default method in most errors currently available in the literature lead to confidence empirical studies, then the statistical significance depends on intervals that could deliver substantial under or over coverage depending on the design and degree of heteroscedasticity con- sidered. We also find that inference based on HC3 standard Table . Empirical application (returns to ability, afqt score). errors is conservative, a general asymptotic result that is for- Outcome: log(wages) mally established in this article. Bootstrap-based methods seem to perform better than their Gaussian-based counterparts, but (a) Secondary education they never outperform our proposed Gaussian-based inference βˆ . procedure nor do they provide close-to-correct empirical cov- Std.Err. p-value erage across all cases. Finally, our proposed confidence intervals HO . . also exhibit very good average interval length. HO . . HC . . HC . . HC . . 6. Empirical Illustration HC . . HC . . We illustrate the different linear regression inference meth- HCK . . Kn  ods discussed in this article using a real dataset to study the n  effectofabilityonearnings.Inparticular,weemploythe Kn/n . M dataset constructed by (Carneiro, Heckman, and Vytlacil n . 2011,CHV,hereafter).[Thedatasetisavailableathttps://www. (B) College education aeaweb.org/articles?id=10.1257/aer.101.6.2754.]. The data ˆ come from the 1979 National Longitudinal Survey of Youth β . Std.Err. p-value (NLSY79), which surveys individuals born in 1957–1964 HO . . and includes basic demographic, economic and educational HO . . information for each individual. It also includes a well-known HC . . HC . . proxy for ability (beyond schooling and work experience): HC . . the Armed Forces Qualification Test (AFQT), which gives a HC . . measure usually understood as a proxy for the “intrinsic ability” HC . . HCK . . of the respondent. This data has been used repeatedly to either Kn  control for or estimate the effects of ability in empirical studies n  K /n in economics and other disciplines. See CHV for further details n . M . and references. n 1360 M. D. CATTANEO, M. JANSSON, AND W. K. NEWEY thewhichinferencemethodisused;seeSection 4.3.Inpar- results obtained in this article is that the latter assumption ticular, HC0 also gives a statistically significant resultp ( -value cannot be dropped if post-covariate-selection inference is is 0.020), while HC1 and HC2 deliver marginal significance basedonconventionalstandarderrors.Itwouldthereforebe (both p-values are 0.048). On the other hand, HC3 and HC4 of interest to investigate whether the methods proposed herein give p-values of 0.092 and 0.122, respectively, and hence suggest can be applied also for inference post-covariate-selection in that the point estimate is not statistically distinguishable from ultra-high-dimensional settings, which would allow for weaker zero. Finally, our proposed standard error, HCK, gives a p-value forms of sparsity because more covariates could be selected for of 0.058, also making βˆ = 0.060 statistically insignificant at inference. the conventional 5% level. In contrast, for college educated individuals, we find an effect of βˆ = 0.091, and all inference Supplement Materials methods indicate that this estimated returns to ability is statis- tically significant at conventional levels. In particular, HC3 and The supplemental appendix gives proofs of the main theorems presented in our proposed standard errors HCK give p-values of 0.037 and the article, contains other related technical results that may be of indepen- dent interest, discusses specific examples of linear regression models cov- 0.017, respectively. ered by our general framework, and reports complete results from a simu- This illustrative empirical application showcases the role of lation study. our proposed inference method for empirical work employing linear regression with possibly many covariates; in this appli- cation, Kn large relative to n is quite natural due to the pres- Acknowledgments ence of many county and cohort fixed effects (i.e., K /n ≈ 0.3 n We thank Xinwei Ma, Ulrich Müller and Andres Santos for very thought- in this empirical illustration). Specifically, when studying the ful discussions regarding this project. We also thank Silvia Gonçalves, Bruce effect of ability on earnings for high school educated individuals, Hansen,LutzKilian,PatKline,andJamesMacKinnon.Inaddition,anAsso- the statistical significance of the results crucially depend on the ciate Editor and three reviewers offered excellent recommendations that inference method used: as predicted by our theoretical findings, improved this article. inference methods that are not robust to the inclusion of many covariates tend to deliver statistically significant results, while Funding methods that are robust (HC3 is asymptotically conservative and HCK is asymptotically correct) do not deliver statistically sig- The first author gratefully acknowledges financial support From the nificant results, giving an example, where the empirical conclu- National Science Foundation (SES 1459931). The second author grate- sion may change depending on whether the presence of many fully acknowledges financial support From the National Science Founda- tion (SES 1459967) and the research support of CREATES (funded by the covariates is taken into account when conducting inference. In Danish National Research Foundation Under grant no. DNRF78). The third contrast, the empirical findings for college educated individuals author gratefully acknowledges financial support From the National Sci- appear to be statistically significant and robust across all infer- ence Foundation (SES 1132399). ence methods.

References

7. Conclusion Abadie,A.,Imbens,G.W.,andZheng,F.(2014), “Inference for Misspeci- fied Models With Fixed Regressors,” Journal of the American Statistical We established asymptotic normality of the OLS estimator of Association, 109, 1601–1614. [1351] a subset of coefficients in high-dimensional linear regression Anatolyev, S. (2012), “Inference in Regression Models With Many Regres- models with many nuisance covariates, and investigated the sors,” Journal of , 170, 368–382. [1351] properties of several popular heteroscedasticity-robust standard Angrist, J., and Hahn, J. (2004), “When to Control for Covariates? Panel Asymptotics for Estimates of Treatment Effects,” Review of Economics error estimators in this high-dimensional context. We showed and Statistics, 86, 58–72. [1350] that none of the usual formulas deliver consistent standard Belloni, A., Chernozhukov, V., Chetverikov, D., and Kato, K. (2015), “On errors when the number of covariates is not a vanishing propor- the Asymptotic Theory for Least Squares Series: Pointwise and Uni- tion of the sample size. We also proposed a new standard error form Results,” , 186, 345–366. [1355] formula that is consistent under (conditional) heteroscedastic- Belloni,A.,Chernozhukov,V.,andHansen,C.(2014), “Inference on Treat- ment Effects After Selection Among High-Dimensional Controls,” ity and many covariates, which is fully automatic and does not Review of Economic Studies, 81, 608–650. [1350,1360] assume a restrictive, special structure on the regressors. Belloni,A.,Chernozhukov,V.,Hansen,C.,andFernandez-Val,I.(2017), Our results concern high-dimensional models where the “Program Evaluation and Causal Inference With High-Dimensional number of covariates is at most a nonvanishing fraction of the Data,” , 85, 233–298. [1360] sample size. A quite recent related literature concerns ultra- Bera, A. K., Suprayitno, T., and Premaratne, G. (2002), “On Some Heteroscedasticity-Robust Estimators of Variance- Matrix high-dimensional models where the number of covariates is of the Least-Squares Estimators,” Journal of Statistical Planning and much larger than the sample size, but some form of (approxi- Inference, 108, 121–136. [1351] mate) sparsity is imposed in the model; see, for example, Belloni, Bickel, P.J., and Freedman, D. A. (1983), “ Regression Models Chernozhukov, and Hansen (2014), Farrell (2015), Belloni et al. With Many Parameters,” in A Festschrift for Erich L. Lehmann,eds.P. (2017), and references therein. In that setting, inference is Bickel,K.Doksum,andJ.Hodges,BocaRaton,FL:ChapmanandHall. [1357] conducted after covariate selection, where the resulting number Carneiro, P., Heckman, J. J., and Vytlacil, E. J. (2011), “Estimating Marginal of selected covariates is at most a vanishing fraction of the Returns to Education,” American Economic Review, 101, 2754–2781. sample size (usually much smaller). An implication of the [1359] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 1361

Cattaneo, M. D., and Farrell, M. H. (2011), “Efficient Estimation of the Kline, P., and Santos, A. (2012), “Higher Order Properties of the Wild Dose-Response Function Under Ignorability Using Subclassification Bootstrap Under Misspecification,” Journal of Econometrics, 171, on the Covariates,” in Missing-Data Methods: Cross-sectional Methods 54–70. [1357] and Applications (Advances in Econometrics) (vol. 27), ed. D. Drukker, Koenker, R. (1988), “Asymptotic Theory and Econometric Practice,” Jour- Bingley, UK: Emerald Group Publishing, pp. 93–127. [1356] nal of Applied Econometrics, 3, 139–147. [1351] ——— (2013), “Optimal Convergence Rates, Bahadur Representation, and Li, C., and Müller, U. K. (2017), “Linear Regression With Many Asymptotic Normality of Partitioning Estimators,” Journal of Econo- Controls of Limited Explanatory Power,” working paper, Princeton metrics, 174, 127–143. [1355] University.[1351] Cattaneo,M.D.,Jansson,M.,andNewey,W.K.(2018), “Alternative Asymp- Long,J.S.,andErvin,L.H.(2000), “Using Heteroscedasticity Consistent totics and the Partially Linear Model With Many Regressors,” Econo- Standard Errors in the Linear Regression Model,” The American Statis- metric Theory, 34, 277–301. [1351,1355,1356] tician, 54, 217–224. [1352,1357] Chen, X. (2007), “Large Sample Sieve Estimation of Semi-Nonparametric MacKinnon, J., and White, H. (1985), “Some Heteroscedasticity-Consistent Models,” in Handbook of Econometrics (vol.VI),eds.J.J.Heckman,and Estimators With Improved Finite Sample Proper- E. E. Leamer, Amsterdam, Netherlands: Elsevier Science B.V., pp. 5549– ties,” Journal of Econometrics, 29, 305–325. [1351] 5632. [1355] MacKinnon, J. G. (2012), “Thirty Years of Heteroscedasticity-Robust Infer- Chesher, A. (1989), “Hájek Inequalities, Measures of Leverage and the Size ence,”in Recent Advances and Future Directions in Causality, Prediction, of Heteroscedasticity Robust Wald Tests,” Econometrica, 57, 971–977. and Specification Analysis,eds.X.Chen,andN.R.Swanson,NewYork: [1351] Springer. [1351,1352,1357,1358] Chesher,A.,andJewitt,I.(1987), “The Bias of a Heteroscedasticity Con- Mammen, E. (1993),“BootstrapandWildBootstrapforHighDimensional sistent Covariance Matrix Estimator,” Econometrica, 55, 1217–1222. Linear Models,” , 21, 255–285. [1351,1356,1357] [1351] Müller,U.K.(2013), “Risk of in Misspecified Models, Cochran, W. G. (1968), “The Effectiveness of Adjustment by Subclassi- and the Sandwich Covariance Matrix,” Econometrica, 81, 1805–1849. fication in Removing Bias in Observational Studies,” Biometrics, 24, [1351] 295–313. [1355] Newey, W. K. (1997), “Convergence Rates and Asymptotic Normal- Cribari-Neto,F.,daGloriaA.,andLima,M.(2011), “A Sequence of ity for Series Estimators,” Journal of Econometrics, 79, 147–168. Improved Standard Errors Under Heteroscedasticity of Unknown [1355] Form,” Journal of Statistical Planning and Inference, 141, 3617–3627. Rosenbaum, P.R., and Rubin, D. B. (1983), “The Central Role of the Propen- [1351] sity Score in Observational Studies for Causal Effects,” , 70, Cribari-Neto,F.,Ferrari,S.L.P.,andCordeiro,G.M.(2000), “Improved 41–55. [1356] Heteroscedasticity-Consistent Covariance Matrix Estimators,” Shao, J., and Wu, C. F. J. (1987), “Heteroscedasticity-Robustness of Jack- Biometrika, 87, 907–918. [1351] knife Variance Estimators in Linear Models,” Annals of Statistics, 15, Donald, S. G., and Newey, W. K. (1994), “Series Estimation of Semilinear 1563–1579. [1351] Models,” Journal of Multivariate Analysis, 50, 30–40. [1355] Stock, J. H., and Watson, M. W. (2008), “Heteroscedasticity-Robust Stan- El Karoui, N., Bean, D., Bickel, P. J., Lim, C., and Yu, B. (2013), “On dard Errors for Fixed Effects Panel Data Regression,” Econometrica, 76, With High-Dimensional Predictors,” Proceedings of 155–174. [1351,1355,1358] the National Academy of Sciences, 110, 14557–14562. [1351] Varah, J. M. (1975), “A Lower Bound for the Smallest Singular Value of a Farrell, M. H. (2015), “Robust Inference on Average Treatment Effects With Matrix,” Linear Algebra and its Applications, 11, 3–5. [1357] Possibly More Covariates than Observations,” Journal of Econometrics, Verdier, V. (2017), “Estimation and Inference for Linear Models With Two- 189, 1–23. [1360] Way Fixed Effects and Sparsely Matched Data,” Working paper, UNC. Freedman, D. A. (1981), “Bootstrapping Regression Models,” Annals of [1351,1355] Statistics, 9, 1218–1228. [1357] White, H. (1980), “A Heteroscedasticity-Consistent Covariance Matrix Gonçalves, S., and White, H. (2005), “Bootstrap Standard Error Estimates EstimatorandaDirectTestforHeteroscedasticity,”Econometrica, 48, for Linear Regression,” Journal of the American Statistical Association, 817–838. [1351] 100, 970–979. [1357] Wu,C.F.J.(1986), “Jackknife, Bootstrap and Other Resampling Meth- Huber, P. J. (1973), “Robust Regression: Asymptotics, Conjectures, and ods in ,” Annals of Statistics, 14, 1261–1295. Monte Carlo,” Annals of Stastistics, 1, 799–821. [1351] [1351] Kauermann, G., and Carroll, R. J. (2001), “A Note on the Efficiency of Sand- Zheng, S., Jiang, D., Bai, Z., and He, X. (2014), “Inference on Multiple wich Covariance Matrix Estimation,” Journal of the American Statistical Correlation Coefficients With Moderately High Dimensional Data,” Association, 96, 1387–1396. [1351] Biometrika, 101, 748–754. [1351]