t- Based Correlation and Heterogeneity Robust Inference

Rustam IBRAGIMOV Economics Department, Harvard University, 1875 Cambridge Street, Cambridge, MA 02138

Ulrich K. MÜLLER Economics Department, Princeton University, Fisher Hall, Princeton, NJ 08544 ([email protected])

We develop a general approach to robust inference about a scalar parameter of interest when the is potentially heterogeneous and correlated in a largely unknown way. The key ingredient is the following result of Bakirov and Székely (2005) concerning the small sample properties of the standard t-test: For a significance level of 5% or lower, the t-test remains conservative for underlying observations that are independent and Gaussian with heterogenous . One might thus conduct robust large sample inference as follows: partition the data into q ≥ 2 groups, estimate the model for each group, and conduct a standard t-test with the resulting q parameter estimators of interest. This results in valid and in some sense efficient inference when the groups are chosen in a way that ensures the parameter estimators to be asymptotically independent, unbiased and Gaussian of possibly different variances. We provide examples of how to apply this approach to , panel, clustered and spatially correlated data.

KEY WORDS: Dependence; Fama–MacBeth method; Least favorable distribution; t-test; es- timation.

1. INTRODUCTION property of the correlations. The key ingredient to the strategy is a result by Bakirov and Székely (2005) concerning the small Empirical analyses in economics often face the difficulty that sample properties of the usual t-test used for inference on the the data is correlated and heterogeneous in some unknown fash- of independent normal variables: For significance levels ion. Many estimators of parameters of interest remain valid and of 8.3% or lower, the usual t-test remains conservative when interesting even under the presence of correlation and hetero- the variances of the underlying independent Gaussian obser- geneity, but it becomes considerably more challenging to cor- vations are not identical. This insight allows the construction rectly estimate their variability. of asymptotically valid test for general correlated and The typical approach is to invoke a law of large numbers to justify inference based on consistent variance estimators: For heterogenous data in the following way: Assume that the data an OLS regression with independent but not identically distrib- can be classified in a finite number q of groups that allow as- uted disturbances, see White (1980). In the context of time se- ymptotically independent normal inference about the scalar pa- ries, popular heteroscedasticity and consistent rameter of interest β. This that there exists estimators ˆ = (“long-run”) variance estimators were derived by Newey and βj, estimated from groups j 1,...,q, so that approximately ˆ ∼ N 2 ˆ ˆ West (1987) and Andrews (1991). For clustered data, which βj (β, vj ), and βj is approximately independent of βi for ˆ includes panel data as a special case, Rogers’ (1993)clus- j = i. Typically, the estimator βj will simply be the element tered standard errors provide a consistent variance estimator. ˆ ˆ of interest of the vector θ j, where θ j is the estimator of the Conley (1999) derives consistent nonparametric standard errors model’s parameter vector using group j data only. The obser- for datasets that exhibit spatial correlations. vations βˆ ,...,βˆ can then approximately be treated as inde- While quite general, the consistency of the variance estima- 1 q pendent normal observations with common mean β (but not tor is obtained through an assumption that asymptotically, an necessarily equal variance), and the usual t-test concerning β infinite number of observable entities are essentially uncorre- constructed from βˆ ,...,βˆ (with q − 1 degrees of freedom) is lated: heteroscedasticity robust estimators achieve consistency 1 q conservative. If the number of observations is reasonably large by averaging over an infinite number of uncorrelated distur- in all groups, the approximate normality βˆ ∼ N (β, v2) is of bances; clustered standard errors achieve consistency by aver- j j aging over an infinite number of uncorrelated clusters; long- course a standard result for most models and estimators, linear run variance estimators achieve consistency by averaging over or nonlinear. Knowledge about the correlation structure in the data is em- an infinite number of (essentially uncorrelated) low ˆ ˆ periodogram ordinates; and so forth. Also, block bootstrap bodied in the assumption that β1,...,βq are (approximately) and subsampling techniques derive their asymptotic validity independent. In contrast to consistent variance estimators, only from averaging over an infinite number of essentially uncorre- this finite amount of uncorrelatedness is directly required for lated blocks. When correlations are pervasive and pronounced enough, these methods are inapplicable or yield poor results. © 2010 American Statistical Association This paper develops a general strategy for conducting infer- Journal of Business & Economic Statistics ence about a scalar parameter with potentially heterogenous and October 2010, Vol. 28, No. 4 correlated data, when relatively little is known about the precise DOI: 10.1198/jbes.2009.08046

453 454 Journal of Business & Economic Statistics, October 2010 the validity of the t-statistic approach. What is more, by invok- and cluster averages of the individual disturbances are approx- ing the results of Müller (2008), we show that the t-statistic ap- imately iid Gaussian across clusters. Hansen (2007) finds that proach in some sense efficiently exploits the information con- the asymptotic null distribution of test statistics based on the tained in the assumption of asymptotically independent and standard clustered error formula for a panel with one fixed di- ˆ ˆ Gaussian estimators β1,...,βq. Of course, a stronger (correct) mension and one dimension tending to infinity become that of assumption, that is, larger q, will typically lead to more pow- a Student-t with a finite number of degrees of freedom (suit- erful inference, so that one faces the usual trade-off between ably scaled), as long as the fixed dimension is “asymptotically robustness and power in choosing the number of groups. In the homogeneous.” Recent work by Bester, Conley, and Hansen benchmark case of underlying iid observations, a 5% level t- (2009) builds on our paper and proposes inference about both statistic based test with q = 16 equal sized groups loses at most scalar and vector valued parameters based on the usual full 5.8 percentage points of asymptotic local power compared to sample estimator, using clustered standard errors with groups inference with known (or correctly consistently estimated) as- chosen as suggested here and critical values derived in Hansen ymptotic variance in an exactly identified GMM problem. The (2007). This approach requires the design to be (asymp- robustness versus power trade-off is thus especially acute only totically) identical across groups. Under this homogeneity, their for datasets where an even coarser partition is required to yield procedure for inference about a scalar parameter is asymptoti- independent information about the parameter of interest. cally numerically identical to the t-statistic approach under both In applications of the t-statistic approach, the key question the null and local alternatives. At the same time, the t-statistic will nevertheless often be the adequate number and composi- approach remains valid even without this homogeneity, so from tion of groups. Some examples and Monte Carlo evidence for the usual first-order asymptotic point of view, the t-statistic ap- group choices are discussed in Section 3 below. The efficiency proach is strictly preferable. result mentioned above formally shows that one cannot dele- The t-statistic approach has an important precursor in the gate the decision about the adequate group choice to the data. work of Fama and MacBeth (1973). Their work on empirical On a fundamental level, some a priori knowledge about the cor- tests of the CAPM has motivated the following widespread ap- relation structure is required in order to be able to learn from proach to inference in panel regressions with firms or stocks as the data. This is also true of other approaches to inference, al- individuals: Estimate the regression separately for each year, and then test hypotheses about the coefficient of interest by though the assumed regularity tends to be more implicit. For in- the t-statistic of the resulting yearly coefficient estimates. The stance, consider the problem of conducting inference about the Fama–MacBeth approach is thus a special case of our suggested mean real exchange rate in 40 years of quarterly data. It seems method, where observations of the same year are collected in challenging to have a substantive discussion about the appropri- a group. While this approach is routinely applied, we are not ateness of, say, a confidence interval based on Andrews’ (1991) aware of a formal justification. One contribution of this paper is consistent long-run variance estimator, whose formal validity is to provide such a justification, and we find that as long as year based on primitive conditions involving mixing conditions and coefficient estimators are approximately normal (or scale mix- the like. [Or, for that matter, on Kiefer and Vogelsang’s (2005) tures of normals) and independent, the Fama–MacBeth method approach with a bandwidth of, say, 30% of the sample size.] At results in valid inference even for a short panel that is heteroge- the same time, it seems at least conceivable to debate whether nous over time. averages from, say, 8 year blocks provide approximately inde- The t-statistic approach generalizes the previous literature on pendent information; business cycle frequency fluctuations of large sample inference without consistent variance estimation the real exchange rate, for instance, would rule out the appro- to a generic strategy that can be employed in different settings, priateness of 4 year blocks. In our view, it is a strength of the such as in time series data, panel data, or spatially correlated t-statistic approach that its validity is based on such a fairly data. Due to the small sample conservativeness result, the ap- explicit regularity condition. At the end of the day, inference proach allows for unknown and unmodeled heterogeneity. In requires some assumption about potential correlations, and em- a time series context, for instance, this means that unlike Kiefer pirical researchers should agonize about the appropriate amount and Vogelsang (2005), we can allow for low frequency vari- of regularity that is imposed on the data. ability in second moments, and in a panel context, we do not Our paper is related to previous work on inference proce- require the asymptotic homogeneity as in Hansen (2007). Also, dures that do not rely on consistency of the variance estimator. the t-statistic approach is very easy to implement, and does In a time series context, Kiefer, Vogelsang, and Bunzel (2000) not require any new tables of critical values. The crucial reg- ˆ ˆ show that it is possible to conduct asymptotically justified in- ularity condition—the assumption that β1,...,βq are approxi- ference in a linear time series regression based on long-run N 2 mately independent and distributed (β, vj )—is more explicit variance estimators with a nondegenerate limiting distribution. and may be easier to interpret than, say, the primitive conditions These results were extended and scrutinized by Kiefer and Vo- underlying consistent long-run variance estimators, or the value gelsang (2002, 2005) and Jansson (2004). Müller (2007)shows of the bandwidth as a fraction of the sample size in Kiefer and that all consistent long-run variance estimators lack robustness Vogelsang (2005). Perhaps most importantly from an econo- in a certain sense, and determines a class of inconsistent long- metric theory perspective, the t-statistic approach in some sense run variance estimators with some optimal trade-off between efficiently exploits the information contained in this regularity robustness and efficiency. Donald and Lang (2007) point out condition; to the best of our knowledge, this is the first gen- that inference in a setting with clusters may be eral large sample efficiency claim about the test of a parameter based on Student-t distributions with a finite number of degrees value that does not involve consistent estimation of the asymp- of freedom under an assumption that both the random effects totic variance. Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 455

The rest of the paper is organized as follows: Section 2 re- q ≤ 14. For q ≥ 15, however, the rejection probability of a 10% views the small sample result by Bakirov and Székely (2005), level two-sided test (or a 5% level one-sided test) under the null 2 = ··· = 2 2 = ··· = and discusses the large sample validity and consistency of the hypothesis is maximized at σ1 σ14 and σ15 2 = t-statistic approach. Section 3 gives examples of group choices σq 0. So usual two-sided t-tests of level 10% are not auto- and provides Monte Carlo evidence for time series, panel, clus- matically conservative for large q, and the appropriate critical tered, and spatially correlated data. We discuss the efficiency value of a robust test is a function of the critical value of the properties in Section 4, followed by concluding remarks in Sec- usual t-test when q = 14. In the following, our focus is on the tion 5. empirically most relevant case of two-sided tests of level 5% or lower. One immediate application of Theorem 1 concerns the con- 2. VALIDITY OF t–STATISTIC BASED INFERENCE struction of confidence intervals for μ: a confidence interval 2.1 Small Sample Result for μ of level C ≥ 95% based on the usual formulas for iid Gaussian observations has effective coverage level of at least C Let Xj, j = 1,...,q, with q ≥ 2, be independent Gaussian for all values of {σ 2,...,σ2}. As long as the realized value of [ ]= 1 q random variables with common mean E Xj μ and variances |t| is larger than the smallest cvq(α) for which (3) holds, also [ ]= 2 V Xj σj . The usual t-statistic for the hypothesis test p-values constructed from the cumulative distribution function of the Student-t distribution maintain their usual interpretation H0 : μ = 0againstH1 : μ = 0(1) as the lowest significance level at which the test still rejects. As is given by stressed by Bakirov and Székely (2005), a further implication of √ X¯ Theorem 1 is the conservativeness of the usual t-test against iid t = q , (2) observations that are scale mixtures of Gaussian variates: Let sX = + ∼ N Yj μ ZjVj where Zj iid (0, 1) and Vj is iid and indepen- ¯ = −1 q 2 = − −1 q − ¯ 2 { }q where X q j=1 Xj and sX (q 1) j=1(Xj X) , and dent of Zj j=1. Then by Theorem 1, the usual t-test based on the null hypothesis is rejected for large values of |t|. [To be pre- { }q Yj j=1 of the null hypothesis (1) of level 5% or lower is conser- cise, we define t in (2) to be equal to zero if sX = 0.] Note that |t| q vative conditional on {Vj} = , and hence also unconditionally. is a scale invariant statistic, that is a replacement of {X }q by j 1 j j=1 The usual t-test of level 5% or lower thus yields a valid test for { }q = | | 2 = 2 cXj j=1 for any c 0leaves t unchanged. If σj σ for all j, the (which is equal to mean, if it exists) of iid observa- by definition, the critical value cv of |t| is given by the appro- tions with a distribution that can be written as a scale mixture of priate of the distribution of a Student-t distributed normals. This is a rather large class of distributions: it includes, Tq−1 with q − 1 degrees of freedom. for instance, the Student-t distribution with arbitrary degrees of In a recent paper, Bakirov and Székely (2005) show that for freedom (including the Cauchy distribution), the double expo- a given critical value, the rejection probability under the null nential distribution, the logistic distribution, and all symmetric | | 2 =···= hypothesis of a test based on t is maximized when σ1 stable distributions. 2 2 =···= 2 = ≤ ≤ { }q { }q σk and σk+1 σq 0forsome1 k q. Their results More generally, as long as Vj j=1 is independent of Zj j=1, imply the following theorem: Theorem 1 and the conditioning argument above imply conser- vativeness of the usual t-test of significance level 5% or lower, Theorem 1 (Bakirov and Székely 2005). Let cv (α) be the q q with an arbitrary joint distribution of {V } . critical value of the usual two-sided t-test based on (2)of j j=1 Figure 1 depicts the null rejection probability of the 5% level level α, that is, P(|T − | > cv (α)) = α, and let  denote the q 1 q two-sided t-test for q = 4, 8, and 16 when (i) there are two cumulative density function of a standard normal random vari- equal sized groups of iid Gaussian observations, and the ra- able. 2 ≤ 2 = 2 √ tio of their variances is equal to a :fori, j q/2, σi σj , (i) If α ≤ 2(− 3) = 0.08326 ...,then for all q ≥ 2, 2 = 2 2 2 = 2 σq+1−i σq+1−j, and σ1 /σq a and (ii) all observations ex- cepts one are of the same variance, that is, for i, j ≥ 2, σ 2 = σ 2, sup P(|t| > cvq(α)|H0) = P(|Tq−1| > cvq(α)) i j { 2 2} 2 2 = 2 σ1 ,...,σq and σ1 /σq a . Due to the scale invariance, the description in terms of the ratio of variances is without loss of general- = α. (3) ity. Rejection probabilities in Figure 1 (and Figures 2 and 3 (ii) Equation (3) also holds true for 2 ≤ q ≤ 14 if in Sections 2.3 and 4.3, respectively) were computed by nu- α ≤ α1 = 0.1, and for q ∈{2, 3} if α ≤ α2 = 0.2. meric inversion of the characteristic function of the appropri- = − 2 ate Gaussian quadratic form; see Imhof (1961). As can be seen Moreover, define cvq(αi) ki(q 1) cvki (αi) / − + − 2 ∈{ } = from Figure 1, for small q, the null rejection probability can be q(ki 1) (q ki) cvki (αi) , i 1, 2 , where k1 = ≥ + | | much lower than the nominal level, but for q = 16, it does not 14 and k2 3. Then for q ki 1, sup{σ 2,...,σ 2} P( t > 1 q drop much below 4% in either scenario. cv q(αi)|H0) = αi.

The usual 5% level two-sided test of (1) based on the usual 2.2 Asymptotic Validity and Consistency { 2 2} t-test thus remains valid for all values of σ1 ,...,σq , and all q ≥ 2. Also, by symmetry of the t-statistic under the null hy- Our main interest in the small sample results on the t-statistic pothesis, Theorem 1(ii) implies conservativeness of the usual stems from the following application: Suppose we want to do one-sided t-test of significance level 5% or lower as long as inference on a scalar parameter β of an econometric model in 456 Journal of Business & Economic Statistics, October 2010

Figure 1. Null rejection probabilities of 5% level t-tests with q independent observations.

2 = a large dataset with n observations. For√ a wide of mod- the values of σj , j 1,...,q. Also, by implication, the confi- ˆ ˆ els and estimators β, it is known that n(β − β) ⇒ N (0,σ2) ˆ dence interval β ± cv s ˆ where cv is the usual (1 + C)/2 per- as n →∞, where “⇒” denotes convergence in distribution. β centile of the Student-t distribution with q − 1 degrees of free- Suppose further that the observations exhibit correlations of dom has asymptotic coverage of at least C for all C ≥ 0.917. largely unknown form. If such correlations are pervasive and An important class of models that typically induce (4) with pronounced enough, then it will be very challenging to consis- an appropriate choice of groups are Hansen’s (1982) General- 2 tently estimate σ , and inference procedures for β that ignore ized Method of Moments (GMM) models. Suppose the ˆ 2 the sampling variability of a candidate σ condition is E[g(θ, yi)]=0, where g is a known k × 1 vector will have poor small sample properties. valued function, θ is a l × 1 vector of parameters (l ≤ k) and ≥ Now consider a partition of the original dataset into q 2 yi, i = 1,...,n, are possibly vector-valued observations. With- q = groups, with nj observations in group j, and j=1 nj n.De- out loss of generality, assume that the first element of θ is the ˆ − note by βj the estimator of β using observations in group j parameter of interest β, so that the last l 1 elements of θ are √ G only. Suppose the groups are chosen such that n(βˆ − β) ⇒ nuisance parameters. Denote by j the set of indices of group j √ j ∈ G N 2 ˆ − observations, such that yi is in group j if and only if i j.As- √(0,σj ) for all j, and, crucially, such that n(βj β) and ˆ ˆ sume that the GMM estimator θ j based on group j observations n(βi − β) are asymptotically independent for i = j—this Gj satisfies amounts to the convergence in distribution √ √ ˆ − = −1 + ˆ ˆ n(θ j θ) (jjj) jjQj op(1), (6) n(β1 − β,...,βq − β) p −1 ∂g(a,yi) | → 2 2 2 where n ∈G =ˆ j, with j of full rank and ⇒ N (0, diag(σ ,...,σ )), max σ > 0. (4) i j ∂a a θ j 1 q ≤ ≤ j 1 j q nonstochastic for all j, j is the nonstochastic full rank limit ˆ √ of the weighting matrix for the GMM estimator θ j, and Qj = ˆ − The asymptotic Gaussianity of n(βj − β), j = 1,...,q, typi- n 1/2 g(θ, y ) ⇒ N (0,  ). In addition, suppose that i∈Gj i j cally follows from the same reasoning as the asymptotic Gaus- the GMM estimators are asymptotically independent, which ˆ sianity of the full sample estimator β. The argument for an as- requires (Q ,...,Q ) ⇒ N (0, diag( ,..., )). These as- ˆ ˆ 1 q 1 q ymptotic independence of βj and βi for i = j, on the other hand, sumptions follow from the usual linearization arguments under depends on the choice of groups and the details of the applica- appropriate conditions. As a consequence, (4) holds, so that the tion. t-statistic approach yields valid inference about β. ˆ Under (4), for large n,theq estimators βj, j = 1,...,q, For some applications, a slightly more general regularity con- are approximately independent Gaussian random variables with dition than (4) is useful: Suppose common mean β and variances σ 2/n. Thus, by Theorem 1 ˆ q q j {mn(βj − β)} = ⇒{ZjVj} = (7) above, one can perform an asymptotically valid test of level α, j 1 j 1 for some positive sequence m →∞, where Z ∼ iidN (0, 1), α ≤ 0.083 of H0 : β = β0 against H1 : β = β0 by rejecting H0 n j the random variables {V }q are independent of {Z }q and when |tβ | exceeds the (1 − α/2) percentile of the Student-t dis- j j=1 j j=1 | | tribution with q − 1 degrees of freedom, where tβ is the usual maxj Vj > 0 almost surely. As discussed in Section 2.1,(7) t-statistic accommodates convergences (at an arbitrarily slow rate) to in- dependent but potentially heterogeneous scale mixtures of stan- ˆ dard normal distributions, such as the family of strictly stable √ β − β0 tβ = q (5) symmetric distributions, and also convergences to conditionally s ˆ β normal variates which are unconditionally dependent through their second moments. Under (7), inference based on tβ remains ˆ −1 q ˆ 2 −1 q ˆ ˆ 2 q with β = q βj and s = (q − 1) (βj − β) .By { } j=1 βˆ j=1 asymptotically valid conditionally on Vj j=1 by the Continuous Theorem 1 and the Continuous Mapping Theorem, this infer- Mapping Theorem and an application of Theorem 1, and thus ence is asymptotically valid whenever (4) holds, irrespective of also unconditionally. Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 457

= = Under the fixed alternative β β0,(4)or(7) imply that sβˆ 2.3 Size Control Under Dependence o (1) and βˆ −β = o (1), so that P(|t | > cv) → 1 for all cv > 0 p p β Tests of level 5% or lower based on t are asymptotically and a test based on |t | is consistent at any level of significance. β β valid whenever (4) holds. As usual, when applying this result Under fixed heterogeneous alternatives of the null hypothesis in small samples, one will incur an approximation error, as the β = β , with the true value of β in group j given by β (and 0 j of {βˆ }q will not be exactly that of a se- = ˆ →p = j j=1 βj βi for some j and i), βj βj for j 1,...,q, and a test quence of independent normals with common mean β. In par- | | based on tβ with critical value cv is consistent if ticular, depending on the application, the estimators from differ- ˆ ent groups βj might not be exactly independent. We now briefly q(β¯ − β )2 0 > cv2, (8) investigate what kind of correlations are necessary to grossly − −1 q − ¯ 2 (q 1) j=1(βj β) distort the size of tests based on tβ , while maintaining the as- sumption of multivariate Gaussianity. ¯ = −1 q where β q j=1 βj. Especially for small q and large cv, (8) Specifically, we consider two correlation structures for q {βˆ }q :(i)βˆ are a strictly stationary autoregressive process might not be satisfied when {βj − β0} = are very heterogenous, j j=1 j j 1 ˆ ˆ even when all βj − β0 are of the same sign. On the other hand, of order one [AR(1)], that is, the correlation between βi and βj ≥ |i−j| { ˆ }q a calculation shows that for q 7, a 5% level test is consis- is ρ ; (ii) βj j=1 has the correlation structure of a random { − }q ˆ ˆ tent for all alternatives βj β0 j=1 of equal sign that are no effects model, that is, the correlation between βi and βj is ρ for more heterogeneous (in the majorization sense, see Marshall i = j. For both cases, we consider the two types of variance het- and Olkin 1979) than β1 − β0 = ··· = β q/2 − β0 = 0 and erogeneity discussed above, with either two equal-sized identi- 2 β q/2 +1 −β0 =···=βq −β0 = 0, where · denotes the great- cal variance groups of relative variance a , or all observations est lesser integer function. of equal variance except for one of relative variance a2. Figure 2

Figure 2. Null rejection probabilities of 5% level t-tests with q correlated observations. 458 Journal of Business & Economic Statistics, October 2010 depicts the effective size of a 5% level two-sided t-tests under W is a k × 1 standard Wiener process [cf. Equation (6) above]. these four scenarios. As might be expected, a negative ρ leads For the groups chosen as above, by the Continuous Mapping to underrejections throughout. More interestingly, t-tests for q Theorem, we obtain small are somewhat robust against correlations in the underly- ⎛ ˆ ⎞ θ 1 − θ ing observations. This effect becomes especially pronounced ⎜ ˆ ⎟ √ ⎜ θ 2 − θ ⎟ if combined with strong heterogeneity in the variances: with T ⎜ . ⎟ a = 5, ρ needs to be larger than 0.4 before the null rejection ⎝ . ⎠ probability of a t-test based on q = 4 observations exceeds the θˆ − θ q nominal level in both the AR(1) and the random effects model ⎛ 1/q −1 1/q ⎞ ( 0 ϒ(λ) dλ) 0 h(λ) dW(λ) for both types of variance heterogeneity. But even in the case of ⎜ ⎟ equal variances, the size of a test based on q = 4 observations ⎜ ( 2/q ϒ(λ) dλ)−1 2/q h(λ) dW(λ) ⎟ ⇒ ⎜ 1/q 1/q ⎟ exceeds 7.5% only when ρ is larger than 0.18 in the AR(1) ⎜ ⎟ ⎝ . ⎠ model. So while (4) is the essential assumption of the approach . 1 −1 1 suggested here, inference based on tβ continues to have quite ( (q−1)/q ϒ(λ) dλ) (q−1)/q h(λ) dW(λ) reasonable properties as long as the dependence in {βˆ }q is √ j j=1 so that also { T(βˆ − β)}q are asymptotically independent weak, especially when q is small. j j=1 and Gaussian. Therefore, whenever (9) holds, t-statistic based inference is asymptotically valid for any q ≥ 2. The t-statistic 3. APPLICATIONS approach can hence allow for asymptotically time varying in- formation [nonconstant ϒ(·)] and pronounced variability in We now discuss applications of the t-statistic approach, and second moments [nonconstant h(·)]. In fact, using (7), the t- provide some Monte Carlo evidence on its performance rela- statistic approach remains valid even if ϒ(·) and h(·) are sto- tive to alternative approaches. Specifically, we consider time chastic, as long as they are independent of W. In contrast, the series data, panel data, data where observations are categorized approach of Kiefer and Vogelsang (2002, 2005) requires ϒ(·) in clusters, and spatially correlated data. The Monte Carlo evi- and h(·) to be constant. There is substantial empirical evidence dence focuses on inference about OLS linear regression coeffi- for persistent instabilities in the second moment of macroeco- cients. This is for convenience and comparability to other sim- nomic and financial time series: see, for instance, Bollerslev, ulation studies in the literature, since the t-statistic approach Engle, and Nelson (1994), Kim and Nelson (1999), McConnell is also applicable to instrumental variable regressions and non- and Perez-Quiros (2000), and Müller and Watson (2008). The linear models, as noted above. Also, we mostly consider data additional robustness of the t-statistic approach is thus arguably generating processes where the variances of the βˆ are simi- j of practical relevance. lar. This is again to ensure comparability with other simulation In fact, the t-statistic based approach suggested here is, to the studies, and it also represents the case where the theoretical re- best knowledge of the authors, the only known way of conduct- sults above predict size control to be most difficult for the t- ing asymptotically valid inference whenever (9) holds, as least statistic approach. under double-array asymptotics: Müller (2007) demonstrates that in the scalar location model, for any equivariant variance 3.1 Time Series Data estimator that is consistent for the variance of Gaussian white With observations ordered in time, the default assumption noise, there exists a double array that satisfies a functional cen- driving most of time series inference is that the further apart tral limit theorem which induces the “consistent” variance esti- the observations, the weaker their potential correlation. For the mator to converge in probability to an arbitrary positive value. t-statistic approach, in absence of more specific information re- Since all usual consistent long-run variance estimators are both garding the potential time series correlation, this suggests di- scale equivariant and consistent for the variance of Gaussian viding the sample of size T into q (approximately) equal sized , none of these estimators yields generally valid groups of consecutive observations: the observation indexed by inference under (9). The general validity of the t-statistic ap- t, t = 1,...,T, is element of group j if t ∈ Gj ={s : (j−1)T/q < proach under (9) is thus an analytical reflection of its robust- s ≤ jT/q} for j = 1,...,q. The smaller q, the less approximate ness. independence in time is imposed by this group choice. Table 1 reports small sample properties of various ap- Consider a GMM setup, as discussed in Section 2.2 above, proaches to inference. The small sample is the one with a k × 1 moment condition, a l × 1 parameter vector θ, and considered in Andrews (1991), Andrews and Monahan (1992), the scalar parameter of interest β is the first element of θ.Un- and Kiefer, Vogelsang, and Bunzel (2000) and concerns in- der a wide range of assumptions on the underlying model and ference in a linear regression with 5 regressors. In addition observations, the partial-sample GMM estimator θˆ(r, s) com- to t-statistic based inference described above with q = 2, 4, 8, puted from the observations t = rT ,..., sT satisfies (see, and 16 and groups Gj ={s : (j − 1)T/q < s ≤ jT/q}, we in- e.g., Andrews 1993 and Hansen 2000) clude in our study the approach developed by Kiefer and Vo- gelsang (2005) and usual inference based on two standard √ s −1 s T(θˆ(r, s) − θ) ⇒ ϒ(λ) dλ h(λ) dW(λ) (9) consistent long-run variance estimators. Specifically, we fol- r r low Kiefer and Vogelsang (2005) and focus on the quadratic ˆ 2 for all 0 ≤ r < s ≤ 1, where ϒ(·) is a positive definite l × l non- spectral kernel estimator ωQS(b) and Bartlett kernel estima- · × ˆ 2 ≤ stochastic function, h( ) is a l k nonstochastic function, and tor ωBT (b) with bandwidths equal to a fixed fraction b 1of Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 459

Table 1. Small sample results in a time series regression with T = 128

ˆ 2 ˆ 2 t-statistic (q) ωQS(b) ωBT (b) ˆ 2 ˆ 2 24816ωQA ωPW 0.05 0.3 1 0.05 0.3 1 Size, AR(1) ρ −0.5 4.74.75.05.110.19.48.56.96.19.07.77.5 04.94.74.64.87.18.17.35.55.26.76.06.2 0.5 4.84.64.64.910.49.99.06.06.19.47.57.0 0.9 4.95.16.17.828.925.426.415.211.529.920.518.8 0.95 5.15.37.010.237.832.436.321.214.740.328.225.5 Size, MA(1) φ −0.5 4.55.04.84.98.48.37.76.05.77.66.96.8 0.5 5.05.15.25.48.98.67.96.26.18.16.96.6 0.9 5.04.85.05.19.18.38.16.46.18.36.96.8 0.95 4.94.85.05.19.18.38.16.46.08.47.06.8 Size adjusted power, AR(1) ρ −0.5 14.637.354.056.156.555.255.637.324.356.146.442.3 015.138.453.750.062.760.659.041.327.260.751.947.2 0.5 14.538.255.954.357.056.254.440.324.956.048.444.2 0.9 17.256.777.678.357.554.658.342.727.758.751.446.6 0.95 22.172.088.590.670.065.571.556.035.172.063.357.5 Size adjusted power, MA(1) φ −0.5 17.645.264.962.471.470.268.949.031.670.559.753.6 0.5 16.544.563.364.869.367.267.147.128.068.557.753.2 0.9 15.542.860.463.365.063.663.143.326.864.153.649.4 0.95 15.742.860.463.364.863.663.043.327.064.053.449.1

t X y = X θ + u t = T NOTE: The entries are rejection probabilities of nominal 5% level two-sided -tests about the coefficient β of the first element of t in the linear regression t t t, 1,..., , = = −1 T ¯ ¯ −1/2 ¯ ¯ =˜ − −1 T ˜ ˜ where Xt (xt, 1) , xt (T s=1 xsxs) xt, xt xt T s=1 xs, and the elements of xt are four independent draws from a mean-zero, Gaussian, stationary AR(1) and MA(1) process of unit variance and common coefficients ρ and φ, respectively. The disturbances ut are an independent draw from the same model as the (pretransformed) regressors, multiplied by the first element of Xt. Unreported simulations for the other forms of heteroscedasticity considered by Andrews (1991) yield√ qualitatively similar results. Under the alternative, the difference between the true and hypothesized coefficient of interest was chosen as 4/ T(1 − ρ2) in the AR(1) model and as 5/ T in the MA(1) model. See text for description of test statistics. Based on 10,000 replications.

ˆ 2 ˆ 2 the sample size, with asymptotic critical values as provided by erate degrees of dependence, tests based on ωQA and ωPW ,as Kiefer and Vogelsang (2005) in their table 1. For standard in- ˆ 2 ˆ 2 well as on ωQS(b) and ωBT (b) with b small have larger size ference based on consistent long-run variance estimators, we corrected power than the t-statistic, with especially large differ- ˆ 2 include the quadratic spectral estimator ωQA with an automatic ences for q small. On the other hand, the t-statistic approach bandwidth selection using an AR(1) model for the bandwidth can be substantially more powerful than any of the other tests determination as suggested by Andrews (1991), and an AR(1) in highly dependent scenarios. ˆ 2 ˆ = prewhitened long-run variance estimator ωPW with a second The group estimators βj, j 1,...,q, may fail to be ap- stage automatic bandwidth quadratic spectral kernel estimator proximately normal because the underlying random variables as described in Andrews and Monahan (1992), where the criti- are too heavy tailed, so that the limiting law is instead a α- cal values are those from a standard . stable distribution with α<2. McElroy and Politis (2002) As can be seen from Table 1,thet-statistic approach is re- stress that few options are available for inference in time se- markably successful at controlling size, the only instance of ries models with such innovations, even for inference about a moderate size distortion occurs in the AR(1) model with the . But as long as, for some sequence mT , ˆ ρ ≥ 0.9 and q ≥ 8. This performance may be understood by mT (βj − β), j = 1,...,q, are asymptotically independent with observing that strong (which induce overrejec- strictly α-stable symmetric limiting distributions, the discus- tion) co-occur with strong heterogeneity in the group design sion around Equation (7) implies that the t-statistic approach matrices (which induce underrejection) for the considered data remains applicable. Note that this approach does not require generating process. In contrast, the tests based on the consistent knowledge of mT , which typically depends on the tail index α. estimators and the fixed-b asymptotic approach lead to much We provide some Monte Carlo evidence on the favorable prop- more severe overrejections. erties of the t-statistic approach relative to subsampling stud- For the computations of size adjusted power, the magnitude ied by McElroy and Politis (2002) in the supplementary mate- of the alternative was chosen to highlight differences. For mod- rials. 460 Journal of Business & Economic Statistics, October 2010

Table 2. Small sample results in a panel with N = 10, T = 50, and time series correlation

Homoskedastic Heteroskedastic − − β√ β0 β√ β0 ρx ρu t-stat clust cl.FE t-stat clust cl.FE nT nT Size 00 0 5.05.25.104.64.94.9 0.5 0.50 4.95.35.104.65.55.3 0.9 0.50 5.36.56.204.47.76.4 0.9 0.90 5.07.26.204.07.96.2 10.50 4.48.76.803.814.78.6 Size adjusted power 00 2.558.760.859.7554.651.952.0 0.5 0.53 53.955.055.1864.756.658.2 0.9 0.58 56.771.560.8862.049.148.9 0.9 0.90.839.431.938.32562.933.646.2 10.51242.583.650.8 180 73.752.145.1

= + = = NOTE: The entries are rejection probabilities of nominal 5% level two-sided t-tests about the coefficient β of xi,t in the linear regression yi,t Xi,tθ ui,t, i 1,...,N, t 1,...,T, where Xi,t = (xi,t, 1) , xi,t = ρxxi,t−1 +εi,t, xi,0 = 0, εi,t ∼ iidN (0, 1), ui,t = ρuui,t−1 +ηi,t, ui,0 = 0, where under , ηi,t ∼ iidN (0, 1) independent of {εi,t}, and under = + 2 ˜ ˜ ∼ N { } heteroscedasticity, ηi,t (0.5 0.5xi,t)ηi,t and ηi,t iid (0, 1) independent of εi,t . The considered tests are the t-statistic approach with groups defined by individuals (“t-stat”); OLS coefficient based tests with Rogers (1993) standard errors (“clust”); and OLS coefficient based test which includes individual Fixed Effects and Arellano (1987) standard√ errors (“cl.FE”). The critical value for the clustered was chosen from the appropriate quantile of a Student-t distribution with N − 1 degrees of freedom, scaled by N/(N − 1). Based on 10,000 replications.

3.2 Panel Data power. The advantage of our approach is that it does not require asymptotic homogeneity to yield valid inference. Many empirical studies in economics are based on observing Table 2 provides some small sample evidence for the per- N individuals repeatedly over T time periods, and correlations formance of these two approaches, with the same data gener- are possible in either (or both) dimensions. In applications, it ating process as considered by Kézdi (2004), with an AR(1) is typically assumed that, possibly after the inclusion of fixed ˆ in both the regressor and the disturbances. Since βi, condition- effects, one of the dimension is uncorrelated, and inference ally on {Xi,t}, is Gaussian with mean β,thet-statistic approach is based on consistent standard errors that allows for arbitrary is exactly small sample conservative for this DGP. Hansen’s correlation in the other dimension (Arellano 1987 and Rogers (2007) asymptotic result is formally applicable for |ρx| < 1 1993). The asymptotic validity of these procedures stems from and |ρu| < 1, as this DGP then is asymptotically homogeneous an application of a law of large numbers across the uncorrelated in the sense defined above. With a unit root in the regressors, dimension. So if the uncorrelated dimension is small, one would −2 T however, T t=1 Xi,tXi,t does not converge to the same limit expect these procedures to have poor finite sample properties, across i, so that despite the iid sampling across i, asymptotic ho- and our approach to inference is potentially attractive. mogeneity fails. These asymptotic considerations successfully To fix ideas, consider a linear regression for the case where explain the size results in Table 2.Thet-statistic approach has N is small and T is large higher size adjusted power for heteroscedastic disturbances, but this is not true under homoscedasticity. yi,t = X θ + ui,t, i = 1,...,N, t = 1,...,T, (10) i,t For panel applications in finance with individuals that are { }T [ ]= where Xi,t, ui,t t=1 are independent across i and E Xi,tui,t firms, it is often the cross-section dimension for which uncorre- →∞ latedness is an unattractive assumption (see Petersen (2009)for 0 for all i, t. Suppose that under T asymptotics with −1 T →p −1/2 T ⇒ an overview of popular corrections in finance). N fixed, T t=1 Xi,tX i and T t=1 Xi,tui,t i,t As noted in the Introduction, if one is willing to assume that N (0, i) for all i for some full rank matrices i and i. These assumptions are enough to guarantee that the OLS coefficient there is no time series correlation, which is empirically plausi- ˆ ble at least for stock returns, then our approach with time peri- estimators βi using data from individual i only are asymptoti- cally independent and Gaussian, so the t-statistic approach with ods as groups becomes the so-called Fama–MacBeth approach: = Estimate the model of interest for each time period j cross sec- q N groups is valid. Hansen (2007) derives a closely re- ˆ lated result under “asymptotic homogeneity across i,” that is tionally to obtain βj, and compute the usual t-statistic for the resulting q = T coefficient estimates. Our results formally jus- if i =  and i = , for all i: in that case, the standard t- ˆ tify this approach for T small and possible heterogeneity in the statistic for β based on the usual Rogers (1993) standard errors ˆ converges in distribution to a t-distributed random variable with variances of βj. Note that the variances may be stochastic and √ dependent even in the limit, as in (7), which would typically N − 1 degrees of freedom, scaled by N/(N − 1), under the arise when regression errors follow a stochastic model null hypothesis. In fact, it is not hard to see that under asymp- ˆ with some common volatility component. totic homogeneity across i, β, and sβˆ in (5) of our approach are In corporate finance applications, or with overlapping long- first-order asymptotically equivalent to βˆ and the appropriately term returns as dependent variable, one would typically not scaled Rogers (1993) standard error under the null and local al- want to rule out additional dependence in the time dimen- ternatives, so both approaches have the same asymptotic local sion. Under the assumption that the correlation dies out over Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 461 time, one could try to nonparametrically estimate the long-run states into groups defined as larger geographical areas so that ˆ T variance of the sequence {βj} = using, say, the Newey and at least one of the states in each group was subject to the in- j 1 ˆ West (1987) estimator. However, this will require a long panel tervention, it again becomes possible to obtain estimators βj, (T large) to yield reasonable inference. Our results suggest an j = 1,...,q, from each group and to apply the t-statistic ap- alternative approach: Divide the data in fewer groups that span proach. This leads to a loss of degrees of freedom, but it has several consecutive time periods. For instance, with T = 24 the advantage of yielding correct inference when the preinter- yearly sampling frequency, one might form 8 groups of 3 year vention and postintervention specific random effects in ui,t are blocks, or, more conservatively, 4 groups of 6 year blocks. If independent, but not necessarily identically distributed scale the time series correlation is not too pronounced, then parame- mixture of standard normals. This is a considerable weakening ter estimators from different groups will have little correlation, of the homogeneous Gaussian assumption required for the ap- and the t-statistic approach yields approximately valid infer- proach of Donald and Lang (2007). Also, if larger geographical ence. We conducted a Monte Carlo study of the performance areas are formed by collecting neighboring states, the t-statistic of this approach relative to clustering and Fama–MacBeth stan- approach becomes at least partially robust to moderate spatial dard errors in a panel with both cross sectional and time series correlations. When the number of states in each of these larger dependence. We found substantially better size control of the areas is not too small (which for the U.S. then implies a rela- t-statistic approach, but also somewhat smaller size-adjusted tively small q), one might appeal to the to power. See the supplementary materials for details. justify the t-statistic approach to inference when the underlying If a panel is very short and potential autocorrelations are random effects cannot be written as scale mixtures of standard large, then it might be more appealing to assume some indepen- normals. dence in the cross section. For instance, in finance applications, one might be willing to assume that there is little correlation 3.3 Clustered Data between firms of different industries, as in Froot (1989). Under this assumption, one could collect all firms of the same industry A further potential application of our approach is to draw in the same group to obtain as many groups as there are differ- inferences about a population based on a two-stage (or multi- ent industries. If the parameter of interest is a regression coef- stage) sampling design with a small number of independently ficient of a regressor that varies within industry, then one could sampled primary sampling units (PSUs). PSUs could be vil- add time fixed effects in each group to guard against interindus- lages in a development study (see, e.g., Deaton 1997, chap- try correlation from a yearly common shock that is independent ters 1.4 and 2.2), or a small number of, say, city blocks in a of the other regressors. Alternatively, one can also combine in- large metropolitan area. One would typically expect that obser- dependence assumptions in both dimensions by, say, forming vations from the same PSU are more similar than those from twice as many groups as there are industries by splitting each different PSUs, which necessitates a correction of the standard industry group into two depending on whether t < T/2 or not. errors. Note that PSUs are independent by sample design, so The theoretical results in Section 4.3 below suggest that there with PSUs as groups j = 1,...,q, the only additional require- are substantial gains in power (more than 10% for 5% level ment of our approach is that the parameter of interest can be es- tests) of such an additional independence assumption as long timated by an approximately Gaussian and unbiased estimator as q ≤ 8. Similar possibilities of group formation might be at- βˆ from each PSU, j = 1,...,q. Of course, this will only be pos- tractive for long-run performance evaluations in finance (see, j sible if the parameter of interest is identified in each PSU; in a e.g., Jegadeesh and Karceski 2009) for a discussion of infer- ence based on consistent variance estimation), and panel analy- regression context, a coefficient about a regressor that only vary ses with individuals as countries and trade blocks or continents across PSUs cannot be estimated from one PSU only, at least as as one group dimension. long as the regression contains a constant. In such cases, our Recently, Bertrand, Duflo, and Mullainathan (2004)have approach is still applicable by collecting more than one PSU in also stressed the importance of allowing for time series correla- each group. tion in panel difference-in-difference applications. This tech- As a stylized example, imagine a world where the only spa- nique is popular to estimate causal effects, and it is usually cial correlation between household characteristics in the popu- implemented by a linear regression (10) with fixed effects in lation arises through the fact that households in the same neigh- both dimensions. In a typical application, the individuals i = borhood are very similar to each other, and villages consist of, 1,...,N are U.S. states, and the coefficient of interest β multi- say, 30–80 neighborhoods. Consider a two-stage sample design plies a binary regressor that describes some area specific inter- with a simple random sample of 400 households within 12 vil- ˆ vention, such as the passage of a law. Donald and Lang (2007) lages as PSUs. Sample means βj of household characteristics of show that if ui,t has an iid Gaussian random effect structure a single PSU are then approximately Gaussian with a mean that for each (potential) preintervention and postintervention area is equal to the national average β, and a variance that is a func- group, then correct inference is obtained for fixed N by a two- tion of the number of neighborhoods. This variance is larger stage inference procedure using a Student-t critical value with than that of a national simple random sample of the same size, an appropriate degrees of freedom correction. See Wooldridge so ignoring the clustering leads to incorrect inference, while our (2003) for further discussion, and Conley and Taber (2005)for approach is approximately correct. a possible approach when only few states were subject to the In some instances, it will be more appropriate to assume that intervention, but many others were not. With the time fixed all individuals from the same PSU are similar—think of the ex- effects, it is obviously not possible to apply the t-statistic ap- treme case where all households in the same village are identi- proach with groups defined as states. However, by collecting cal. In this case, there is no equivalent to the averaging over the 462 Journal of Business & Economic Statistics, October 2010 neighborhoods, and one cannot appeal to the central limit the- input-output relations to measure the distance different sectors ˆ ∼ N 2 orem to argue for the approximation βj (β, vj ). This setup of the U.S. economy. would naturally lead to a random parameter model, where the For the t-statistic approach suggested here, an assumption of household characteristic βj in PSU j is a random draw from the correlations decaying as a function of distance suggests con- national distribution. In a slightly more general regression con- structing the q groups out of blocks of neighboring observa- text, this leads to the random coefficient regression model (cf., tions. If the groups are carefully chosen, then under asymp- for instance, Swamy 1970) totics where there are more and more observations in each of the q groups, most observations are sufficiently far away Y = X θ + u = X θ + X (θ − θ) + u i,j i,j j i,j i,j i,j j i,j from the “borders.” The variability of the group estimators is for individual i = 1,...,nj in PSU j = 1,...,q, E[Xi,juj]= thus dominated by observations that are essentially uncorrelated 0 and θj are iid draws from some population with mean θ. with observations from other groups. Furthermore, the averag-

Thought of as part of the disturbance term, X (θ j − θ) in- ing within each group yields asymptotic Gaussianity for each i,j ˆ duces intra-PSU correlations. Now under sufficient regularity βj, so that under sufficiently strong regularity conditions, the ˆ p conditions, θ j − θ j → 0 as nj →∞, and the t-statistic ap- t-statistic based inference is valid. Bester, Conley, and Hansen proach for inference about β (the first element of θ) remains (2009) provide such sufficient primitive conditions for linear valid as long as the distribution of βj − β can be written as regressions. a scale mixture of standard normals. This is a wide class of We investigate the relative performance of the t-statistic ap- distributions, as noted in Section 2.1 above. If nj is not large proach and inference based on consistent variance estimators in ˆ p a Monte Carlo exercise as follows: We are interested in con- enough to make θ j − θ j → 0 a good approximation, but instead ˆ 2 ducting inference about the mean β of n = 128 observations βj|βj ∼ iidN (βj, v ), then the t-statistic approach remains valid j which are located on a rectangular array of unit squares with as long as βj − β is a scale mixture of standard normals, as 8 rows and 16 columns (two checker boards side by side). The the induced unconditional distribution of βˆ − β then is a scale j observations are generated such that in the Gaussian case, the mixture of standard normals, too. correlation of two observations is given by exp(−φd) for some The need for clustering might arise in a more subtle way de- φ>0, where d is the Euclidian distance between the two obser- pending on the relationship between the sampling scheme and vations. We also consider disturbances with a mean corrected the population of interest. For example, suppose we want to chi-squared distribution with one degree of freedom. As can be study labor supply based on a large iid sample from U.S. house- seen from Table 3,thet-statistic approach is more successful at holds, which are located in, say, 12 different regions. Similar to controlling size than inference based on the consistent variance the example above, assume that each region consists of, say, estimators. The asymmetry in the error distribution has only a 30–80 different metropolitan and rural areas, and that the char- relatively minor impact on size control. Size corrected power of acteristics of these areas induce similar behavior of households, the t-statistics increases in q, but is always smaller than the size so that there are effectively about 500 different types of house- corrected power of tests based on nonparametric spatial con- holds. Of course, in a large sample, we will have many observa- sistent variance estimators suggested by Conley (1999) with a tions from the same area, which are quite similar to each other. small bandwidth b ≤ 2, which includes the OLS variance esti- Nevertheless, the usual (small) standard errors, based on the to- tal number of observations, are applicable by definition of an mator as a special case. iid sample for statements about labor supply in the current U.S. population. But if the study’s results are to be understood as 4. OF t–STATISTIC BASED INFERENCE generic statements about labor supply, then the relevant popu- lation becomes households in all kinds of circumstances, and We now turn to a discussion of the efficiency properties of the iid sample from U.S. households is no longer iid in this the t-statistic based approach to large sample inference. We larger population. Instead, it makes sense to think of the 12 re- start by establishing a small sample optimality result for the gions as independently sampled PSUs of this superpopulation, t-statistic in Section 4.1. This in turn yields a corresponding and apply our approach with the regions as groups. As pointed large sample “efficiency under robustness” result by an appli- out by Moulton (1990), ignoring this clustering often leads to cation of the recent results in Müller (2008), as discussed in very different results. Section 4.2. In particular, these results imply the impossibil- ity of using data dependent methods to automatically select 3.4 Spatially Correlated Data the number or composition of groups while maintaining robust- ness. Finally, in Section 4.3, we compare the efficiency of the Inference with spatially correlated data is usually justified t-statistic based approach to inference to the benchmark case by a similar reasoning as with time series observations: more of known (or, equivalently, correctly consistently estimated) as- distant observations are less correlated. With enough assump- ymptotic variance. tions on the decay of correlations between distant observations, consistent parametric and nonparametric variance estimators of 4.1 Small Sample Result spatially correlated data can be derived—see Case (1991) and Conley (1999). “Distance” here can mean physical distance be- Theorem 1 provides conditions under which the usual small tween geographical units (country, county, city, and so forth), sample t-test remains a valid test. We now turn to a discussion but may also be thought of as distance in some economic sense. of the small sample optimality of the t-statistic (2) when the un- ∼ N 2 Conley and Dupor (2003), for instance, use metrics based on derlying Gaussian variates Xj (μ, σj ) are not necessarily Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 463

Table 3. Small sample results in a location problem with spatial correlation, n = 128

ˆ 2 ˆ 2 t-statistic (q) ωUA(b) ωWA(b) 248160248248 Size, Gaussian errors φ =∞ 5.05.05.15.15.56.27.913.28.114.919.1 φ = 25.15.45.97.515.111.010.415.88.014.921.0 φ = 15.67.510.616.939.826.419.622.816.517.425.0 Size, mean corrected chi-squared errors φ =∞ 5.05.45.76.36.57.18.413.48.814.719.1 φ = 25.56.57.08.013.510.911.116.29.516.021.7 φ = 15.79.512.817.935.325.820.623.817.919.526.9 Size adjusted power, gaussian errors φ =∞ 15.440.056.864.168.867.765.160.862.541.131.3 φ = 215.643.059.067.571.370.867.862.468.146.231.8 φ = 115.441.157.164.069.667.763.957.866.449.730.8 Size adjusted power, mean corrected chi-squared errors φ =∞ 15.534.852.060.367.867.063.058.859.736.229.4 φ = 214.136.959.869.576.975.370.863.370.343.031.3 φ = 115.235.752.364.779.372.462.253.365.039.428.4

= + = = NOTE: The entries are rejection probabilities of nominal 5% level two-sided t-tests about β in the model yi,j β ui,j, i 1,...,8, j 1,...,16. Under Gaussian errors, ui,j are 2 2 multivariate mean zero unit variance Gaussian with correlation between ui,j and ul,k given by exp(−φ (i − l) + (j − k) ), and the mean corrected chi-squared errors were generated − − by u =  1 ((u˜ )), where u˜ are the Gaussian model disturbances,  is the cdf of a standard normal and  1 is the inverse of the cdf of a mean corrected chi-squared i,j χ2−1 i,j i,j χ2−1 × × × × random variable. The considered tests are the t-statistic approach with groups of spatial dimension 8 8, 8 4, 4 4, and 2 4, at the obvious locations; and inference based on ¯ = −1 8 16 ˆ 2 y n i=1 j=1 yi,j with two versions of Conley’s (1999) nonparametric spatial consistent variance estimators of bandwidth b: a simple average ωSA(b) of all cross products of −¯ −¯ = = ≤ ˆ 2 = [˜ (yi,j y)(yk,l y), i, k 1,...,8, j, l 1,...,16, of Euclidian distance d b, and a weighted average ωWA (b) of these cross products, with√ weights w(i, j, k, l) 1 w(i, j, k, l)> 0]˜w(i, j, k, l) and w˜ (i, j, k, l) = (1 −|i − k|/b)(1 −|j − l|/b) [cf., equation (3.14) of Conley (1999)]. Alternatives where chosen as β − β0 = c/ n with c = 2.5, 3.4, 5.7 under Gaussian disturbances and c = 3.5, 4.7, 8 under chi-squared errors for φ =∞, 2, 1, respectively. Based on 10,000 replications. of equal variance. Recall that if the variances are identical, then the analogous result also holds for the one-sided t-test of small the usual two-sided t-test is not only the uniformly most pow- enough level. erful unbiased test of (1), but also the uniformly most powerful Note that this optimality result is driven by the conservative- scale invariant test (see Ferguson 1967, p. 246). For a signifi- ness of the usual t-test. For α = 10% and q = 20, say, accord- cance level of 5% or lower, Theorem 1 shows that the null re- ing to Theorem 1, the critical value of the t-statistic must be jection probability for the t-test never exceeds the nominal level amended to induce conservativeness. The resulting test is thus 2 = 2 for heterogeneous variances. So if we consider the hypothesis not optimal when σj σ for all j under both H0 and H1.Itis test also not optimal against the worst case alternative with 14 vari- = { 2}q ances identical and 6 variances zero—the optimal test against H0 : μ 0 and σj j=1 arbitrary against (11) such an alternative would certainly exploit that if 6 equal real- = 2 = 2 izations of X are observed, they are known to be equal to μ. H1 : μ 0 and σj σ for all j j and restrict attention to scale invariant tests, then the least fa- vorable distribution for the q dimensional nuisance parameter 4.2 Asymptotic Efficiency and the Choice of Groups {σ 2}q is the case of equal variances. In other words, the usual j j=1 Suppose the n observations in the potentially correlated large t-test is the optimal scale invariant test of (11) for any given data set are of dimension  × 1, so that the overall data Y is alternative μ = 0 when the level constraint is most difficult to n an element of Rn. In general, tests ϕ of H : β = β are se- satisfy. By theorem 7 of Lehmann (1986, pp. 104–105), we thus n 0 0 quences of (measurable) functions from Rn to the unit inter- have the following result. val, where ϕn(yn) ∈[0, 1] indicates the probability of rejection Theorem 2. Let α and q be such that (3) holds. A test that conditional on observing Yn = yn.Ifϕn takes on values strictly rejects the null hypothesis for |t| > cvq(α) is the uniformly most between zero and one then ϕn is a randomized test. As usual in powerful scale invariant level α test of (11). large sample testing problems,√ consider the sequence of local alternatives β = β = β + μ/ n, so that the null hypothesis If one is uncertain about the actual variances of X , and con- n 0 j becomes H : μ = 0. Under such local alternatives, (4) implies siders the case of equal variances a plausible benchmark, then 0 √ the usual 5% level t-test maximizes power against such bench- { ˆ − }q ⇒{ }q n(βj β0) j=1 Xj j=1 (12) mark alternatives in the class of all scale invariant tests. Since the one-sided t-test is also known to be the uniformly most pow- where Xj, j = 1,...,q, are as in the small sample Section 4.1 N 2 erful invariant test under the (sign-preserving) scale transfor- above, that is Xj are independent and distributed (μ, σj ). Fur- { }q →{ }q mations Xj j=1 cXj j=1 for c > 0(Ferguson 1967, p. 246), thermore, by the Continuous Mapping Theorem, it also holds 464 Journal of Business & Economic Statistics, October 2010

R that that (14) also holds for M0 = M . What is more, for any data 0 βˆ − β q X q generating process satisfying (13) under the alternative with j 0 ⇒{R }q = j = ∈ MR ∈ MX ∗ j j=1 ¯ , (13) μ 0, that is, for any m 1 or m 1 , ϕn (α) has local as- ˆ − j=1 X j=1 ∗ q β β0 ymptotic power lim →∞ ϕ (α) dF (m,μ,{σ 2} ) for μ = 0 n n n j j=1 ¯ −1 q [| | ] where X = q = Xj. The interest of (13) over (12)isthat equal to the power of the small sample t-test 1 t > cv(α) in j 1 q q { }q the “limiting problem” (1) with {Xj} = (or {Rj} = ) observed. Rj j=1 is a maximal invariant of the group of transformations j 1 j 1 { }q →{ }q = An asymptotic efficiency claim about the t-statistic approach Xj j=1 cXj j=1 for c 0, so that in the “limiting problem” now amounts to the statement that no other test ϕn satisfy- with {R }q observed, Theorem 2 implies that a level-α test j j=1 √ √ ing (14) has higher local asymptotic power. The following the- based on |t|= qR¯ /s = qX¯ /s with R¯ = q−1 q R and R X j=1 j orem, which follows straightforwardly from the general results 2 = − −1 q − ¯ 2 sR (q 1) j=1(Rj R) maximizes power against al- in Müller (2008) and Theorem 2 above, provides such a state- = 2 = = ment for the case of equal asymptotic variances. ternatives with μ 0 and σj σ for j 1,...,q as long as α ≤ 0.083. M = q Theorem 3. (i) For any test ϕn that satisfies (14)for 0 { 2} q Let Fn(m,μ, σj j=1) be the distribution of Yn in a spe- MR ≤ { 2} ≤ √ 0 and α 0.083, lim supn→∞ ϕn dFn(m,μ, σ j=1) cific model m with parameter β = β + μ/ n and asymp- ∗ q 0 lim →∞ ϕ (α) dF (m,μ,{σ 2} ) for all μ = 0, σ 2 > 0 and totic variance of βˆ equal to σ 2; think of m as describing n n n j=1 j j ∈ MR all aspects of the data generation mechanism beyond the pa- m 1 . 2 q (ii) Suppose there exists a group of transformations Gn(c) rameters μ and {σ } = , such as the correlation structure of j j 1 of Y that induces the transformations {βˆ − β}q → Yn. The unconditional rejection probability of a test ϕn then n j j=1 { 2}q {c(βˆ − β)}q for c = 0. For any test ϕ that is invariant is ϕn dFn(m,μ, σj j=1), and the asymptotic null rejection j j=1 n { 2}q to G and that satisfies (14)forM = MX and α ≤ 0.083, probability is lim supn→∞ ϕn dFn(m, 0, σj j=1). n 0 0 { 2}q ≤ ∗ The weak convergences (12) and (13) obviously only hold for lim supn→∞ ϕn dFn(m,μ, σ j=1) limn→∞ ϕn (α) dFn(m, some sequences of underlying distributions F (m,μ,{σ 2}q ) { 2}q = 2 ∈ MX n j j=1 μ, σ j=1) for all μ 0, σ > 0, and m 1 . of Yn, that is some models m. The assumption (12)isanas- ymptotic regularity condition that restricts the dependence in Part (i) of Theorem 3 shows the t-statistic approach to be as- ˆ ymptotically most powerful against the benchmark alternative Yn in a way that in large samples, each βj provides indepen- dent and Gaussian information about the parameter of interest of equal asymptotic variances among all tests that provide as- β. The plausibility of this assumption in any given application ymptotically valid inference under the regularity condition (13). { ˆ }q Part (ii) contains the same claim under the slightly more natural depends on the functional relationship between βj j=1 and the data Yn, and the properties of Yn. The convergence (13)isa condition (12) when attention is restricted to tests that are ap- very similar, but slightly weaker regularity condition. Denote propriately invariant. For example, in a regression context, an MX MR by 0 and 0 the set of models m for which (12) and (13) adequate underlying group of transformations is the multiplica- hold under the null hypothesis of μ = 0, respectively, (so that tion of the dependent variable by c. Also, the analogous asymp- MX ⊂ MR MX MR 0 0 ), and analogously, denote by 1 and 1 the set totic efficiency statements hold for one-sided tests based on tβ of models m for which (12) and (13) hold pointwise for every of asymptotic level smaller than 4.1%. μ = 0. A concern about strong and pervasive correlations in Note that this asymptotic optimality of the t-statistic ap- Y of largely unknown form means that little is known about MX MR n proach holds for all models in 1 and 1 , that is, when- { 2}q properties of Fn(m,μ, σj j=1). In an effort to obtain robust in- ever (12) and (13) holds with μ = 0. In other words, for any ference for a large set of possible data generating processes m, test that has higher asymptotic power for some data-generating one might want to impose that level-α tests ϕn are asymptoti- process for which (12) and (13) holds with μ = 0 and equal as- cally valid for all m ∈ MX or m ∈ MR, that is, ymptotic variances, there exists a data generating process satis- 0 0 fying (12) and (13) with μ = 0 for which the test has asymptotic { 2}q ≤ lim sup ϕn dFn(m, 0, σj j=1) α rejection probability larger than α. n→∞ In particular, this implies that it is not possible to use data- ∈ M { 2}q 2 for all m 0 and σj = with max σj > 0 (14) dependent methods to determine an appropriate q: Suppose one j 1 j is conservatively only willing to assume (13) to hold for some M = MX M = MR = for 0 0 or 0 0 . The robustness constraint (14) small q q0, but the actual data is much more regular in the is strong, as the asymptotic size requirement is imposed for all sense that (13) also holds for q = 2q0, with each group divided m ∈ M0, which is attractive only if (12)or(13) summarize all into two subgroups. Then any data dependent method that leads knowledge about properties of Yn that are relevant for inference to higher asymptotic local power for this more regular data nec- about β. essarily lacks robustness in the sense that there exists some data ∗ = [| | ] Rn → { } Denote by ϕn (α) 1 tβ > cvq(α) the 0, 1 test generating process for which (13) holds with q = q , and this | | 0 of asymptotic size α that rejects for large values of tβ as method overrejects asymptotically. Thus, with (13)viewedas defined in (5), and note that, by scale invariance, tβ can also a regularity condition on the underlying large dataset, the t- { ˆ − ˆ − }q be computed from the observations (βj β0)/(β β0) j=1. statistic approach efficiently exploits the available information, ∗ As discussed in Section 2.2,thetestϕn (α) satisfies (14)for with highest possible power in the benchmark case of equal as- M = MX ≤ 0 0 as long as α 0.083, and scale invariance implies ymptotic variances. Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 465

4.3 Comparison With Inference Under Known Variance (i) Let ι be the k × 1 vector with 1 in the first row and zeros ¯ = elsewhere. Then inf{ }q ∈Qq,{ }q ∈Pq q/q 0, We now turn to a discussion of the relative performance of j j=1 1 j j=1 1 the t-statistic approach as outlined in Section √2.2 and infer- ι qι ˆ ˆ inf = 0 and β n − ⇒ q q q q ¯ ence based on the full sample estimator with (β β) { } ∈P ,{ } ∈P ι  ι N 2 2 j j=1 k j j=1 k q (0,σ ) and known σ > 0. When (12)or(13) summarizes ¯ the amount of regularity that one is willing to impose, then this ι qι 2 = inf = 1/q if k 1 is a purely theoretical exercise. On the other hand, one might be q q q q ≥ {j} = ∈P ,{j} = ∈P ι qι 0ifk 2. willing to consider stronger assumptions that enable consistent j 1 k j 1 k estimation of σ 2, and it is interesting to explore the relative gain { }q ∈ Qq { ¯ }q ∈ Pq (ii) For any j j=1 k there exists j j=1 k so that in power. − ¯ { }q ={¯ }q p q q is positive semidefinite for j j=1 j j=1, and for With σˆ 2 → σ 2, the standard approach to testing the null hy- q q q q ¯ any {j} = ∈ P there exists { } = ∈ P so that q − q is pothesis β = β is to reject when |z | exceeds the critical value j 1 k j j 1 k 0 β negative semidefinite for { }q ={ }q . for a standard normal, where z is given by j j=1 j j=1 β = = ¯ = { }q (iii) If j  for j 1,...,q, then q q for all j j=1. √ βˆ − β √ βˆ − β z = n 0 = n 0 + o ( ) Part (i) of Theorem 4 shows that very little can be said in β ˆ p 1 (15) σ σ general about the relative magnitudes of the asymptotic vari- under the null and local alternatives. In this case, a comparison ˆ ˆ ances of β and β. Only for k = 1 and j restricted to positive of the asymptotic based on tβ with the asymp- numbers there exists a bound on the relative asymptotic vari- totic power of a test based on zβ approximates the efficiency ances, and this bound is so weak that even for q as small as cost of the higher robustness of inference based on tβ . q = 4, one can construct an example where the local asymp- To investigate this issue, we consider the class of GMM mod- totic power of a two-sided 5% level test based on |tβ | greatly els as discussed in Section 2.2 under exact identification (k = l). exceeds the local asymptotic power of a test based on |zβ | for Under the assumptions made there, the simple average of the q almost all alternatives, despite the much larger critical value for ˆ = −1 q ˆ | | = | | group estimators θ q j=1 θ j satisfies tβ (which is equal to 3.18 for q 4 compared to 1.96 for zβ ). What is more, as shown in part (ii), it is not possible to deter- q √ − − ˆ ˆ n(θˆ − θ) = q 1  1Q + o (1) ⇒ N (0, ¯ ), (16) mine whether θ is more efficient than θ without knowledge of j j p q { }q j=1 j j=1, and vice versa in the important special case where j { }q ∈ Pq − are symmetric and positive definite. (There exist j j=1 / k where ¯ = q−2 q  1 ( )−1. In contrast, the full sam- q j=1 j j j that make θˆ the more efficient estimator for all possible values ˆ −1 n ˆ ˆ q ple GMM estimator θ which solves n i=1 g(θ, yi) g(θ, of {j} = ; for instance, for k = 1 and q = 2, let 1 = 1 and = j 1 yi) 0, satisfies under the same assumptions 2 =−1/2.) − When  =  for all j, however, the two estimators become √ q 1 q j ˆ asymptotically equivalent. This special case naturally arises n(θ −θ) = j Qj +op(1) ⇒ N (0, q), (17) when the groups have an equal number of observations n/q, j=1 j=1 and the average of the derivative of the moment condition is = q −1 q q −1 where q ( j=1 j) ( j=1 j)( j=1 j) . In general, homogenous across groups. One important setup with this fea- = this full sample GMM estimator is not efficient: with heteroge- ture√ is the case√ of underlying iid observations. With j , ˆ − = ˆ − + ˆ ˆ neous groups, it would be more efficient to compute the optimal n(θ θ) n(θ θ) o√p(1) and β and β are asymptoti- GMM estimator of the q conditions E[g(θ, yi)]=0 for i ∈ Gj, cally equivalent (up to order n) under the null and local alter- j = 1,...,q. But this efficient full-sample estimator requires the natives. There is thus no asymptotic efficiency cost for basing consistent estimation of the optimal weighting matrix, which inference about β on βˆ associated with the reestimation of the = involves j, j 1,...,q. This is unlikely to be feasible or ap- last k − 1 elements of θ in each of the q groups. The asymp- propriate in applications with pronounced correlations and het- totic local power of tests based on tβ and zβ simply reduces to ˆ ˆ erogeneity, so that the relevant comparison for θ is with θ as the small√ sample power of the t-statistic (2) and the z-statistic = ¯ ¯ 2 characterized in (17). √ z qX/σq in the hypothesis test (1), where σj is the (1, 1) Comparing ¯ with  , we find that while n-consistent −1 −1 ¯ 2 = −1 q 2 q q element of  j and σq q j=1 σj . Figure 3 depicts and asymptotically Gaussian, the estimators θˆ and θˆ (and thus βˆ the power of such 5% level tests for various q and the two sce- narios for the variances considered in Figures 1 and 2 above. and βˆ) are not asymptotically equivalent. The asymptotic power The scale of the variances is normalized to ensure σ¯ 2 = 1, and of tests based on t and z thus not only differ through differ- q β β the magnitude of the alternative μ is the value on the abscissa ences in the denominator, but also through their numerator. The √ divided by q, so that the power of the z-statistic is the same relationship between ¯ and  is summarized in the following q q for all q. theorem, whose proof is given in the Appendix: When all variances are identical (a = 1), the differences in Theorem 4. Let Qk be the set of full rank k × k matri- power between the t-statistic and z-statistic are substantial for ces, and let Pk ⊂ Qk denote the set of symmetric and positive small q, but become quite small for moderate q: The largest definitek × k matrices. For any q ≥ 2: difference in power is 32 percentage points for q = 4, is 13 for 466 Journal of Business & Economic Statistics, October 2010

Figure 3. Power of 5% level t-tests and z-tests with q independent observations.

q = 8, and is 5.8 for q = 16. In both scenarios and all con- more powerful one, depending on the values of j, the group sidered values of a = 1, the maximal difference in power be- j asymptotic . In the important special case where tween the z-statistic and t-statistic is smaller than this equal j =  for all j, the largest gain in power of inference based variance benchmark, despite the fact that the t-statistic under- on 5% level two-sided zβ over tβ is typically no larger than the rejects under the null hypothesis when variances are heteroge- largest difference in power between a small sample z-statistic neous. When j =  for all j, the loss in local asymptotic power over a t-statistic for iid Gaussian observations. By implication, of inference based on tβ compared to zβ is thus approximately as soon as q is moderately large (say, q = 16) there exist only bounded above by the largest loss of power of a small sample modest gains in terms of local asymptotic power (less than 6 t-statistic over the z-statistic in an iid Gaussian setup. Interest- percentage points for 5% level tests) of efforts to consistently = ingly, for very unequal variances with a 5, the t-statistic is estimate the asymptotic variance σ 2. sometimes even more powerful than the z-statistic. This is pos- sible because the z-statistic is not optimal in the case of unequal variances. Intuitively, for small realizations of the high variance 5. CONCLUSION 2 ¯ 2 observation, sX is much smaller than σq , and the t-statistic ex- ceeds the (larger) critical value more often under moderate al- This paper develops a general strategy to deal with inference ternatives. about a scalar parameter in data with pronounced correlations To sum up, in an exactly identified GMM framework, tests of largely unknown form. The key assumption is that it is pos- based on tβ and zβ compare as follows: Both tests are consistent sible to partition the data into q groups, such that estimators and have power against the same local alternatives. Without ad- based on data from group j, j = 1,...,q, are approximately in- ditional assumptions on j—the sample average of the deriva- dependent, unbiased and Gaussian, but not necessarily of equal tive of the moment condition in group j—little can be said about variance. The t-statistic approach to inference provides in some their local asymptotic power, as either procedure may be the sense efficient inference under this regularity condition. What Ibragimov and Müller: t-Statistic Based Correlation and Heterogeneity Robust Inference 467 is more, this inference remains valid also when the group esti- ACKNOWLEDGMENTS mators have a joint distribution that can be written as a mixture of independent Gaussian distributions with means equal to the The authors would like to thank the Editor, Arthur Lewbel, an parameter of interest. The t-statistic approach may therefore be anonymous Associate Editor, a referee, seminar, and conference used also for group estimators that are approximately symmet- participants at Berkeley, Brown, Cambridge, N. G. Chebotarev ric stable or of statistically dependent stochastic variances. De- Research Institute of Mathematics and Mechanics, Chicago spite its simplicity, the proposed method is thus applicable in a GSB, Columbia, Harvard/MIT, Institute of Mathematics of wide range of models and settings. Uzbek Academy of Sciences, LSE, Ohio State, Penn State, Princeton, Rice, Risk Metrics Group London, UC Riverside, APPENDIX Rochester, Stanford, Tashkent State University, Tashkent State Proof of Theorem 4 University of Economics, Queen Mary, Vanderbilt, the Greater New York Metropolitan Area Colloquium 2006, (i) For the first claim, let 1 = ξ − (q − 1), 1 = 1 and the NBER Summer Institute 2006, the CIREQ Time Series ¯ j = j = 1forj = 2,...,q for some ξ>0. Then q/q = Conference 2006, the 62nd European Meeting of the Econo- 2 −2 2 ¯ ξ (q − 1 + (1 − q + ξ) )/q , so that q/q → 0asξ → 0. metric Society 2007 and the 2008 Far Eastern and South Asian For the second claim, let 1 = 1 = Ik, and j =ξIk, j = Meeting of the Econometric Society for valuable comments and = q = ςIk for j 2,...,q, for some ς>0, ξ>0, so that j=1 j discussions. Ibragimov gratefully acknowledges partial support − + ((q 1)ς 1)Ik. Then by the NSF grant SES-0820124, a Harvard Academy Junior ξ(q − 1) + 1 Faculty Development grant and the Warburg Research Funds q = Ik and ((q − 1)ς + 1)2 (Department of Economics, Harvard University), and Müller gratefully acknowledges support by the NSF via grant SES- 1 + (q − 1)ξ/ς 2 ¯ = I . 0518036. q q2 k = → Letting ξ 1 and ς 0 proves the second claim, and with [Received February 2008. Revised April 2009.] 4 ¯ 2 ξ = ς and ς → 0 we find ι qι/ι qι → 1/q . Also, for k ≥ 2, let 1 = diag(A, Ik−2) ∈ Pk with A = 1 1 = = = ((1, 2 ) ,(2 , 1) ), 1 1 diag(1,ξ,Ik−2)1 and j j Ik REFERENCES for j = 2,...,q. Then ¯ Andrews, D. W. K. (1991), “Heteroskedasticity and Autocorrelation Consistent ι qι = 1/q and Estimation,” , 59, 817–858. [453,454,458, 459] −3 − 4q + 16q3 + 4ξ(q − 1)2 (1993), “Tests for Parameter Instability and Structural Change With ι  ι = q − 2 2 Unknown Change Point,” Econometrica, 61, 821–856. [458] (1 4q ) Andrews, D. W. K., and Monahan, J. C. (1992), “An Improved Heteroskedas- ¯ ticity and Autocorrelation Consistent Covariance Matrix Estimator,” Econo- so that ι qι/ι qι → 0asξ →∞. = ¯ ≥ 2 metrica, 60, 953–966. [458,459] We are thus left to show that for k 1, q/q 1/q for Arellano, M. (1987), “Computing Robust Standard Errors for Within-Groups { }q { }q all positive numbers j j=1 and nonnegative numbers j j=1. Estimators,” Oxford Bulletin of Economics and Statistics, 49, 431–434. But [460] Bakirov, N. K., and Székely, G. J. (2005), “Student’s t-Test for Gaussian Scale −2 −1 q q q q Mixtures,” Zapiski Nauchnyh Seminarov POMI, 328, 5–19. [453,455] = ≤ 2 q j j j j Bertrand, M., Duflo, E., and Mullainathan, S. (2004), “How Much Should We Trust Differences-in-Differences Estimates?” The Quarterly Journal of j=1 j=1 j=1 j=1 Economics, 119, 249–275. [461] q Bester, C. A., Conley, T. G., and Hansen, C. B. (2009), “Inference With Depen- − ≤  2 = q2¯ . dent Data Using Cluster Covariance Estimators,” working paper, Chicago j j q GSB. [454,462] j=1 Bollerslev, T., Engle, R. F., and Nelson, D. B. (1994), “ARCH Models,” in Handbook of Econometrics, Vol. IV, eds. R. F. Engle and D. McFadden, (ii) Note that for any real full-column rank matrix X, Amsterdam: Elsevier Science. [458] −1 −1 X(X X) X is idempotent, so that I − X(X X) X is posi- Case, A. C. (1991), “Spatial Patterns in Household Demand,” Econometrica, tive semidefinite. Therefore, for any real matrix Y of suitable 59, 953–965. [462] dimension, Y Y − Y X(X X)−1X Y is positive semidefinite. Conley, T. G. (1999), “GMM Estimation With Cross Sectional Dependence,” ¯ = ¯ = −1 , 92, 1–45. [453,462,463] For the first claim, let j jj. Then q q Ik, Conley, T. G., and Dupor, B. (2003), “A of Sectoral Comple- = q −1 q q −1 mentarity,” Journal of Political Economy, 111, 311–352. [462] and q ( j= j) ( j= jj)( j= j) . It suffices to 1 1 1 Conley, T. G., and Taber, C. (2005), “Inference With ‘Difference in Differences’ −1 − ¯ −1 show that q q is negative semidefinite, and this fol- With a Small Number of Policy Changes,” Technical Working Paper 312, NBER. [461] lows from the above result with Y = (Ik,...,Ik) and X = Deaton, A. (1997), The Analysis of Household Surveys: A Microeconomet- (1,...,q) . ric Approach to Development Policy, Baltimore: John Hopkins University − q − For the second claim, let  =  . Then ¯ = q 2  1 Press. [461] j j q j=1 j = q −1 = Donald, S. G., and Lang, K. (2007), “Inference With Difference-in-Differences and q ( j=1 j) , and the result follows by setting Y and Other Panel Data,” The Review of Economics and Statistics, 89, 221– − − ( 1/2,..., 1/2) and X = (1/2,...,1/2). 233. [454,461] 1 q 1 q Fama, E. F., and MacBeth, J. (1973), “Risk, Return and Equilibrium: Empirical ¯ = q−2( q  ) (iii) Immediate from q j=1 j and Tests,” Journal of Political Economy, 81, 607–636. [454] q  = q. Ferguson, T. S. (1967), —A Decision Theoretic Ap- j=1 j proach, New York and London: Academic Press. [463] 468 Journal of Business & Economic Statistics, October 2010

Froot, K. A. (1989), “Consistent Covariance Matrix Estimation With Cross- Marshall, A. W., and Olkin, I. (1979), Inequalities: Theory of Majorization and Sectional Dependence and Heteroskedasticity in Financial Data,” The Jour- Its Applications, New York: Academic Press. [457] nal of Financial and Quantitative Analysis, 24, 333–355. [461] McConnell, M. M., and Perez-Quiros, G. (2000), “Output Fluctuations in the Hansen, B. E. (2000), “Testing for Structural Change in Conditional Models,” United States: What Has Changed Since the Early 1980’s,” American Eco- Journal of Econometrics, 97, 93–115. [458] nomic Review, 90, 1464–1476. [458] Hansen, C. B. (2007), “Asymptotic Properties of a Robust Variance Matrix Es- McElroy, T., and Politis, D. N. (2002), “Robust Inference for the Mean in the timator for Panel Data When T Is Large,” Journal of Econometrics, 141, Presence of Serial Correlation and Heavy-Tailed Distributions,” Economet- 597–620. [454,460] ric Theory, 18, 1019–1039. [459] Hansen, L. P. (1982), “Large Sample Properties of Generalized Method of Mo- Moulton, B. R. (1990), “An Illustration of a Pitfall in Estimating the Effects of ments Estimators,” Econometrica, 50, 1029–1054. [456] Aggregrate Variables on Micro Units,” Review of Economics and Statistics, Imhof, J. P. (1961), “Computing the Distribution of Quadratic Forms in Normal 72, 334–338. [462] Variables,” Biometrika, 48, 419–426. [455] Müller, U. K. (2007), “A Theory of Robust Long-Run Variance Estimation,” Jansson, M. (2004), “The Error in Rejection Probability of Simple Autocorre- Journal of Econometrics, 141, 1331–1352. [454,458] lation Robust Tests,” Econometrica, 72, 937–946. [454] (2008), “Efficient Tests Under a Weak Convergence Assumption,” Jegadeesh, N., and Karceski, J. (2009), “Long-Run Performance Evaluation: working paper, Princeton University. [454,462,464] Correlation and Heteroskedasticity-Consistent Tests,” Journal of Empirical Müller, U. K., and Watson, M. W. (2008), “Testing Models of Low-Frequency Finance, 16, 101–111. [461] Variability,” Econometrica, 76, 979–1016. [458] Kézdi, G. (2004), “Robust Standard Error Estimation in Fixed-Effects Panel Newey, W. K., and West, K. D. (1987), “A Simple, Positive Semi-Definite, Het- Models,” Hungarian Statistical Review, 9, 95–116. [460] eroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econo- Kiefer, N., and Vogelsang, T. J. (2002), “Heteroskedasticity-Autocorrelation metrica, 55, 703–708. [453,461] Robust Testing Using Bandwidth Equal to Sample Size,” Econometric The- Petersen, M. A. (2009), “Estimating Standard Errors in Finance Panel Data ory, 18, 1350–1366. [454,458] Sets: Comparing Approaches,” The Review of Financial Studies, 22, 435– (2005), “A New Asymptotic Theory for Heteroskedasticity-Auto- 480. [460] correlation Robust Tests,” Econometric Theory, 21, 1130–1164. [454,458, Rogers, W. H. (1993), “Regression Standard Errors in Clustered Samples,” 459] Stata Technical Bulletin, 13, 19–23. [453,460] Kiefer, N. M., Vogelsang, T. J., and Bunzel, H. (2000), “Simple Robust Testing Swamy, P. A. V. B. (1970), “Efficient Inference in a Random Coefficient Re- of Regression Hypotheses,” Econometrica, 68, 695–714. [454,458] gression Model,” Econometrica, 38, 311–323. [462] Kim, C.-J., and Nelson, C. R. (1999), “Has the Economy Become More Stable? White, H. (1980), “A Heteroskedasticity-Consistent Covariance Matrix Estima- A Bayesian Approach Based on a Markov-Switching Model of the Business tor and a Direct Test for Heteroskedasticity,” Econometrica, 48, 817–830. Cycle,” The Review of Economics and Statistics, 81, 608–616. [458] [453] Lehmann, E. L. (1986), Testing Statistical Hypotheses (2nd ed.), New York: Wooldridge, J. M. (2003), “Cluster-Sample Methods in Applied Econometrics,” Wiley. [463] American Economic Review, 93, 133–138. [461]