<<

Journal of 185 (2015) 409–419

Contents lists available at ScienceDirect

Journal of Econometrics

journal homepage: www.elsevier.com/locate/jeconom

Bayesian regression with nonparametric heteroskedasticity Andriy Norets Department of Economics, Brown University, United States article info a b s t r a c t

Article history: This paper studies large sample properties of a semiparametric Bayesian approach to inference in a linear Received 3 March 2014 regression model. The approach is to model the distribution of the regression error term by a normal Received in revised form distribution with the that is a flexible function of covariates. The main result of the paper is a 1 September 2014 semiparametric Bernstein–von Mises theorem under misspecification: even when the distribution of the Accepted 19 December 2014 regression error term is not normal, the posterior distribution of the properly recentered and rescaled Available online 7 January 2015 regression coefficients converges to a normal distribution with the zero and the variance equal to the semiparametric bound. Keywords: Bayesian © 2014 Elsevier B.V. All rights reserved. Heteroskedasticity Misspecification Posterior consistency Semiparametric Bernstein–von Mises theorem Semiparametric efficiency Gaussian process priors Multivariate Bernstein polynomials

1. Introduction efficients converges to a normal distribution with the zero mean and the variance equal to the semiparametric efficiency bound. The = ′ + A Yi Xi β0 ϵi with the conditional equality of the variance to the semiparametric efficiency bound restriction E(ϵi|Xi) = 0 is a standard regression model, which is suggests that the about the linear coefficients widely used in and econometrics. This paper analyzes based on this model is conservative in the following sense: the pos- asymptotic properties of a Bayesian semiparametric approach to terior variance in a correctly specified is likely estimation of this model. The approach is to model the distribu- to be smaller than the posterior variance in a model that postu- tion of the error term by a normal distribution with the variance lates normally distributed errors with the flexibly modeled vari- that is a flexible function of covariates. For example, Gaussian pro- ance. With carefully specified priors, Bayesian procedures usually cess priors, splines, or polynomials can be used to build a prior behave well in small samples. Thus, the Bayesian normal linear for the variance. Normality of the error term guarantees that the regression with nonparametric heteroskedasticity can also be an Kullback–Leibler (KL) distance between the model and the attractive alternative to classical semiparametrically efficient esti- generating process (DGP), which does not necessarily satisfy the mators from Carroll(1982) and Robinson(1987). At the same time, normality assumption, is minimized at the data generating values the results of the paper provide a Bayesian interpretation to these of the linear coefficients and the conditional variance of the error classical estimators. term. Thus, one can expect that the posterior asymptotically con- Several different approaches to inference in a regression model centrates around the true values for these two parameters. The have been proposed in the Bayesian framework. In a standard text- normality assumption can also be justified by appealing to the book linear regression model, normality of the error terms is as- principle of maximum entropy of Jaynes(1957) when only the first sumed. More recent literature relaxed the normality assumption two conditional moments are of interest. by using mixtures of normal or Student t distributions. However, The main result of the paper is a semiparametric Bernstein–von if the shape of the error distribution depends on covariates then Mises theorem under misspecification: even when the distribution the posterior may not concentrate around the data generating val- of the regression error term is not normal in the DGP, the posterior ues of the linear coefficients (Müller, 2013). Lancaster(2003) and distribution of the properly recentered and rescaled regression co- Poirier(2011) do not assume linearity of the regression function and treat the linear projection coefficients as the parameters of in- terest. They use Bayesian bootstrap (Rubin, 1981) to justify from E-mail address: [email protected]. the Bayesian perspective the use of the ordinary es- http://dx.doi.org/10.1016/j.jeconom.2014.12.006 0304-4076/© 2014 Elsevier B.V. All rights reserved. 410 A. Norets / Journal of Econometrics 185 (2015) 409–419 timator with a heteroskedasticity robust matrix. Pele- on a finite constant B is specified below, and a prior distribution for nis(2014) demonstrates posterior consistency in a semiparamet- σ on S. It is also assumed that σ0 ∈ S. ric model with a parametric specification for the regression func- The prior on S will be assumed to put a sufficiently large tion and a nonparametric specification for the conditional distri- probability on the following class of smooth functions bution of the regression error term. It is also possible to estimate  k a fully nonparametric model for the distribution of the response SM,α = σ : X → [σ, σ]: max sup |∂ σ (x)| +···+ ≤ conditional on covariates, see, for example, Peng et al.(1996), k1 kd α x∈X Wood et al.(2002), Geweke and Keane(2007), Villani et al.(2009), |∂ kσ (x) − ∂ kσ (z)|  + max sup ≤ M , and Norets(2010) for Bayesian models based on smoothly mix- α−α k1+···+kd=α x̸=z∈X ∥x − z∥ ing regressions or mixtures of experts and MacEachern(1999), k k k1 kd DeIorio et al.(2004), Griffin and Steel(2006), Dunson and Park where k = (k1,..., kd) is a multi-index, ∂ = ∂ /∂ x1 ··· ∂ xd is (2008), Chung and Dunson(2009), Norets and Pelenis(2014), and a partial derivative operator, and α is the greatest integer strictly Pati et al.(2013) for models based on dependent Dirichlet pro- smaller than α > 0. cesses. These fully nonparametric models require a lot of data The distribution of covariates is assumed to be ancillary and it for reliable estimation results and prior specification is nontrivial. is not modeled. The is given by The model considered in the present paper is more parsimonious. n n n  Nevertheless, it delivers consistent estimation of the first two con- p(Y |X , β, σ ) = pβ,σ (Yi|Xi), ditional moments, conservative inference about the regression co- i=1 efficients, and it is robust to misspecification of the regression error n  − ′ 2   1 (Yi Xi β) distribution. Thus, it can be thought of as a useful intermediate step pβ,σ (Yi|Xi) = √ exp − . 2 2 X between fully nonparametric and oversimplistic models. i=1 2πσ (Xi) σ ( i) Bayesian Markov chain Monte Carlo (MCMC) estimation algo- For A ∈ A, the posterior is given by rithms for the normal regression with flexibly modeled variance  n n have been developed in the literature; see, for example, Yau and p(Y |X , β, σ )dΠ(β, σ ) Π(A|Yn, Xn) = A . Kohn(2003) and Chib and Greenberg(2013), who use transformed  n| n d×S p(Y X , β, σ )dΠ(β, σ ) splines, or Goldberg et al.(1998), who use transformed Gaussian R process prior for modeling the variance. In those papers, the mod- In misspecified models, parameter values minimizing the KL els with flexibly modeled were shown to perform well distance between the model and the DGP are called pseudo true in simulation studies. Thus, the present paper considers only the parameter values. It is well known that in models with finite di- theoretical properties of the model. mensional parameters the maximum likelihood and Bayesian esti- The rest of the paper is organized as follows. Section2 describes mators are consistent for the pseudo true parameter values under the DGP. The model is described in Section3. The Bernstein–von weak regularity conditions (see Huber(1967), White(1982), and Mises theorem is presented in Section4. The assumptions of the Gourieroux et al.(1984) for classical results and Geweke(2005) theorem are verified in Section5 for priors based on truncated and Kleijn and van der Vaart(2012) for Bayesian results). Analo- gous results for misspecified infinite dimensional models are ob- Gaussian processes and multivariate Bernstein polynomials. Sec- 1 tion6 concludes. Proofs are delegated to Section7. tained in Kleijn and van der Vaart(2006). Thus, the following lemma suggests that in the regression model described above the posterior concentrates around (β , σ ) in large samples. 2. Data generating process 0 0 Lemma 1. Consider the DGP and the model described above. Suppose The data are assumed to include n observations on a response | | | ∞ n n E( log f0(Yi Xi) ) < . Then, variable and covariates (Y , X ) = (Y1,..., Yn, X1,..., Xn), where d   Y ∈ Y ⊂ R and X ∈ X ⊂ , i ∈ {1,..., n}. X is f0(Yi|Xi) i i R (β , σ ) ∈ argmin E log . assumed to be convex and bounded set with a nonempty interior. 0 0 β∈Rd,σ :X→[σ,σ ] pβ,σ (Yi|Xi) The observations are independently identically distributed (iid), ′ (Yi, Xi) ∼ F0. The joint DGP distribution F0 is assumed to have a If E(XiXi ) is positive definite, then the minimizer is F0 almost surely conditional density f0(Yi|Xi) with respect to (w.r.t.) the Lebesgue unique. measure. The distribution of the infinite sequence of observations, ∞ ∞ ∞ · The lemma is proved in Section7. (Y , X ), is denoted by F0 . Hereafter, the expectations E( ) and E(·|·) are taken w.r.t. the DGP F ∞. 0 4. Semiparametric Bernstein–von Mises theorem Let us make the following assumptions about the data gen- erating process. First, E(Y |X ) = X ′β . Thus, for ϵ = Y − i i i 0 i i The standard Bernstein–von Mises theorem shows that in well X ′β , E(ϵ |X ) = 0. Second, E(X X ′) is invertible. Third, σ 2(x) = i 0 i i i i 0 behaved parametric models the posterior√ distribution centered at 2| = ∈ E(ϵi Xi x) is well defined for any x X, and for some 0 < σ < an efficient estimator and scaled by n converges to a normal σ < ∞, distribution with the zero mean and the variance equal to the inverse of the Fisher information, see van der Vaart(1998) for σ0(x) ∈ (σ , σ ), ∀ x ∈ X. (1) a textbook treatment under weak regularity conditions. Thus,

3. Model, prior, and pseudo true parameter values

1 There is a considerable body of research on posterior consistency in correctly In the model, it is assumed that the regression error term specified nonparametric models. A general weak posterior consistency theorem for is normally distributed conditional on covariates, ϵi|Xi, β, σ ∼ was established by Schwartz(1965). Barron(1988), Barron et al. 2 N(0, σ (Xi)). Note that this normality assumption is not made for (1999), and Ghosal et al.(1999) developed theory of strong posterior consistency. the DGP and the model can be misspecified. An alternative approach to consistency was introduced by Walker(2004). Posterior Let S ⊂ {σ : X → [σ , σ ]} be a complete separable metric convergence rates were studied in Ghosal et al.(2000) and Shen and Wasserman d × (2001). Belitser and Ghosal(2003), Ghosal et al.(2003), Huang(2004), Scricciolo space and A be a Borel σ -field on R S. Let Π denote a prior (2006), Ghosal et al.(2008), van der Vaart and van Zanten(2009), Kruijer et al. d distribution for (β, σ ) on (R × S, A). Π is a product of a normal (2010), and Gine and Nickl(2011) among others analyzed adaptation of posterior −1 N(β, H ) prior for β truncated to [−B, B]d, where a lower bound convergence rates. A. Norets / Journal of Econometrics 185 (2015) 409–419 411 the theorem implies asymptotic equivalence between standard inequalities for classes of functions that change with n from Pol- confidence and credible sets. Chernozhukov and Hong(2003) and lard(1984) and van der Vaart(1998) are used to make this claim Kleijn and van der Vaart(2012) prove Bernstein–von Mises type precise and to show that the posterior of σ concentrates on σ0 suf- theorems for misspecified parametric models,2 and Panov and ficiently fast. Spokoiny(2013) consider settings with increasing dimension of The assumption of the bounded parameter space (σ < σ0 < σ the nuisance parameter. Shen(2002) gave a set of conditions for and B < ∞) is admittedly restrictive. However, it appears to be asymptotic normality of the posterior of a finite dimensional part difficult to relax under misspecification. Similar assumptions are of the parameter in semiparametric models. The conditions are also made in the previous related literature: Kleijn and van der general but difficult to verify. Deriving easier to verify sufficient Vaart(2006) assumed a fixed variance and a bounded for the conditions for the non- and semiparametric Bernstein–von Mises conditional mean in their misspecified regression example; Carroll theorem is an active area of current research, see, for example, (1982) and Andrews(1994) assume a strictly positive lower bound Bickel and Kleijn(2012), Castillo(2012), Rivoirard and Rousseau for the conditional . Bickel and Kleijn(2012) do (2012), Kleijn and Knapik(2012), Kato(2013), Castillo and Nickl not assume bounded space for regression coefficients in their anal- (2013), Castillo and Rousseau(2013), and Castillo and Nickl ysis of homoskedastic partially linear regression, which suggest (2014). Misspecified semiparametric models are not covered by that it could be possible to use their approach to relax this assump- the existing results. tion under a correctly specified model. Frequentist estimation of a semiparametric regression outlined The assumptions of Theorem 1 on the prior are easy to verify in in Section2 was considered in Chamberlain(1987). He shows applications. Assumption 5 of prior concentration on KL neighbor- that the semiparametric efficiency bound for estimation of β0 is hoods is standard in the literature. Verification of Assumption 2 can −1 given by E X X ′ X −2  . This is the asymptotic variance of the be simplified when the distribution of ϵi has subexponential tails: ( i i σ0( i) ) ′ −Dϵ when f0(x β0 ± ϵ|x) ≤ e for some D > 0, all x ∈ X, and all generalized least squares estimator under known σ0, sufficiently large ϵ, the assumption holds for −1  n ′  n X X X Y − ˆ  i i  i i M = nγ1 = n γ2 βGLS = . n , δn , σ (X )2 σ (X )2 i=1 0 i i=1 0 i 1 (3) γ2 < , γ1 < [2α/d − 1]γ2. It follows from Carroll(1982) and Robinson(1987) that if σ0 is 6 + d/α estimated by kernel smoothing or nearest neighbor methods and ˆ Assumption 3 is easy to verify for priors that provide explicit dis- plugged in the formula for βGLS the resulting estimator attains the tributions for derivatives. The following section verifies the as- efficiency bound. The following Bernstein–von Mises theorem is a sumptions of the theorem for priors based on truncated Gaussian Bayesian analog of these results. processes and multivariate Bernstein polynomials. Theorem 1. For the DGP from Section 2 and the prior from Section 3, 5. Examples of priors let us make the following additional assumptions. 1. E[ 4] ∞. ϵi < 5.1. Truncated Gaussian process 2. There exist α > d/2 and positive sequences δn and Mn satisfying 4 → ∞ 2 d/α 4 → Priors based on Gaussian processes are extensively used in δn n and (Mn/δn ) /(δn n) 0, √ √ Bayesian nonparametrics; see, for example, Tokdar and Ghosh | | {| | (1+d/(2α)) −d/(2α)} → nE Xijϵi 1 Xijϵi > C nδn Mn 0, (2007), Tokdar(2007), van der Vaart and van Zanten(2008), Liang ∀ C ∈ (0, ∞), j = 1,..., d, et al.(2009), and Tokdar et al.(2010). In this section, I consider − δ(1 d/(2α))Md/(2α) → 0. a prior for σ based on integrated Brownian motion. Relevant n n technical background can be found in Section 4 of van der Vaart 3. For any constant A > 0, Π(σ : σ ∈ Sc ) exp{An} → 0 as Mn,α and van Zanten(2008). n → ∞. 2 Let X = [0, 1], W (x) be a Brownian motion on X with con- 4. The truncation bound in the prior for β satisfies B > (σ /σ 2) x tinuous sample paths, (I1W )(x) =  W (t)dt, and (IjW )(x) = ∥E|X Y |∥ e E X X ′ , where e E X X ′ is the smallest 0 i i 2/ min( ( i i )) min( ( i i ))  x j−1 ′ (I W )(t)dt for j ≥ 2. Suppose σ0 ∈ SM,α for some (M, α0) eigenvalue of E(XiXi ). 0 0 5. For any C > 0 there exists r ∈ (0, 1) such that for all sufficiently and σ < σ0 < σ . For a positive integer J, let us model σ (x) by J + J j ! [ ] large n, (I W )(x) j=0 Zjx /j truncated to σ, σ , where Zj’s are i.i.d.     N (0, 1) independent of W . More formally, let W denote the prob- pβ ,σ (Yi|Xi) Π σ , β : E log 0 0 ≤ Cδ2 ≥ exp{−nrCδ2}. ability measure for W and Z ’s, which is defined on C([0, 1]) and its | n n j pβ,σ (Yi Xi) Borel σ -field. Define an event, Then, the total variation distance  J  J  j √ T = (I W )(x) + Zjx /j! ∈ [σ, σ ], ∀x ∈ [0, 1]  n n  ′ −2 −1 [ − ˆ | ]  [ ] j=0 dTV Π n(β βGLS ) Y , X , N 0, E XiXi σ0(Xi) (2) and for a measurable set E, ∞ ˆ converge to 0 in F0 probability. This result also holds if βGLS in (2) is  J    replaced by the posterior mean  βdΠ(β|Yn, Xn). k  j  Π(σ ∈ E) = W (I W )(x) + Zjx /j! ∈ E T W(T ). = The theorem is proved in Section√7. The proof exploits the fact j 0 ˆ that the conditional posterior for n(β − βGLS ) given σ is a   [ ′ Proposition 1. Assume that the tails of ϵi are subexponential. Then, truncated normal distribution, which is close to N 0, E XiXi σ0 − for any J ≥ 3 and α0 ∈ ((J + 1)/(2J − 1), J + 1/2], Assump- −2  1 (Xi) ] when σ is close to σ0. Arguments based on maximal tions 2, 3 and 5 of Theorem 1 hold.

5.2. Multivariate Bernstein polynomials 2 In parametric misspecified models, asymptotic normality of posterior does not in general imply asymptotic equivalence of standard confidence and credible sets Priors based on Bernstein polynomials and related mixtures of due to the failure of the information equality. beta models were considered in Petrone(1999), Ghosal(2001), 412 A. Norets / Journal of Econometrics 185 (2015) 409–419

Petrone and Wasserman(2002), Kruijer and van der Vaart(2008), Conditional on m, σ (x) = Bm(x; gm) and the prior for gm is Rousseau(2010), and Burda and Prokhorov(2013) among others. assumed to have a positive density w.r.t. the Lebesgue measure (m +1)d Bernstein polynomials play a prominent role in function approxi- on R 1 . This density is assumed to be bounded from below mation literature especially when it comes to shape preserving ap- (m +1)d by 1 for some 0 on the restrictions defined in (8) and proximation, see a monograph by Lorentz(1986). As shown below, π g π g > multivariate Bernstein polynomials are also convenient for setting equal to zero elsewhere. up priors that respect bounds on partial derivatives and thus satisfy m ∈ [ ] ≤ ≤ ; gj σ, σ for o j m sufficient conditions in Theorem 1. d d d (8) In this subsection, X = [0, 1] . For functions f :[0, 1] → R, m,λ m  m−λ |D (x; g )| ≤ Mn for λi ≤ ⌊d/2⌋ + 1, x ∈ G , a Bernstein polynomial operator of order mi w.r.t. coordinate i is = defined as follows, i 1 2 mi   γ [2α/d − 1](1 − s) m  ji  mi  j − = 1 = i = i − mi ji Mn n , γ1 , (9) (Bi f )(x) f , x−i xi (1 xi) , d[6 + d/α] m j j =0 i i i and s is a constant in (0, 1). − d−1 λi m,λ m λ; m where x = (xi, x−i), xi ∈ [0, 1], and x−i ∈ [0, 1] . Also let ∂ Lemma 11 provides explicit formulas for D (G g ) that i m stand for a partial derivative of order λi w.r.t. to coordinate i and are linear in g . Thus, the model with the sample size dependent mi 3 prior can be implemented by truncating a positive density for Di stand for a first difference operator w.r.t. coordinate i the Bernstein polynomial coefficients to a set of linear inequality f ((1 − 1/m )x + 1/m , x− ) − f ((1 − 1/m )x , x− ) mi = i i i i i i i restrictions in (8) and using Markov chain Monte Carlo methods Di f (x) . 1/mi for estimation. − + − + mi,λi = mi λi 1 mi λi 2 For a given mi, define λith difference by Di Di Di Proposition 2. Assume that the tails of ϵi are subexponential and ··· mi = = ⌊ ⌋ + Di . For multi-indices m (m1,..., mk) and λ (λ1, . . . , λk), σ0 has uniformly bounded partial derivatives up to order d/2 2. m ≥ λ, denote a multivariate Bernstein polynomial operator of Then, for the prior specified above, Assumptions 2, 3 and 5 of Theo- m = m1 ··· mk order m by B B1 Bk , a multivariate difference operator rem 1 hold. m ,λ m ,λ by Dm,λ = D 1 1 ··· D k k , and a partial derivative of order λ by 1 k 5.3. Other possible priors λ = λ1 ··· λk ∂ ∂1 ∂k . Part (v) of Lemma 11 shows that It should be possible to verify the assumptions of Theorem 1 for m m −1 m other common nonparametric prior distributions. First, a ∂ 1B i = B i D i . (4) i i i i prior in the spirit of Gine and Nickl(2011) (p. 5 display (5)) is likely λi mi = to satisfy the assumptions as it imposes an explicit uniform bound From a repeated application of (4) it follows that ∂i Bi m −λ m ,λ m q q m on the Hölder norm. Such a prior might be easier to implement B i i D i i . Since for any m , q, and i ̸= k, we have B i D = D B i , i i i i k k i than the prior of Section 5.1 since it does not involve truncation it follows that (the advantage of the prior of Section 5.1 is that it does not require λ m m−λ m,λ ∂ B = B D . (5) uniform bounds on derivatives). Second, the sufficient conditions d d for the truncated Gaussian process in Proposition 1 show that For m ∈ N , let us define a collection of points (a grid on [0, 1] ) an increase in the smoothness of the process, J, only weakens m = { ∈ [ ]d : = = ≤ ≤ } G xj 0, 1 xj j/m (j1/m1,..., jk/mk), o j m , the sufficient conditions. Thus, truncated Gaussian processes with where the inequalities hold coordinate-wise and o is a vector of analytical sample paths, such as a process with the squared- zeros. Let us denote a multivariate Bernstein polynomial of order exponential covariance kernel, are likely to satisfy the assumptions m with parameter g = {gj, o ≤ j ≤ m} by as well. Finally, a verification of the assumptions for a prior based on splines was present in an earlier version of the paper and is d  mi  m   ji mi−ji omitted from the current version for brevity. B (x; g) = gj x (1 − xi) . j i o≤j≤m i=1 i 6. Conclusion d For any given parameter g and a function f :[0, 1] → R such that g = f (Gm), The paper proves asymptotic normality of the posterior of the m m d coefficients in possibly misspecified heteroskedastic linear regres- B (x; g) = (B f )(x), ∀ x ∈ [0, 1] , sion model. The model is shown to be robust to misspecification and we can also define in the distribution of the regression error term. Thus, this model − Dm,λ(x; g) = (Dm,λf )(x), ∀ x ∈ Gm λ. (6) should be a more prominent part of the Bayesian toolbox for re- 4 gression analysis. Let us consider a sample size dependent prior Πn that puts probability 1 on m such that m = · · · = m = nγ3 , where 1 d 7. Proofs 4 + d/α γ3 = , α = ⌊d/2⌋ + 1, (7) Proof. Lemma 1. d[6 + d/α]  f y|x and the dependence of m on n is not reflected in the notation for 0( ) log dF0(y, x) brevity. (2π)−0.5σ (x)−1 exp{−0.5(y − x′β)2/σ (x)2}   − ′ 2  2 (y x β) = log f0(y|x) + 0.5 log(2πσ (x) ) + dF0(y, x). 2σ (x)2 3 This definition is from Majer(2012), who briefly outlines an argument for why partial derivatives of multivariate Bernstein polynomials for f approximate [ − ′ 2| = 2 + [ ′ − ]2 = Since E (Yi Xi β) Xi) σ0(Xi) Xi (β β0) , β β0 is a corresponding partial derivatives of f ; the details of the argument are worked out : → [ ] here in Lemma 11. minimizer of the KL distance for any σ X σ , σ . It is a unique ′ 2 + 2 2 4 The proof of Theorem 1 does not require any changes when the prior is indexed minimizer if E(XiXi ) > 0. For any fixed x, log σ (x) σ0 (x)/σ (x) by n. is uniquely minimized at σ (x) = σ0(x).  A. Norets / Journal of Econometrics 185 (2015) 409–419 413

− = · ∥ − [ ′ −2]∥ · ∥  [ ′ −2] 1 ∥ Proof. Theorem 1. Bn d H/n E XiXi σ0(Xi) ∞ E XiXi σ0(Xi) ∞, n n Conditional on σ , the posterior of β, Π(β|σ , Y , X ), is  −1 −1 d  ′  N(β, H ) truncated to [−B, B] , where  1  XiXi 1  XiYi C = ∥H/n∥ · √ , n 2  2 2 ′  n σ0(Xi) n σ0(Xi)  XiX −   XiYi  i i H = H + i and β = H 1 Hβ + .  2 2 2 σ (Xi) σ (Xi)   i i − Hβ 1  XiYi  − (H/n) 1 √ + √  , 2  A derivation of the conditional posteriors in linear regression mod- n n σ (Xi) i 2 els can be found, for example, in Geweke(2005). The (marginal)   √  posterior of can be expressed as = − ˆ −1 β Dn √ φ z, n(β βGLS ), (H/n) dz, [− ]d− ˆ  n( B,B βGLS ) n n n n n n  Π(β|Y , X ) = Π(β|σ , Y , X )dΠ(σ |Y , X ).  −1 =  [ ′ −2] En √ φ z, 0, E XiXi σ0(Xi) dz. d ˆ n([−B,B] −βGLS ) √The conditional posterior of the normalized coefficients z = ˆ It is shown in Lemma 3 that E and D have lower bounds that do n(β − βGLS ) is n n not depend on σ and converge to 1 in F ∞ probability. By Lem-  √  0√ √ √ n n ˆ −1 Π(z|σ , Y , X ) ∝ φ z, n(β − βGLS ), (H/n) mas 4–6, which bound (An, Bn, Cn), and a + b ≤ a + b for any nonnegative a and b, to prove the theorem√ it suffices × √ − − 1 n([−B,B]d−βˆ )(z),  ∥  2 − 2 ∥ | n n GLS to show that i Xiϵi(σ0(Xi) σ (Xi) )/ n 2dΠ(σ Y , X )  −2 −2 | n n ∞ where φ(·, ·, ·) denotes the density of the normal distribution and and d2(σ0 , σ )dΠ(σ Y , X ) converge to zero in F0 proba- · −2 −2 =  [ −2 − −2 ]2 0.5 1.( ) is an indicator function. bility, where d2(σ , σ0 ) ( σ (x) σ0 (x) dF0(x)) . The total variation distance of interest can be expressed as fol-   lows  √  −2 −2  n n  Xiϵi(σ0(Xi) − σ (Xi) )/ n dΠ(σ |Y , X )   −1   [ | n n]  [ ′ −2] dTV Π z Y , X , N 0, E XiXi σ0(Xi)  i 2       − √   n n n n ≤  2 − −2  =  Π(z|σ , Y , X )dΠ(σ |Y , X ) sup Xiϵi(σ0 (Xi) σ (Xi))/ n ∈    σ S  i   2  −1  −2 −2 n n  ′ −2   · Π(d2(σ , σ ) > δn ∩ SM ,α|Y , X ) − φ z, 0, E[XiX σ0(Xi) ]  dz 0 n i      c n n    n n + Π(S |Y , X ) + sup  Xiϵi ≤ Π(z|σ , Y , X ) Mn,α   { ∈ : −2 −2 ≤ } σ SMn,α d2(σ0 ,σ ) δn  i  −1  −  [ ′ −2]  | n n √ φ z, 0, E XiXi σ0(Xi) dzdΠ(σ Y , X ). (10) − −  × (σ 2(X ) − σ 2(X ))/ n . (12) 0 i i  At this point in the proof, let us derive a bound on the total variation 2 distance between two normal distributions and introduce notation ∞ The right hand side of the above display converges to zero in F for various matrix norms that will be used below. By Kemperman 0 outer probability5 if (1969) (or Proposition 1.2.2 in Ghosh and Ramamoorthi(2003)), √ the total variation distance is bounded by 2 times the square root · −2 −2 ∩ | n n → n Π(d2(σ0 , σ ) > δn SMn,α Y , X ) 0 of the KL distance. The KL distance between two normal distribu- ∞ tions N(µ , Σ ) and N(µ , Σ ) is equal to in F0 probability, (13) 1 1 2 2 √   · c | n n → ∞ 1 |Σ2| − ′ − n Π(SM ,α Y , X ) 0 in F0 probability, (14) log + tr 1 − I + − 1 − n (Σ2 Σ1 ) (µ1 µ2) Σ2 (µ1 µ2)   2 |Σ1|  √   −2 − −2  →  −1 −1  sup Xijϵi(σ0 (Xi) σ (Xi))/ n 0 |Σ | − |Σ | −2   2 1 −1 −1 {σ∈S : d (σ ,σ −2)2≤δ2} ≤ + d · ∥Σ − Σ ∥∞ · ∥Σ ∥∞ Mn,α 2 0 n  i  | −1| | −1| 2 1 1 min( Σ2 , Σ1 ) ∞ in F0 outer probability. (15) + ∥ − ∥2 · ∥ −1∥ µ1 µ2 2 Σ2 2, (11) Lemmas 7–9 show that conditions (13)–(15) hold under the as- − where |Σ| denotes the determinant of Σ, a matrix norm ∥Σ∥∞ =  2 −2 sumptions of the theorem. Finally, let us consider d2(σ0 , σ ) − maxij |[Σ]ij| is the largest element of Σ in the absolute value, and | n n 2 −2 ≤ −2 dΠ(σ Y , X ). Since d2(σ0 , σ ) σ , ∥Σ∥2 = supµ ∥Σµ∥2/∥µ∥2 is a matrix norm induced by the stan- d 2 d 2   dard norm on , ∥µ∥ =  µ . ∞ − − R 2 i=1 i F d (σ 2, σ 2)dΠ(σ |Yn, Xn) > ϵ By Lemma 2 and inequality (11), 0 2 0

 √  −1   ˆ n n  ′ −2  ∞ −2 −2 −2 n n dTV Π[ n(β − βGLS )|Y , X ], N 0, E[XiX σ0(Xi) ] ≤ | + i F0 σ Π(d2(σ0 , σ ) > ϵ/2 Y , X ) ϵ/2 > ϵ     ∞ −2 −2 n n −2 ≤ 2 An + Bn + Cn/Dn + (1 − Dn)/Dn + 1 − En =  |  F0 Π(d2(σ0 , σ ) > ϵ/2 Y , X ) > ϵ/(2σ ) , (16) × dΠ(σ |Yn, Xn), where 5  ′ −2  For standard notions and definitions used in empirical processes theory, such |H/n| − |E[XiX σ0(Xi) ]| A = i , as outer probability and metric and bracketing entropy, see, for example, van der n | | | [ ′ −2]| Vaart and Wellner(1996). min( H/n , E XiXi σ0(Xi) ) 414 A. Norets / Journal of Econometrics 185 (2015) 409–419     which converges to zero by conditions (13) and (14). This com- 2   2  σ  Hβ + |XiYi|/σ n pletes the proof of the theorem when βˆ is used for centering.   GLS ≤ i 2 To see that the theorem also holds when the posterior mean is     +  ′ used for centering, note that emin Hσ XiXi n   i d |Yn Xn = d |Yn Xn ∞ 2 2 β Π(β , ) β Π(σ , ), F0 σ ∥E|XiYi|/σ ∥2 = Fn → F = ′ . (19) √  emin(E(XiXi )) n∥βˆ − βdΠ(σ |Yn, Xn)∥ GLS 2 Note that by Assumption 4 of Theorem 1, B > F. From (17)–(19),  √  ≤ n∥βˆ − β∥ dΠ(σ |Yn, Xn) −1 GLS 2 √ φ(z, 0,(H/n) )dz ̸∈ [− ]− √ zi n( B,B βi) ∥ ˆ − ∥  and that n βGLS β 2 is bounded in Lemma 6. Thus, under the ≤ −1 2 √ φ(z, 0,(H/n) )dz posterior mean centering, the term analogous to Cn above is han- zi≥ n(B−|βi|) dled by adding and subtracting ˆ to/from the posterior mean, βGLS   0.5 and applying the triangle inequality and the proof of Lemma 6. The    ≤  +  ′ 2  ˆ 2 H XiXi /σ n use of the posterior mean instead of βGLS in Dn and En does not re-    i  quire any changes in the proof of Lemma 3. The rest of the proof is  not affected by the centering. × {− ′ n } −d/2  √ exp 0.5z zem (2π) dz zi≥ n(B−Fn) Lemma 2. For two distributions P1 and P2 with densities p1 and p2    0.5 w.r.t. a measure µ, the total variation distance between P truncated   ′ 2  n −d/2 2 ≤ 2  H + X X /σ n (e ) to a set E and P can be bounded as follows  i i  m 1  i    1 p  P (Ec )  |p − p |dµ   E 2  c 2 1 2 × {− 2} −1/2 p1 −  dµ ≤ P1(E ) + + . √ exp 0.5zi (2π) dz. n  P2(E) P2(E) P2(E) zi≥ n(B−Fn)em For z ≥ 1, the normal CDF can be bounded as follows, 1 − Φ(z) ≤ Proof. exp(−z2). Thus, the integral in the last display is bounded by   c ∞ |p − 1 p /P (E)| = |p P (E) − p |/P (E) + P (E ) √ F 1 E 2 2 1 2 2 2 1 2 n 2 n 0 E exp{−n(B − Fn) (e ) } + 1{ n(B − Fn)e < 1} → 0,  m m c ≤ |p1(P2(E) − 1) + p1 − p2|/P2(E) + P1(E ) where the convergence in probability follows from the conver- n E gence of F and e . This completes the proof for D . The proof of  n m n c the analogous result for En is similar.  ≤ P1(E ) + (1 − P2(E))/P2(E) + |p1 − p2|/P2(E). 

Lemma 4. Expression Bn from the proof of Theorem 1 can be bounded ∞ ∞ F F − 0 0 above by B1 + B2d −2 2 , where B1 → 0 and B2 → B2,B2 is a Lemma 3. Under the assumptions of Theorem 1,En and Dn defined n n 2(σ , σ0 ) n n 1 2 in its proof have lower bounds that do not depend on σ and converge constant, and (Bn, Bn) do not depend on σ . to 1 in F ∞ probability. 0 Proof. Proof. ′ −2 ∥H/n − E[X X σ (X ) ]∥∞ ≤ ∥H/n∥∞  √ i i 0 i − − ˆ −1  ′  ′  1 √ φz,( n(β βGLS ), (H/n) )dz  1  XiX XiX  d ˆ i i n([−B,B] −βGLS ) + sup  − E  ∈  n σ 2(X ) σ 2(X )   σ S  i i i  − ∞ = 1 − φ(z, 0,(H/n) 1)dz √    1 1  n([−B,B]d−β)  ′  + E XiX −  . i 2 X 2 d   σ ( i) σ0 (Xi) ∞ ≤  −1 √ φ(z, 0,(H/n) )dz. The first term on the right hand side converges to zero. The z ̸∈ n([−B,B]−β ) i=1 i i second term converges to zero in outer probability by the as- ′ −2 ∈ Next, note that sumed F0-Glivenko–Cantelli class for XiXi σ (Xi), σ S. By the Cauchy–Schwarz inequality and the finiteness of the fourth mo-     ′ ′  ′ ′ ments of Xi, the last term is bounded by a constant multiple of ≥ + 2 ≥ n − z (H/n)z z H XiXi /σ n z z zem, (17) −2 2 d2(σ , σ0 ).  i

n  ′ 2 where e is the smallest eigenvalue of (H + X X /σ )/n. As in Lemma 5. Expression An from the proof of Theorem 1 is bounded m i i i ∞ ∞ F F the proof of Lemma 5, 1 + 2 −2 −2 1 →0 2 →0 2 2 above by An And2(σ , σ0 ), where An 0 and An A ,A    1 2  is a constant, and (An, An) do not depend on σ .   ′  |H/n| ≤  H + X X /σ 2 n . (18)  i i  Proof. It follows by the definition of the determinant and  i    induction that for two d × d matrices A and B, |A| − |B| ≤ −1 d−1 Using the bound on ∥(H/n) ∥2 from Lemma 6, we get d!· d max(∥A∥∞, ∥B∥∞) · ∥A − B∥∞. Thus, the numerator of An    is bounded by a multiple of the bound on Bn derived in Lemma 4  ′ −2 d−1 −1   2  times max(∥H/n∥ , ∥E[X X σ (X ) ]∥ ) . Since ∥H/n∥ ≤ |β | ≤ ∥β∥ ≤ ∥(H/n) ∥ ·  Hβ + |X Y |/σ n ∞ i i 0 i ∞ ∞ i 2 2  i i   ′ 2 ∥H/n∥∞ + ∥ XiX /n∥∞/σ , the numerator of An is bounded  i 2 i i A. Norets / Journal of Econometrics 185 (2015) 409–419 415

 −1  −1 above as desired. To bound the denominator of A below note ∥(H/n)y∥2 ∥y∥2 · ∥(H/n)y∥2 n = inf = inf that for symmetric positive semidefinite matrices A and B, A ≥ B y ∥ ∥ y ∥ ∥2 y 2 y 2 implies |A| ≥ |B| (see, for example, Lemma 1.4 in Zi-Zong(2009)). − ∞  ′  1 F |y (H/n)y| Thus, |H/n| ≥ |  X X ′/n|/σ 2k. Since |  X X ′/n| →0 |E[X X ′]| > ≤ inf i i i i i i i i y ∥ ∥2 y 2 0, the claim of the lemma follows.       −1 | ′ 2 +  ′ | 2 Lemma 6. The following inequality holds for Cn defined in the proof y Hσ XiXi n y /σ  i  of Theorem 1 ≤ inf   y ∥y∥2     2    1   1 1  C ≤ C 1 + C 2 √ X ϵ −  n n n  i i 2 2  n σ0(Xi) σ (Xi) ∞  i 2 σ 2 F σ 2 = →0 + 3 −2 −2    ′ , (23) Cn d2(σ0 , σ ), e (E(X X )) e Hσ 2 +  X X ′ /n min i i ∞ min i i F i 1 2 3 1 →0 2 where (Cn , Cn , Cn ) do not depend on σ ,Cn 0,Cn converges in ∞ 3 where emin(·) stands for the smallest eigenvalue. In the preceding F0 probability to a constant, and Cn converges weakly to a . display, the first inequality on the third line follows by the Cauchy–Schwarz inequality, the second inequality follows by the ′ ′ Proof. Plugging Yi = X β0 + ϵi into the definition of Cn results in i positive semidefiniteness of XiXi , and the last equality follows from  the eigenvalue decomposition for symmetric matrices.8  √  ∥ ∥ =  −1 − The third part of the product in (22) is bounded above by Cn/ H/n 2  (H/n) H(β0 β)/ n     ′  ′    ′  −  H   1  XiXi XiXi   XiXi  ′  1 +  − E  + sup E 1 X X 1 X ϵ    2 2   2  i i  i i  n  n σ (Xi) σ (Xi) σ∈S σ (Xi) + √  i 0 0 2  n σ (X )2 n σ (X )2 i 0 i i 0 i ′      1  XiXi   ′ 1 1   −  + E X X − , 2   i i 2 2  −1 1  Xiϵi  n σ (Xi)  σ (Xi) σ (Xi)  − (H/n) √  . (20) i 2 0 2 n σ (X )2  i i  2 which can be bounded as in Lemma 4( ∥A∥2 ≤ dim(A)∥A∥∞). The The first expression on the right hand side of (20) converges to zero bounds derived above and the Slutsky theorem imply the claim of −1 in probability because ∥(H/n) ∥2 is bounded above by a sequence the lemma.  converging in probability as it is shown below (see (23)). The norm of the second expression can be bounded by6 Lemma 7. Under the assumptions of Theorem 1, (14) holds. √   Proof. Note that nΠ(Sc |Yn, Xn) is bounded above by   Mn,α −1  1  1 1  ∥ (H/n) ∥2 · √ Xiϵi −   n σ (X )2 σ (X )2     i 0 i i  √ 1  2 nΠ(Sc ) exp n sup log p (Y |X )   Mn,α β,σ i i −1 d n  ′   β∈[−B,B] ,σ ∈S i  1  XiXi −1 +  − (H/n)   n X 2 1  i σ0( i)    2 − inf log pβ,σ (Yi|Xi) . β∈[−B,B]d,σ∈S n   i  1  Xiϵi  √ · √  . (21) 2 ′  2  Since log pβ,σ (Yi|Xi) = − log( 2πσ (Xi)) − (Y − 2β XiYi + n σ0(Xi) i  i 2 ′ ′ 2 ∈ [− ]d ≤ ≤ β XiXi β)/(2σ (Xi)), β B, B , and 0 < σ σ σ < The norm of the difference in the inverses in the second line of (21) ∞, the expression in the square brackets is bounded above by 7 ∞ is bounded by a sufficiently large constant a.s. F0 by the strong law of large numbers. This constant can be increased to some A > 0 so that  −1 √  ′   c | n n ≤ c { } 1  XiX − nΠ(SM ,α Y , X ) Π(SM ,α) exp An for all sufficiently large n  i  1 ∞ n n   · ∥ (H/n) ∥2 a.s. F . Together with Assumption 3 in Theorem 1, this implies the n X 2 0  i σ0( i)   2 claim of the lemma.        ′ − −  Lemma 8. Under the assumptions of Theorem 1, (13) holds. ·  X X (σ (X ) 2 − σ (X ) 2) − H n . (22)  i i 0 i i   i   pβ ,σ (Yi|Xi) 2  1  0 0 Proof. Let Zn = sup ∈[−B B]d ∈S log − β , ,σ Mn,α  n i pβ,σ (Yi|Xi) Next, we separately consider the three parts of the product in (22). |  pβ ,σ (Yi Xi) −2 −2 2 ′ − −1 E log 0 0 . By Lemma 10, d (σ , σ ) C ≤ E log p / ∥  2  ∥ p (Y |X ) 2 0 β0,σ0 The first part converges to E(XiXi σ0(Xi) ) 2 in probability. β,σ i i  The second part, pβ,σ , where C is a constant in (0, ∞). Thus, − − −1 −1 Π S ∩ d 2 2 2 2|Yn Xn − ∥(H/n) x∥2 ∥(H/n) (H/n)y∥2 Mn,α 2(σ0 , σ ) > δn , ∥(H/n) 1∥ = sup = sup 2    x ∥x∥ y ∥(H/n)y∥ pβ0,σ0 2 n n 2 2 ≤ Π S ∩ E log > Cδ Y , X Mn,α n  pβ,σ

6 ∥A−1a − B−1b∥ ≤ ∥A−1(a − b)∥ + ∥(A−1 − B−1)b∥ ≤ ∥A−1∥ ∥a − b∥ + ∥A−1 − −1 B ∥ ∥b∥. 8 A = Q ΛQ ′, QQ ′ = I and Λ is a diagonal matrix with eigenvalues of A on the 7 ∥A−1 − B−1∥ = ∥A−1(A − B)B−1∥ ≤ ∥A−1∥ ∥A − B∥ ∥B−1∥. diagonal. 416 A. Norets / Journal of Econometrics 185 (2015) 409–419

 2 2 exp{n[Cδ /2 − E log(p /p ) + Z ]}dΠ S ∩E log(p /p )>Cδ n β0,σ0 β,σ n Then, the bound on the uniform metric entropy for S in (27) ≤ Mn,α β0,σ0 β,σ n Mn,α  exp{n[C 2 2 − E log p p − Z ]}d implies S ∩E log(p /p )≤Cδ2/2 δn / ( β0,σ0 / β,σ ) n Π Mn,α β0,σ0 β,σ n   d/α { [− 2 + ]} N1 ϵ, Fn, L1(F0 n) ≤ exp{C1(QnMn/ϵ) }, exp n Cδn /2 2Zn , ≤   S ∩ E log p p ≤ C 2 2 2 Π Mn,α ( β0,σ0 / β,σ ) δn / =  where C1 is a constant and Qn i ϵi /n converges a.s. to a ≤ {− [ − 2 − ]} exp n (1 r)Cδn /2 2Zn , constant. Also note that  2 ≤ −4 where the last inequality follows from Assumptions 3 and 5 of the max gj (ϵi, Xi) σ Rn, j=1,N theorem. For ϵ > 0, 1 i √ ∞  ∩ −2 −2 2 2| n n  where R =  ϵ4/n converges a.s. to a constant. Thus, (26) is F0 SMn,α nΠ(d2(σ0 , σ ) > δn Y , X ) > ϵ n i i √ bounded above by  − 2 [ ] ∞∗ (1 r)Cδn log n/ϵ ≤ − d/α d/α −2d/α 4 F0 Zn >   −  4 2n 4E min 1, 2 exp C2Qn Mn δn C3nδn /Rn , where C and C are positive constants. Since the integrand is  (1 − r)Cδ2  2 3 ≤ ∞∗ n F0 Zn > , (24) uniformly bounded by 1, the limsup version of Fatou’s lemma 8 implies that the previous display converges to zero under where the last inequality holds for all sufficiently large n by δ4n → Assumption 2 of Theorem 1. Claim (26) for the other three classes n follow by the same argument with modified R and Q . ∞ (Assumption 2 of the theorem). It remains to prove that the last Fn n n  bound in (24) converges to zero. Since Lemma 9. Under the assumptions of Theorem 1, (15) holds. pβ ,σ (Yi|Xi) log 0 0 Proof. The proof is based on lemma 19.34 in van der Vaart(1998). pβ,σ (Yi|Xi) Define 1  ϵ2 ϵ2 = { [ −2 − −2 ]: −2 −2 2 ≤ 2 ∈ } 2 2 i i F Xijϵi σ0 (Xi) σ (Xi) d2(σ0 , σ ) δn , σ SMn,α . = log σ (Xi) − log σ (Xi) − + 2 0 2 X σ 2(X ) σ0 ( i) i ∈ 2 ≤ 2 = 2 For any f F , Ef C1δn δ , where C1 is a positive constant ′  ′   ∈ − 2 ≤ ϵiXi ′ XiXi and δ is defined by the last equality. For f1, f2 F , E(f1 f2) + 2 (β0 − β) + (β − β0) (β − β0) (25) | − | σ 2(X ) σ 2(X ) C2 sup σ1 σ2 , where C2 is a positive constant. Thus, by (27), the i i d/α bracketing L2(F0) entropy log N[](δ, F , L2(F0)) ≤ C3[Mn/δ] and d and β, β0 ∈ [−B, B] , it suffices to prove that  1+d 2 −d 2 a = log N L F ≥ C ( /( α))M /( α)     (δ) δ/ [](δ, F , 2( 0)) 4δn n .  1  ∞∗   −  ≥ ˜ 2 → 1−d/(2α) F0 sup f (ϵi, Xi) Ef (ϵi, Xi) Cδn 0, A bound on the bracketing integral is proportional to δ ∈  n  f Fn  i  d/(2α) Mn and ∀ C˜ > 0, (26) ≤ (1−d/(2α)) d/(2α) J[](δ, F , L2(F0)) C5δn Mn . { ∈ } for the following four classes Fn: log σ (Xi), σ SMn,α , By lemma 19.34 in van der Vaart(1998), { 2 X 2 ∈ S }, { X X 2 ∈ S }, and ϵi /σ ( i) , σ Mn,α ϵi ik/σ ( i) , σ Mn,α   { 2 ∈ } 2  √  XijXik/σ (Xi) , σ SMn,α . Convergence at rate δn for expressions ∗  E sup  f (ϵi, Xi)/ n ≤ J[](δ, F , L2(F0)) in (25) involving σ0 follows from the Chebyshev’s inequality, the ∈   f F  i  4 → ∞ existence of second moments (Assumption 1), and δn n . √ √ It follows from inequalities (30) and (31) on p. 31 in Pollard + nE(|Xijϵi|1{|Xijϵi| > na(δ)}). (1984) that (26) is bounded above by By Assumption 2 of Theorem 1, this display converges to zero. Thus,  the claim of the lemma is proved.   ˜ 2  ∗  Cδn 4E min 1, 2N1 , Fn, L1(F0,n) Lemma 10. For some positive constants C and C¯  8 2 2 2 2   E(log(pβ ,σ /pβ,σ )) ≥ C[∥β − β0∥ + E(σ − σ ) ] (28)  2 0 0 2 0 ˜ 2  1 Cδn 1 ≤ ¯[∥ − ∥2 + 2 − 2 2] × exp − n  , E(log(pβ0,σ0 /pβ,σ )) C β β0 2 E(σ0 σ ) . (29)   2  2 8 max gj (ϵi, Xi) j=1,...,N  1 i Proof. The law of iterated expectations implies where N1 = N1(·, Fn, L1(F0,n)) is the covering number w.r.t.    2 2 2 pβ ,σ 1 σ (Xi) σ (Xi) − σ (Xi) L F distance, F is the empirical measure corresponding to F , E log 0 0 = E log + 0 1( 0,n) 0,n 0 2 2 pβ,σ 2 σ (Xi) σ (Xi) and gj’s are the centers of the covering balls. Lemma 2.3.7 in van 0 der Vaart and Wellner(1996) on p. 112 gives a proof of Eq. (30) in  ′   ′ XiXi + (β − β0) (β − β0) . Pollard(1984) using outer probabilities and expectations to avoid σ 2(X ) measurability problems. i As X is assumed to be bounded and convex, it follows from First, for emin and emax denoting the smallest and largest eigenval- lemma 2.7.1 in van der Vaart and Wellner(1996) that the metric ues, note that entropy in the sup norm of SM,α is bounded by ′  ′  emin(E(XiXi )) 2 ′ XiXi ∥β − β0∥ ≤ (β − β0) E (β − β0) d/α 2 2 σ 2(X ) log N(ϵ, SM,α, ∥ · ∥∞) ≤ K(M/ϵ) , (27) σ i ′ emax(E(XiXi )) 2 where K > 0 is a constant that does not depend on (M, ϵ). For ≤ ∥β − β0∥ . = { 2 2 ∈ } σ 2 2 Fn ϵi /σ (Xi) , σ SMn,α , 2 2 2  −2 −2  2 Second, let σ /σ = z and q(z) = (z − 1 − log z)/(z − 1) . Note 1/n |f1(ϵi, Xi) − f2(ϵi, Xi)| ≤ sup |σ − σ | ϵ /n. 0 1 2 i that q(z) is well defined, positive, and monotonically decreasing i i A. Norets / Journal of Econometrics 185 (2015) 409–419 417   ∞ ∈ [ 2 2 2 2] 2 2 ≤ ≤ 1 1 α0 on (0, ). Thus, for any z σ /σ , σ /σ , q(σ /σ ) q(z) < min , , q(σ 2/σ 2). From this inequality, 2(2J − 1) 6 + J−1 2J + 2

2 2 2  2 2 2  E(σ − σ ) σ σ − σ which holds for any J ≥ 3 and α0 > (J + 1)/(2J − 1).  0 q(σ 2/σ 2) ≤ E log + 0 4 2 σ 2 σ σ0 Proof. (Proposition 2) 2 2 2 −γ2 E(σ − σ ) For α = ⌊d/2⌋ + 1, γ1 from (9), γ3 from (7), δn = n , and 0 2 2 − ≤ q(σ /σ ). (30) γ = 1 s , it can be verified by a direct calculation that σ 4 2 d[6+d/(⌊d/2⌋+1)] inequalities in (3) hold. Thus, Assumption 2 of Theorem 1 holds. Thus, inequalities (28) and (29) are proved.  Simple algebraic manipulations deliver the following conver- Proof. Proposition 1. gence results that are used below, First, consider Assumption 5. By Lemma 10 and E(σ 2 − σ 2)2 ≤ 0 Mn 1/2 2 → 0, δ m → ∞, [2σ sup |σ − σ0|] , ⌊d/2⌋+1/2 n 1 m1    2  pβ0,σ0 2 2 Cδn d Π E log ≤ Cδ ≥ Π ∥β − β0∥ ≤ m log m1 p n 2 ¯ 1 → 0 as n → ∞. (34) β,σ 2C 2 nδn  1/2  C δn × Π sup |σ − σ0| ≤ . (31) By Part (iv) of Lemma 11, the assumed existence and uniform 2σ (2C¯)1/2 boundedness of partial derivatives up to order ⌊d/2⌋ + 2 for σ0 A lower bound on the first term on the right hand side is propor- implies that for some N ∈ N and a constant M > 0 tional to δd, which is polynomial in n and, thus, can be ignored. n | m,λ | ≤ Since σ < σ0 < σ , for all sufficiently small εn D σ0(x) M (35)   for all x ∈ [0, 1]d, all m ≥ N, i = 1,..., d and all λ with | − | ≤ i Π sup σ (x) σ0(x) 2εn | | = d ≤ ⌊ ⌋ + x λ j=1 λj d/2 1. Constant M can be chosen sufficiently     large so that it is also a Lipschitz constant for σ0: |σ0(x1)−σ0(x2)| ≤  J  J  j M∥x1 − x2∥2. W sup (I W )(x) + Zjx /j! − σ0(x) ≤ 2εn x  =   j 0  First, let us verify that Πn puts probability 1 on SMn,⌊d/2⌋+1. For = . m   j d  i mi i mi−ji m W(T ) any x ∈ [0, 1] , = x (1 − xi) = (xi + 1 − xi) = 1 and ji 0 ji i i Theorems 2.1 and 4.1 in van der Vaart and van Zanten(2008) im- d ply the following lower bound for prior concentration probability   mi  j − i − mi ji = xi (1 xi) 1. around σ0, j o≤j≤m i=1 i   m Π sup |σ (x) − σ0(x)| ≤ 2εn Thus, for g satisfying (8), x m m m m −1 2 −α0/(2J+2) ≤ ≤ ; ≤ ≤ ≥ W(T ) exp{−nε }, εn ∝ n σ min gj B (x g ) max gj σ . (36) n o≤j≤m o≤j≤m when α ≤ J +1/2. Thus, Assumption 5 of Theorem 1 holds as long 0 Next, let us consider partial derivatives of Bm(x; gm). By (5), (6), (8) as and (36) −α /(2J+2) n 0 /δ → 0, α ≤ J + 1/2. (32) − − n 0 |∂λBm(x, gm)| = |Bm λ(x; Dm,λ(Gm λ; gm))| = m,λ m−λ m Next, let us consider Assumption 3 of Theorem 1 with α J. ≤ max |D (G ; g )| ≤ Mn.  J     Thus, Assumption 3 of Theorem 1 holds. (IJ W )(x) + Z xj/j! ∈ Sc ⊂ sup |W (x)| ≥ M /4 j Mn,α n x Next, consider Assumption 5 of Theorem 1. By (31), it suffices j=0 to show that    ∃j, |Zj| ≥ Mn/(2(J + 1)) .   m m m 1/2 ¯ 0.5 Πn g : sup |B (x; g ) − σ0(x)| ≤ C δn/(2σ (2C) ) c −1  Thus, Π(S ) ≤ W(T ) · W[sup |W (x)| ≥ M /4] + (J + 1)W x Mn,α x n [| | ≥ + ] | | ≥ ≤ ≥ {− 2} zj Mn/(2(J 1)) . Note that W(supx W (x) Mn/4) exp nrCδn . 4[1 − Φ(M /4)] by the standard results on barrier crossing prob- n m ⋆m abilities for W , see, for example, theorem 7.1 on p.314 in Shorack By theorem B.7 in Appendix B of Heitzinger(2002), |B (x; g ) − | ≤ Md0.5 ⋆m = m m | m − (2000). Then, Assumption 3 of Theorem 1 holds if σ0(x) 0.5 for g σ0(G ). For g satisfying maxj gj 2m1 ⋆m ⌊d/2⌋+1 (1+r1)/2 g | ≤ (Mn − M)/((2m1) ), Mn ≥ n , for some r1 > 0. (33) j m m m ⋆m Thus, to prove the proposition it suffices to find δn and Mn that sup |B (x; g ) − σ0(x)| ≤ sup |B (x; g ) − σ0(x)| satisfy conditions (3), (32) and (33). These conditions are satisfied if x x + sup |Bm(x; g⋆m) − Bm(x; gm)| α0 1 x γ < , γ < , α ≤ J + 1/2, and 2 + 2 + −1 0 ≤ [ ] 0.5 0.5 + − ⌊d/2⌋+1 ≤ 0.5 0.5 2J 2 6 J M/2 d /m1 (Mn M)/((2m1) ) Md /m1 , 1 + r1 γ1 = < (2J − 1)γ2. where the last inequality holds for all sufficiently large m1 by the 2 first convergence result in (34). Also, note that for such gm, the prior m−λ Values of (γ1, γ2, r1) satisfying these inequalities can be chosen as truncation constraint (8) holds (for λ ≤ ⌊d/2⌋ + 1 and x ∈ G , m,λ ⋆m m,λ long as |D (x; g )| = |D (σ0)(x)| ≤ M, and, thus, by an analog of (39) 418 A. Norets / Journal of Econometrics 185 (2015) 409–419

m,λ m m,λ m m,λ ⋆m λi   in Lemma 11, |D (x; g )| ≤ |D (x; g ) − D (x; g )| + M ≤  mi − λi − 1 ji + f xi + , x−i Mn). Therefore, mi mi ji=1  1/2  C δn   m m  λi  λi  Πn sup |B (x; g ) − σ0(x)| ≤ λi−(ji−1) λi−ji 0.5 × (−1) − (−1) x 2σ (2C¯) ji − 1 ji  Md0.5  ≥ Π sup |Bm x; gm − x | ≤ n ( ) σ0( ) 0.5 and (37) is proved since the last square bracket term is equal to x m1 + −  λ +1  (−1)λi 1 ji i .  M − M  j ≥ | m − ⋆m| ≤ n i Πn max gj gj ⌊ ⌋+ j (2m ) d/2 1 Formula (38) follows from repeated application of (37). 1   d λi λi (m1+1) Inequality (39) follows immediately from (38) and =  −  ji=0 j (m +1)d Mn M i ≥ π 1 · , λ g ⌊d/2⌋+1 2 i . To prove (40), plug the following Taylor expansion (2m1)

0.5  d l  where the first inequality holds because δnm → ∞ and the  −   −    i 1 m λ j  l m λ 1  ji last inequality follows because the prior density is assumed to be f x + = ∂ f x d m m m l! mi (m1+1) |l|≤|λ| i=1 bounded from below by π g on its support (8). |λ|+1 The last bound in the above inequality combined with [(m(n)+ + R(x, λ, m, j) · (1/m1) , d ] 2 → 1) log m(n) /(nδn ) 0 implies Assumption 5 of Theorem 1.  where R(x, λ, m, j) is bounded by a constant independent of d Lemma 11. (i) For x = (xi, x−i) ∈ [0, 1] , where xi ∈ [0, 1], (x, λ, m, j) and l! = l1!··· lk!, into (38) to obtain

 λi  λi m,λ m ,λ   (D f )(x) i i = − + (Di f )(xi, x−i) (mi λi ji) f d λi ji=1 ji=0   (m1 − λi + ji)   i=1 j =1 mi − λi ji λi  i λi−ji × xi + , x−i (−1) . (37)    d λi  li  mi mi ji  m − λ 1  λi  − ji = ∂ lf x · (−1)λi ji ii For multi-indices m ≥ , m l! ji m1 ( ) λ |l|≤|λ| i=1 ji=0

 d λi  d λi  m,λ     |λ|+1  λi−ji (D f )(x) = (mi − λi + ji) f + R(x, λ, m, j) · (1/m1) (−1) . ji i=1 ji=1 0≤j≤λ 0≤j≤λ i=1   d m − λ j λ    l  i λ −j λi λi λi−ji i × n x + (−1) i i , (38) By properties of binomial coefficients, = (−1) j is m m j ji 0 ji i i=1 i equal to zero if li < λi and λi! if li = λi (Ruiz, 1996). Thus, where operations on vectors (x, λ, m, j) inside f are coordinate-wise. d d λi (iii) For functions f1 and f2 from [0, 1] to R,   (m1 − λi + ji) | | i=1 j =1   m − λ  |(Dm,λf )(x) − (Dm,λf )(x)| ≤ (2 max m ) λ sup |f (x) − f (x)|. (39) m,λ = i · λ 1 1 i 1 2 (D f )(x) | | ∂ f x x λ m m1 (iv) If all components of m are the same (mi = m1) and f has bounded d  1  λi  − partial derivatives up to order |λ| + 1, then + R(x, λ, m, j) · (−1)λi ji m j 1 0≤j≤λ i=1 i   m − λ   sup  Dm,λf x − λf x  → 0 as m → ∞ (40) ( )( ) (∂ )  1 . λ x  m  and (40) follows as ∂ f is uniformly bounded. (v) Eq. (4) holds. Eq. (4) follows by a direct calculation,

m Proof. Formula (37) follows by induction on λi: for λi = 1, the i ∂ m   mi  j −1 m −j mi,λi i = i − i i formula holds by the definition of D ; assume it holds for λ . (Bi f )(x) f (ji/mi, x−i) jixi (1 xi) i i ∂x j m ,λ +1 λ +1 i j =1 i Then D i i f (x)/  i (m − λ + j ) is equal to i i ji=1 i i i mi−1   mi  ji m −1−j λi       − f (j /m , x− ) (m − j )x (1 − x ) i i  mi − λi 1 1 ji i i i i i i i ji f xi 1 − + + , x−i ji=0 mi mi − λi mi − λi mi ji=0 − mi 1 −   mi 1  ji m −1−j λi     = m f ((j + 1)/m , x− ) x (1 − x ) i i λi  mi − λi 1 i i i i i i λi−ji  j × (−1) − f xi 1 − j =0 i j m m − i i j =0 i i λi i mi−1   mi − 1  j − −  − i − mi 1 ji ji λi  λ −j mif (ji/mi, x−i) xi (1 xi) + , x− (−1) i i j i j =0 i mi ji i −  − −  mi 1 mi λi 1 0 λ +1  m  mi − 1  j − − = f x + x −1 i = i − i − mi 1 ji i , −i ( ) (Di f )(ji/(mi 1), x−i) xi (1 xi) mi mi j j =0 i   i m − λ − 1 λ + 1 − + i i + i = mi 1 mi f xi , x−i (Bi Di f )(x).  mi mi A. Norets / Journal of Econometrics 185 (2015) 409–419 419

Acknowledgments Kemperman, J.H.B., 1969. On the optimum rate of transmitting information. Ann. Math. Statist. 40, 2156–2177. Kleijn, B., Knapik, B., 2012. Semiparametric posterior limits under local asymptotic The author is grateful to Ulrich Mueller, Justinas Pelenis, and exponentiality. ArXiv:1210.6204. Chris Sims for helpful discussions. He also thanks anonymous ref- Kleijn, B., van der Vaart, A., 2006. Misspecification in infinite-dimensional Bayesian statistics. Ann. Statist. 34, 837–877. erees for useful suggestions. This paper is based upon work par- Kleijn, B., van der Vaart, A., 2012. The Bernstein-Von-Mises theorem under tially supported by the National Science Foundation under Grant misspecification. Electron. J. Stat. 6, 354–381. No. SES-1260861. Kruijer, W., Rousseau, J., van der Vaart, A., 2010. Adaptive Bayesian density estimation with location-scale mixtures. Electron. J. Stat. 4, 1225–1257. Kruijer, W., van der Vaart, A., 2008. Posterior convergence rates for Dirichlet References mixtures of beta densities. J. Statist. Plann. Inference 138, 1981–1992. Lancaster, T., 2003. A note on bootstraps and robustness. Andrews, D.W., 1994. Empirical process methods in econometrics. In: Engle, R.F., Liang, S., Carlin, B.P., Gelfand, A.E., 2009. Analysis of minnesota colon and rectum McFadden, D. (Eds.), Handbook of Econometrics. In: Handbook of Econometrics, cancer point patterns with spatial and nonspatial covariate information. Ann. vol. 4. Elsevier, pp. 2247–2294 (Chapter 37). Appl. Statist. 3, 943–962. Barron, A., 1988. The Exponential Convergence of Posterior Probabilities with Lorentz, G.G., 1986. Bernstein Polynomials. Chelsea Pub. Co., New York, NY. MacEachern, S.N., 1999. Dependent nonparametric processes. In: ASA Proceedings Implications for Bayes Estimators of Density Functions. University of Illinois, of the Section on Bayesian Statistical Science. Dept. of Statistics. Majer, P., 2012. Multivariate Bernstein polynomials for approximation of deriva- Barron, A., Schervish, M.J., Wasserman, L., 1999. The consistency of posterior tives. MathOverflow. http://mathoverflow.net/questions/111257 (version: distributions in nonparametric problems. Ann. Statist. 27, 536–561. 2012-11-03). Belitser, E., Ghosal, S., 2003. Adaptive Bayesian inference on the mean of an infinite- Müller, U.K., 2013. Risk of Bayesian inference in misspecified models, and the dimensional normal distribution. Ann. Statist. 31, 536–559. Dedicated to the sandwich . Econometrica 81, 1805–1849. memory of Herbert E. Robbins. Norets, A., 2010. Approximation of conditional densities by smooth mixtures of Bickel, P.J., Kleijn, B.J.K., 2012. The semiparametric Bernstein-von Mises theorem. regressions. Ann. Statist. 38, 1733–1766. Ann. Statist. 40, 206–237. Norets, A., Pelenis, J., 2014. Posterior consistency in conditional density estimation Burda, M., Prokhorov, A., 2013. Copula based factorization in Bayesian multivariate by covariate dependent mixtures. Econometric Theory 30, 606–646. infinite mixture models. Working Papers, University of Toronto, Department of Panov, M., Spokoiny, V., 2013. Finite sample Bernstein-von Mises theorem for Economics. semiparametric problems. ArXiv:1310.7796. Carroll, R.J., 1982. Adapting for in linear models. Ann. Statist. 10, Pati, D., Dunson, D.B., Tokdar, S.T., 2013. Posterior consistency in conditional 1224–1233. distribution estimation. J. Multivariate Anal. 116, 456–472. Castillo, I., 2012. A semiparametric Bernstein-von Mises theorem for Gaussian Pelenis, J., 2014. Bayesian regression with heteroscedastic error density and process priors. Probab. Theory Related Fields 152, 53–99. parametric mean function. J. Econometrics 178 (3), 624–638. Castillo, I., Nickl, R., 2013. Nonparametric Bernstein-von Mises theorems in Peng, F., Jacobs, R.A., Tanner, M.A., 1996. Bayesian inference in mixtures-of-experts Gaussian white noise. Ann. Statist. 41, 1999–2028. and hierarchical mixtures-of-experts models with an application to speech Castillo, I., Nickl, R., 2014. On the Bernstein-von Mises phenomenon for recognition. J. Amer. Statist. Assoc. 91, 953–960. nonparametric Bayes procedures. Ann. Statist. 42, 1941–1969. Petrone, S., 1999. Bayesian density estimation using bernstein polynomials. Canad. Castillo, I., Rousseau, J., 2013. A General Bernstein-von Mises theorem in J. Statist. 27, 105–126. semiparametric models. ArXiv:1305.4482. Petrone, S., Wasserman, L., 2002. Consistency of Bernstein polynomial posteriors. Chamberlain, G., 1987. Asymptotic efficiency in estimation with conditional J. R. Stat. Soc. Ser. B Stat. Methodol. 64, 79–100. moment restrictions. J. Econometrics 34, 305–334. Poirier, D., 2011. Bayesian interpretations of heteroskedastic consistent covariance Chernozhukov, V., Hong, H., 2003. An MCMC approach to classical estimation. estimators using the informed bayesian bootstrap. Econometric Rev. 30, J. Econometrics 115, 293–346. 457–468. Chib, S., Greenberg, E., 2013. On conditional variance estimation in nonparametric Pollard, D., 1984. Convergence of stochastic processes. In: Springer Series in regression. Stat. Comput. 23, 261–270. Statistics. Springer Verlag GMBH. Chung, Y., Dunson, D.B., 2009. Nonparametric bayes conditional distribution Rivoirard, V., Rousseau, J., 2012. Bernstein-von Mises theorem for linear functionals modeling with variable selection. J. Amer. Statist. Assoc. 104, 1646–1660. of the density. Ann. Statist. 40, 1489–1523. De Iorio, M., Muller, P., Rosner, G.L., MacEachern, S.N., 2004. An ANOVA model for Robinson, P.M., 1987. Asymptotically efficient estimation in the presence of dependent random measures. J. Amer. Statist. Assoc. 99, 205–215. heteroskedasticity of unknown form. Econometrica 55, 875–891. Dunson, D.B., Park, J.-H., 2008. Kernel stick-breaking processes. Biometrika 95, Rousseau, J., 2010. Rates of convergence for the posterior distributions of mixtures 307–323. of betas and adaptive nonparametric estimation of the density. Ann. Statist. 38, Geweke, J., 2005. Contemporary Bayesian Econometrics and Statistics. Wiley- 146–180. Interscience. Rubin, D., 1981. The Bayesian bootstraps. Ann. Statist. 9, 130–134. Geweke, J., Keane, M., 2007. Smoothly mixing regressions. J. Econometrics 138, Ruiz, S.M., 1996. An algebraic identity leading to Wilson’s theorem. Math. Gaz. 80, 252–290. 579–582. Ghosal, S., 2001. Convergence rates for density estimation with Bernstein Schwartz, L., 1965. On Bayes procedures. Z. Wahrscheinlichkeitstheor. Verwandte polynomials. Ann. Statist. 29, 1264–1280. Geb. 4, 10–26. Ghosal, S., Ghosh, J.K., Ramamoorthi, R.V., 1999. Posterior consistency of dirichlet Scricciolo, C., 2006. Convergence rates for Bayesian density estimation of infinite- mixtures in density estimation. Ann. Statist. 27, 143–158. dimensional exponential families. Ann. Statist. 34, 2897–2920. Ghosal, S., Ghosh, J.K., Vaart, A.W.v.d., 2000. Convergence rates of posterior Shen, X., 2002. Asymptotic normality of semiparametric and nonparametric distributions. Ann. Statist. 28, 500–531. posterior distributions. J. Amer. Statist. Assoc. 97, 222–235. Ghosal, S., Lember, J., Van Der Vaart, A., 2003. On Bayesian adaptation. In: Shen, X., Wasserman, L., 2001. Rates of convergence of posterior distributions. Ann. Proceedings of the Eighth Vilnius Conference on Probability Theory and Statist. 29, 687–714. , Part II (2002), vol. 79. pp. 165–175. Shorack, G., 2000. Probability for . Springer, New York. Tokdar, S., 2007. Towards a faster implementation of density estimation with Ghosal, S., Lember, J., van der Vaart, A., 2008. Nonparametric Bayesian model logistic Gaussian process priors. J. Comput. Graph. Statist. 16, 633–655. selection and averaging. Electron. J. Stat. 2, 63–89. Tokdar, S., Zhu, Y., Ghosh, J., 2010. Bayesian density regression with logistic Ghosh, J., Ramamoorthi, R., 2003. Bayesian Nonparametrics, first ed. Springer. Gaussian process and subspace projection. Bayesian Anal. 5, 319–344. Gine, E., Nickl, R., 2011. Rates of contraction for posterior distributions in Lr -metrics, Tokdar, S.T., Ghosh, J.K., 2007. Posterior consistency of logistic Gaussian process 1 ≤ r ≤ ∞. Ann. Statist. 39, 2883–2911. priors in density estimation. J. Statist. Plann. Inference 137, 34–42. Goldberg, P.W., Williams, C.K.I., Bishop, C.M., 1998. Regression with input- van der Vaart, A., 1998. Asymptotic Statistics. Cambridge University Press. dependent noise: a Gaussian process treatment. In: Advances in Neural van der Vaart, A., Wellner, J., 1996. Weak Convergence and Empirical Processes: Information Processing Systems, vol. 10. MIT Press, pp. 493–499. with Applications to Statistics (Springer Series in Statistics). Springer. Gourieroux, C., Monfort, A., Trognon, A., 1984. Pseudo maximum likelihood van der Vaart, A.W., van Zanten, J.H., 2008. Rates of contraction of posterior methods: Theory. Econometrica 52, 681–700. distributions based on Gaussian process priors. Ann. Statist. 36, 1435–1463. Griffin, J.E., Steel, M.F.J., 2006. Order-based dependent dirichlet processes. J. Amer. van der Vaart, A.W., van Zanten, J.H., 2009. Adaptive Bayesian estimation using Statist. Assoc. 101, 179–194. a Gaussian random field with inverse gamma bandwidth. Ann. Statist. 37, Heitzinger, C., 2002. Simulation and Inverse Modeling of Semiconductor Manu- 2655–2675. facturing Processes, Technische Universität Wien. http://www.iue.tuwien.ac.at/ Villani, M., Kohn, R., Giordani, P., 2009. Regression density estimation using smooth phd/heitzinger/node130.html. adaptive Gaussian mixtures. J. Econometrics 153, 155–173. Huang, T.-M., 2004. Convergence rates for posterior distributions and adaptive Walker, S.G., 2004. Posterior consistency of Dirichlet mixtures in density estimation. Ann. Statist. 32, 1556–1593. estimation. Ann. Statist. 32, 2028–2043. Huber, P., 1967. The behavior of the maximum likelihood estimates under White, H., 1982. Maximum likelihood estimation of misspecified models. nonstandard conditions. In: Proceedings of the Fifth Berkeley Symposium on Econometrica 50, 1–25. Mathematical Statistics and Probability, Vol. 1. University of California Press, Wood, S., Jiang, W., Tanner, M., 2002. Bayesian mixture of splines for spatially Berkeley, pp. 221–233. adaptive nonparametric regression. Biometrika 89, 513–528. Jaynes, E.T., 1957. Information theory and statistical mechanics. Phys. Rev. 106, Yau, P., Kohn, R., 2003. Estimation and variable selection in nonparametric 620–630. heteroscedastic regression. Stat. Comput. 13, 191–208. Kato, K., 2013. Quasi-Bayesian analysis of nonparametric instrumental variables Zi-Zong, Y., 2009. Schur complements and determinant inequalities. J. Math. models. Ann. Statist. 41, 2359–2390. Inequal. 3, 161–167.