arXiv:1503.04493v1 [math.ST] 16 Mar 2015 a nsbaia tPictnwiepr fti okwsfina was work this of part hospitality. while its and Princeton for DMS-1418042 department at DMS-1151692, sabbatical Award on CAREER was NSF by Sponsored BM hoe seSe 20) ikladKej 21) Ca (2012); of Kleijn sem and distribution the Bickel posterior by (2001); marginal supported Shen the be (see distribu to theorem known posterior (BvM) is marginal procedures the Bayesian from these sampling MCMC through Θ space [email protected]. nnt-iesoa usnefunction nuisance infinite-dimensional rprinlhzrsmodel, hazards proportional eiaaercmdli nee yaEcienprmtro parameter Euclidean a by indexed is model semiparametric A ai,while ratio, h nes fteecetFse information: estim Fisher efficient efficient semiparametric the of a to inverse at proven the centered is limit it precisely, Gaussian More a efficiency. semiparametric of ∗ ‡ † eiaaercBrsenvnMssTerm eodOrder Second Theorem: Mises Bernstein-von Semiparametric eateto ttsia cec,Dk nvriy o 90 Box University, Duke Science, Statistical of Department eateto ttsia cec,Dk nvriy E-mai University, Duke Science, Statistical of Department eateto ttsis udeUiest,Ws Lafayet West University, Purdue Statistics, of Department he motn lse fsmprmti oesaeeaie,an examined, are models provided. str semiparametric also some of unless desira classes case a adaptive important this produce Three in to component fail parametric even the bu orde may for second efficient priors the independent only to used not up commonly component) are nonparametric component u of priors parametric smoothness dependent the p of the class for new on procedures a accuracy propose inferential we contribution, frequentist second higher Bay to accurate leads pheno more interference function accuracy: Mis interesting inferential Bernstein-von frequentist an and semiparametric discover mation to order is second contribution sem a in first particular, component parametric in the els, of distribution posterior marginal H × e words Key frequen order second the study to is paper this of goal major The η ecnmk aeinifrne o h aaee fintere of parameter the for inferences Bayesian make can we , sup sacmltv aadfnto.B nrdcn on prio joint a introducing By function. hazard cumulative a is A
enti-o ie hoe;scn re smttc;semip asymptotics; order second theorem; Mises Bernstein-von : Π( θ u Yang Yun ∈ A | X 1 θ X , . . . , sargeso oait etrcrepnigt h o ha log the to corresponding vector covariate regression a is un Cheng Guang , ∗ n ) η θ − eogn oaBnc space Banach a to belonging .Introduction 1. saypoial omladstse rqets criteri frequentist satisfies and normal asymptotically is ac 7 2015 17, March N Studies p Abstract θ 0 1 + † n − :[email protected] l: n ai .Dunson B. David and e N497 mi:ceg@udeeu Research [email protected]. Email: 47907, IN te, 1 / ie;h ol iet hn h rneo ORFE Princeton the thank to like would he lized; 2 ∆ e 5,N 70,Dra,UA mi:dun- Email: USA, Durham, 27708, NC 251, n , ( ovre(nttlvrainnr)to norm) variation total (in converge n t,wt oainemti qa to equal matrix covariance with ate, iosFudto 026 un Cheng Guang 305266. Foundation Simons I e sa siaino h nuisance the on estimation esian θ rqets aiiy However, validity. frequentist r tlo(02),wihsae that states which (2012b)), stillo 0 in h rqets aiiyof validity frequentist The tion. eo ewe aeinesti- Bayesian between menon drwihBysa inference Bayesian which nder rmti opnn.A the As component. arametric netasmto simposed. is assumption ingent prmti enti-o Mises Bernstein-von iparametric ,η interest f 0 loaatv wrt the (w.r.t. adaptive also t prmti aeinmod- Bayesian iparametric xesv iuain are simulations extensive d ) l otncnrcinrate contraction root-n ble − s(v)Term Our Theorem. (BvM) es H 1 o xml,i h Cox the in example, For . ( itpoete fthe of properties tist A )
‡ θ rmti model. arametric P t .. rdbeset, credible e.g., st, −→ nteproduct the on Π r θ ∈ 0 ,η 0 Θ 0 , ⊂ R p n an and (1.1) zard a where A is any measurable subset of Θ and
n P 1 1 θ0,η0 1 ∆n := I− ℓθ ,η (Xi) Np(0, I− ) (1.2) √n θ0,η0 0 0 θ0,η0 Xi=1 e e e e reflects a random fluctuation on the center of the posterior distribution. Here, Pθ0,η0 denotes the true underlying distribution that generates the data (X1,...,Xn), where θ0 and η0 are true pa- P P rameter values; “ ” and “ ” denote the weak convergence and convergence in probability P , → respectively. In the expression displayed above, Iθ,η (ℓθ,η) represents the efficient Fisher informa- tion matrix (efficient score function) evaluated at (θ, η). Please see Section 2.1 for a brief review on semiparametric efficiency theory. We call (1.2)e ase the first order version of semiparametric BvM theorem. The recent studies of BvM theorem in the nonparametric context can be found in Shang and Cheng (2014); Castillo and Nickl (2014). The major goal of this paper is to conduct second order studies of semiparametric BvM theorem by characterizing the decay rate of the remainder term in (1.1), which we name as the second order Bayesian efficiency. This efficiency consideration is crucial for us to understand the influence of the nonparametric prior on the semiparametric Bayesian inferential accuracy, and further provide guidance in choosing an appropriate nonparametric prior (or more generally, a joint prior Π). We remark that our second order result is radically different from those in Cheng and Kosorok (2008a,b, 2009) where the nuisance function is profiled out, and thus no nonparametric prior needs to be assigned. Therefore, as far as we are aware, our work is the first study on the second order semiparametric BvM theorem in a fully Bayesian framework. Our main conclusion is that the second order efficiency in (1.1) is of the order O (√nρ2 ) Pθ0,η0 n (upto a logarithmic term), where ρn refers to the posterior contraction rate of the nuisance function throughout the paper. For example, in the partially linear models, the posterior contraction rate α/(2α+d) γ achieves ρn = n− (log n) for some γ > 0, which is known to be (almost) minimax optimal, when an appropriate Gaussian process (GP) prior is assigned to the d-dimensional nonparametric function of α-smoothness. Our general result implies an interesting interference phenomenon be- tween Bayesian estimation and frequentist inferential accuracy: more accurate Bayesian estimation of the nuisance function leads to higher frequentist inferential accuracy on the parametric part. For example, we show that the credible set for θ possesses a second order frequentist validity that is determined by ρn. Please see Section 3.2 for more discussions. Therefore, it is desirable to construct a nonparametric prior under which an optimal contraction rate can be achieved. For example, it is desirable to match the smoothness of the reproducing kernel Hilbert space (RKHS) induced by the assigned GP with that of true regression function. None of the aforementioned interesting conclusions can be inferred from the first order semiparametric BvM theorem. We also remark that our second order Bayesian efficiency is consistent with that derived in Cheng and Kosorok (2008a,b, 2009) (up to a logarithmic term) where no nonparametric prior is assigned. Note that Cheng and Kosorok (2008a,b, 2009) is not a fully Bayesian framework, and did not cover the adap- tive case considered in this paper. In the end, we point out that the above second order results are derived only under two intuitively appealing conditions: one is on the posterior concentration; an- other is on the integrated local asymptotic normality (Bickel and Kleijn, 2012). Interestingly, these two conditions (together with a set of sufficient conditions in Section 3.3) are not stronger than those imposed in the literature for the first order result, e.g., Bickel and Kleijn (2012); Castillo (2012b). On the contrary, we even relax a stringent root-n convergence condition in Bickel and Kleijn (2012) to a set of commonly used conditions; see Lemma 3.4. We further apply our general theory to two classes of priors varying by whether θ and η are dependent or not. Surprisingly, we find that the commonly used independent prior is not the best
2 choice for the second order semiparametric BvM theorem in the sense it requires a slightly strong condition (A6), and might even break down for the first order consistency when the smooth- ness of the nonparametric function is unknown; see Section A.1 for more explanations. This failure is mainly due to the existence of a semiparametric bias term defined in (2.1); also see Rivoirard and Rousseau (2012). Interestingly, we show that the semiparametric bias can be easily eliminated through shifting the center of a nonparametric prior (by a θ-dependent quantity), which naturally leads to a general class of nonparametric priors. This re-centering idea is rather different from, and perhaps easier to implement than, the prior under-smoothing procedure proposed in Castillo (2012a). Moreover, our dependent priors can be easily made adaptive with respect to the unknown smoothness of the nuisance function by re-centering a nonparametric adaptive prior. The rest of the paper is organized as follows. Section 2 provides necessary background on the semiparametric efficiency theory, and describes several semiparametric models including partially linear models and Cox proportional hazards models. Our main theorem, together with the related Bayesian inference results, is presented in Section 3. The classes of independent and dependent priors are extensively discussed in Section 4. Section 5 illustrates the applicability of our general theory in three examples. All the technical proofs are postponed to the Appendix.
2. Preliminaries
In this section, we briefly review semiparametric efficiency theory (Bickel et al., 1998), and describe several semiparametric models considered in the paper.
2.1 Review on Semiparametric Efficiency
An estimator θn is semiparametric efficient if it achieves the minimal asymptotic variance V ∗ P over all regular semiparametric estimators θ that satisfy √n(θ θ ) N (0,V ) for some non- n n − 0 p degenerate asymptoticb variance V . It can be shown that the minimal V ∗ exists and corresponds to the largest asymptotic variance over all the parametrice submodelse P : θ Θ with η(θ )= η { θ,η(θ) ∈ } 0 0 (Bickel et al., 1998). The submodel achieving V ∗ is called the least favorable submodel, and denoted as P ∗ : θ Θ , where η (θ) is the so-called least favorable curve. We define the semiparametric { θ,η (θ) ∈ } ∗ bias as
∆η(θ)= η∗(θ) η∗(θ )= η∗(θ) η , (2.1) − 0 − 0 which will be frequently mentioned hereafter. Let ℓθ0,η0 be the score function of the least favorable 1 T submodel P ∗ : θ Θ at θ = θ . Hence, we have V = I− , where I = E ℓ ℓ . θ,η (θ) 0 ∗ θ0,η0 θ0,η0 0 θ0,η0 θ0,η0 { ∈ } e Note that ℓθ0,η0 and Iθ0,η0 are also known as the efficient score function and efficient information e e e e matrix in the semiparametric literature. For simplicity, denote Pθ0,η0 , ℓθ0,η0 and Iθ0,η0 as P0, ℓ0 and e e I0 from now on. Severini and Wong (1992) discovered that η∗(θ) is essentially evaluatede as thee unique minimizere ofe Kullback-Leibler (KL) divergence in with the parametric part θ being fixed, i.e. H
η∗(θ) = arg inf K(P0, Pθ,η), (2.2) η ∈H where K(P,Q) = log(dP/dQ)dP denotes the KL divergence between two measures P and Q. In the Bayesian regime, the least favorable curve η (θ) can be understood as the function to- R ∗ wards which the conditional posterior distribution of the nuisance parameter η given θ contracts
3 (Kleijn and van der Vaart, 2006). Therefore, the posterior distribution of (θ, η) tends to concen- trate around the true value (θ0, η0) under a well chosen prior Π. We use the following examples to illustrate the above concepts, in particular η∗(θ).
2.2 Generalized Partially Linear Model (GPLM)
Suppose that the data Xi = (Ui,Vi,Yi), i = 1,...,n, are i.i.d. copies of X = (U, V, Y ), where Y R is the response variable and T = (U, V ) [0, 1]p [0, 1]d is the covariate variable. Consider ∈ ∈ × a general class of semiparametric regression models with the following partially linear structure: m (t) E (Y T = t)= F (g (t)), g (t)= θT u + η (v), t = (u, v), 0 ≡ 0 | 0 0 0 0 where F : R R is some known link function and η is some unknown smooth function. Here the → 0 notation E0 means the expectation under the true data generating probability measure P0 = Pθ0,η0 . The above class of semiparametric models is called generalized partially linear models (GPLM) (Boente et al., 2006). Interestingly, we can derive an explicit expression of the least favorable curve (see Lemma A.1) when the log-likelihood is written in the following form1: log p(y; m) = m(y s)/ (s)ds, where (m (T )) = V ar(Y T ). We next apply Lemma A.1 to three concrete y − V V 0 | models. R Example 2.1 (Partially linear models). Consider a partially linear regression model T Y = U θ0 + η0(V )+ w, (2.3) where w N(0, 1) is assumed to be independent of (U, V ) and η belongs to a H¨older function class ∼ 0 Cα([0, 1]d) with smoothness index α. In this case, F (t) = t and (s) = 1. Based on Lemma A.1, V we obtain the least favorable curve as T η∗(θ)(v)= η (v) (θ θ ) E[U V = v]. (2.4) 0 − − 0 | For identifiability, we assume that E(U E[U V ]) 2 is invertible. − | ⊗ Example 2.2 (Partially linear exponential models). In the partially linear exponential model, the conditional density of Y given (U, V ) is p (y u, v)= λ (u, v) exp( λ (u, v)y), y> 0, (2.5) 0 | 0 − 0 with λ (u, v) = exp (uT θ +η (v)) . In this case, F (t)= et and (s)= s2. Therefore, by Lemma 0 {− 0 0 } V A.1, we have T 2 η∗(θ)(v)= η (v) (θ θ ) E[U V = v]+ O( θ θ ). (2.6) 0 − − 0 | | − 0| Example 2.3 (Partially linear logistic models). In the partially linear logistic model, we observe binary Y 0, 1 and model the data as i ∈ { } P (Y = 1 U, V ) log 0 | = U T θ + η (V ). (2.7) P (Y = 0 U, V ) 0 0 0 | In this case, F (t)= et/(1 + et) and (s)= s(1 s). Again, Lemma A.1 implies the least favorable V − curve as
T E[Uf0(U, V ) V = v] 2 η∗(θ)(v)= η (v) (θ θ ) | + O( θ θ ), (2.8) 0 − − 0 E[f (U, V ) V = v] | − 0| 0 | T T 2 where f0(u, v) = exp(u θ0 + η0(v))/(1 + exp(u θ0 + η0(v))) . 1This form is also called as quasi-likelihood in Wedderburn (1974)
4 2.3 Cox Proportional Hazards Model Let Z Rp be a vector of covariates, T the survival time that follows a Cox model, and C a ∈ random observation time. The Cox model assumes that the conditional hazard function given Z satisfies P (T [t,t + dt] T t, Z) = exp(θT Z)λ (t)dt, where λ is an unknown baseline hazard ∈ | ≥ 0 0 0 function. Assume that T is independent of C given Z and that there exists a real τ > 0 such that P (T >τ) > 0 and P (C τ) = P (T = τ) > 0. In the Cox model with current status data, the 0 0 ≥ 0 observed data are n i.i.d. realizations of X = (C, δ, Z), where δ = I(T C). The density of X ≤ relative to the product of the marginal joint density of (C,Z) and counting measure on 0, 1 is { } given by
δ 1 δ T T − p (x)= 1 exp eθ zΛ(c) exp eθ zΛ(c) , θ,Λ − − − c where Λ(c) = 0 λ(t)dt is considered as the nuisance parameter. By the derivations in Section 25.11.1 of van der Vaart (1998), the least favorable curve is given by R E[ZQ2 (X) C = c] T θ0,Λ0 2 Λ∗(θ)(c) = Λ (c) (θ θ ) Λ (c) | + O( θ θ ), (2.9) 0 − − 0 0 E[Q2 (X) C = c] | − 0| θ0,Λ0 | for the function Qθ,Λ given by
θT z T exp( e Λ(c)) Qθ,Λ(x) = exp(θ z) δ − (1 δ) . 1 exp( eθT zΛ(c)) − − − − 3. Main Results
3.1 Second Order Semiparametric BvM Theorem For a general class of semiparametric models = P : θ Θ, η , we consider a joint P { θ,η ∈ ∈ H} prior distribution Π over the product space Θ for the parameter pair (θ, η). In the sequel, we θ ×H use notation Π (η) and ΠΘ(θ) to denote the conditional prior distribution of η given θ and the marginal priorH distribution of θ, respectively. Our main theorem is based on two primary assumptions, which we will revisit in Section 3.3. The first one is a convergence condition for (θ, η). It allows us to focus on the posterior mass in a suitable neighborhood of (θ0, η0): (θ, η) : θ θ0 ǫn, η n , where denotes the Euclidean norm. { | − | ≤ ∈ H } | · | P0 Here, n is a sequence of subsets of the nuisance space that satisfies Π(η n X1,...,Xn) 1. H H η ∈Hη | −→ For example, can be defined as η : η η Mρ n, where n is a sieve sequence for Hn { k − 0kn ≤ n} ∩ F F the nuisance parameter defined after Lemma 3.4 and f = n 1 n f 2(X ) 1/2 is an empirical k kn − i=1 i L -norm. Recall that ρ denotes the contraction rate of marginal posterior distribution of η. 2 n P Assumption 1 (Localization condition). There exists a sequence ǫ 0 satisfying nǫ2 and n → n →∞ a sequence of subsets , such that as n , {Hn}⊂H →∞ Π θ θ ǫ , η X ,...,X = 1 O (δ ) | − 0|≤ n ∈Hn 1 n − P0 n for some δ 0. n → In a general setup, Lemma 3.4 in Section 3.3 provides a set of sufficient conditions for Assump- tion 1 with ǫn = ρn. Throughout the paper, we always choose ǫn to be ρn.
5 The second assumption extends the concept of local asymptotic normality (LAN) (required for the parametric BvM theorem in LeCam (1953)) to the semiparametric context. Denote the log-likelihood by ln(θ, η). By Fubini’s theorem, the marginal posterior for θ can be written as
θ Π(θ A X1,...,Xn)= exp ln(θ, η) ln(θ0, η0) dΠH (η) dΠΘ(θ) ∈ | A − Z ZH (3.1) θ exp ln(θ, η) ln(θ0, η0) dΠ (η) dΠΘ(θ). − H ZΘ Z H Therefore, the integrated likelihood ratio Sn(θ) defined by the map
θ Sn(θ)= exp ln(θ, η) ln(θ0, η0) dΠ (η), (3.2) − H ZH plays a similar role as the likelihood ratio in the parametric model. To prove the first order semiparametric BvM theorem, Bickel and Kleijn (2012) assume that for every random sequence h of order O (1), { n} P0 1/2 Sn(θ0 + n− hn) T 1 T log = hn gn hn I0hn + oP0 (1), (3.3) ( Sn(θ0) ) − 2
n P0 e e where gn = (1/√n) i=1 ℓ0(Xi) Np(0, I0). Accompanied with (3.3), Bickel and Kleijn (2012) further require the marginal posterior of θ to P convergee at root-n rate. Ine many cases, ite may require significant effort to verify this parametric- rate condition. To avoid such a stringent assumption as well as keep track of the higher-order remainder, we introduce the notion of the localized integral likelihood ratio as follows:
θ Sn(θ)= exp ln(θ, η) ln(θ0, η0) dΠ (η). (3.4) n − H ZH η The information in the localizatione sequence , e.g., η η Mρ and η n, will be Hn k − 0kn ≤ n ∈ F utilized in the application of the maximal inequality (van der Vaart and Wellner, 1996, Corollary 2.2.5) to provide a uniform bound. More importantly, when these conditions are combined with Assumption 1, we no longer need to assume the root-n marginal convergence rate for θ.
Assumption 2 (Second order integrated LAN). There exists a nondecreasing function Rn( ) : 2 · R R −1/2 satisfying supt [n ,Mǫn] Rn(t)/nt 0 for each M > 0 such that for every sequence θn → ∈ → satisfying θn = θ0 + oP0 (1),
Sn(θn) T n T log =√n(θn θ0) gn (θn θ0) I0(θn θ0)+ OP0 (Rn θn θ0 ). (3.5) Sn(θ0) − − 2 − − | − | e 1/2 Note that (3.5) can be writtene in the form of (3.3)e by re-parameterizing θn as θ0 + n− hn. In e 2 2 the sequel, we name (3.5) as ILAN. A typical Rn(t) is dominated by √nt +√nρn; see the examples in Section 5 and their proofs. Now, we are ready to present the main theorem in this paper. Theorem 3.1. We assume the prior for θ has a Lebesgue density that is continuous and strictly positive at θ0 and the efficient information matrix I0 is invertible. Suppose X1,...,Xn are i.i.d. observations sampled from P0. Under Assumptions 1 and 2, we have e1/2 1 sup Π(θ A X1,...,Xn) Np θ0 + n− ∆n, (nI0)− (A) = OP0 (Sn), (3.6) A ∈ | − 1/2 where A ranges over all measurable subsets of Θ and Sn e= Rn(en− log n)+ δn.
6 We remark that Theorem 3.1 can be easily adapted to a non-asymptotic version (by invoking (A.14) in Lemma A.2) if Assumptions 1 and 2 are stated in a non-asymptotic manner. The log n term in Sn is not essential and does not affect the polynomial order of Sn, which is our main interest. We need it in the proof such that the posterior probability of the event θ : Mn 1/2 log n θ θ ǫ decays at a faster rate than S for a sufficiently large M. { − ≤ | − 0|≤ n} n We comment that Assumption 2 is implied by the following conditions: (A1) on the semi- parametric model; and (A2) on the prior. Specifically, Lemma 3.5 in Section 3.3 shows that R ( ) = G ( )+ G ( ), where G and G are given in (A1) and (A2), respectively. Note that n · n · n · n n Hn in Conditions (A1) and (A2) is the same as that in Assumption 1. (A1) (Stochastice LAN) There existse an increasing function G : R [0, ), such that for every n → ∞ sequence θ satisfying θ = θ + o (1), { n} n 0 P0 n sup l θ , η +∆η(θ ) l (θ , η) (θ θ )T ℓ (X ) n n n − n 0 − n − 0 0 i η n i=1 ∈H X (3.7) 1 e + n(θ θ )T I (θ θ ) = O G θ θ . 2 n − 0 0 n − 0 P0 n | n − 0|