<<

arXiv:1712.05731v3 [math.ST] 26 Apr 2019 n aevle n[0 in values take and problem regression nonparametric standard the Consider Introduction 1. n efr neec ak yfidn h otro itiuino distribution posterior the finding by tasks inference perform and bevtos( observations 1 osuytertso otato ihrsett h integrated the to respect with contraction of rates the study to iid)ma-eoGusa osswt var( with noises Gaussian -zero (i.i.d.) eflo h oua aeinapoc yassigning by approach Bayesian popular the follow We ercpirmdl.I atclr eepaieta talw h spa the allows nonp it of Gau examples. that renowned class special the wide emphasize as including a priors unbounded, we to be particular, to applied functions In be regression can models. and prior flexible metric very is framework The roshv ensuidetniey olwn h aletframewo earliest the Following extensively. studied been have priors , hoeia rmwr o Bayesian for Framework Theoretical A · · · epooeatertclfaeokfrBysa nonparametric Bayesian for framework theoretical a propose We ae fcnrcino otro itiuin o aeinnonp Bayesian for distributions posterior of contraction of Rates n , 1 eateto ple ahmtc n ttsis on Ho Johns , and Mathematics Applied of Department ersint td h ae fcnrcinwt epc to respect with contraction of rates L the study to regression Abstract: ihaatv-n-xc ae fcnrcin h Gaussia an The an of contraction; of sion of regression rates prior adaptive-and-exact block regre with un-modified series The random factor; finite The applicat provided: α a non-trivial are to Three framework applied be models. posed can prior and nonparametric flexible very of is framework The bounded. L phrases: discussed and are Keywords dimensions high re in fixed-design models the additive to sparse Extensions and literature. the o complement in or results generalization as serve applications These hr h e fpeitr ( predictors of set the where , Ho drfnto,wt dpiertso otato pto up contraction of rates adaptive with -H¨older function, 2 2 dsac ihu suigtergeso ucinsaet space function regression the assuming without -distance dsac,otoomlrno eis aeo contraction of rate series, random orthonormal -distance, 2 orsodnesol eadesdt e-mail: to addressed be should Correspondence aghn Xie Fangzheng oprmti Regression Nonparametric x i y , α Ho drfnto,wt h erotmlpseircontrac posterior optimal near the with -H¨older function, edvlpauiyn rmwr o aeinnonparametric Bayesian for framework unifying a develop We i k ) f i n =1 , − 1] . p g k ⊂ 2 R = p 1 ( , aeinnnaaercrgeso,integrated regression, nonparametric Bayesian e Z e Jin Wei , i [0 saeidpnetadietclydistributed identically and independent are ’s , 1] p [ x f 1 i ( ) x i n =1 ) 1 n axnXu Yanxun and , − e r eerdt sdsg (points) design as to referred are g i = ) ( x )] 2 σ [email protected] d 2 x and , α ) Sblvfunction, -Sobolev f rsinproblem gression 1 hi respective their f oso h pro- the of ions / pieregres- spline n y 2 . ro distribution, prior a euniformly be o h integrated the kn University pkins i logarithmic a swell. as . y = i so fan of ssion saeresponses. are ’s ieclass wide 1 f L , 2 ( 2 x sa process ssian f ko generic on rk -distance tion. i f + ) regression arametric ie the given e i , eof ce ara- i = F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 2 rates of contraction theorems with i.i.d. proposed by [12], specific examples for via Dirichlet process mixture models [3, 13, 16, 37] and location-scale mixture models [22, 46] are discussed. For nonparametric regres- sion, the rates of contraction had not been discussed until [14], who develop a generic framework for fixed-design nonparametric regression to study rates of contraction with respect to the empirical L2-distance. There are extensive studies for various priors that fall into this framework, including location-scale mixture priors [9], conditional Gaussian tensor-product splines [10], and Gaus- sian processes [42, 44], among which adaptive rates are obtained in [9, 10, 44]. Although it is interesting to achieve adaptive rates of contraction with re- spect to the empirical L2-distance for nonparametric regression, this might be restrictive since the empirical L2-distance quantifies the convergence of func- tions only at the given design points. In nonparametric regression, one also expects that the error between the estimated function and the true function can be globally small over the whole design space [47], i.e., small mean-squared error for out-of-sample prediction. Therefore the integrated L2-distance is a natural choice. For Gaussian processes, [40, 48] provide contraction rates for nonparametric regression with respect to the integrated L2 and L∞-distance, respectively. A novel spike-and-slab series prior is constructed in [51] to achieve adaptive contraction with respect to the stronger L∞-distance. These examples however, take advantage of their respective prior structures and may not be easily generalized. A closely related reference is [25], which discusses the rates of contraction of the rescaled-Gaussian process prior for nonparametric random-design regression with respect to the integrated L1-distance, which is weaker than the integrated L2-distance. Although a generic framework for the integrated L2-distance is presented in [20], the prior there is imposed on a uni- formly bounded function space and hence rules out some popular priors, e.g., the popular Gaussian process prior [28]. It is therefore natural to ask the following fundamental question: for Bayesian nonparametric regression, can one build a unifying framework to study rates of contraction for various priors with respect to the integrated L2-distance without assuming the uniform boundedness of the regression function space? In this paper we provide a positive answer to this question. The major contribution of this work is that we prove the existence of an ad-hoc test function that is required in the generic rates of contraction framework in [12] by leveraging Bernstein’s inequality and imposing certain structural assumption on the sieves with large prior probabilities. This is made clear in Section 2. Furthermore, we do not require the prior to be supported on a uniformly bounded function space. Consequently, we are able to establish a general rate of contraction theorem with respect to the integrated L2-distance for Bayesian nonparametric regression. Examples of applications falling into this framework include the finite random series prior [29, 35], the (un-modified) block prior [11], and the Gaussian splines prior [10]. In particular, for the block prior regression, rather than modifying the block prior by conditioning on a truncated function space as in [11] with a known upper bound for the unknown true regression function, we prove that the un-modified block prior automatically yields rate-exact Bayesian adaptation F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 3 for nonparametric regression without such a truncation. We further extend the proposed framework to the fixed design regression and sparse additive models in high dimensions. The analyses of the above applications and extensions under the proposed framework also generalize their respective results in the literature. These improvements and generalizations are made clear in Sections 3 and 4. The layout of this paper is as follows. In Section 2 we introduce the random series framework for Bayesian nonparametric regression and present the main result concerning rates of contraction. As applications of the main result, we derive the rates of contraction of various renowned priors for nonparametric regression in the literature with substantial improvements in Section 3. Section 4 elaborates on extensions of the proposed framework to the fixed design re- gression problem and sparse additive models in high dimensions. The technical proofs of main results are deferred to Section 5.

Notations

For 1 r , we use to denote both the ℓ -norm on any finite di- ≤ ≤ ∞ k·kr r mensional Euclidean space and the integrated Lr-norm of a measurable func- tion (with respect to the Lebesgue measure). In particular, for any function p f L2([0, 1] ), we use f 2 to denote the integrated L2-norm defined to be ∈2 2 k k f 2 = [0,1]p f (x)dx. We follow the convention that when r = 2, the subscript k k 2 is omitted,R i.e., 2 = . The Hilbert space l denotes the space of se- quences that arek·k squared-summable.k·k We use x to denote the maximal integer no greater than x, and x to denote the minimum⌊ ⌋ integer no less than x. The notations a . b and a &⌈b ⌉denote the inequalities up to a positive multiplicative constant, and we write a b if a . b and a & b. Throughout capital letters ′ ′ ≍ C, C1, C,˜ C ,D,D1, D,D˜ , are used to denote generic positive constants and their values might change··· from line to line unless particularly specified, but are universal and unimportant for the analysis. We refer to as a if it consists of a class of densities on a sample spaceP with respect to some underlying σ-finite measure. Given a X n (frequentist) statistical model and the i.i.d. data (wi)i=1 from some P , the prior and the posteriorP distribution on are always denoted by Π(∈) P P · and Π( w1, , wn), respectively. Given a function f : R, we use P · |−1 ···n X → G nf = n i=1 f(xi) to denote the empirical measure of f, and nf = −1/2 n E n i=1 [f(xi) f(xi)] to denote the empirical process of f, given the Pn − i.i.d. data (xi)i=1. With a slight abuse of notations, when applying to a set P n P −1 n G of design points (xi)i=1, we also denote nf = n i=1 f(xi) and nf = −1/2 n E n i=1 [f(xi) f(xi)] to be the empirical measure and empirical process, − n P even when the design points (xi)i=1 are deterministic. In particular, φ denotes the probabilityP density function of the (univariate) standard normal distribu- tion, and we use the shorthand notation φσ(y)= φ(y/σ)/σ to denote the density of N(0, σ2). For a metric space ( , d), for any ǫ > 0, the ǫ-covering number of ( , d), denoted by (ǫ, , d), isF defined to be the minimum number of ǫ-balls ofF the form g N: d(f,gF ) <ǫ that are needed to cover . { ∈F } F F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 4

2. The framework and main results

n Consider the nonparametric regression model: yi = f(xi)+ ei, where (ei)i=1 are 2 n i.i.d. mean-zero Gaussian noises with var(ei)= σ , and (xi)i=1 are design points p n taking values in [0, 1] . Unless otherwise stated, the design points (xi)i=1 are assumed to be independently and uniformly sampled for simplicity throughout the paper. Our framework naturally adapts to the case where the design points are independently sampled from a density function that is bounded away from 0 and . We assume that the responses yi’s are generated from yi = f0(xi)+ei for some∞ unknown f L ([0, 1]p), thus the data = (x ,y )n can be regarded 0 ∈ 2 Dn i i i=1 as i.i.d. samples from a distribution P0 with joint density p0(x,y) = φσ(y 2 − f0(x)). Throughout we assume that the σ of the noises is known, but our framework can be easily extended to the case where σ is unknown by placing a prior on σ that is supported on a compact interval contained in (0, ) with a density bounded away from 0 and (see, for example, Section 2.2.1∞ in [9] and Theorem 3.3 in [42]). ∞ Before presenting the main result, let us first introduce the basic framework for studying convergence of Bayesian nonparametric regression. In the context of the aforementioned nonparametric regression, by assigning a prior Π on the regression function f, one obtains the posterior distribution Π(f n) defined through ∈·|D

n [pf (xi,yi)/p0(xi,yi)]Π(df) Π(f A )= A i=1 ∈ | Dn n [p (x ,y )/p (x ,y )]Π(df) R Qi=1 f i i 0 i i for any measurable function classR AQ, where pf (x,y)= φσ(y f(x)). In order that the posterior distribution Π( ) contracts to f at rate−ǫ with respect to a · | Dn 0 n distance d, i.e., Π(d(f,f0) > Mǫn n) 0 in P0-probability for some large constant M > 0, the authors of [12]| proposed D → the following renowned sufficient conditions, referred to as the prior-concentration-and-testing framework: There exist some constants D,D′ > 0, such that for sufficiently large n: 1. The prior concentration condition holds:

2 p0 p0 2 Π E log ǫ2 , E log ǫ2 e−Dnǫn . (2.1) 0 p ≤ n 0 p ≤ n ≥  f  " f  # ! ∞ p 2. There exists a sequence ( n)n=1 of subsets of L2([0, 1] ) (often referred to F 2 as the sieves) and test functions (φ )∞ such that Π( c) e−(D+4)nǫn , n n=1 Fn ≤ ′ 2 −D Mnǫ E0φn 0, and sup Ef (1 φn) e n . → f∈Fn∩{d(f,f0)>Mǫn} − ≤

However, the above framework is not instructive for constructing the appropriate sieves ( )∞ nor the desired test functions (φ )∞ for studying the rates of Fn n=1 n n=1 contraction for nonparametric regression with respect to 2. Specifically, it does not provide a guidance on how to construct the desiredk·k sieves, or what F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 5 their structural features are. The major contribution of this work, in contrast, is that we impose certain structural assumption on the sieves to construct the desired test functions. By doing so, we are able to modify the entire framework so that it can be applied to a variety of nonparametric regression priors for us to derive the corresponding posterior contraction rates. The following local testing lemma is the first technical contribution of this work. It also serves as a building block to construct the desired test functions required in the prior-concentration-and-testing framework.

Lemma 2.1. For any m N+ and δ > 0, assume the class of functions m(δ) satisfies ∈ F

f (δ)= f f 2 η(m f f 2 + δ2) (2.2) ∈Fm ⇒k − 0k∞ ≤ k − 0k2 for some constant η > 0. Then for any f1 m(δ) with √n f1 f0 2 > 1, there exists a test function φ : ( )n ∈[0, F1] such that k − k n X × Y → E φ exp Cn f f 2 , 0 n ≤ − k 1 − 0k2 E 2 sup f (1 φn) exp Cn f1 f0 2 {f∈Fm(δ):kf−f1k2≤ξkf0−f1k2} − ≤ − k − k Cn f  f 2 + 2exp k 1 − 0k2 −m f f 2 + δ2  k 1 − 0k2  for some constant C > 0 and ξ (0, 1). ∈ The key ingredient of Lemma 2.1 is the assumption (2.2) on the sieve (δ). Fm By requiring that functions in m(δ) cannot explode in f f0 ∞ when f f0 2 is small, we can utilize Bernstein’sF inequality and obtaink − exponentiallyk k − smakll type I and type II error probability bounds. Based on Lemma 2.1, we are able to establish the following global testing lemma.

Lemma 2.2. Suppose that m(δ) satisfies (2.2) for an m N+ and a δ > 0. ∞ F 2 ∈ Let (ǫn)n=1 be a sequence with nǫn . Then for any M 0, there exists a ∞ → ∞ ≥ sequence of test functions (φn)n=1 such that

∞ E φ N exp Cnj2ǫ2 , 0 n ≤ nj − n j=M X  CM 2nǫ2 sup E (1 φ ) exp( CM 2nǫ2 )+2exp n , f − n ≤ − n −mM 2ǫ2 + δ2 {f∈Fm(δ):kf−f0k2>Mǫn}  n  where N = (ξjǫ , (ǫ ), ) is the covering number of nj N n Snj n k·k2 (ǫ )= f (δ): jǫ < f f (j + 1)ǫ , Snj n { ∈Fm n k − 0k2 ≤ n} and C is some positive constant. The prior concentration condition (2.1) is very important in the study of Bayes theory. It guarantees that the denominator appearing in the posterior F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 6

′ 2 −D nǫ distribution [pf (xi,yi)/p0(xi,yi)]Π(df) can be lower bounded by e n for some constant D′ > 0 with large probability (see, for example, lemma 8.1 in [12]). In theR context of normal regression, the Kullback-Leibler is proportional to the integrated L2-distance between two regression functions. Motivated by this observation, we establish the following lemma that yields an exponential lower bound for the denominator

n p (x ,y ) f i i Π(df) p (x ,y ) i=1 0 i i Z Y in the posterior distribution under the proposed framework. Lemma 2.3. Denote

B(m,ǫ,ω)= f : f f < ǫ, f f 2 η′(m f f 2 + ω2) k − 0k2 k − 0k∞ ≤ k − 0k2  ′ for any ǫ,δ > 0 and m N+, where η > 0 is some constant. Suppose sequences ∞ ∞ ∈ 2 2 (ǫn)n=1 and (kn)n=1 satisfy ǫn 0, nǫn , knǫn = O(1), and ω is some constant. Then for any constant→C > 0, → ∞

n p (x ,y ) 1 P f i i Π(df) Π (B(k ,ǫ ,ω)) exp C + nǫ2 0. 0 p (x ,y ) ≤ n n − σ2 n → i=1 0 i i ! Z Y     In some cases it is also straightforward to consider the prior concentration with respect to the stronger -norm. For example, for a wide class of Gaus- k·k∞ sian process priors, the prior concentration Π( f f0 ∞ < ǫ) has been exten- sively studied (see, for example, [15, 42, 44] fork more− details).k ∞ 2 Lemma 2.4. Suppose the sequence (ǫn)n=1 satisfies ǫn 0 and nǫn . Then for any constant C > 0, → → ∞

p (x ,y ) 1 P f i i Π(df) Π( f f <ǫ )exp C + nǫ2 0. 0 p (x ,y ) ≤ k − 0k∞ n − σ2 n → Z 0 i i     Now we present the main result regarding the rates of contraction for Bayesian nonparametric regression. The proof is based on the modification of the prior- concentration-and-testing procedure. We also remark that the prior Π on f is not necessarily supported on a uniformly bounded function space, which is one of the major contributions of this work. ∞ ∞ Theorem 2.1 (Generic Contraction). Let (ǫn)n=1 and (ǫn)n=1 be sequences 2 2 such that min(nǫn,nǫn) as n , with 0 ǫn ǫn 0. Assume ∞→ ∞ → ∞ 2 ≤ ≤ → that the sieve ( mn (δ))n=1 satisfies (2.2) with mnǫn 0 for some constant δ. F → ∞ N In addition, assume that there exists another sequence (kn)n=1 + such that 2 ⊂ knǫn = O(1). Suppose the following conditions hold for some constants ω,D > 0 F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 7 and sufficiently large n and M:

∞ N exp Dnj2ǫ2 0, (2.3) nj − n → j=M X  1 Π( c (δ)) . exp 2D + nǫ2 , (2.4) Fmn − σ2 n     Π (B(k ,ǫ ,ω)) exp Dnǫ2 , (2.5) n n ≥ − n where N = (ξjǫ , (ǫ ), ) is the covering number of nj N n Snj n k·k2 = f (δ): jǫ < f f (j + 1)ǫ , Snj { ∈Fmn n k − 0k2 ≤ n} E and B(kn,ǫn,ω) is defined in Lemma 2.3. Then 0 [Π( f f0 2 >Mǫn n)] 0. k − k | D → Remark 2.1. In light of Lemma 2.4, by exploiting the proof of theorem 2.1 we remark that when the assumptions and conditions in theorem 2.1 hold with (2.5) 2 replaced by Π( f f0 ∞ < ǫn) exp( Dnǫn), the same rate of contraction also holds: E [Π(k −f fk >Mǫ≥ )]− 0 for sufficiently large M > 0. 0 k − 0k2 n | Dn →

3. Applications

In this section we consider three concrete priors on f for the nonparametric re- gression problem yi = f(xi)+ ei, i =1, ,n. For simplicity the design points are assumed to independently follow the··· one-dimensional uniform distribution Unif(0, 1). For some of the examples, the results can be easily generalized to the case where the design points are multi-dimensional by considering the tensor- product basis functions. The results under the proposed framework also gener- alize their respective counterparts in the literature.

3.1. Finite random series regression with adaptive rate

The finite random series prior [1, 29, 35, 45] is popular in the literature of Bayesian nonparametric theory. It is a class of hierarchical priors that first draw an integer-valued serving as the number of “terms” to be used in a finite sum, and then sample the “term-wise” parameters given the num- ber of “terms”. The finite random series prior typically does not depend on the smoothness level of the true function, and often yields minimax-optimal rates of contraction (up to a logarithmic factor) in many nonparametric problems (e.g., density estimation [29, 35] and fixed-design regression [1]). However, the adap- tive rates of contraction for the finite random series prior in the random-design regression with respect to the integrated L2-distance has not been established. In this subsection we address this issue by leveraging the framework proposed in Section 2. ∞ We first introduce the finite random series prior. Let (ψk)k=1 be the Fourier basis in L2([0, 1]), i.e., ψ1(x)=1, ψ2k(x) = sin kπx, and ψ2k+1(x) = cos kπx, F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 8

N ∞ k +. Writing f in terms of the Fourier series expansion f(x)= k=1 βkψk(x), we∈ then assign the finite random series prior Π on f by considering the follow- ∞ P ing prior distribution on the coefficients (βk)k=1: first sample an integer-valued random variable N from a density function πN (with respect to the counting measure on N+), and then given N = m the coefficients βk’s are independently sampled according to

g(βk)dβk, if 1 k m, Π(dβk N = m)= ≤ ≤ | δ0(dβk), if k m,  ≥ τ where g is an exponential power density g(x) exp( τ0 x ) for some τ, τ0 > 0 [36]. We further require that ∝ − | |

∞ π (m) exp( b m log m) and π (N) exp( b m log m) (3.1) N ≥ − 0 N ≤ − 1 N=Xm+1 for some constants b0, b1. The zero-truncated Poisson distribution πN (m) = λ −1 m (e 1) λ /m!½(m 1) satisfies condition (3.1)[46]. − ≥ The true regression function f0 is assumed to yield a Fourier expansion ∞ ∞ f0(x)= k=1 β0kψk(x) with regard to (ψk)k=1, and be in the α-H¨older ball

P ∞ ∞ α Cα(Q)= f(x)= βkψk(x): k βk Q , ( | |≤ ) Xk=1 kX=1 where α> 1/2 is the smoothness level, and Q> 0 is the α-H¨older radius. Note that the construction of the aforementioned finite random series prior does not require the knowledge of the smoothness level α. In the literature of Bayes theory, such a procedure is referred to as adaptive. The following theorem shows that the constructed finite random series prior is adaptive and the rate of contraction n−α/(2α+1)(log n)t with respect to the integrated L2-distance is minimax-optimal up to a logarithmic factor [38].

Theorem 3.1. Suppose the true regression function f0 Cα(Q) for some α> 1/2 and Q > 0, and f is imposed the prior Π given above.∈ Then there exists some sufficiently large constant M > 0 such that

E Π f f >Mn−α/(2α+1)(log n)t 0 0 k − 0k2 | Dn → h  i for any t>α/(2α + 1).

3.2. Block prior regression with adaptive and exact rate

In the literature of adaptive Bayesian procedure, the minimax-optimal rates of contraction are often obtained with an extra logarithmic factor. It typically re- quires extra work to obtain the exact minimax-optimal rate. Gao and Zhou [11] elegantly construct a modified block prior that yields rate-adaptive (i.e., the F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 9 prior does not depend on the smoothness level) and rate-exact contraction (i.e., the contraction rate does not involve an extra logarithmic factor) for a wide class of nonparametric problems. Nevertheless, for nonparametric regression, [11] modifies the block prior by conditioning on the space of uniformly bounded functions. Requiring a known upper bound for the unknown f0 when construct- ing the prior is restrictive since it eliminates the popular Gaussian process priors. Besides the theoretical concern, the block prior itself is also a conditional Gaus- sian process and such a modification is inconvenient for implementation. In this section, we address this issue by showing that for nonparametric regression such a modification is not necessary. Recall that in Section 3.1 we have introduced the Fourier basis functions ∞ (ψk)k=1 in L2([0, 1]) with ψ1(x) = 1, ψ2k(x) = √2 sin πkx, and ψ2k+1(x) = √2cos πkx, k N+. The true regression function f0 is also assumed to yield a ∈ ∞ Fourier expansion f0(x)= k=1 β0kψk(x), and to be in the α-Sobolev ball P ∞ ∞ 2α 2 α(Q)= f(x)= βkψk(x): k βk Q H ( ≤ ) Xk=1 Xk=1 ∞ with radius Q > 0. Write f(x) = k=1 βkψk(x) in terms of the Fourier ex- pansion. Similar to the finite random series prior, the block prior is constructed P ∞ by assigning a prior distribution on the coefficients (βk)k=1 as follows. Given 2 a sequence β = (β1,β2, ) in the squared-summable sequence space l , de- fine the ℓth block B to··· be the integer index set B = k , , k 1 and ℓ ℓ { ℓ ··· ℓ+1 − } n = B = k k , where k = eℓ . We use β = (β : j B ) Rnℓ to ℓ | ℓ| ℓ+1 − ℓ ℓ ⌈ ⌉ ℓ j ∈ ℓ ∈ denote the coefficients with indices lying in the ℓth block Bℓ. The block prior ∞ then assigns the following distribution on the coefficients (βk)k=1: β A N(0, A I ), A g , independently for each ℓ, ℓ | ℓ ∼ ℓ nℓ ℓ ∼ ℓ ∞ where (gℓ)ℓ=0 is a sequence of densities satisfying the following properties: 2 1. There exists c > 0 such that for any ℓ and t [e−ℓ , e−ℓ], 1 ∈ g (t) exp( c eℓ). (3.2) ℓ ≥ − 1 2. There exists c2 > 0 such that for any ℓ, ∞ tg (t)dt 4 exp( c ℓ2). (3.3) ℓ ≤ − 2 Z0 3. There exists c3 > 0 such that for any ℓ, ∞ ℓ gℓ(t)dt exp( c3e ). (3.4) −ℓ2 ≤ − Ze ∞ The existence of a sequence of densities (gℓ)ℓ=0 satisfying (3.2), (3.3), and (3.4) is verified in [11] (see proposition 2.1 in [11]). Our major improvement for the block prior regression is the following theo- rem, which shows that the (un-modified) block prior yields rate-exact Bayesian adaptation for nonparametric regression. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 10

Theorem 3.2. Suppose the true regression function f0 α(Q) for some ∞ ∈ H α> 1/2 and Q> 0, and f(x)= k=1 βkψk(x) is imposed the block prior Π as described above. Then P E Π f f >Mn−α/(2α+1) 0 0 k − 0k2 | Dn → h  i for some sufficiently large constant M > 0.

Rather than using the sieve n proposed in theorem 2.1 in [11], which does not necessarily satisfy (2.2), weF construct (δ) in a slightly different fashion: Fmn ∞ ∞ 2 2α 2 mn (Q)= f(x)= βkψk(x): (βk β0k) k Q F ( − ≤ ) Xk=1 kX=1 1/(2α+1) with mn n and δ = Q. The covering number Nnj can be bounded using the≍ metric entropy for Sobolev balls (for example, see lemma 6.4 in [2]), and the rest conditions in 2.1 can be verified using similar techniques as in [11]. As discussed in Section 4.2 in [11], the block prior can be easily extended to the wavelet basis functions and wavelet series. The wavelet basis functions are another widely-used class of orthonormal basis functions for L2([0, 1]). Let

(ψjk)j∈N,k∈Ij be an orthonormal basis of compactly supported for L2([0, 1]), with j referring to the so-called “resolution level”, and k to the “di- lation” (see, for example, Section E.3 in [15]). We adopt the convention that j the index set Ij for the jth resolution level runs through 0, 1, , 2 1 . The exact definition and specific formulas for the wavelet basis{ are··· not of− great} interest to us, and for a complete and thorough review of wavelets from a statis- tical perspective, we refer to [17]. We shall assume that the wavelet basis ψjk’s ∞ are appropriately selected such that for any f(x)= j=0 k∈Ij βjkψjk(x), the following inequalities hold [5, 6, 19]: P P 1/2 ∞ ∞ 2 j/2 f 2 βjk and f ∞ 2 max βjk . k k ≤   k k ≤ k∈Ij | | j=0 j=0 X kX∈Ij X   ∞ Write f in terms of the wavelet series expansion f(x)= j=0 k∈Ij βjkψjk(x). The block prior for the wavelet series is then introduced through the wavelet P P coefficients βjk’s as follows: β A N(0, A I ), A g , independently for each j, j | j ∼ k nk j ∼ j where β = (β : k I ), n = I =2j, and g is given by j jk ∈ j k | k| j 2 2 j log 2 −2j log 2 −j log 2 e (e Tj )t + Tj , 0 t e , 2 −2j log 2 − −≤j log≤ 2 −j log 2 gj(t)= e , e e−j log 2, 2 j 2 −2j log 2 Tj = exp (1 + j ) log 2 exp ( 2 + j j) log 2 +e . − − − We further assume that f0 is an α-Sobolev function. For the block prior regres- sion via wavelet series, the rate-exact Bayesian adaptation also holds. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 11

Theorem 3.3. Suppose the true regression function f0 α(Q) for some α > 1/2 and Q > 0, and f(x) = ∞ β ψ (x) is∈ imposed H the block j=0 k∈Ij jk jk prior for wavelet series Π as described above. Then there exists some sufficiently large constant M > 0 such that P P

E Π f f >Mn−α/(2α+1) 0. 0 k − 0k2 | Dn → h  i 3.3. Beyond Fourier series: Gaussian spline regression

The previous two examples show that the proposed framework can be applied to priors through Fourier series expansions. The framework also works for prior distributions that are beyond Fourier basis, and we provide an example in this subsection. Spline functions or splines are defined piecewise by polynomials on subin- tervals [t0,t1), [t1,t2),..., [tK−1,tK ] that are also globally smooth on the entire domain [t0,tK ], where K is the number of subintervals. Without loss of gener- ality, we may further require that t0 = 0 and tK = 1. A spline function is said to be of order q for some positive integer q, if the involved polynomials are of degrees at most q 1. Given the order q and the number of subintervals K, the space of spline functions− forms a linear space with dimension m = q + K 1, and a basis for this linear space is also a set of spline functions, referred to− as m B-splines and denoted as (Bk)k=1. We refer the readers to [8] and [34] for a systematic introduction of the spline functions. We present the following facts regarding the approximation property of splines, and the norm equivalence be- tween spline functions and the coefficients of the corresponding B-splines. These results can be found in [8].

Lemma 3.1. Let f0 be an α-H¨older function with α > 0, and q α. Then m ≥ R there exists a constant C depending on f0, q, and α, and (β0k)k=1 with m = q + K 1, such that ⊂ − m −α θkBk(x) f0(x) Cm . − ≤ k=1 ∞ X Furthermore, for any β ,...,β R, 1 m ∈ m m 1/2 m 2 max βk βkBk(x) , βk √m βkBk(x) . 1≤k≤m | |≍ ! ≍ k=1 ∞ k=1 k=1 2 X X X

Assume that the true regression function f0 is an α-H¨older function with α> 1/2. We now present the Gaussian spline prior, which simplifies the version presented in [10]. Assume that f is a spline function of order q α on [0, 1], and the number of subintervals is K. Then we write f in terms of a≥ linear combina- m tion of the B-splines f(x) = k=1 βkBk(x). The Gaussian spline prior assigns a prior distribution on f by letting the coefficients β1,...,βm independently P F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 12 follow the standard normal distribution N(0, 1). Allowing m grows moderately with the sample size n, we show in the following theorem that the posterior contraction rate with respect to 2 is minimax-optimal up to a logarithmic factor. k·k α Theorem 3.4. Suppose the true regression function f0 C (Q) for some α> 1/2 and Q > 0, and f(x) is assigned the Gaussian spline∈ prior Π as described above with order q α and the number of subintervals K. If m = q + K 1= n1/(2α+1), then there≥ exists some sufficiently large constant M > 0 such that−

E Π f f >Mn−α/(2α+1)(log n)1/2 0. 0 k − 0k2 | Dn → h  i We briefly compare the result of Theorem 3.4 with that in [10], which con- sidered the empirical L2-distance and obtained the minimax-optimal rate up to a logarithmic factor in the fixed-design regression problem. In contrast, we put the prior models in the context of the random-design regression and use the integrated L2-distance, which can be viewed as a complement of the contraction result presented in [10].

4. Extensions

4.1. Extension to the fixed design regression

n So far, the design points (xi)i=1 in this paper are assumed to be randomly sampled from [0, 1]p. This is referred to as the random-design regression problem. n There are, however, many cases where the design points (xi)i=1 are fixed and can be controlled. One of the examples is the design and analysis of computer [7, 33]. To emulate a computer model, the design points are typically manipulated so that they are reasonably spread. In some physical experiments the design points can also be required to be fixed [39]. In this subsection we show that by slightly extending the framework in Section 2, the integrated L2- distance contraction is also obtainable under similar conditions when the design points are reasonably selected. n Suppose that the design points (xi)i=1 [0, 1] are one-dimensional and fixed. Intuitively, the design points need to be relatively⊂ “spread” so that the global behavior of the true signal f0 can be recovered as much as possible. Formally, we require that the design points satisfy

n 1 1

sup ½(xi x) x = O . (4.1) x∈[0,1] n ≤ − n i=1   X

A simple example of such design is the univariate equidistance design, i.e., xi = (i 1/2)/n (see, for example, [50, 51]). Now− we extend the framework in Section 2 to the (one-dimensional) fixed- design regression problem. Specifically, this amounts to modifying the require- F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 13 ment for the sieve in (2.2):

m δ f,f (δ) P (f f )2 f f 2 η f f 2 + f f , 1 ∈Fm ⇒ n − 0 −k − 0k2 ≤ n k − 0k2 √nk − 0k2   m δ and P (f f )2 f f 2 η f f 2 + f f . (4.2) n − 1 −k − 1k2 ≤ n k − 1k2 √nk − 1k2  

With the above ingredients, we present the following modification of Theorem 2.1 for the fixed-design regression, which might be of independent interest as well. Theorem 4.1 (Generic Contraction, Fixed-design). Suppose the design points n ∞ ∞ (xi)i=1 are fixed and satisfy (4.1). Let (ǫn)n=1 and (ǫn)n=1 be sequences such 2 2 that min(nǫn,nǫn) as n with 0 ǫn ǫn 0. Let the sieves ( (δ))∞ satisfy→(4.2 ∞) for some→ ∞ constant≤δ > ≤0, where→ m and Fmn n=1 n → ∞ mn/n 0. Suppose the conditions (2.3), (2.4), and Π( f f0 ∞ < ǫn) → 2 k − k ≥ exp( Dnǫn) hold for some constant D > 0 and sufficiently large n and M. Then− E [Π( f f >Mǫ )] 0. 0 k − 0k2 n | Dn → As a sample application of the above framework, we consider one of the most popular Gaussian processes GP(0,K) with the function K(x, x′) = exp (x x′)2 of the squared-exponential form [28]. We show that optimal rates− of contraction− with respect to the integrated L -distance is also attainable   2 when the design points are reasonably selected, in contrast to most Bayesian literatures that obtain rates of contraction with respect to the empirical L2- distance. Given constants c,Q > 0, we assume that the underlying true regression function f0 lies in the following function class

∞ ∞ 2 2 k 2 c(Q)= f(x)= βkψk(x): βk exp Q . A ( c ≤ ) Xk=1 kX=1  

The function class c(Q) is closely related to the reproducing kernel Hilbert space (RKHS) associatedA with GP(0,K). For a complete and thorough review of RKHS from a Bayesian perspective, we refer to [43]. A key feature of the functions in c(Q) is that they are “supersmooth”, i.e., they are infinitely differ- entiable. ForA the squared-exponential Gaussian process regression, the following property regarding the corresponding RKHS is available by applying Theorem 4.1 in [43]. Lemma 4.1. Let H be the RKHS associated with the squared-exponential Gaus- ′ ′ 2 sian process GP(0,K), where K(x, x ) = exp[ (x x ) ]. Then 4(Q) H for any Q> 0. − − A ⊂ Under the squared-exponential Gaussian process prior Π, the rate of contrac- tion of a supersmooth f (Q) is 1/√n up to a logarithmic factor. 0 ∈ A4 F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 14

n Theorem 4.2. Assume that the design points (xi)i=1 are fixed and satisfy (4.1). Suppose the true regression function f0 4(Q) for some Q > 0, and f fol- lows the squared-exponential Gaussian process∈ A prior Π. Then there exists some sufficiently large constant M > 0 such that

E Π f f >Mn−1/2(log n) 0. 0 k − 0k2 | Dn → h  i Remark 4.1. For the squared-exponential Gaussian process regression with ran- dom design, the rate of contraction with respect to the integrated L2-distance for f0 4(Q) has been studied in the literature. In contrast, we remark that for the∈ fixed-design A regression problem, Theorem 4.2 is new and original, and provides a stronger result compared to the existing literature (see, for example, Theorem 10 in [40]).

4.2. Extension to sparse additive models in high dimensions

We have so far considered that the design space is low dimensional with fixed p. Nonetheless, the rapid development of technology has been enabling scientists to collect data with high-dimensional covariates, where the number of covari- ates p can be much larger than the sample size n, to explore the potentially nonlinear relationship between these covariates and certain outcome of interest. The emergence of high dimensional prediction problems naturally motivates the study of nonparametric regression in high dimensions [49]. In this section, we fo- cus on one class of high-dimensional nonparametric regression problem, known as sparse additive models, and illustrate that with suitable prior specification, the framework for low-dimensional Bayesian nonparametric regression naturally extends to such a high-dimensional scenario. We first review some background regarding the sparse additive models. Con- sider the additive regression model yi = f(xi)+ei, where the regression function p f(xi) is of an additive structure of the covariates f(xi) = µ + j=1 fj(xij ). Without loss of generality, one can assume that each component fj (xj ) is cen- 1 P tered: 0 fj(xj )dxj = 0, j = 1, ,p. For sparse additive models in high di- mensions, the number of covariates··· p is typically much larger than the sample R size n, and the underlying true regression function f0 depends only on a small number of covariates, say, xj1 ,...,xjq , i.e., f0 is of a sparse additive structure q R f0(xi)= µ0 + r=1 f0jr (xijr ), where each f0r : [0, 1] is a univariate func- tion, and q is the number of active covariates that does→ not change with sample P size. Furthermore, the indices of these active covariates j1,...,jq and q are unknown. This is referred to as the sparse additive models{ in high} dimensions in the literature [18, 21, 24, 26, 27]. There have also been several works regard- ing Bayesian modeling of sparse additive models in high dimensions, see, for example, [23, 32, 49]. To model the sparsity occurring in the high-dimensional additive regression model, we consider the following parameterization of f by introducing the binary F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 15 covariate-selection variables z ,...,z 0, 1 : 1 p ∈{ } p f(x)= µ + z f (x ), z 0, 1 , j =1, , p, (4.3) j j j j ∈{ } ··· j=1 X where zj = 1 indicates that the jth covariate is active and zj = 0 otherwise. Following the strategy in Section 2, each component function fj is assigned a prior distribution independently across j =1,...,p. We complete the prior dis- tribution Π by imposing the selection variables zj with a Bernoulli distribution zj Bernoulli(1/p). The Bernoulli prior for sparsity has been widely adopted in other∼ high-dimensional Bayesian models with variable selection structures (see, for example, [4, 30, 31]). We now extend Theorem 2.1 to sparse additive models by modifying the sieve T p property (2.2). Denote z = [z1,...,zp] 0, 1 and let A be a positive integer, A ∈{ } we consider the sieve m(δ)= kzk1≤Aq m(δ, z), where m(δ, z) with z 1 A satisfies the followingG condition: there existsG some constantG η> 0 suchk thatk ≤ S f (δ, z)= f f 2 η(A2m f f 2 + δ2). (4.4) ∈ Gm ⇒k − 0k∞ ≤ k − 0k2 Theorem 4.3 (Generic Contraction, Sparse Additive Models). Consider the ∞ ∞ aforementioned sparse additive model in high dimensions. Let (ǫn)n=1 and (ǫn)n=1 be sequences such that min(nǫ2 ,nǫ2 ) as n , with 0 ǫ ǫ 0. n n → ∞ → ∞ ≤ n ≤ n → An (δ, z), where Assume that there exist sieves of the form mn (δ)= kzk1≤Anq mn ∞ ∞ G G 2 mn (δ, z) satisfies (4.4), (mn)n=1, (An)n=1 are sequences such that Anmnǫn G ∞ S 2 → 0, and δ is some constant. Let (kn)n=1 be another sequence such that knǫn = O(1). Suppose the following conditions hold for some constants ω,D > 0 and sufficiently large n and M:

∞ N An exp Dnj2ǫ2 0, (4.5) nj − n → j=M X  1 Π An (δ)c . exp 2D + nǫ2 , (4.6) Gmn − σ2 n      Π B (k ,ǫ ,ω) exp Dnǫ2 , (4.7) n n n ≥ − n    where for any m N eand ǫ,ω > 0, ∈ + B(m,ǫ,ω)= f f ǫ, f f 2 η′(m f f 2 + δ2) k − 0k2 ≤ k − 0k∞ ≤ k − 0k2

′ An An for somee constant η > 0, and Nnj = (ξjǫn, nj (ǫn), 2) is the covering number of N S k·k

An = f An (δ): jǫ < f f (j + 1)ǫ . Snj { ∈ Gmn n k − 0k2 ≤ n} Then E [Π( f f >Mǫ )] 0. 0 k − 0k2 n | Dn → F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 16

The above framework is quite flexible and can be applied to a variety of prior models. For example, let us extend the finite random series prior discussed in Section 3.1 to the sparse additive models as follows: We model each compo- ∞ ∞ nent fj(xj ) via a Fourier series fj (xj ) = k=1 βjkψk(xj ), where (ψk)k=1 are the Fourier basis introduced in Section 3.1. Then the coefficients (βjk : j = 1, ,p,k =2, 3, ) are assigned the followingP prior distributions: First sam- ··· ··· ple an integer-valued random variable N with density πN satisfying (3.1), and then given N = m, sample the coefficients (β : j =1, ,p,k =2, 3, ) with jk ··· ···

g(βjk)dβjk, if 2 k m, Π(dβjk N = m)= ≤ ≤ | δ0(dβjk), if k > m,  independently for all j =1, ,p and k =2, 3, , where g(x) exp( τ x τ ) ··· ··· ∝ − 0| | for some τ, τ0 > 0. Finally let µ follow a prior distribution with density π(h) supported on R, and set β = m β 1 ψ (x )dx , j = 1, ,p to en- j1 − k=2 jk 0 k j j ··· sure each component fj integrates to 0. The prior specification is completed by combining the aforementioned BernoulliP priorR on z. Theorem 4.4. Consider the sparse additive model (4.3) with the above prior distribution, and the dimension of the design space p 2 possibly growing with the sample size n. Suppose the true regression function≥ yields an additive struc- ture: f (x) = µ + q f (x ), where each f C (Q) for some α > 1/2 0 r=1 0jr jr 0j ∈ α and Q > 0 for all j j1, , jq , and q does not change with n. Assume the dimension p satisfiesP∈log { p ···. log n}. Let f be imposed the prior Π given above. Then there exists some sufficiently large constant M > 0 such that

E Π f f >Mn−α/(2α+1)(log n)t 0 0 k − 0k2 Dn → n h io for any t>α/(2α + 1).

Remark 4.2. The minimax rate of convergence with respect to 2 for sparse additive models is n−α/(2α+1)+(log p)/n, provided that log p . nck·kfor some c< 1 [49]. The first term n−α/(2α+1) is the usual rate for estimating a one-dimensional α-smooth function, and the second term (log p)/n comes from the complexity of selecting the q active covariates xj1 , , xjq among x1, , xp. Under the assumption that log p . log n, the minimax··· rate of convergence··· is dominated by the first term n−α/(2α+1). Thus Theorem 4.4 states that with the aforementioned finite random series prior for sparse additive model in high dimensions, the rate of contraction is also adaptive and minimax-optimal modulus an logarithmic factor, generalizing the result in Section 3.1.

5. Proofs of the main results

Proof of Lemma 2.1. Recall the assumption

f f 2 . m f f 2 + δ2. (5.1) k − 0k∞ k − 0k2 F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 17

Let us take ξ =1/(4√2). Define the test function to be φ = ½ T > 0 , where n { n } n 1 √n T = y (f (x ) f (x )) nP (f 2 f 2) f f nP (f f )2. n i 1 i − 0 i − 2 n 1 − 0 − √ k 1 − 0k2 n 1 − 0 i=1 8 2 X p Before proceeding to the analysis of the type I and type II error probabilities, we introduce the motivation of the choice of Tn as the test . In fact, by the well-known Neyman-Pearson lemma, the most powerful test is the likelihood ratio test. In the context of random-design normal regression, the likelihood ratio test for testing H0 : f = f0 against HA : f = f1 is equivalent to rejecting f0 for large value of

n n 1 1 exp [y f (x )]2 + [y f (x )]2 −2σ2 i − 1 i 2σ2 i − 0 i ( i=1 i=1 ) Xn X 1 1 = exp y (f (x ) f (x )) nP (f 2 f 2) , σ2 i 1 i − 0 i − 2σ2 n 1 − 0 ( i=1 ) X which is also equivalent to rejecting f0 for large value of Tn. We first consider the type I error probability. Under P0, we have yi = f0(xi)+ 2 ei, where ei’s are i.i.d. N(0, σ ) noises. Therefore, there exists a constant C1 > 0 P 2 n such that 0(ei >t) exp( 4C1t ) for all t> 0. Then for a sequence (ai)i=1 Rn, the Chernoff bound≤ yields− P ( n a e t) exp 4C t2/ n a2 ∈. 0 i=1 i i ≥ ≤ − 1 i=1 i Now we set ai = f1(xi) f0(xi) and − P P 

1 2 √n 2 t = nPn(f1 f0) + f1 f0 2 nPn(f1 f0) . 2 − 8√2k − k − p Clearly,

1 1 t2 nP (f f )2 nP (f f )2 + n f f 2 ≥ n 1 − 0 4 n 1 − 0 128 k 1 − 0k2   1 nP (f f )2 n f f 2 . ≥ n 1 − 0 128 k 1 − 0k2   Then under P ( x , , x ), we have 0 · | 1 ··· n C E (φ x , , x ) exp 1 n f f 2 . 0 n | 1 ··· n ≤ − 32 k 1 − 0k2   It follows that the unconditioned error can be bounded: C E φ exp 1 n f f 2 . 0 n ≤ − 32 k 1 − 0k2  

We next consider the type II error probability. Under Pf , we have yi = f(xi)+ e with e ’s being i.i.d. mean-zero Gaussian. Then for any f with f f i i k − 1k2 ≤ F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 18

f f /(4√2) f f /4, we have k 0 − 1k2 ≤k 0 − 1k2 2 1 2

E (1 φ ) E ½ P (f f ) P (f f ) E (1 φ x , , x ) f − n ≤ n − 1 ≤ 16 n 1 − 0 f − n | 1 ··· n     1 + P P (f f )2 > P (f f )2 . n − 1 16 n 1 − 0   When P (f f )2 P (f f )2/16, we have n − 1 ≤ n 1 − 0

√n 2 Tn + f1 f0 2 nPn(f1 f0) 8√2k − k − n p 1 = e [f (x ) f (x )] + nP (f f )(f f )+ nP (f f )2 i 1 i − 0 i n − 1 1 − 0 2 n 1 − 0 i=1 Xn 1 e [f (x ) f (x )] + nP (f f )2. ≥ i 1 i − 0 i 4 n 1 − 0 i=1 X Now set

1 2 √n 2 R = R(x1, , xn)= nPn(f1 f0) f1 f0 2 nPn(f1 f0) . ··· 4 − − 8√2k − k − p 2 Then given R √n f1 f0 2 nPn(f1 f0) /(8√2), we use the Chernoff bound to obtain≥ k − k − p n P (T < 0 x , , x ) P e [f (x ) f (x )] R x , , x f n | 1 ··· n ≤ i 1 i − 0 i ≤− | 1 ··· n i=1 ! X 4C R2 C n f f 2 exp 1 exp 1 k 1 − 0k2 . ≤ −nP (f f )2 ≤ − 32  n 1 − 0    On the other hand,

√n 2 2 √n 2 P R< f1 f0 2 nPn(f1 f0) = P Gn(f1 f0) < f1 f0 . 8√2k − k − − − 2 k − k2     p It follows that

2 1 2

E ½ P (f f ) P (f f ) E (1 φ x , , x ) n − 1 ≤ 16 n 1 − 0 f − n | 1 ··· n     √n 2 E ½ R f1 f0 2 nPn(f1 f0) , ≤ ≥ 8√2k − k −   1 p P (f f )2 P (f f )2 P (T < 0 x , , x ) n − 1 ≤ 16 n 1 − 0 f n | 1 ··· n   √n + P R< f f nP (f f )2 √ k 1 − 0k2 n 1 − 0  8 2  C p √n exp 1 n f f 2 + P G (f f )2 < f f 2 . ≤ − 32 k 1 − 0k2 n 1 − 0 − 2 k 1 − 0k2     F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 19

Using Bernstein’s inequality (Lemma 19.32 in [41]), we obtain the tail probabil- ity of the empirical process G (f f )2 n 1 − 0 √n P G (f f )2 < f f 2 n 1 − 0 − 2 k 1 − 0k2   1 n f f 4/4 exp k 1 − 0k2 ≤ −4 E(f f )4 + f f 2 f f 2 /2  1 − 0 k 1 − 0k2k 1 − 0k∞  C′n f f 2 exp k 1 − 0k2 , ≤ −m f f 2 + δ2  k 1 − 0k2  for some constant C′ > 0, where we use the relation (5.1). On the other hand, 2 2 when Pn(f f1) > Pn(f1 f0) /16, we again use Bernstein’s inequality and the fact that−f f (δ−): f f 2 2−5 f f 2 to compute ∈{ ∈Fm k − 1k2 ≤ k 0 − 1k2} 1 1 n f f 4/1024 P P (f f )2 > P (f f )2 exp k 1 − 0k2 , n − 1 16 n 1 − 0 ≤ −4 g 2 + f f 2 g /32    k k2 k 1 − 0k2k k∞  where g = (f f )2 (f f )2/16. We further compute − 1 − 1 − 0 1 2 g 2 (f f )2 + (f f )2 k k2 ≤ k − 1 k2 16k 1 − 0 k2   1 2 f f f f + f f f f ≤ k − 1k∞k − 1k2 16k 1 − 0k∞k 1 − 0k2   . f f 2 f f 2 + f f 2 f f 2 k − 1k∞k − 1k2 k 1 − 0k∞k 1 − 0k2 . (m f f 2 + δ2) f f 2, k 1 − 0k2 k 0 − 1k2 where we use (5.1), the fact that f f . f f , and that k − 1k2 k 0 − 1k2 f f 2 2 f f 2 +2 f f 2 . m f f 2 + δ2. k − 1k∞ ≤ k − 0k∞ k 0 − 1k∞ k 1 − 0k2 Similarly, we obtain on the other hand, 1 g = f f 2 + f f 2 . m f f 2 + δ2. k k∞ k − 1k∞ 16k 1 − 0k∞ k 0 − 1k2 Therefore, we end up with 1 C˜ n f f 2 P P (f f )2 > P (f f )2 exp 2 k 1 − 0k2 , n − 1 16 n 1 − 0 ≤ −m f f 2 + δ2   k 1 − 0k2 ! where C˜2 > 0 is some constant. Hence we obtain the following exponential bound for type I and type II error probabilities: E φ exp( Cn f f 2), 0 n ≤ − k 1 − 0k2 Cn f f 2 E(1 φ ) exp( Cn f f 2)+2exp k 1 − 0k2 − n ≤ − k 1 − 0k2 −m f f 2 + δ2  k 1 − 0k2  2 2 for some constant C > 0 whenever f f1 2 f1 f0 2/32. Taking the k − k ≤ k − 2k 2 supremum of the type II error over f f m(δ): f f1 2 f1 f0 2/32 completes the proof. ∈{ ∈F k − k ≤k − k } F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 20

Proof of Lemma 2.2. We partition the alternative set into disjoint unions f (δ): f f >Mǫ { ∈Fm k − 0k2 n} ∞ ∞ f (δ): jǫ < f f (j + 1)ǫ := (ǫ ). ⊂ { ∈Fm n k − 0k2 ≤ n} Snj n j[=M j[=M For each Snj (ǫn), we can find Nnj = (ξjǫn, nj (ǫn), 2)-many functions f (ǫ ) such that N S k·k njl ∈ Snj n

Nnj (ǫ ) f (δ): f f ξjǫ . Snj n ⊂ { ∈Fm k − njlk2 ≤ n} l[=1 Since for each fnjl, we have fnjl nj (ǫn), implying that fnjl f0 2 > jǫn, we obtain the final decomposition∈ of S the alternative k − k

Nnj (ǫ ) f (δ): f f ξ f f . Snj n ⊂ { ∈Fm k − njlk2 ≤ k 0 − njlk2} l[=1 Now we apply Lemma 2.1 to construct individual test function φnjl for each fnjl satisfying the following property E φ exp( Cnj2ǫ2 ), 0 njl ≤ − n E 2 2 sup f (1 φnjl) exp( Cnj ǫn) 2 2 2 {f∈Fm(δ):kf−fnjlk2≤ξ kf0−fnjlk2} − ≤ − Cnj2ǫ2 + 2exp n , −mj2ǫ2 + δ2  n  where we have used the fact that f f > jǫ . Now define the global test k njl − 0k2 n function to be φn = supj≥M max1≤l≤Nnj φnjl. Then the type I error probability can be upper bounded using the union bound

∞ Nnj ∞ Nnj ∞ E φ E φ exp( Cnj2ǫ2 )= N exp( Cnj2ǫ2 ). 0 n ≤ 0 njl ≤ − n nj − n jX=M Xl=1 jX=M Xl=1 jX=M The type II error probability can also be upper bounded:

sup Ef (1 φn) {f∈Fm(δ):kf−f0k2>Mǫn} −

sup sup sup Ef (1 φnjl) ≤ j≥M l=1,··· ,Nnj {f∈Fm(δ):kf−fnjlk2≤ξkf0−fnjlk2} − Cnj2ǫ2 sup sup exp( Cj2nǫ2 )+2exp n ≤ − n −mj2ǫ2 + δ2 j≥M l=1,··· ,Nnj   n  CnM 2ǫ2 exp( CM 2nǫ2 )+2exp n . ≤ − n −mM 2ǫ2 + δ2  n  The proof is thus completed. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 21

Proof of Lemma 2.3. Denote the re-normalized restriction of Π on n n Bn = B(kn,ǫn,ω) to be Π( Bn), and the random variables (Vni)i=1, (Wni)i=1 to be · | 1 V = f (x ) f(x )Π(df B ), W = (f(x ) f (x ))2Π(df B ). ni 0 i − i | n ni 2 i − 0 i | n Z Z Let n p (x ,y ) 1 := f i i Π(df) > Π(B )exp C + nǫ2 Hn p (x ,y ) n − σ2 n ( i=1 0 i i ) Z Y     Then by Jensen’s inequality

n p (x ,y ) 1 c f i i Π(df B ) exp C + nǫ2 Hn ⊂ p (x ,y ) | n ≤ − σ2 n (Z i=1 0 i i    ) Yn 1 1 (e V + W ) C + nǫ2 ⊂ σ2 i ni ni ≥ σ2 n ( i=1 ) X   1 n n e V Cnǫ2 W nǫ2 . ⊂ σ2 i ni ≥ n ∪ ni ≥ n ( i=1 ) (i=1 ) X X Now we use the Chernoff bound for Gaussian random variables to obtain the n conditional probability bound for the first event given the design points (xi)i=1:

n C2σ4nǫ4 P e V Cσ2nǫ2 x , , x exp n . 0 i ni ≥ n | 1 ··· n ≤ − P V 2 i=1 ! n ni X   Since over the function class B , we have f f ǫ , k ǫ2 = O(1), and n k − 0k2 ≤ n n n f f 2 . k f f 2 + ω2 k ǫ2 + ω2 = O(1), k − 0k∞ nk − 0k2 ≤ n n it follows from Fubini’s theorem that

E(V 2 ) f f 2Π(df B ) ǫ2 , ni ≤ k 0 − k2 | n ≤ n Z E V 4 E (f (x) f(x))4Π(df B ) ni ≤ 0 − | n Z   f f 2 f f 2Π(df B ) . ǫ2 . ≤ k − 0k∞k − 0k2 | n n Z Hence by the Chebyshev’s inequality,

P P 2 E 2 2 1 2 1 E 4 1 nVni Vni >ǫnǫ 2 4 var(Vni) 4 2 (Vni) . 2 0 − ≤ nǫ ǫn ≤ nǫnǫ nǫn →   for any ǫ> 0, i.e., P V 2 EV 2 + o (ǫ2 ) ǫ2 (1 + o (1)), and hence, n ni ≤ ni P n ≤ n P C2σ4nǫ4 C2σ4nǫ2 exp n = exp n 0 − P V 2 −1+ o (1) →  n ni   P  F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 22 in probability. Therefore by the dominated convergence theorem the uncondi- tioned probability goes to 0: n n P e V Cσ2nǫ2 = E P e V Cσ2nǫ2 x , , x 0 i ni ≥ n 0 i ni ≥ n | 1 ··· n i=1 ! " i=1 !# X X C2σ4nǫ4 E exp n 0. ≤ − P V 2 →   n ni  For the second event we use the Bernstein’s inequality. Since 1 1 EW = f f 2Π(df B ) ǫ2 , ni 2 k − 0k2 | n ≤ 2 n Z 1 1 EW 2 E (f(x) f (x))4Π(df B ) f f 2 f f 2 Π(df B ) . ǫ2 , ni ≤ 4 − 0 | n ≤ 4 k − 0k2k − 0k∞ | n n Z  Z then n n 1 P W > nǫ2 P (W EW ) > nǫ2 ni n ≤ ni − ni 2 n i=1 ! i=1 ! X X 1 nǫ4 /4 exp n ≤ −4 EW 2 + ǫ2 W /2  ni nk nik∞  Cˆ nǫ2 exp 1 n , ≤ −1+ W k nik∞ ! 2 where Wni ∞ = supx∈[0,1]p (1/2) (f(x) f0(x)) Π(df Bn). Since for any f k k − | P ∈ Bn, f f0 ∞ = O(1), it follows that Wni ∞ = O(1), and hence, ( i Wni > nǫ2 )k −0. Tok sum up, we concludeR thatk k n → P n n P( c ) P e V Cσ2nǫ2 + P W > nǫ2 0. Hn ≤ 0 i ni ≥ n ni n → i=1 ! i=1 ! X X

Proof of lemma 2.4. Denote Π( Bn) = Π( Bn)/Π(Bn) to be the re-normalized restriction of Π on B = f ·f | <ǫ ,·∩ and n {k − 0k∞ n} 1 V = f (x ) f(x )Π(df B ), W = (f(x ) f (x ))2Π(df B ). ni 0 i − i | n ni 2 i − 0 i | n Z Z Similar to the proof of lemma 2.3, we obtain n p (x ,y ) 1 c = f i i Π(df) Π(B )exp C + nǫ2 Hn p (x ,y ) ≤ n − σ2 n (Z i=1 0 i i    ) Yn 1 1 (e V + W ) C + nǫ2 ⊂ σ2 i ni ni ≥ σ2 n ( i=1 ) X   1 n 1 e V C + nǫ2 , ⊂ σ2 i ni ≥ 2σ2 n ( i=1 ) X   F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 23

2 2 where we use the fact that Wni (1/2) f f0 ∞Π(df Bn) ǫn/2 for all f B in the last step. Conditioning≤ on thek design− k points (|x )n ≤, we have ∈ n R i i=1 1 2 σ4nǫ4 P ( c x , , x ) exp C + n 0 Hn | 1 ··· n ≤ − 2σ2 P V 2 "   n ni # 1 2 −1 exp C + σ4nǫ4 f f 2 Π(df B ) ≤ − 2σ2 n k − 0k∞ | n (   Z  ) 1 2 exp C + σ4nǫ2 0. ≤ − 2σ2 n → "   # The proof is completed by applying the dominated convergence theorem. Proof of Theorem 2.1. For convenience denote the log-likelihood ratio function

n Λ (f)= [log p (x ,y ) log p (x ,y )]. n f i i − 0 i i i=1 X Let φn be the test function given by lemma 2.2 and

3D 1 = exp(Λ (f ))Π(df) exp + nǫ2 . Hn n | Dn ≥ − 2 σ2 n Z     It follows from condition (2.5) that

D 1 c exp(Λ )Π(df) < Π(B (k ,ǫ ,ω)) exp + nǫ2 , Hn ⊂ n n n n − 2 σ2 n Z     P c and hence, 0( n)= o(1) by lemma 2.3. Now we decompose the expected value of the posteriorH probability

E0 [Π ( f f0 2 >Mǫn n)]

k − k | D c ½ E [(1 φ )½( )Π( f f >Mǫ )] + E φ + E [(1 φ ) ( )] ≤ 0 − n Hn k − 0k2 n | Dn 0 n 0 − n Hn exp(Λ (f ))Π(df) {kf−f0k2>Mǫn} n n

E (1 φ )½( ) | D ≤ 0 − n Hn exp(Λ (f ))Π(df) " R n | Dn # c + E0φn + P0( ). R Hn

By (2.3) and lemma 2.2 the type I error probability E0φn 0. It suffices to bound the first term. Observe that on the event , the denominator→ in the Hn F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 24 square bracket can be lower bounded:

exp(Λ (f ))Π(df) {kf−f0k2>Mǫn} n n

E (1 φ )½( ) | D 0 − n Hn exp(Λ (f ))Π(df) " R n | Dn # 3D 1 R exp + nǫ2 ≤ 2 σ2 n    E (1 φ ) exp(Λ (f ))Π(df) × 0 − n n | Dn " ZFmn (δ)∩{kf−f0k2>Mǫn} # 3D 1 2 E + exp + 2 nǫn 0 exp(Λn(f n))Π(df) . 2 σ F c (δ) | D    "Z mn # By Fubini’s theorem, lemma 2.2 we have

E (1 φ ) exp(Λ (f ))Π(df) 0 − n n | Dn " ZFmn (δ)∩{kf−f0k2>Mǫn} # E [(1 φ ) exp(Λ )] Π(df) ≤ 0 − n n | Dn ZFmn (δ)∩{kf−f0k2>Mǫn} sup Ef (1 φn) ≤ f∈Fmn (δ)∩{kf−f0k2>Mǫn} − CM 2nǫ2 exp( CM 2nǫ2 )+2exp n ≤ − n −m ǫ2 M 2 + δ  n n  exp( CM˜ 2nǫ2 ), ≤ − n ˜ 2 for some constatn C > 0 for sufficiently large n, since mnǫn 0 and δ = O(1) by assumption. For the integral on c (δ), we apply Fubini’s→ theorem to obtain Fmn n pf (xi,yi) E0 exp(Λn(f n))Π(df) = E0 Π(df) F c (δ) | D F c (δ) p0(xi,yi) " mn # mn "i=1 # Z Z Y Π( c (δ)). ≤ Fmn Hence we proceed to compute

exp(Λ (f ))Π(df) {kf−f0k2>Mǫn} n n

E (1 φ )½( ) | D 0 − n Hn exp(Λ (f ))Π(df) " R n | Dn # 3D 1 R . exp + nǫ2 CM˜ 2nǫ2 2 σ2 n − n    3D 1 1 + exp + nǫ2 2D + nǫ2 0 2 σ2 n − σ2 n →      as long as M is sufficiently large, where (2.4) is applied. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 25

Supplementary Material

Supplement to “A theoretical framework for Bayesian nonparametric regression” (). The supplementary material contains the remaining proofs and additional technical results.

References

[1] Arbel, J., Gayraud, G. and Rousseau, J. (2013). Bayesian optimal adaptive estimation using a sieve prior. Scandinavian Journal of Statistics 40 549–570. [2] Belitser, E. and Ghosal, S. (2003). Adaptive on the mean of an infinite-dimensional normal distribution. The Annals of Statistics 31 536–559. [3] Canale, A. and De Blasi, P. (2017). Posterior asymptotics of non- parametric location-scale mixtures for multivariate density estimation. Bernoulli 23 379–404. [4] Castillo, I. and van der Vaart, A. (2012). Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences. Ann. Statist. 40 2069–2101. [5] Cohen, A. (2003). Numerical analysis of wavelet methods 32. Elsevier. [6] Cohen, A., Daubechies, I. and Vial, P. (1993). Wavelets on the interval and fast wavelet transforms. Applied and computational harmonic analysis 1 54–81. [7] Currin, C., Mitchell, T., Morris, M. and Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the de- sign and analysis of computer experiments. Journal of the American Sta- tistical Association 86 953–963. [8] de Boor, C. (1978). A practical guide to splines. Applied Mathematical Sciences, New York: Springer, 1978. [9] De Jonge, R. and Van Zanten, J. (2010). Adaptive nonparamet- ric Bayesian inference using location-scale mixture priors. The Annals of Statistics 38 3300–3320. [10] De Jonge, R. and Van Zanten, J. (2012). Adaptive estimation of multi- variate functions using conditionally Gaussian tensor-product spline priors. Electronic Journal of Statistics 6 1984–2001. [11] Gao, C. and Zhou, H. H. (2016). Rate exact Bayesian adaptation with modified block priors. The Annals of Statistics 44 318–345. [12] Ghosal, S., Ghosh, J. K. and Van Der Vaart, A. W. (2000). Conver- gence rates of posterior distributions. The Annals of Statistics 28 500–531. [13] Ghosal, S. and Van Der Vaart, A. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. The Annals of Statistics 35 697– 723. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 26

[14] Ghosal, S. and Van Der Vaart, A. (2007). Convergence rates of pos- terior distributions for noniid observations. The Annals of Statistics 35 192–223. [15] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonpara- metric Bayesian inference 44. Cambridge University Press. [16] Ghosal, S. and Van Der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. The Annals of Statistics 29 1233–1263. [17] Gine,´ E. and Nickl, R. (2015). Mathematical foundations of infinite- dimensional statistical models 40. Cambridge University Press. [18] Hastie, T. J. (2017). Generalized additive models. In Statistical models in S 249–307. Routledge. [19] Hoffmann, M., Rousseau, J. and Schmidt-Hieber, J. (2015). On adaptive posterior concentration rates. The Annals of Statistics 43 2259– 2295. [20] Huang, T.-M. (2004). Convergence rates for posterior distributions and adaptive estimation. The Annals of Statistics 32 1556–1593. [21] Koltchinskii, V. and Yuan, M. (2010). Sparsity in multiple kernel learn- ing. Ann. Statist. 38 3660–3695. [22] Kruijer, W., Rousseau, J. and Van Der Vaart, A. (2010). Adap- tive Bayesian density estimation with location-scale mixtures. Electronic Journal of Statistics 4 1225–1257. [23] Linero, A. R. and Yang, Y. (2017). Bayesian regression tree ensembles that adapt to smoothness and sparsity. arXiv preprint arXiv:1707.09461. [24] Meier, L., van de Geer, S. and Bhlmann, P. (2009). High-dimensional additive modeling. Ann. Statist. 37 3779–3821. [25] Pati, D., Bhattacharya, A. and Cheng, G. (2015). Optimal Bayesian Estimation in Random Covariate Design with a Rescaled Gaussian Process Prior. Journal of Machine Learning Research 16 2837-2851. [26] Pradeep, R., John, L., Han, L. and Larry, W. Sparse additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 1009-1030. [27] Raskutti, G., Wainwright, M. J. and Yu, B. (2012). Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research 13 389–427. [28] Rasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning 1. MIT press Cambridge. [29] Rivoirard, V. and Rousseau, J. (2012). Posterior concentration rates for infinite dimensional exponential families. Bayesian Analysis 7 311–334. [30] Rockova,´ V. (2018). Bayesian estimation of sparse signals with a contin- uous spike-and-slab prior. Ann. Statist. 46 401–437. [31] Rockova,´ V. and George, E. I. (2018). The Spike-and-Slab . Journal of the American Statistical Association 113 431-444. [32] Rockova,´ V. and van der Pas, S. (2017). Posterior concentra- tion for Bayesian regression trees and their ensembles. arXiv preprint arXiv:1708.08734. F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 27

[33] Sacks, J., Welch, W. J., Mitchell, T. J. and Wynn, H. P. (1989). Design and analysis of computer experiments. Statistical science 409–423. [34] Schumaker, L. (2007). Spline functions: Basic theory. Cambridge Uni- versity Press. [35] Scricciolo, C. (2006). Convergence rates for Bayesian density estimation of infinite-dimensional exponential families. The Annals of Statistics 34 2897–2920. [36] Scricciolo, C. (2011). Posterior rates of convergence for Dirichlet mix- tures of exponential power densities. Electronic Journal of Statistics 5 270– 308. [37] Shen, W., Tokdar, S. T. and Ghosal, S. (2013). Adaptive Bayesian multivariate density estimation with Dirichlet mixtures. Biometrika 100 623–640. [38] Stone, C. J. (1982). Optimal global rates of convergence for nonparamet- ric regression. The Annals of Statistics 1040–1053. [39] Tuo, R. and Jeff Wu, C. (2016). A theoretical framework for calibration in computer models: parametrization, estimation and convergence proper- ties. SIAM/ASA Journal on Uncertainty Quantification 4 767–795. [40] Vaart, A. v. d. and Zanten, H. v. (2011). Information rates of nonpara- metric Gaussian process methods. Journal of Machine Learning Research 12 2095–2119. [41] Van der Vaart, A. W. (2000). Asymptotic statistics 3. Cambridge uni- versity press. [42] van der Vaart, A. W. and van Zanten, J. H. (2008). Rates of con- traction of posterior distributions based on Gaussian process priors. The Annals of Statistics 1435–1463. [43] van der Vaart, A. W. and van Zanten, J. H. (2008). Reproducing ker- nel Hilbert spaces of Gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh 200–222. Institute of . [44] van der Vaart, A. W. and van Zanten, J. H. (2009). Adaptive Bayesian estimation using a Gaussian random field with inverse Gamma bandwidth. The Annals of Statistics 2655–2675. [45] Weining, S. and Subhashis, G. Adaptive Bayesian Procedures Using Random Series Priors. Scandinavian Journal of Statistics 42 1194-1213. [46] Xie, F. and Xu, Y. (2017). Bayesian Repulsive Gaussian Mixture Model. arXiv preprint arXiv:1703.09061. [47] Xie, F. and Xu, Y. (2018). Adaptive Bayesian Nonparametric Regression Using a Kernel Mixture of Polynomials with Application to Partial Linear Models. Bayesian Anal. Advance publication. [48] Yang, Y., Bhattacharya, A. and Pati, D. (2017). Frequentist cover- age and sup-norm convergence rate in Gaussian process regression. arXiv preprint arXiv:1708.04753. [49] Yang, Y. and Tokdar, S. T. (2015). Minimax-optimal nonparametric regression in high dimensions. Ann. Statist. 43 652–674. [50] Yoo, W. W. and Ghosal, S. (2016). Supremum norm posterior con- F. Xie, W. Jin and Y. Xu/A Theoretical Framework for Bayesian Regression 28

traction and credible sets for nonparametric multivariate regression. The Annals of Statistics 44 1069–1102. [51] Yoo, W. W., Rousseau, J. and Rivoirard, V. (2017). Adaptive Supre- mum Norm Posterior Contraction: Spike-and-Slab Priors and Anisotropic Besov Spaces. arXiv preprint arXiv:1708.01909.