An Invitation to Bayesian Nonparametrics
Subhashis Ghoshal
North Carolina State University A short course on theory and methods of Nonparametric Bayesian Inference given in EURANDOM, Summer 2011
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Objectives
The goal is to develop sound Bayesian methods for nonparametric and semiparametric problems. This means constructing sensible priors, computing posterior distributions possibly with support from computational devices like Markov chain-Monte Carlo, and showing good (frequentist) behavior of the posterior distribution (usually in asymptotic sense) Nonparametric and semiparametric modeling are way to go, since parametric models are arguably too restricted. Bayesian methods quantify uncertainty in a direct way, and straightforward in approach. It can also incorporate additional information in a very natural way. Since both nonparametric modeling and Bayesian approach are sensible, then the nonparametric Bayesian approach must be sensible too.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of nonparametric models
iid X1, ..., Xn ∼ P, c.d.f. F , completely unknown. iid X1, ..., Xn ∼ P, p.d.f. f , not known to be a member of a parametric family.
Yi = f (Xi ) + εi , regression function f unknown. ind Yi |Xi ∼ Bin(1, p(Xi )), p(x) ∈ [0, 1] unknown.
X1, ..., Xn are survival times, subject to censoring, and have cumulative hazard function H, unknown.
dXt = f (t)dt + σdBt , unknown signal f corrupted by white noise dBt .
X0, X1, ..., Xn stationary time series with unknown spectral density f .
Yi |Xi = xi has conditional density f (·; xi ), f unknown.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of semiparametric models
0 iid Yi = β Xi + εi , εi ∼ P unknown, but β is the parameter of interest. ind 0 Yi |Xi ∼ ψ(·, f (β Xi )), ψ exponential family, f unknown, but β is the parameter of interest.
X1, ..., Xn are survival times, subject to censoring, and have cumulative hazard function Hi respectively, where β0Z Hi (x) = e i H0(x), where Z1,..., Zn are associated covariates, baseline hazard H0 unknown, but β is the parameter of interest.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Functional data analysis examples
Functional regression with Euclidean predictor: 0 Yi (t) = β(t) Xi + εi (t). Functional regression with functional predictor: R Yi (t) = β(s, t)Xi (s)ds + εi (t).
Functional principal component analysis: X1(t),..., Xn(t) i.i.d. mean zero Gaussian processes with covariance kernel P∞ K(s, t) = j=1 λj φj (s)φj (t).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Issues
Need to construct probability measures on infinite dimensional spaces, where there is no Lebesgue-type dominating σ-finite measure. Subjective priors are not possible, since that means infinite details about the unknown. Prior should be largely developed through an automatic mechanism. Only a few key parameters may be chosen using prior information. Automatic priors should spread out mass all over the parameter space — big support. Computational feasibility and good asymptotic behavior of posterior distribution/Bayes estimates should be kept in mind.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Random basis expansion
P∞ f (x) = j=1 θj Bj (x), where {Bj : j ∈ N} is a suitable basis — like polynomial, trigonometric, splines, wavelets etc.
Coefficients θj ’s are given some suitable prior, often independent normal (with quickly decreasing variance for convergence). A (more pragmatic) variation is consider finite expansion PK f (x) = j=1 θj Bj (x), where K is chosen (usually depending on n) to make the approximation error in the expansion under control or some model selection technique, or rather by imposing a suitable prior on K (infinitely supported, appropriate tail). The latter is considered as more sensible, but involves more challenging computation because of changing dimension. Typically this means involving reversible jump Markov chain Monte-Carlo techniques.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Use of link functions
Sometimes functions have natural constraints in their assumed values. Intensity of a Poisson process, or spectral density of a time series take only positive values. Then basis expansion for log f is more natural. The advantage is that there would be no restriction on the coefficients. Thus P∞ f (x) = exp[ j=1 θj Bj (x)]. In binary regression, the response probability p(x) takes values in the unit interval. Then logit, probit or some other link can convert a basis expansion into a function taking values in P∞ [0, 1]: p(x) = H( j=1 θj Bj (x)). The function H can (and should) be chosen to be increasing (and continuous), so a continuous c.d.f. There need not be any further prior put on H, since the free basis expansion already gives large support.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Probability density functions, on the other hand have two constraints: f (x) ≥ 0 and R f (x)dx = 1. One way to ensure these constraints is to exponentiate and renormalize — P∞ R P∞ f (x) = exp[ j=1 θj Bj (x)]/( exp[ j=1 θj Bj (u)]du). The renormalization will be meaningful in all compact domains and continuous basis functions. This is often called an infinite dimensional exponential family. With finitely many terms and a spline basis, this is often called a log-spline prior.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian processes
A stochastic process Wt is called a Gaussian process if all finite dimensional distributions are multivariate normal. The distribution is completely characterized by mean function µ(t) = E(Wt ) and covariance kernel K(s, t) = cov(Ws , Wt ). Common examples of Gaussian process include the Brownian motion (covariance kernel min(s, t)), integrated Brownian motion, squared exponential process (covariance kernel e−cks−tk2 ). Gaussian processes have the flexibility of sample paths in modeling an arbitrary function. With the help of a link function, then restrictions can be addressed too.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Smoothness of sample paths are related to smoothness of the covariance kernel. For instance, the Brownian motion has 1 order < 2 smoothness, while the squared exponential process has infinite order smoothness. The k-fold integrated Brownian 1 motion has < k + 2 order smoothness. A Gaussian process can be chosen to match with prior knowledge of smoothness of the function to be modeled. A (finite or infinite) basis expansion with normal coefficients gives a Gaussian process. Conversely, by the Karhunen-Lo`ev`e expansion, every Gaussian process has such a basis expansion.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Completely random measures
A completely random measure (CRM) [Kingman, 1975] on X is a measure-valued process Φ such that for any finite collection of disjoint sets A1,..., Ak , the random variables Φ(A1),..., Φ(Ak ) are mutually independent. Because of additivity of measures and independence of increments, the marginal distribution of Φ(A) must be infinitely divisible. A CRM has Levy-Khinchine representation E(e−tΦ(A)) = exp[−β(A)t − R (1 − e−tz )ν(A, dz)]. A random measure with given mean measure µ(A) = EΦ(A) can be constructed from points of a Poisson process N on the product space X × [0, ∞) by the relation R ∞ R Φ(A) = 0 A zN(dx, dz) (assume β = 0 w.l.o.g.) The representation shows that Φ is a.s. discrete. A CRM can be used as a prior on a space of measures. Finiteness/infiniteness can be controlled through the mean measure. A gamma randomSubhashis measureGhoshal is particularlyAn Invitation to Bayesian important Nonparametrics in Bayesian nonparametrics. Here the distribution of Φ(A) is gamma with scale parameter 1 and shape parameter α(A), with α being a measure. L´evyprocesses
Essentially, a (subordinator) L´evyprocess is an increasing process indexed by the half-line with increasing, right-continuous sample paths. (This differs from the definition used in probability theory where stationarity is imposed as a requirement.) They can have either finite or infinite total mass. Every finite interval has finite random mass. Infinite mass L´evyprocess can model a prior for cumulative hazard function. Clearly L´evyprocesses are pure jump processes, having countably many jump points. Most of these jumps are very small. In fact, number of jumps of size bigger than any > 0 is a.s. finite.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Random discrete distributions
PN P = j=1 Wj δθj , 1 ≤ N ≤ ∞, possibly random if not identically infinity, (W1,..., WN ) has some distribution on the unit N-simplex, (θ1, . . . , θN ) have some joint distribution in XN , both of which can possibly depend on the assumed value of N. Even though the realized measures are always discrete, there would an enormous variety if the distributions of W ’s and θ’s are flexible. Since any probability measure can be weakly approximated by finite/discrete probability measures, it is easy to maintain large weak support of the prior. It is natural to consider θj ’s i.i.d. with a non-singular distribution. If N is a.s. finite, then a Dirichlet distribution (which includes the uniform) on (W1,..., WN ) may be considered as an automatic choice. [Actually, there is also a countable dimensional Dirichlet.]
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick breaking
Consider a random discrete distribution with countably many points. What prior can be put on (W1, W2,...)? The stick-breaking technique proceeds as follows: The total mass 1 is to be distributed sequentially to the support points, which have been already arranged in a sequence θ1, θ2,.... Consider random variables V1, V2,... taking values in [0, 1] with some joint distribution (typically independent, or even i.i.d. with a fixed distribution like beta). Then induce a distribution on (V1, V2,...) through the relations Qj−1 Wj = [ k=1(1 − Vk )]Vj . The interpretation is that in the beginning, there was a stick of length 1. It is broken in the proportion V1 : 1 − V1, and the initial mass is assigned to θ1. Next, the leftover is broken in the proportion V2 : 1 − V2, the initial mass is assigned to θ2, and the leftover is broken again, in the proportion V3 : 1 − V3 and it continues like this.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Qj Clearly, the leftover mass after j steps is k=1(1 − Vk ). The total mass will be exhausted in countably many steps if Qj k=1(1 − Vk ) → 0 a.s., which under independence, holds if P∞ k=1 log E(1 − Vk ) = −∞. This always holds if Vj ’s are i.i.d. with P(V > 0) > 0.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Normalized L´evyprocesses
One way to ensure that a random measure is a probability measure is to normalize, provided that random measure is a.s. finite. The most familiar example is to take a gamma process with finite mean measure α. The normalized measure is called a Dirichlet process to be studied in details later. All finite dimensional distributions are Dirichlet by the connection between independent gammas and Dirichlet. Another example which stands out as tractable is normalized inverse-Gaussian process. Like the gamma, the inverse Gaussian is also closed under convolution, and hence is infinitely divisible. The normalized process has fairly tractable joint distributions, and has many similarities with the Dirichlet process.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Partitioning method
A natural method of distributing total mass 1 to various subsets is to use a recursive partitioning scheme. Start with a sequence of refining partitions, for simplicity binary partitions. This means each set in a given stage is split in two in the next stage, to be called its offsprings. Distribute the mass in the current stage to its offsprings according to a random proportion. Probabilities of a set in any stage of the partition is given by a finite product of these proportions. The random proportions are given some joint distribution. If the union of all refining partitions generate the Borel σ-field on the sample space, then a random probability measure P is automatically obtained. A slight control on the proportions ensure countable additivity of P. Usually P has no fixed atoms, but can have random atoms.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics R
V0 V1
B0 B1
V00 V01 V10 V11
B00 B01 B10 B11
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics
1 Tail-free processes
It is natural to impose some independence on the random proportions used in a partitioning method.
Let {V0}, {V00, V10}, {V000, V010, V100, V110}, ... be mutually independent collections. Note that we have assumed independence across levels, but not necessarily within levels. Such a random probability is called tail-free with respect to the given sequence of partitions [Freedman, 1963, Ann. Statist.]
Interestingly, P(Bε1···εk ) is a product of independent random variables.
This allows calculation of moments of P(Bε1···εk ) and
log P(Bε1···εk ) in terms of those of Vε’s.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Conjugacy
There is a structural conjugacy: The posterior distribution of iid P given X1,..., Xn|P ∼ P is also tail-free with respect to the same sequence of partitions. The proof of the last fact is intriguing. One first looks at the counts of the sets in a given level of partitions and the corresponding cell probabilities. The posterior distribution given the counts at this level can be obtained. On the other hand, a similar calculation can be done at any finer level of partition, and can be marginalized to the previous level. Independence in the likelihood and the tail-free prior conclude that these two posterior distributions are the same. Making the second stage of partition finer and finer, the posterior given observations is obtained. The independence across levels are in-built.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The name tail-free comes because the posterior distributions of cell probabilities at any finite stage is unaffected by the counts in the finer stages, or the tail. A practical method of choosing a sequence of partitions is to follow binary quantiles of a target probability measure λ. If 1 E(V ) = 2 at every stage, then E(P) = λ is ensured. Such a partition is called a canonical partition.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Support
Support, loosely speaking, is the subset where the prior assigns mass. Technically, it is the smallest closed set with prior probability 1. A tail-free process in R has large weak support, provided the endpoints of the partitioning sets become dense and all proportions have non-singular joint distributions. By denseness, closeness to a distribution P0 in the weak topology can be reduced to closeness of finite stage cell probabilities, which can be achieved with positive probability by non-singularity. Thus any P0 is in the weak support of the tail-free process.
If P0 is continuous, then P0 is also in the Kolmogorov-Smirnov support of the tail-free process by Polya’s theorem — a general fact irrespective of the prior.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Absolute continuity
Theorem (Kraft, 1964, J. Appl. Probab.) Let Π be a tail-free prior with respect to the sequence of partitions {Tm} and let λ be a probability measure such that λ(B) > 0 for all ∞ B ∈ ∪m=1Tm. If
Qm 2 j=1 E(Vε ···ε ) sup max 1 j < ∞, ε∈Em 2 m∈N λ (Bε1···εm ) then P is absolutely continuous with respect to λ a.s. In particular, if the canonical partition is chosen so that −m 1 λ(Bε1···εm ) = 2 , |E(Vε1···εm ) − 2 | ≤ δm and var(Vε1···εm ) ≤ γm, m P∞ for all ε ∈ E and m ∈ N, where m=1 δm < ∞ and P∞ m=1 γm < ∞, then the conclusion holds.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Zero-one laws
A tail-free process has interesting dichotomies. If V ’s in all levels are strictly in (0, 1), then with respect to any measure λ with all cell probabilities positive, the sample realizations of the tail-free process are absolutely continuous with respect to λ with probability either zero, or one. This is actually a simple consequence of Kolmogorov’s zero-one law in view of the independence in the tail-free structure. Absolute continuity is no special, in that mutual singularity also holds with probability either zero or one. Discreteness or continuity (i.e., non-atomicity) each also holds with probability either zero or one.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Polya tree process
Tail-free property is an abstract concept. For actual Bayesian analysis, we need to make some definite choices about the partitions and the distributions of the Vε’s.
Call Vε0 = Yε, and so Vε1 = 1 − Yε. Make ind Yε ∼ Be(αε0, αε1). Note that all these allocation proportions are independent, across or in the same levels. Expressions for mean and moments are more concrete. In α Qk ε1···εj particular, E[P(Bε1···εk )] = j=1 . αε1···εj−10+αε1···εj−11 Using a canonical partition from a center measure λ, and letting all αε = am’s depend only on the length of the string m = |ε|, Kraft’s sufficient condition for the existence of a P∞ −1 density can be simplified to m=1 am < ∞.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Conjugacy
If P follows a Polya-tree in the prior, then given i.i.d. observations from P, the posterior for P is again a Polya tree with respect to the same partition, with Pn αε 7→ αε + i=1 1l{Xi ∈ Bε}. In particular, assuming a canonical Polya tree with parameters {am} and admitting density, this leads to a simple expression for the posterior mean of the density
∞ Y 2aj + 2N(Bε1···εj ) E[f (x)|X1,..., Xn] = λ(x) . 2aj + N(Bε ···ε 0) + N(Bε ···ε 0) j=1 1 j−1 1 j−1
The infinite product actually terminates, so numerical computation is simple, giving a simple Bayesian density estimate. The posterior variance can also be computed fairly easily.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Dirichlet process
iid Consider the problem of estimating the distribution Xi |P ∼ P. How to construct a conjugate prior for P?
Had the data been grouped using a partition A1,..., Ak , then the likelihood would be multinomial — the closest finite dimensional relative of the general nonparametric model.
The corresponding conjugate prior for (P(A1),..., P(Ak )) would be a Dirichlet distribution. But the way grouping can be done is completely arbitrary, so for all choice of partition, the Dirichlet distribution would be needed. This motivates the following:
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Dirichlet process
Definition (Ferguson, 1973) A random probability measure P is said to follow a Dirichlet process prior Dα with base measure α if for every finite measurable partition {A1,..., Ak } of X,
(P(A1),..., P(Ak )) ∼ Dir(k; α(A1), . . . , α(Ak )),
where α is a finite positive Borel measure.
Why a measure? Allows unambiguous specification essential for existence: If some sets in the partition are merged together, resulting probabilities follow lower dimensional Dirichlet with parameters group sums.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Moments
Marginal: (P(A), P(Ac )) ∼ Dir(2; α(A), α(Ac )), that is, P(A) ∼ Be(α(A), α(Ac )). Mean: E(P(A)) = α(A)/α(X) =:α ¯(A). Implication: If X |P ∼ P and P ∼ Dα, then marginally X ∼ α¯, Variance: var(P(A)) =α ¯(A)¯α(Ac )/M, M = |α| := α(X). Comment: α¯ is called the center measure and M is called the precision parameter. We often denote Dα by DP(M, G), where G is the c.d.f. ofα ¯. More generally, E(R ψdP) = R ψdα¯.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Transformation
Theorem −1 P ∼ Dα on X and f : X → Y implies P ◦ f ∼ Dα◦f −1 .
The proof is obvious from the definition of induced measures. The property, in particular, is useful in carrying Dirichlet process from one sample to another, and will be useful when we discuss construction of a Dirichlet process.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior distribution
Theorem n ∗ ∗ P P|X1, ..., Xn ∼ Dα , where α = α + i=1 δXi , gives a version of the posterior distribution.
Sketch of the proof: Can just do n = 1 and iterate. Reduce to a finite measurable partition {A1, ..., Ak }. To show its posterior is Dir(k; α(A1), . . . , α(Ai−1), α(Ai ) + 1, α(Ai+1), . . . , α(Ak )), when X ∈ Ai . Clearly true if only that much were known. Refine partition and corresponding information. The posterior does not change. Apply the martingale convergence theorem to pass to the limit.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior moments and convergence
M n E(P(A)|X1, ..., Xn) = M+n α¯(A) + M+n Pn(A). Convex combination, relative weights M and n. M can be interpreted as the prior sample size, M → 0 means “noninformative limit” Asymptotically equivalent to sample mean up to O(n−1), −1/2 converges to true P0 at n rate.
var(P(A)|X1,..., Xn) ≤ 1/(4n) → 0. −1/2 Π(P : |P(A) − P0(A)| ≥ Mnn |X1,..., Xn), by Chebyshev, −2 2 bounded by Mn nE((P(A) − P0(A)) |X1,..., Xn) → 0 a.s. under P0, for any Mn → ∞.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Discreteness
Theorem
Dα(D) := Dα(P : P is discrete) = 1.
Sketch of the proof: P ∈ D iff P{x : P({x}) > 0} = 1. D is a measurable subset of M. Assertion holds iff (Dα × P){(P, X ): P{X } > 0} = 1.
Equivalent toα ¯ × Dα+δX {(X , P): P{X } > 0} = 1. True by Fubini since P has positive mass at X under the
“posterior” Dα+δX . Note: Null sets do matter.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Self-similarity
Theorem
c P(A) ⊥ P|A ⊥ P|A and P|A ∼ Dα(A)¯α|A .
Can be shown using the relations between finite dimensional Dirichlet and independent gamma variables. If the Dirichlet process is localized to A and Ac , then both are Dirichlet processes and are independent of each other, and also independent of the “macro level” variable P(A).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Weak support
Theorem
supp(Dα) = {P : supp(P) ⊂ supp(α)}.
Sketch of the proof: No P which supports points outside supp(α) can be in the support since the corresponding first beta parameter would be zero. Any compliant P would be in the weak support by fine partitioning and non-singularity of corresponding finite dimensional Dirichlet distribution. Further, if P is in the weak support and is continuous, then assertion automatically upgrades to Kolmorov-Smirnov support by Polya’s theorem.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence
Theorem
Let αm be such that α¯m →w α¯, then
(i) If |αm| → M, 0 < M < ∞, then Dαm →w DMα¯; ∗ (ii) If |αm| → 0, then Dαm →w Dα := L(δX : X ∼ α¯);
(iii) If |αm| → ∞, then Dαm →w δα¯.
Sketch of the Proof. A random measure is tight iff its expectation measure is tight, so tightness holds here. To check finite dimensional convergence. Work with a finite partition. For (i), Dirichlet goes to Dirichlet by Scheffe. For (ii) and (iii), check convergence of all mixed moments.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence
Corollary iid If Xi ∼ P0, then Dα+Pn δ weakly converges to i=1 Xi
(i) δP0 if n → ∞ (consistency of posterior);
(ii) DPn δ (the Bayesian bootstrap) if M → 0. i=1 Xi
By (i), the posterior distribution of P as n → ∞ concentrates near the true distribution P0 in the weak topology (and Kolmogorov-Smirnov distance) irrespective of the base measure α. In particular, it is not required that the prior supports the true distribution. The property is driven by the behavior of the empirical distribution and the structure of Dirichlet process, rather than likelihood ratios, which are non-existent in the present case. (ii) can be regarded as the posterior based on a “noninformative Dirichlet prior”, and is a sensible alternative to the bootstrap method, to be discussed later. Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence
2 If α¯m also diffuses in the limit, like N(0, σ ) with σ → ∞ along with M → ∞, then some other non-trivial limits may be obtained, provided that the growth of M is appropriately linked with the growth of σ. The resulting limit has sometimes called the limdir process [Bush, Lee and McEachern (2011), J. Roy. Statist. Soc.]. The process has been suggested as a noninformative choice in nonparametric mixture modeling like Dirichlet mixture process, to be discussed later.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet-multinomial process
Definition PN iid The distribution ΠN of P = k=1 Wk δθk , where θ1, . . . , θN ∼ G and (W1,..., WN ) ∼ Dir(N; α1, . . . , αN ), is called the Dirichlet-multinomial process of order N with parameters G and (α1, . . . , αN ).
iid If Y1, ..., Yn ∼ P and P ∼ ΠN , then Yi = θKi , where (K1,..., Kn) ∼ MN(n, N; W1,..., WN ) and (W1,..., Wk ) ∼ Dir(N; α1, . . . , αN ).
For a given (θ1, . . . , θN ), P ∼ DPN α δ . k=1 k θk R R For any ψ ∈ L1(G), E( ψdP) = ψdG irrespective of the values of α1, . . . , αN .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet-multinomial process
Theorem (Ishwaran and Zarepour, 2002, Statistica Sinica) PN iid Let PN = k=1 Wk,N δθk , where θ1, θ2,... ∼ G and independently (W1,N ,..., WN,N ) ∼ Dir(N; α1,N , . . . , αN,N ),P ∼ DP(M, G). PN PN 1 (a) If αk,N → ∞ and max1≤k≤n αk,N /( αk,N ) → 0, kR=1 R k=1 then ψdPN →p ψdG, ψ ∈ L2(G). In particular, holds for αk,N = λN and NλN → ∞. P∞ 2 2 −1 PN 1 (b) If αk,N = λk , λ /k < ∞,N λk → λ > 0, R kR=1 k k=1 then ψdPN → ψdG a.s., ψ ∈ L2(G). PN 2 (a) If αk,N → M and max1≤k≤n αk,N → 0, then R k=1 R ψdPN →d ψdP, ψ bounded continuous. R R 2 (b) If αk,N = M/N, then ψdPN →d ψdP, ψ ∈ L1(G). PN PN 3 If αk,N → 0 and max1≤k≤n αk,N /( αk,N ) → 0, k=1 R k=1 then for any bounded continuous ψ, ψdPN →d ψ(θ), where θ ∼ G. In particular, holds if αk,N = λN and NλN → 0.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick-breaking representation
Theorem (Sethuraman, 1994, Statistica Sinica)
Let θ1, θ2,... ∼ α¯ and V1, V2,... ∼ Be(1, M), all mutually Qi−1 P∞ independent, Wi = Vi j=1(1 − Vj ),P = i=1 Wi δθi . Then
∞ X P = Wi δθi ∼ Dα. i=1
Distributional equation: i−1 Y 0 Wi = (1 − V1)[ (1 − Vj )]Vi = (1 − V1)Wi−1, j=2 ∞ X 0 P = V δ + (1 − V ) W δ 0 = V δ + (1 − V )P. θ i θi d θ i=1 Can do “functional MCMC”.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick-breaking representation
Intuition: Since prior is posterior averaged w.r.t. marginal distribution of R the observation, Dα is also Dα+δx dα¯(x). Conditionally on observed x, P{x} ∼ Be(1, M) and P|{x}c ∼ Dα (assuming non-atomicity of α), independently by the self similarity property. Steps in formal proof: DP is a solution of the distributional equation, the solution is unique.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics -Dirichlet process
Definition P∞ Let P = i=1 Wi δθi ∼ DP(M, G). For a given > 0, define Pm N = inf{m ≥ 1 : i=1 Wi > 1 − } and PN ¯ P = i=1 Wi δθi + Wδθ0 , where ¯ PN QN W = 1 − i=1 Wi = i=1(1 − Vj ) and θ0 ∼ G independent of everything else. The distribution of P is called an -Dirichlet process Dα,.
Theorem
dTV (P, P) ≤ a.s.
dL(Dα, Dα,) ≤ , dL L´evydistance.
N − 1 ∼ Poi(M log− ). Consequently, E(N) = 1 + M log− and var(N) = M log− . R R ψdP → ψdP a.s. as → 0.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Generalized Polya urn scheme
Suppose that Xi |P ∼ P and P ∼ Dα, often called a sample from a Dirichlet process. Then
X1 ∼ α¯;
X2|P, X1 ∼ P and P|X1 ∼ D , so α+δX1 M 1 X2|X1 ∼ M+1 α¯ + M+1 δX1 , that is,
( 1 δX1 , w.p. M+1 , X2|X1 ∼ M α¯, w.p. M+1 .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Generalized Polya urn scheme
More generally, Xi |P, X1,..., Xi−1 ∼ P and P|X1,..., Xi−1 ∼ Dα+Pi−1 δ , so j=1 Xj M Pi−1 1 Xi |X1,..., Xi−1 ∼ M+i−1 α¯ + j=1 M+i−1 δXj , that is,
( 1 δXj , w.p. M+i−1 , j = 1,..., i − 1, Xi |X1,..., Xi−1 ∼ M α,¯ w.p. M+i−1 .
By exchangeability of (X1,..., Xn),
( 1 δXj , w.p. M+n−1 , j = 1,..., n − 1, Xi |Xj , j 6= i ∼ M α,¯ w.p. M+n−1 .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Joint distribution
The joint distribution of (X1,..., Xn) is not absolutely continuous with respect to the Lebesgue measure. A realization of (X1,..., Xn) partitions {1, 2,..., n} by identifying components that have identical values. Corresponding to each partition, consider a lower dimensional Lebesgue measure on the “diagonals” by setting the values of each component within the same partitioning set identical. For instance, if n = 3 and {{1, 2}, {3}} is a partition, then the 3 corresponding diagonal is D12;3 = {(x1, x2, x3) ∈ R : x1 = x2} with the corresponding lower dimensional Lebesgue measure λ12;3. Then the joint distribution is absolutely continuous with respect to the sum µ of such mutually orthogonal measures. The joint distribution has several components each of which is absolutely continuous with respect to exactly one component of µ. The density with respect to µ is the sum of the densities of each component of the joint distribution with respect to the corresponding component of µ.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Clustering
A lot of ties are produced by Dirichlet samples, and hence it induces random partitions on {1,..., n}, creating clusters.
Expected number of distinct elements Kn grows like E(Kn) ∼ M log(n/M), which arises from the partial sum of a harmonic series.
var(Kn) also grows at the same rate.
Kn/ log n → M using Kolmogorov’s strong law. Kn−M log(1+n/M) √ →d N(0, 1). M log(1+n/M) Poisson approximation holds with parameter M log(1 + n/M). k Γ(M) Exact distribution: P(Kn = k|M, n) = Cn(k)n!M Γ(M+n) .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Mutual singularity
Theorem (Korwar and Hollander, 1973, Ann. Statist.)
If α1 6= α2 are two non-atomic measures on X, then Dα1 and Dα2 are mutually singular. More generally, if αi is decomposed into continuous part αi,c and
atomic part αi,a, i = 1, 2, and α1,c 6= α2,c , then Dα1 and Dα2 are mutually singular.
If α1,c = α2,c but supp(α1,a) 6= supp(α2,a), then Dα1 and Dα2 are mutually singular.
If α1,c = α2,c and supp(α1,a) = supp(α2,a), then Dα1 and Dα2 need not be mutually singular.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Tails
Theorem (Doss and Sellke, 1982, Ann. Statist.) Let F be the c.d.f. of a random P following DP(M, G). Let h be a real-valued function on [0, 1] which is strictly increasing and convex on (0, ) for some > 0. Then a.s. R F (x) 0, if 0 log− h(t)dt < ∞, lim = R x→−∞ h(MG(x)) ∞, if 0 log− h(t)dt = ∞. ¯ R F (x) 0, if 0 log− h(t)dt < ∞, lim = x→∞ ¯ R h(MG(x)) ∞, if 0 log− h(t)dt = ∞.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Tails
The proof is almost an immediate corollary of a similar tail representation for gamma processes. The most interesting conclusion is that the tails of the random F are much thinner than the center measure. For instance, if the center measure is normal, F has exponential of exponential tails. If the center measure is Cauchy, the random F has finite moment generating function. In particular, R ψdF can be a.s. finite even if R ψdG = ∞. Using c.f. based techniques, it can be shown that R |ψ|dF < ∞ a.s. iff R log(1 + |ψ|)dG < ∞.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Distribution of median
Theorem
Let F be the c.d.f. of a random P following DP(M, G). Let mF stands for any choice of median of F and H the distribution of mF . Then
the c.d.f. H of mF is given by
1 Z Γ(M) ¯ H(x) = uMG(x)−1(1 − u)MG(x)−1du; 1/2 Γ(MG(x))Γ(MG¯ (x))
H is continuous if G is so; any median of H is a median of G and conversely.
The distribution above, called the median-Dirichlet distribution, can be evaluated numerically. Proof is related to monotone likelihood ratio property of the beta family.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Distribution of mean
Theorem The distribution of L = R xdF (x),F ∼ DP(M, G) is given by
1 1 Z ∞ M Z ∞ + t−1 exp − log(1 + t2(s − x)2)dG(x) 2 π 0 2 −∞ Z × sin M tan−1(t(s − x))dG(x) dt. R
Uses an identity [Cifarelli and Regazzini, 1990, Ann. Statist.] Z Z E exp ψdS = exp − log(1 − itψ(x))dα(x) R for S following a gamma process with mean measure α — use c.f. and independent increments of S. One interesting case: The distribution of the mean is Cauchy iff the base measure is Cauchy — the only such case.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Characterization
A Dirichlet can be roughly characterized as the only process satisfying any of the following properties: Tail-free with respect to any sequence of refining partitions;
Neutral, that is P(B1) ⊥ P(B2)/P(B1) ⊥ P(B3)/P(B2) ⊥ · · · for all B1 ⊃ B2 ⊃ · · · ; Posterior distribution of P(A) depends on Pn(A) only for all A; Polya tree process with αε0 + αε1 = αε for every finite string ε.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Construction of a Dirichlet process
Finite dimensional distributions are self-consistent, so there is a joint distribution in [0, 1]∞. But measure theoretic difficulties can arise. For instance, the space of measures is not a measurable subset of the product space. Difficulties can be overcome by using a countable generator and the countable additivity of the mean measure. Existence of gamma process also gives Dirichlet process upon normalization. Stick-breaking representation also gives a direct construction.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Invariant Dirichlet process
Sometimes we need to put a shape restriction on the random P, for instance if we need to model ar error distribution, symmetry is often imposed. One can consider a symmetrized probability ˜ 1 P(A) = 2 (P(A) + P(−A)), where P follows Dirichlet [Dalal, 1982, Stoch. Process Appl.]. If the center measure G is symmetric, then the posterior mean of symmetrized Dirichlet is obtained as
1 Pn MG(x) + [1l(Xi ≤ x) + 1l(Xi ≥ −x)] E(F (x)|X ,..., X ) = 2 i=1 . 1 n M + n More generally, invariance under the action of a compact group of transformations can be considered. Invariant Dirichlet process is not tail-free.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Constrained Dirichlet process
Sometimes a restriction is imposed on quantile(s), such as median set to zero. Applications include quantile regression, regression with asymmetric error etc. More generally, one can express that as finitely many restrictions P(Bj ) = vj , where Bj ’s form a finite partition, called a control set. By the self-similarity of a Dirichlet process, such a thing can be expressed as a finite mixture of independent Dirichlet process with orthogonal supports. Restriction on moments is also sensible, but is less tractable.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Mixture of Dirichlet process
It is typically hard to specify the base measure exactly, but letting it a member of a (parametric) family is more easily comprehendible. For instance, it may be hard to say that the center measure should be N(0, 1), but it is sensible to choose the center measure as N(µ, σ2) with µ and σ undetermined. In other words, the center measure G and/or the precision M may involve and additional parameter θ. Then a further prior π is imposed on θ, often a very diffuse prior. The resulting hierarchical prior is called a mixture of Dirichlet process (MDP) [Antoniak, 1974, Ann. Statist.]. Invariant Dirichlet and constrained Dirichlet are special cases of MDP. The precision parameter of a Dirichlet process has high sensitivity, so it is a good idea to impose a prior on it, rather than choosing it directly.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Properties
MDP is a.s. discrete, but is not tail-free. R RR Mean: E( ψdP) = ψdGθ dπ(θ) = ψ¯. Variance: var(R ψdP) = R 2 R ¯ 2 RR (ψ− ψdGθ) R ( ψdGθ−ψ) dGθ dπ(θ) + dπ(θ). 1+Mθ 1+Mθ Posterior of an MDP is again an MDP: Mθ 7→ Mθ + n, MθGθ+nPn Gθ 7→ and θ ∼ π(·|X1, ..., Xn). Mθ+n To find π(·|X1,..., Xn), assume that Gθ has a density gθ. The joint distribution of X1,..., Xn|θ is given by the Polya urn scheme for a Dirichlet process. Then π(·|X1,..., Xn) is obtained by the Bayes theorem.
In particular, if all Xi ’s are distinct, the joint density of Qn X1,..., Xn|θ is the parametric likelihood i=1 gθ(Xi ).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet process mixtures
Since Dirichlet samples are a.s. discrete, it is not usable in density estimation. Ferguson (1983) and Lo (1984, Ann. Statist.) suggested convoluting with a parametric kernel, f (x) = R ψ(x, θ, ϕ)dP(θ). This is equivalent to parametric mixture model: iid Xi |θi ∼ ψ(·, θi , ϕ), θi |P ∼ P, P ∼ DP(M, G).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior distribution
Formally, the posterior for P is an MDP: L(P|θ, X) = L(P|θ) = Dα+Pn δ . i=1 θi The posterior distribution of θ|X can be calculated by Bayes Qn theorem because p(X|θ) = i=1 ψ(Xi , θi ), and the joint (marginal) distribution of θ is that of the sample of size n from a Dirichlet process (P being integrated out). Thus the posterior mean of a function R h(x, θ)dP(θ) can be written analytically, but it has too many terms — impossible to compute even for moderate n.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics MCMC algorithms
Idea is based on Gibbs sampling, by describing the posterior distribution of θi |θj , j 6= i, X using Bayes theorem.
The conditional prior distribution of θ1, . . . , θn is given by the Polya urn scheme.
The likelihood is ψ(Xi , θi ) — the posterior distribution of θi |θj , j 6= i, X depends on X only through Xi . All this information can be summarized in the following theorem:
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Theorem
δθj , w.p. qi,j , θi |(θ−i , X) = θi |(θ−i , Xi ) ∼ Gb,i , w.p. qi,0, where Z qi,j = cψ(Xi ; θj ), j 6= i, qi,0 = cM ψ(Xi ; θ)dG(θ),
P c is chosen to satisfy qi,0 + j6=i qi,j = 1, and Gb,i is the “baseline posterior measure” given by
ψ(Xi ; θ)dG(θ) dGb,i (θ) = R . ψ(Xi ; t)dG(t)
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics MCMC algorithms
The theorem above describes the algorithm: Treat θ1, . . . , θn as your hidden parameters.
However, many of the θi ’s are tied with each other, so a reduction in number of parameters is possible, keeping track of the labels and the number of distinct values. This usually improves the speed. Calculation of the posterior weights and sampling from the baseline posterior is essential for the algorithm. If the base measure is conjugate with the likelihood, both of these are easily done. When there is no such conjugacy, specialized algorithms using acceptance-rejection strategies can overcome the problems.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Variational method
MCMC methods are slow in complex problems. A variational method involves deterministic iterative optimization in the space of distributions. First used in exponential families. One assumes that the actual posterior is closely approximated by some very flexible but tractable family of distributions of the product type, called the variational distribution. Then the idea would be to pick up that distribution which is the Kullback-Leibler projection of the true posterior in the family. The Kullback-Leibler divergence is bounded below using Jensen’s inequality, and then one minimizes the lower bound iteratively in each parameter. The corrections from the previous value are called variational updates. For Dirichlet process mixture models, one uses an approximate fixed term stick-breaking representation to convert the Dirichlet model to a parametric class, and use beta family on the variables for stick-breaking.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Choice of kernel
For density estimation on the line, a location-scale kernel like the normal, t or logistic is used. The scale parameter is either mixed of separately treated. Should allow small values of the scale for the bias to be small. For densities on the unit interval, beta mixtures may be considered. In fact, very special beta mixtures given by Bernstein polynomials already have good approximation property. Densities on the half-line can be treated by mixtures of gamma, log-normal, Weibull, inverse gamma, inverse Gaussian etc. Sometimes shape restriction is an important issue. For instance, normal scale mixtures produce strongly unimodal densities, scale mixtures of uniform decreasing densities etc.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Feller sampling scheme
More generally, the approximation property of a mixture can often be connected with a sampling scheme E[Zk,x ] → x, var(Zk,x ) → 0, so f (x) is approximated by E[f (Zk,x )]. −1 On R with Zk,x ∼ N(x, k ) gives the normal kernel naturally — a sort of canonical choice. −1 On [0, 1] with Zk,x ∼ k Bin(k, x) gives the Bernstein polynomial kernel. −1 On (0, ∞), using Zk,x ∼ k Poi(kx) gives the gamma kernel.
On (0, ∞), using Zk,x ∼ Ga(k, k/x) gives the inverse-gamma kernel.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Bayesian bootstrap
∗ Pn D = DPn δ = Wi δX , where n i=1 Xi i=1 i (W1,..., Wn) ∼ Dir(n; 1,..., 1). ∗ Dn is actually a resampling distribution like Efron’s bootstrap. The main difference from a general Dirichlet posterior is that only finitely many atoms are involved, so simulation is particularly easy. Pn iid In fact, Wi = Yi / j=1 Yj , where Yi ∼ Ex(1). Density of mean functional µ = R ψdP under BB may be obtained analytically
n n−2 X (ψ(Xi ) − µ)+ p(µ|X1,..., Xn) = (n − 1) Q , (ψ(Xi ) − ψ(Xj )) i=1 j6=i
a B-spline of order (n − 2) with knots at ψ(X1), . . . , ψ(Xn).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency
“What if” study — if the data are generated from a model (n) with true parameter δ0, then the posterior Π(θ|X ) should
approach the perfect knowledge δθ0 . Equivalent to ”For any neighborhood U of θ0, Π(Uc |X (n)) → 0”.
The prior Π must support θ0; otherwise very little chance of consistency (Dirichlet is an exception). We tend to think (based on experience with the parametric situation) that if the prior puts positive probability in the neighborhood of θ0, we must have consistency, at least when the data are i.i.d. Not quite true in infinite dimension.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency
Example (Freedman, 1963, Ann. Statist.)
Infinite dimensional multinomial; p unknown p.m.f.; true p.m.f. p0 is Geo(1/4). We can construct a prior Π which gives positive mass to every neighborhood of p0 but the posterior concentrates at p1 := Geo(3/4).
Example is actually generic: The collection of (p, Π) which leads to consistency is topologically very small.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency
Example (Diaconis and Freedman, 1986, Ann. Statist.) To estimate the point of symmetry θ of a symmetric density. Put normal prior on θ and symmetrized DP with Cauchy base measure on the rest of the density. Then there is a trimodal symmetric true density under which the posterior concentrates at two wrong values of θ. Doss (1985, Ann. Statist.) has a similar result for constrained Dirichlet with median zero.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency
Example (Kim and Lee, 2001, Ann. Statist.) Consider estimating hazard cumulative function H. There are two priors Π1, Π2, both having prior mean equal to the true hazard H0, Π1 with a uniformly smaller prior variance that Π2, such that the posterior for Π2 is consistent but the posterior for Π1 is inconsistent.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Why consistency matters
It will be embarrassing if Bayesian method is unable to uncover the truth even with infinitely rich resources. Consistency makes the estimator more acceptable to other people. Merging of opinion: Two Bayesian with different priors eventually agree iff consistency holds.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Doob’s consistency result
Theorem (Doob, 1948) Consider any sequence of models with observations X (n) guided by parameter θ and θ having prior Π. Suppose that θ is a “function of all observations”: θ = f (X (∞)). Then for any bounded h(θ), the posterior mean converges to the true value, i.e., (n) E[h(θ)|X ] → h(θ0) a.s. under θ0, for almost all θ0 w.r.t. Π.
For i.i.d. observations from an identifiable model, or more generally whenever there is a consistent estimator, Doob’s condition holds, so the result is extremely general, and the conclusion is also much stronger than consistency. However, the null set can spoil the party. Generally, the conclusion is really useful when the parameter space is countable, or the prior is equivalent with some standard measure like the Lebesgue measure.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Notions of consistency
Consistency depends on the choice of neighborhoods, i.e., topology and metric. The weak topology is generated by closeness of expectation of bounded, continuous functions: R R {P : | ψi dP − ψi dP0| < , i = 1,..., k}.
On densities, stronger distances are more relevant: The L1 distance between two densities p, q is defined by R kp − qk1 = |p − q|. The Hellinger distance is q √ √ R √ √ 2 k p − qk2 = ( p − q) . These two distances give the same topology, sometimes called the strong topology.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency for tail-free process
Theorem (Freedman, 1963, Ann. Statist.) iid Let Xi |P ∼ P and P is given a tail-free process prior Π. Then the posterior for P is consistent with respect to the weak topology at any P0 in the weak support of Π. In particular, this means that that the posterior based on Dirichlet (of course) and Polya tree processes are consistent in weak topology whenever the true P0 is in the support of the prior. This holds because of the strong connection between tail-free priors and finite dimensional priors. Because of the weak topology, the calculation can be reduced to finite dimensional multinomial model, where everything is fine. The problem is that tail-free is a very fragile property, easily destroyed by mixtures and other operations.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Schwartz’s theory
Initiated by Schwartz (1965), extending Freedman’s work in the discrete case. iid Consider i.i.d. observations Xi ∼ pθ, θ ∈ Θ. Assume the {p(x, θ), θ ∈ Θ} family is dominated, so densities exist and Bayes theorem is applicable. Let θ0 be the true value, U be a neighborhood of θ0 and Π a prior for θ. To show that Qn R i=1 p(Xi ,θ) c Qn dΠ(θ) c (n) U i=1 p(Xi ,θ0) Π(θ ∈ U |X ) = Qn → 0 R i=1 p(Xi ,θ) Qn dΠ(θ) Θ i=1 p(Xi ,θ0) To bound, we can replace Θ in the integral in the denominator by any subset. Since observations are i.i.d., we can parameterize by the density itself, i.e., θ = pθ = p, and the true density is p0. We write F for Θ and U for U.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Schwartz’s theorem
Theorem Suppose that (i) the family is “statistically separable”, i.e., there exists a test function Φn = Φn(X1,..., Xn) for testing H0 : p = p0 against H : p ∈ U c such that
−bn −bn P0(Φn) ≤ Be , sup P(1 − Φn) ≤ Be ; p∈U c
(ii) p0 belongs to the Kullback-Leibler support of Π, i.e., Π(K(p0, p) < ) > 0 for all > 0, where K(p, q) = R p log(p/q). c ∞ Then Π(U |X1,..., Xn) → 0 exponentially fast a.s. [P0 ].
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Idea of the proof
R Qn p(Xi ) −nc We show that, for some c > 0, c dΠ(p) ≤ e U i=1 p0(Xi ) and for all c > 0, enc R Qn p(Xi ) dΠ(p) → ∞. {K(p0,p)<} i=1 p0(Xi ) The first assertion follows from Z n Z Y p(Xi ) −nc E[(1−Φn) dΠ(p)] = P(1−Φn)dΠ(p) ≤ e c p0(Xi ) c U i=1 U by the testing condition. The second assertion is a consequence of Fatou’s lemma, since nc Qn p(Xi ) nc−nK(p ,p) e ≈ e 0 → ∞ for K(p0, p) < < c/2. i=1 p0(Xi )
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Comments on tests
Uniformity in the testing condition is the biggest challenge. Otherwise exponentially powerful tests separating two distributions are always there. If there is a uniformly consistent test, the exponential consistency is obtained automatically from the i.i.d. structure. For the weak topology on distributions, separating tests exist by Hoeffding’s inequality.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Building up tests
The weak topology is too weak for density estimation and many other problems. For stronger metrics like the Hellinger distance, a test can be build up from more basic tests. From known work of Le Cam, Birg´eetc., it is known that exponentially powerful tests exist for any pair of convex sets separated by positive distance. Sets like U c are not convex. But one can cover it with small convex balls, each separated from p0, and get a test for each such ball. Consider the maximum of these tests. If the number of balls needed is finite, then the maximum of those tests is a test satisfying required conditions, with the number appearing as a multiplicative constant in the exponential bound for P0(Φn). Unfortunately, the number is not finite unless the class of densities is compact — a strong condition.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Role of sieves
The class F can be replaced by a subset Fn in the testing c condition provided we can show that Π(Fn |X1,..., Xn) → 0. c A sufficient condition to ensure this is that Π(Fn ) is exponentially small.
This Fn, called a sieve, may be taken to be compact. But now, as this is not fixed, just knowing that finitely many balls cover Fn will not be enough. It’s growth has be slower than exponential to be absorbed in the bound. Thus the number of small balls needed to cover the sieve needs to be estimated.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Entropy
The covering number ,N(ε Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ.
ε big ε small
Entropy is the logarithm log N(ε, Θ, d).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency in L1
Theorem
Let X1, X2,... be X-valued i.i.d. random elements with density p ∈ F. Let the true density p0 ∈ KL(Π). Suppose that given any 2 > 0, there exists δ < /4, c1, c2 > 0, β < /32 and sieves Fn such that
c −c2n (i) Π(Fn ) ≤ c1e ; (ii) log N(δ, Fn, k · k1) ≤ nβ, where N stands for the covering number.
Then for all > 0, Π(p ∈ F : kp − p0k1 > |X1,..., Xn) → 0 a.s. ∞ [P0 ].
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Kullback-Leibler positivity property
Needs closeness and control of likelihood ratios to make the KL divergence small.
Random p should come close to true p0 with positive probability.
The ratios p0/p need to be bounded above — typically needs lower bounds on p and some integrability of p0.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Preservation of Kullback-Leibler property
The KL property is very stable, unlike the tail-freeness. It is preserved under mixtures, projection on a co-ordinate, taking products, symmetrization, small distortions of the true density like small location shift etc.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Common priors with Kullback-Leibler property
The Polya tree prior with canonical partition and parameter P∞ −1/2 sequence am growing to meet m=1 am < ∞ will satisfy the KL property at p0 with certain integrability conditions. For infinite dimensional exponential family or a finite random series prior on the unit interval, only continuity of p0 is needed. Gaussian process priors are included here, but they have much richer structural property, so much stronger conclusions will hold, to be discussed later.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Kernel mixture prior
For kernel mixture priors, to meet the KL property, one needs full weak support of the prior on the mixing distribution and some integrability conditions on the true p0. Depending on the kernel used, the integrability conditions vary. For a location-scale kernel, this usually means some moment condition on p0. For the normal kernel, the condition is finiteness of some 2 + δ moment. For thicker tail kernels, like logistic or double exponential, only finiteness of 1 + δ moment is needed. For even thicker tail kernel like the t-distribution, only logmoment will be needed. For a kernel on the unit interval like a Bernstein polynomial, only continuity of p0 is needed. Other kernels can be treated [Wu and Ghosal, 2008, Electon. J. Statist.].
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Density estimation
Once the KL property holds, it is only the matter of defining an appropriate sieve Fn, controlling its entropy, and c controlling Π(Fn ). It is a trade-off — neither too big nor too small sieves will work. Priors with more regular support, the random p’s have less complexity, so meeting the conditions on the sieve is easier. This translates into milder conditions on the parameters of the prior and the true density. For instance, for a Dirichlet mixture of normal prior with normal base measure, the prior on the bandwidth can be reasonably flexible. In particular, the commonly used prior like inverse-gamma will be allowed [Ghosal, Ghosh and Ramamoorthi (1999), Ann. Statist.; Tokdar (2006), Sankhya]. Multivariate normal kernel [Wu and Ghosal (2010), J. Mult. Anal.]
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Semiparametric applications
Schwartz theorem with weak topology on the nuisance nonparametric part is an appropriate tool for dealing with semiparametric problems. KL property plays a vital role. In many cases, even a sieve is not required.
Location problem: X ∼ pθ = p(· − θ). Dirichlet does not admit densities; use Polya tree, Dirichlet mixtures or any other prior with KL property. Interesting observation: (θ, p) 7→ pθ is bi-continuous, so just need to get weak consistency for the density pθ based on i.i.d. observations. KL property is essentially preserved under location change. Many other applications possible: Linear regression (using a non-i.i.d. generalization of Schwartz theory), exponential frailty model, Cox proportional hazard model, etc. [Wu and Ghosal, 2008, Sankhya].
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Non-i.i.d. generalizations of Schwartz’s theorem
The handling of the numerator in Schwartz’s theorem only uses properties of the tests, not independence or identical distribution of observations. Thus if testing conditions are assumed as in the i.i.d. case, then the rest will go through. The denominator is handled through strong law on log-likelihood ratios, so some condition on dependence and/or distributions of observations will be needed. For independent but not identically distributed (i.n.i.d.), strong law is available. KL property will then be described by average KL divergences and second moment of log-likelihood ratios [Amewou-Atisso et al., 2003, Bernoulli; Choudhuri et al., 2004, J. Amer. Statist. Assoc.; Wu and Ghosal, 2008, Sankhya]. Tests in such situations are studied by Birg´e,or can be constructed directly in specific applications.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dependent generalizations of Schwartz’s theorem
When observations are not independent, Schwartz’s theorem can still be extended under certain situations. The dependence must allow a strong law, and required tests need to be constructed. A favorable setting is that of ergodic Markov processes. Strong law holds in this case. Tests have been constructed by Birg´efor (an analog) of the Hellinger distance. Under specific modeling assumptions of transition densities, such tests may be constructed directly as well. Under mild conditions, consistency can be verified for a certain Dirichlet process mixture models for conditional densities [Tang and Ghosal, 2007, J. Statist. Plan. Inf.]
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate of posterior
(n) (n) For a statistical model Pθ with observations X and prior Π, posterior convergence rate is n → 0 at true value Θ0 w.r.t (n) a metric d if Π(d(θ, θ0) ≥ Mnn|X ) → 0 for every Mn → ∞.
n is the smallest size of a ball around true θ0 that holds nearly all posterior mass, thus measures the speed of convergence — a quantitative form of consistency.
If the posterior converges at rate n, then there is a point estimator converging at the same rate. Define as the center of the ball of radius n containing maximum posterior probability. Thus posterior convergence rate is restricted by the minimax rate of convergence for all estimators — the optimal rate.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate of posterior
In some cases convergence rate can be obtained by direct calculations. For instance, if posterior mean and variance can (n) be calculated, then Π(d(θ, θ0) ≥ Mnn|X ) can be estimated by Chebyshev’s inequality. ind −1 ind 2 Infinite dimensional normal Xi ∼ N(θi , n ), θi ∼ N(0, τi ) can be treated in this way. Survival analysis has other examples. A theory of posterior convergence rate can be built as a quantitative analog of Schwartz’s theory (in particular, assuming that the family is dominated). First we consider i.i.d. observations.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate theorem
Theorem (Ghosal, Ghosh and van ver Vaart, 2000, Ann. Statist.)
Let Πn be a sequence of priors on P. Suppose ¯n, ˜n → 0 with 2 2 n min(¯n, ˜n) → ∞, constants c1, c2, c3, c4 > 0 and sets Pn ⊂ P such that 2 log N(¯n, Pn, d) ≤ c1n¯n; 2 −c2n˜ Πn(K(p0, p) ≤ ˜n, V (p0, p) ≤ ˜n) ≥ c3e n ,
then for n = max(¯n, ˜n) and a sufficiently large M > 0,
n Πn(p ∈ Pn : d(p, p0) > Mn X1,..., Xn) → 0 in P0 -probability. If further, 2 c −(c2+4)n˜n Πn(Pn ) ≤ c4e ,
then Πn(p ∈ P : d(p, p0) > Mn X1,..., Xn) → 0 in n P0 -probability.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Proof of the theorem
Follow the same decomposition as in proof of consistency theorem. 2 The numerator needs tests with error probabilities like e−cnn . From Le Cam, Birg´e’swork, such tests exist when the 2 alternative is a small ball. Now the ecnn like bound for covering number asserts that the combined test will satisfy the requirements. To bound the denominator, the following lemma is used.
Lemma
Z n n Y p −(1+C)n2 1 P0 (Xi ) dΠ(p) ≤ Π(B(; p0))e ≤ 2 2 . p0 C n i=1
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Optimal rate by bracketing
Cover a space of densities by N[](j , P) many brackets. Normalize upper brackets to obtain a discrete approximation and let Πj be uniform on the collection. Take a convex combination of these as the final prior. Then the convergence rate is given by
2 n : log N[](, P) ≤ n . Often bracketing numbers are essentially equal to usual covering numbers, so the device will produce optimal rates, for instance for H¨older α-smooth class of densities — entropy is −1/α, so −α/(1+2α) n = n −1 −1/3 monotone densities — entropy is , so n = n .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Optimal rate by log-splines
Density estimation on [0, 1].
Split [0, 1] into Kn equal intervals.
Form an exponential family using the corresponding Jn many B-splines and put uniform (say) prior on the coefficients. α −2α If p0 is C , spline density approximates p0 up to Jn in KL. Hellinger and Euclidean are comparable, so calculations reduce to Euclidean.
(local) Entropy√ grows like Jn while prior concentrates like e−Jn(c+log(n Jn)). 2 −α Leads to rate equations nn ∼ Jn, n ∼ Jn giving optimal −α/(1+2α) rate n = n .
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet mixture of normal
Reference: Ghosal and van der Vaart, 2001, Ann. Statist.
True density p0 itself a normal mixture R p0 = φσ0 (x − z)dF0(z), σ ≤ σ ≤ σ, known as the supersmooth case. Basic technique in calculation of entropy and prior concentration is finding discrete mixture approximation with only N = O(log 1/) support points, leading to calculation in N dimension. Entropy grows like (log 1/)2 and prior concentration rate is N −(log 1/)2 −1/2 = e , leading to rate n ∼ n (log n) for most favorable situation.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet mixture of normal
Reference: Ghosal and van der Vaart, 2007a, Ann. Statist. −1/5 Take a prior and scale by a sequence σn like n . ∗ Approximate p0 by a normal mixture p0 with bandwidth σn up 2 ∗ to σn, and work with p0 as target. Similar strategy as before but the number of support points −1 1 increases to N = σn log . 2 −1 −1 2 2 Rate equations nn = σn (log n ) and n = σn leading to −2/5 4/5 n = n (log n) . Usual sieve selection does not give good results. One needs to use structural properties of Dirichlet mixtures to bound posterior probability of F [−a, a] > , and then construct a much smaller sieve. A more recent work Kruijer et al. (2011, Electron. J. Statist.) improves by allowing a fixed prior on σ and allowing different smoothness levels of p0.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bernstein polynomials
Reference: Ghosal (2001) Ann. Statist; Kruijer and van der Vaart (2008) J. Statist. Plan. Inf. If true density is itself a Bernstein polynomial, then nearly parametric rate is obtained. If the true density is α-smooth, 0 < α ≤ 2 and prior on order k has exponential tail, then the posterior converges at rate n−α/(2+2α) up to logarithmic factors. The sieve used is a set of Bernstein polynomials of certain order changing with n. Entropy calculation can be related to that of the unit simplex. The rate is far from optimal. A technique called coarsening can improve the rate to the optimal order n−α/(1+2α) up to logarithmic factors, provided α ≤ 1.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Automatic adaptation of kernel mixture priors
In obtaining posterior convergence rate of a kernel mixture prior, an important step is to find an approximation p∗ of the true density p0 within the model F. ∗ A natural candidate for p is p0 convoluted with the kernel. For smoothness level α up to 2, this works well since the bias ∗ α (distance between p0 and p ) is of the order σ . 2 Unfortunately, this does not improve from σ if p0 is actually smoother (α > 2). Recently Rousseau (2010, Ann. Statist.) and Kruijer et al. (2011, Electron. J. Statist.) introduced a better approximation for p0 in the model if p0 is smoother.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics For the normal mixture model, the idea is to consider 2 p1 = p0 ∗ φσ is p0 is C , but for smoothness α between 2 and 4, replace p1 by p2 = φσ ∗ (p0 − (p1 − p0)). Then the KL approximation order improves to σ2α.
However, the new p2 is not a probability density, so some modification is actually made on the approximation. The process can be repeated to obtain the correct approximation for a given smoothness level: pj+1 = φσ ∗ (p0 − (pj − pj−1)). The resulting high quality approximation allows using much larger bandwidth than before for smoother densities, and subsequently gives a better convergence rate — the optimal rate for the correct smoothness level (up to a logarithmic factor).
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Note that the approximation is a technical device unrelated with the method of analysis. Bandwidths are also automatically selected from the appropriate range using the prior. Thus the posterior will converge at the right rate near p0 without actually knowing the smoothness of p0 and without using that knowledge in defining the prior on the bandwidth. For technical reasons, however, Dirichlet mixtures are not covered — they worked with finite but arbitrary order mixtures.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Compare this with the classical kernel density estimator. It tries to really estimate p0 ∗ φσ using empirical average, allowing the bias. This works because the bias is made small using a small bandwidth. The order of the bias remains the same even if p0 is smoother. The target does not change. So the rate does not improve. One can improve the bias using a higher order kernel (a nuisance since density estimates can then be negative), but one will have to know the correct smoothness order of p0 and use that knowledge in choosing the kernel and the bandwidth. The Bayesian estimator, on the other hand, picks up the right target within the model automatically without knowing the smoothness of p0, and converges around it at the correct rate. Thus the Bayesian estimator is smarter, automatically adapting to the given situation — the bandwidth problem is solved!
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Non-i.i.d. observations
Theorem (Ghosal and van der Vaart, 2007b, Ann. Statist.)
Let dn and en be semi-metrics on Θ so that
(n) 2 (n) 2 P φ ≤ e−Kn , sup P (1 − φ ) ≤ e−Kn . θ0 n θ n θ∈Θ:en(θ,θ1)<ξ
2 Let n > 0, n → 0, nn 0, k > 1 and Θn ⊂ Θ be such that for every sufficiently large j ∈ N, ξ 2 sup log N , {θ ∈ Θn : dn(θ, θ0) < }, en ≤ nn, >n 2
Πn (θ ∈ Θn : jn < dn(θ, θ0) ≤ 2jn) 2 2 ≤ eKnnj /2 (n) (n) (n) (n) Π K(p ; p ) ≤ n2; V (p ; p ) ≤ nk/2k ) n θ0 θ n k,0 θ0 θ n
Then for every Mn → ∞, (n) P Π (θ ∈ Θ : d (θ, θ ) ≥ M |X (n)) → 0. θ0 n n n 0 n n
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics I.N.I.D. observations
Theorem (Ghosal and van der Vaart, 2007b, Ann. Statist.) (n) 2 Let Pθ be product measures and dn be the average squared Hellinger distance. Suppose that there exist n → 0, k > 1 and 2 sets Θn ⊂ Θ such that nn 0 and
2 sup log N (/36, {θ ∈ Θn : dn(θ, θ0) < }, dn) ≤ nn, >n
c Π (Θ ) 2 n n −2nn ∗ = o e Πn (Bn (θ0, n; k))
Π (θ ∈ Θ : j < d (θ, θ ) ≤ 2j ) 2 2 n n n n 0 n nnj /4 −1 Pn 2 ≤ e . Πn (n i=1 max{K(pθ0,i , pθ,i ), V (pθ0,i , pθ,i )} ≤ )) (n) Then P Π (θ : d (θ, θ ) ≥ M |X (n)) → 0 for every M → ∞. θ0 n n 0 n n n
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples
Markov chains White noise signal model Gaussian time series
Nonlinear autoregression: Xi = f (Xi−1) + εi Regression using spline basis expansion Binary regression with unknown link Interval censoring Estimating spectral density of a stationary time series using Whittle likelihood
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rates under misspecification
Kleijn and van der Vaart (2006, Ann. Statist.) studied posterior convergence rate when the true density can lie outside the model. They showed that under certain conditions, the posterior concentrates around the KL projection p∗ of the true density p0 on the model Fn at a rate roughly given by the rate 2 ∗ 2 equation log N(n, Fn) nn and − log Π(K(p , n)) nn. Technically, the analysis requires dealing with a suitable modification of entropy, a weighted version of the Hellinger distance, a modification of KL around p∗ and testing against finite measures.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bayesian adaptation
Commonly priors appropriate to obtain optimal (or at least good) posterior convergence rate need the knowledge of the smoothness class. Priors based on brackets or log-splines for instance use the smoothness information. Can a single prior give optimal rate for all classes? If so, the prior is called rate adaptive. Obtaining a procedure that works for all smoothness classes is the classical problem of adaptation. Typically models are nested with different convergence rates, especially if they are indexed by a smoothness level.
Natural Bayesian approach: If Πα gives the optimal rate for R class Cα, then a hierarchical mixture prior Π = Παdλ(α) may give the right rate for every class. Strategy works in many cases. One also gets an adaptive point estimator as a by-product.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Infinite dimensional normal
Theorem (Belitser and Ghosal, 2003, Ann. Statist.) ind −1 In the infinite dimensional normal model Xi ∼ N(θi , n ), let the P∞ 2q 2 true mean satisfy i=1 i θ0i < ∞, q unknown but belongs to a ind −(2q+1) discrete set Q. Let Πq : θi ∼ N(0, i ), q ∼ λq, P Π = q λqΠq. Then the posterior converges at the right rate −q /(2q +1) n 0 0 corresponding to the true value q0 of q.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Sketch of the proof
Treat the correct model q = q0, smoother cases q > q0 and coarser cases q < q0 separately. Selection step: Coarser models have higher complexity, we show Π(q < q0|X) → 0, i.e., probability of selecting a coarser model from the posterior is small. This effectively reduces the P prior to the form m≥0 λmΠm. Compactification step: In the smoother models, the parameter P∞ 2q0 2 lies inside a compact set {θ : i=1 i θi ≤ B, q > q0} for a sufficiently large B with high posterior probability. This allows to control covering numbers of the effective part of smoother models.
The correct model q = q0 can be handled by direct conjugacy calculations. Finally in the reduced parameter space P∞ 2q0 2 {θ : i=1 i θi ≤ B}, the general theory applies.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Adaptation in density estimation
Reference: Ghosal, Lember and van der Vaart, 2008, Electon. J. Statist.
Consider a setting with countably many models Pn,α, and in model α, the entropy matching rate is n,α.
Let Bn,α() and Cn,α() respectively be KL and ordinary neighborhoods.
Let βn be the best model index for the true density p0.
Let An,&βn or An,<βn , respectively be the smoother and coarser models.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Main theorem
Theorem Assume that λn,α Πn,α Cn,α(in,α) Li 2n2 ≤ µn,αe n,α , α < βn, i ≥ I , λn,βn Πn,βn Bn,βn (n,βn ) λn,α Πn,α Cn,α(in,βn ) Li 2n2 ≤ µn,αe n,βn , α & βn, i ≥ I , λn,βn Πn,βn Bn,βn (n,βn ) X λn,α Πn,α Cn,α(IBn,α) −2n2 = o e n,βn , λn,βn Πn,βn Bn,βn (n,βn ) α∈An:α<βn
√ n2 2 and P ≤ e n,βn . If n → ∞, then under some α∈An µn,α n,βn restrictions on the constants, the posterior converges at the rate
n,βn , i.e., is rate adaptive.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Two models context
Somewhat easier to comprehend what’s going on by looking at two models family.
Corollary
Assume that the entropy condition holds for α ∈ An = {1, 2} and sequences n,1 > n,2. −n2 n2 (i) If Πn,1 Bn,1(n,1) ≥ e n,1 and λn,2/λn,1 ≤ e n,1 , then the posterior converges at the rate n,1. −n2 −n2 (ii) If Πn,1 Bn,2(n,2) ≥ e n,2 , λn,2/λn,1 ≥ e n,1 and −3n2 Πn,1 Cn,1(I n,1) ≤ (λn,2/λn,1)o(e n,2 ) for every I , then the posterior converges at the rate n,2.
The range of relative weights allowed is very broad
−n2 λn,2 n2 e n,1 ≤ ≤ e n,1 . λn,1
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Log-spline prior
Universal weight scheme:
−Cn2 λαe n,α λn,α = 2 1lAn (α). P e−Cnn,γ γ∈An λγ
−β/(2β+1) Flat priors:√ λn,α = µα. Adaptation to correct rate n up to log n factor. Finitely many classes, decreasing weights: Q Jn,γ λn,α ∝ γ∈A:γ<α(Cn,γ) . This choice can remove the log factor.
Bracketing induced log-splines: First find n,α bracketing, and then find the closest log-spline approximations. The corresponding θ’s are given uniform weights, forming a log-spline prior Πn,α. The the universal weights, which increase with smoothness, is considered. The resulting prior is rate adaptive.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Model selection
Can the Bayesian device automatically select the correct model in large sample situations? If one model is simple (singleton family) and KL property holds in the other model, then this is a consequence of Doob’s theorem and Schwartz’s arguments [Dass and Lee, 2004, J. Stat. Plan. Inf.] Because of the the point mass at the simple model, Doob’s theorem implies its posterior probability goes to one whenever simple model holds. The other side follows from consistency. One model with KL property, other not, then model with KL property is chosen [Walker et al. 2005, J. Amer. Statist. Assoc.] under some conditions. If models are with different prior concentration rate and order of entropy, then the correct model is chosen [Ghosal, Lember and van der Vaart]
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Model selection
Theorem (Ghosal, Lember and van der Vaart, 2008) Under assumptions of the adaptive rate theorem
n P0 Πn An,<βn |X1,..., Xn → 0,
n P0 Πn α ∈ An,&βn : d(p0, Pn,α) > IB n,βn |X1,..., Xn → 0.
In testing a parametric model against a nonparametric alternative, the required conditions, which essentially need weak concentration of the nonparametric prior around parametric family, hold. Testing a parametric family of densities on the unit interval against Bernstein polynomial prior on nonparametric alternative. Testing a parametric family of densities on the unit interval against log-spline prior on nonparametric alternative. Finite dimensional normal against infinite dimensional normal.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian process
When using a Gaussian process as prior, convergence rates will be determined by the complexity of the space this process spans and the probability of concentration around the true function. Both properties are guided by the geometry of an associated Hilbert space, known as the reproducing kernel Hilbert space (RKHS). To understand the idea of RKHS, think about the simplest nontrivial Gaussian process, namely a multivariate normal distribution. Only vectors in the linear span of the covariance matrix Σ are in the support of the distribution. Further, the elliptical contours described by Σ determine variation in different directions. Thus an intrinsic way to measure variability relative to the inherent variability of the 2 0 −1 process is khkΣ = h Σ h.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Reproducing kernel Hilbert space
Consider a Gaussian process as a random element of a Banach space (B, k · k). A notion of expectation is available for B-valued random variables, known as Pettis integral. ∗ ∗ For every Element b of the dual space B , define Sb∗ = E[Wb∗(W )]. On the range define an inner product ∗ ∗ ∗ ∗ hSb1, Sb2i = E[b1(W )b2(W )]. This is analogous to rescaling the usual inner product by the inverse of the covariance matrix in the finite dimensional case. The corresponding norm is stronger than the Banach space norm. ∗ SB is not complete (w.r.t. the new norm) in infinite dimensional spaces. The completion is a Hilbert space, which we call the RKHS and denote by H. The reproducing property refers to ∗ ∗ ∗ ∗ ∗ ∗ b2(Sb1) = E[b1(W )b2(W )] = hSb1, Sb2i.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian processes of various smoothness
1 Sample paths of the Brownian motion Bt are < 2 -order smooth. Any process like the Brownian bridge or Ornstein-Uhlenbeck process, whose paths are related to the Brownian motion, also have the same smoothness. k-fold integrated Brownian motion W = I k B is smooth of 1 order < α := k + 2 . There is a notion of fractional order integration, allowing k to be non-integer. The corresponding process is called Riemann-Liouville process, has smoothness < α. Another process, called the fractional Brownian motion, which 1 2α 2α 2α has covariance kernel 2 (s + t − |t − s| ) has also smoothness order < α. The squared exponential process with covariance kernel e−|s−t|2 has analytic sample paths.
Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Role of RKHS
The support of a mean-zero Gaussian process is the norm closure of the RKHS. Small ball probability: P(kW k < ). Concentration function: ( ) = inf{khk2 : kh − wk ≤ } − log (kW k ). ϕw H P < Describes the concentration of W near an element w ∈ H¯ . Borell’s inequality: