An Invitation to Bayesian Nonparametrics

Subhashis Ghoshal

North Carolina State University A short course on theory and methods of Nonparametric Bayesian Inference given in EURANDOM, Summer 2011

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Objectives

The goal is to develop sound Bayesian methods for nonparametric and semiparametric problems. This means constructing sensible priors, computing posterior distributions possibly with support from computational devices like Markov chain-Monte Carlo, and showing good (frequentist) behavior of the posterior distribution (usually in asymptotic sense) Nonparametric and semiparametric modeling are way to go, since parametric models are arguably too restricted. Bayesian methods quantify uncertainty in a direct way, and straightforward in approach. It can also incorporate additional information in a very natural way. Since both nonparametric modeling and Bayesian approach are sensible, then the nonparametric Bayesian approach must be sensible too.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of nonparametric models

iid X1, ..., Xn ∼ P, c.d.f. F , completely unknown. iid X1, ..., Xn ∼ P, p.d.f. f , not known to be a member of a parametric family.

Yi = f (Xi ) + εi , regression function f unknown. ind Yi |Xi ∼ Bin(1, p(Xi )), p(x) ∈ [0, 1] unknown.

X1, ..., Xn are survival times, subject to censoring, and have cumulative hazard function H, unknown.

dXt = f (t)dt + σdBt , unknown signal f corrupted by white noise dBt .

X0, X1, ..., Xn stationary time series with unknown spectral density f .

Yi |Xi = xi has conditional density f (·; xi ), f unknown.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of semiparametric models

0 iid Yi = β Xi + εi , εi ∼ P unknown, but β is the parameter of interest. ind 0 Yi |Xi ∼ ψ(·, f (β Xi )), ψ exponential family, f unknown, but β is the parameter of interest.

X1, ..., Xn are survival times, subject to censoring, and have cumulative hazard function Hi respectively, where β0Z Hi (x) = e i H0(x), where Z1,..., Zn are associated covariates, baseline hazard H0 unknown, but β is the parameter of interest.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Functional data analysis examples

Functional regression with Euclidean predictor: 0 Yi (t) = β(t) Xi + εi (t). Functional regression with functional predictor: R Yi (t) = β(s, t)Xi (s)ds + εi (t).

Functional principal component analysis: X1(t),..., Xn(t) i.i.d. mean zero Gaussian processes with covariance kernel P∞ K(s, t) = j=1 λj φj (s)φj (t).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Issues

Need to construct probability measures on infinite dimensional spaces, where there is no Lebesgue-type dominating σ-finite measure. Subjective priors are not possible, since that means infinite details about the unknown. Prior should be largely developed through an automatic mechanism. Only a few key parameters may be chosen using prior information. Automatic priors should spread out mass all over the parameter space — big support. Computational feasibility and good asymptotic behavior of posterior distribution/Bayes estimates should be kept in mind.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Random basis expansion

P∞ f (x) = j=1 θj Bj (x), where {Bj : j ∈ N} is a suitable basis — like polynomial, trigonometric, splines, wavelets etc.

Coefficients θj ’s are given some suitable prior, often independent normal (with quickly decreasing variance for convergence). A (more pragmatic) variation is consider finite expansion PK f (x) = j=1 θj Bj (x), where K is chosen (usually depending on n) to make the approximation error in the expansion under control or some model selection technique, or rather by imposing a suitable prior on K (infinitely supported, appropriate tail). The latter is considered as more sensible, but involves more challenging computation because of changing dimension. Typically this means involving reversible jump Markov chain Monte-Carlo techniques.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Use of link functions

Sometimes functions have natural constraints in their assumed values. Intensity of a Poisson process, or spectral density of a time series take only positive values. Then basis expansion for log f is more natural. The advantage is that there would be no restriction on the coeﬃcients. Thus P∞ f (x) = exp[ j=1 θj Bj (x)]. In binary regression, the response probability p(x) takes values in the unit interval. Then logit, probit or some other link can convert a basis expansion into a function taking values in P∞ [0, 1]: p(x) = H( j=1 θj Bj (x)). The function H can (and should) be chosen to be increasing (and continuous), so a continuous c.d.f. There need not be any further prior put on H, since the free basis expansion already gives large support.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Probability density functions, on the other hand have two constraints: f (x) ≥ 0 and R f (x)dx = 1. One way to ensure these constraints is to exponentiate and renormalize — P∞ R P∞ f (x) = exp[ j=1 θj Bj (x)]/( exp[ j=1 θj Bj (u)]du). The renormalization will be meaningful in all compact domains and continuous basis functions. This is often called an inﬁnite dimensional exponential family. With ﬁnitely many terms and a spline basis, this is often called a log-spline prior.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian processes

A stochastic process Wt is called a Gaussian process if all ﬁnite dimensional distributions are multivariate normal. The distribution is completely characterized by mean function µ(t) = E(Wt ) and covariance kernel K(s, t) = cov(Ws , Wt ). Common examples of Gaussian process include the Brownian motion (covariance kernel min(s, t)), integrated Brownian motion, squared exponential process (covariance kernel e−cks−tk2 ). Gaussian processes have the ﬂexibility of sample paths in modeling an arbitrary function. With the help of a link function, then restrictions can be addressed too.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Smoothness of sample paths are related to smoothness of the covariance kernel. For instance, the Brownian motion has 1 order < 2 smoothness, while the squared exponential process has infinite order smoothness. The k-fold integrated Brownian 1 motion has < k + 2 order smoothness. A Gaussian process can be chosen to match with prior knowledge of smoothness of the function to be modeled. A (finite or infinite) basis expansion with normal coefficients gives a Gaussian process. Conversely, by the Karhunen-Loèvè expansion, every Gaussian process has such a basis expansion.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Completely random measures

A completely random measure (CRM) [Kingman, 1975] on X is a measure-valued process Φ such that for any finite collection of disjoint sets A1,..., Ak , the random variables Φ(A1),..., Φ(Ak ) are mutually independent. Because of additivity of measures and independence of increments, the marginal distribution of Φ(A) must be infinitely divisible. A CRM has Levy-Khinchine representation E(e−tΦ(A)) = exp[−β(A)t − R (1 − e−tz )ν(A, dz)]. A random measure with given mean measure µ(A) = EΦ(A) can be constructed from points of a Poisson process N on the product space X × [0, ∞) by the relation R ∞ R Φ(A) = 0 A zN(dx, dz) (assume β = 0 w.l.o.g.) The representation shows that Φ is a.s. discrete. A CRM can be used as a prior on a space of measures. Finiteness/infiniteness can be controlled through the mean measure. A gamma randomSubhashis measureGhoshal is particularlyAn Invitation to Bayesian important Nonparametrics in Bayesian nonparametrics. Here the distribution of Φ(A) is gamma with scale parameter 1 and shape parameter α(A), with α being a measure. Lévyprocesses

Essentially, a (subordinator) Lévyprocess is an increasing process indexed by the half-line with increasing, right-continuous sample paths. (This differs from the definition used in probability theory where stationarity is imposed as a requirement.) They can have either finite or infinite total mass. Every finite interval has finite random mass. Infinite mass Lévyprocess can model a prior for cumulative hazard function. Clearly Lévyprocesses are pure jump processes, having countably many jump points. Most of these jumps are very small. In fact, number of jumps of size bigger than any > 0 is a.s. finite.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Random discrete distributions

PN P = j=1 Wj δθj , 1 ≤ N ≤ ∞, possibly random if not identically infinity, (W1,..., WN ) has some distribution on the unit N-simplex, (θ1, . . . , θN ) have some joint distribution in XN , both of which can possibly depend on the assumed value of N. Even though the realized measures are always discrete, there would an enormous variety if the distributions of W ’s and θ’s are flexible. Since any probability measure can be weakly approximated by finite/discrete probability measures, it is easy to maintain large weak support of the prior. It is natural to consider θj ’s i.i.d. with a non-singular distribution. If N is a.s. finite, then a Dirichlet distribution (which includes the uniform) on (W1,..., WN ) may be considered as an automatic choice. [Actually, there is also a countable dimensional Dirichlet.]

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick breaking

Consider a random discrete distribution with countably many points. What prior can be put on (W1, W2,...)? The stick-breaking technique proceeds as follows: The total mass 1 is to be distributed sequentially to the support points, which have been already arranged in a sequence θ1, θ2,.... Consider random variables V1, V2,... taking values in [0, 1] with some joint distribution (typically independent, or even i.i.d. with a ﬁxed distribution like beta). Then induce a distribution on (V1, V2,...) through the relations Qj−1 Wj = [ k=1(1 − Vk )]Vj . The interpretation is that in the beginning, there was a stick of length 1. It is broken in the proportion V1 : 1 − V1, and the initial mass is assigned to θ1. Next, the leftover is broken in the proportion V2 : 1 − V2, the initial mass is assigned to θ2, and the leftover is broken again, in the proportion V3 : 1 − V3 and it continues like this.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Qj Clearly, the leftover mass after j steps is k=1(1 − Vk ). The total mass will be exhausted in countably many steps if Qj k=1(1 − Vk ) → 0 a.s., which under independence, holds if P∞ k=1 log E(1 − Vk ) = −∞. This always holds if Vj ’s are i.i.d. with P(V > 0) > 0.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Normalized L´evyprocesses

One way to ensure that a random measure is a probability measure is to normalize, provided that random measure is a.s. finite. The most familiar example is to take a gamma process with finite mean measure α. The normalized measure is called a Dirichlet process to be studied in details later. All finite dimensional distributions are Dirichlet by the connection between independent gammas and Dirichlet. Another example which stands out as tractable is normalized inverse-Gaussian process. Like the gamma, the inverse Gaussian is also closed under convolution, and hence is infinitely divisible. The normalized process has fairly tractable joint distributions, and has many similarities with the Dirichlet process.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Partitioning method

A natural method of distributing total mass 1 to various subsets is to use a recursive partitioning scheme. Start with a sequence of refining partitions, for simplicity binary partitions. This means each set in a given stage is split in two in the next stage, to be called its offsprings. Distribute the mass in the current stage to its offsprings according to a random proportion. Probabilities of a set in any stage of the partition is given by a finite product of these proportions. The random proportions are given some joint distribution. If the union of all refining partitions generate the Borel σ-field on the sample space, then a random probability measure P is automatically obtained. A slight control on the proportions ensure countable additivity of P. Usually P has no fixed atoms, but can have random atoms.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics R

V0 V1

B0 B1

V00 V01 V10 V11

B00 B01 B10 B11

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics

1 Tail-free processes

It is natural to impose some independence on the random proportions used in a partitioning method.

Let {V0}, {V00, V10}, {V000, V010, V100, V110}, ... be mutually independent collections. Note that we have assumed independence across levels, but not necessarily within levels. Such a random probability is called tail-free with respect to the given sequence of partitions [Freedman, 1963, Ann. Statist.]

Interestingly, P(Bε1···εk ) is a product of independent random variables.

This allows calculation of moments of P(Bε1···εk ) and

log P(Bε1···εk ) in terms of those of Vε’s.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Conjugacy

There is a structural conjugacy: The posterior distribution of iid P given X1,..., Xn|P ∼ P is also tail-free with respect to the same sequence of partitions. The proof of the last fact is intriguing. One first looks at the counts of the sets in a given level of partitions and the corresponding cell probabilities. The posterior distribution given the counts at this level can be obtained. On the other hand, a similar calculation can be done at any finer level of partition, and can be marginalized to the previous level. Independence in the likelihood and the tail-free prior conclude that these two posterior distributions are the same. Making the second stage of partition finer and finer, the posterior given observations is obtained. The independence across levels are in-built.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The name tail-free comes because the posterior distributions of cell probabilities at any finite stage is unaffected by the counts in the finer stages, or the tail. A practical method of choosing a sequence of partitions is to follow binary quantiles of a target probability measure λ. If 1 E(V ) = 2 at every stage, then E(P) = λ is ensured. Such a partition is called a canonical partition.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Support

Support, loosely speaking, is the subset where the prior assigns mass. Technically, it is the smallest closed set with prior probability 1. A tail-free process in R has large weak support, provided the endpoints of the partitioning sets become dense and all proportions have non-singular joint distributions. By denseness, closeness to a distribution P0 in the weak topology can be reduced to closeness of ﬁnite stage cell probabilities, which can be achieved with positive probability by non-singularity. Thus any P0 is in the weak support of the tail-free process.

If P0 is continuous, then P0 is also in the Kolmogorov-Smirnov support of the tail-free process by Polya’s theorem — a general fact irrespective of the prior.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Absolute continuity

Theorem (Kraft, 1964, J. Appl. Probab.) Let Π be a tail-free prior with respect to the sequence of partitions {Tm} and let λ be a probability measure such that λ(B) > 0 for all ∞ B ∈ ∪m=1Tm. If

Qm 2 j=1 E(Vε ···ε ) sup max 1 j < ∞, ε∈Em 2 m∈N λ (Bε1···εm ) then P is absolutely continuous with respect to λ a.s. In particular, if the canonical partition is chosen so that −m 1 λ(Bε1···εm ) = 2 , |E(Vε1···εm ) − 2 | ≤ δm and var(Vε1···εm ) ≤ γm, m P∞ for all ε ∈ E and m ∈ N, where m=1 δm < ∞ and P∞ m=1 γm < ∞, then the conclusion holds.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Zero-one laws

A tail-free process has interesting dichotomies. If V ’s in all levels are strictly in (0, 1), then with respect to any measure λ with all cell probabilities positive, the sample realizations of the tail-free process are absolutely continuous with respect to λ with probability either zero, or one. This is actually a simple consequence of Kolmogorov’s zero-one law in view of the independence in the tail-free structure. Absolute continuity is no special, in that mutual singularity also holds with probability either zero or one. Discreteness or continuity (i.e., non-atomicity) each also holds with probability either zero or one.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Polya tree process

Tail-free property is an abstract concept. For actual Bayesian analysis, we need to make some deﬁnite choices about the partitions and the distributions of the Vε’s.

Call Vε0 = Yε, and so Vε1 = 1 − Yε. Make ind Yε ∼ Be(αε0, αε1). Note that all these allocation proportions are independent, across or in the same levels. Expressions for mean and moments are more concrete. In α Qk ε1···εj particular, E[P(Bε1···εk )] = j=1 . αε1···εj−10+αε1···εj−11 Using a canonical partition from a center measure λ, and letting all αε = am’s depend only on the length of the string m = |ε|, Kraft’s suﬃcient condition for the existence of a P∞ −1 density can be simpliﬁed to m=1 am < ∞.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Conjugacy

If P follows a Polya-tree in the prior, then given i.i.d. observations from P, the posterior for P is again a Polya tree with respect to the same partition, with Pn αε 7→ αε + i=1 1l{Xi ∈ Bε}. In particular, assuming a canonical Polya tree with parameters {am} and admitting density, this leads to a simple expression for the posterior mean of the density

∞ Y 2aj + 2N(Bε1···εj ) E[f (x)|X1,..., Xn] = λ(x) . 2aj + N(Bε ···ε 0) + N(Bε ···ε 0) j=1 1 j−1 1 j−1

The inﬁnite product actually terminates, so numerical computation is simple, giving a simple Bayesian density estimate. The posterior variance can also be computed fairly easily.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Dirichlet process

iid Consider the problem of estimating the distribution Xi |P ∼ P. How to construct a conjugate prior for P?

Had the data been grouped using a partition A1,..., Ak , then the likelihood would be multinomial — the closest ﬁnite dimensional relative of the general nonparametric model.

The corresponding conjugate prior for (P(A1),..., P(Ak )) would be a Dirichlet distribution. But the way grouping can be done is completely arbitrary, so for all choice of partition, the Dirichlet distribution would be needed. This motivates the following:

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Dirichlet process

Deﬁnition (Ferguson, 1973) A random probability measure P is said to follow a Dirichlet process prior Dα with base measure α if for every ﬁnite measurable partition {A1,..., Ak } of X,

(P(A1),..., P(Ak )) ∼ Dir(k; α(A1), . . . , α(Ak )),

where α is a ﬁnite positive Borel measure.

Why a measure? Allows unambiguous speciﬁcation essential for existence: If some sets in the partition are merged together, resulting probabilities follow lower dimensional Dirichlet with parameters group sums.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Moments

Marginal: (P(A), P(Ac )) ∼ Dir(2; α(A), α(Ac )), that is, P(A) ∼ Be(α(A), α(Ac )). Mean: E(P(A)) = α(A)/α(X) =:α ¯(A). Implication: If X |P ∼ P and P ∼ Dα, then marginally X ∼ α¯, Variance: var(P(A)) =α ¯(A)¯α(Ac )/M, M = |α| := α(X). Comment: α¯ is called the center measure and M is called the precision parameter. We often denote Dα by DP(M, G), where G is the c.d.f. ofα ¯. More generally, E(R ψdP) = R ψdα¯.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Transformation

Theorem −1 P ∼ Dα on X and f : X → Y implies P ◦ f ∼ Dα◦f −1 .

The proof is obvious from the deﬁnition of induced measures. The property, in particular, is useful in carrying Dirichlet process from one sample to another, and will be useful when we discuss construction of a Dirichlet process.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior distribution

Theorem n ∗ ∗ P P|X1, ..., Xn ∼ Dα , where α = α + i=1 δXi , gives a version of the posterior distribution.

Sketch of the proof: Can just do n = 1 and iterate. Reduce to a ﬁnite measurable partition {A1, ..., Ak }. To show its posterior is Dir(k; α(A1), . . . , α(Ai−1), α(Ai ) + 1, α(Ai+1), . . . , α(Ak )), when X ∈ Ai . Clearly true if only that much were known. Reﬁne partition and corresponding information. The posterior does not change. Apply the martingale convergence theorem to pass to the limit.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior moments and convergence

M n E(P(A)|X1, ..., Xn) = M+n α¯(A) + M+n Pn(A). Convex combination, relative weights M and n. M can be interpreted as the prior sample size, M → 0 means “noninformative limit” Asymptotically equivalent to sample mean up to O(n−1), −1/2 converges to true P0 at n rate.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Discreteness

Theorem

Dα(D) := Dα(P : P is discrete) = 1.

Sketch of the proof: P ∈ D iﬀ P{x : P({x}) > 0} = 1. D is a measurable subset of M. Assertion holds iﬀ (Dα × P){(P, X ): P{X } > 0} = 1.

Equivalent toα ¯ × Dα+δX {(X , P): P{X } > 0} = 1. True by Fubini since P has positive mass at X under the

“posterior” Dα+δX . Note: Null sets do matter.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Self-similarity

Theorem

c P(A) ⊥ P|A ⊥ P|A and P|A ∼ Dα(A)¯α|A .

Can be shown using the relations between ﬁnite dimensional Dirichlet and independent gamma variables. If the Dirichlet process is localized to A and Ac , then both are Dirichlet processes and are independent of each other, and also independent of the “macro level” variable P(A).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Weak support

Theorem

supp(Dα) = {P : supp(P) ⊂ supp(α)}.

Sketch of the proof: No P which supports points outside supp(α) can be in the support since the corresponding first beta parameter would be zero. Any compliant P would be in the weak support by fine partitioning and non-singularity of corresponding finite dimensional Dirichlet distribution. Further, if P is in the weak support and is continuous, then assertion automatically upgrades to Kolmorov-Smirnov support by Polya’s theorem.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence

Theorem

Let αm be such that α¯m →w α¯, then

(i) If |αm| → M, 0 < M < ∞, then Dαm →w DMα¯; ∗ (ii) If |αm| → 0, then Dαm →w Dα := L(δX : X ∼ α¯);

(iii) If |αm| → ∞, then Dαm →w δα¯.

Sketch of the Proof. A random measure is tight iff its expectation measure is tight, so tightness holds here. To check finite dimensional convergence. Work with a finite partition. For (i), Dirichlet goes to Dirichlet by Scheffe. For (ii) and (iii), check convergence of all mixed moments.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence

Corollary iid If Xi ∼ P0, then Dα+Pn δ weakly converges to i=1 Xi

(i) δP0 if n → ∞ (consistency of posterior);

(ii) DPn δ (the Bayesian bootstrap) if M → 0. i=1 Xi

By (i), the posterior distribution of P as n → ∞ concentrates near the true distribution P0 in the weak topology (and Kolmogorov-Smirnov distance) irrespective of the base measure α. In particular, it is not required that the prior supports the true distribution. The property is driven by the behavior of the empirical distribution and the structure of Dirichlet process, rather than likelihood ratios, which are non-existent in the present case. (ii) can be regarded as the posterior based on a “noninformative Dirichlet prior”, and is a sensible alternative to the bootstrap method, to be discussed later. Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence

2 If α¯m also diﬀuses in the limit, like N(0, σ ) with σ → ∞ along with M → ∞, then some other non-trivial limits may be obtained, provided that the growth of M is appropriately linked with the growth of σ. The resulting limit has sometimes called the limdir process [Bush, Lee and McEachern (2011), J. Roy. Statist. Soc.]. The process has been suggested as a noninformative choice in nonparametric mixture modeling like Dirichlet mixture process, to be discussed later.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet-multinomial process

Deﬁnition PN iid The distribution ΠN of P = k=1 Wk δθk , where θ1, . . . , θN ∼ G and (W1,..., WN ) ∼ Dir(N; α1, . . . , αN ), is called the Dirichlet-multinomial process of order N with parameters G and (α1, . . . , αN ).

iid If Y1, ..., Yn ∼ P and P ∼ ΠN , then Yi = θKi , where (K1,..., Kn) ∼ MN(n, N; W1,..., WN ) and (W1,..., Wk ) ∼ Dir(N; α1, . . . , αN ).

For a given (θ1, . . . , θN ), P ∼ DPN α δ . k=1 k θk R R For any ψ ∈ L1(G), E( ψdP) = ψdG irrespective of the values of α1, . . . , αN .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet-multinomial process

Theorem (Ishwaran and Zarepour, 2002, Statistica Sinica) PN iid Let PN = k=1 Wk,N δθk , where θ1, θ2,... ∼ G and independently (W1,N ,..., WN,N ) ∼ Dir(N; α1,N , . . . , αN,N ),P ∼ DP(M, G). PN PN 1 (a) If αk,N → ∞ and max1≤k≤n αk,N /( αk,N ) → 0, kR=1 R k=1 then ψdPN →p ψdG, ψ ∈ L2(G). In particular, holds for αk,N = λN and NλN → ∞. P∞ 2 2 −1 PN 1 (b) If αk,N = λk , λ /k < ∞,N λk → λ > 0, R kR=1 k k=1 then ψdPN → ψdG a.s., ψ ∈ L2(G). PN 2 (a) If αk,N → M and max1≤k≤n αk,N → 0, then R k=1 R ψdPN →d ψdP, ψ bounded continuous. R R 2 (b) If αk,N = M/N, then ψdPN →d ψdP, ψ ∈ L1(G). PN PN 3 If αk,N → 0 and max1≤k≤n αk,N /( αk,N ) → 0, k=1 R k=1 then for any bounded continuous ψ, ψdPN →d ψ(θ), where θ ∼ G. In particular, holds if αk,N = λN and NλN → 0.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick-breaking representation

Theorem (Sethuraman, 1994, Statistica Sinica)

Let θ1, θ2,... ∼ α¯ and V1, V2,... ∼ Be(1, M), all mutually Qi−1 P∞ independent, Wi = Vi j=1(1 − Vj ),P = i=1 Wi δθi . Then

∞ X P = Wi δθi ∼ Dα. i=1

Distributional equation: i−1 Y 0 Wi = (1 − V1)[ (1 − Vj )]Vi = (1 − V1)Wi−1, j=2 ∞ X 0 P = V δ + (1 − V ) W δ 0 = V δ + (1 − V )P. θ i θi d θ i=1 Can do “functional MCMC”.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick-breaking representation

Intuition: Since prior is posterior averaged w.r.t. marginal distribution of R the observation, Dα is also Dα+δx dα¯(x). Conditionally on observed x, P{x} ∼ Be(1, M) and P|{x}c ∼ Dα (assuming non-atomicity of α), independently by the self similarity property. Steps in formal proof: DP is a solution of the distributional equation, the solution is unique.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics -Dirichlet process

Deﬁnition P∞ Let P = i=1 Wi δθi ∼ DP(M, G). For a given > 0, deﬁne Pm N = inf{m ≥ 1 : i=1 Wi > 1 − } and PN ¯ P = i=1 Wi δθi + Wδθ0 , where ¯ PN QN W = 1 − i=1 Wi = i=1(1 − Vj ) and θ0 ∼ G independent of everything else. The distribution of P is called an -Dirichlet process Dα,.

Theorem

dTV (P, P) ≤ a.s.

dL(Dα, Dα,) ≤ , dL L´evydistance.

N − 1 ∼ Poi(M log− ). Consequently, E(N) = 1 + M log− and var(N) = M log− . R R ψdP → ψdP a.s. as → 0.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Generalized Polya urn scheme

Suppose that Xi |P ∼ P and P ∼ Dα, often called a sample from a Dirichlet process. Then

X1 ∼ α¯;

X2|P, X1 ∼ P and P|X1 ∼ D , so α+δX1 M 1 X2|X1 ∼ M+1 α¯ + M+1 δX1 , that is,

( 1 δX1 , w.p. M+1 , X2|X1 ∼ M α¯, w.p. M+1 .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Generalized Polya urn scheme

More generally, Xi |P, X1,..., Xi−1 ∼ P and P|X1,..., Xi−1 ∼ Dα+Pi−1 δ , so j=1 Xj M Pi−1 1 Xi |X1,..., Xi−1 ∼ M+i−1 α¯ + j=1 M+i−1 δXj , that is,

( 1 δXj , w.p. M+i−1 , j = 1,..., i − 1, Xi |X1,..., Xi−1 ∼ M α,¯ w.p. M+i−1 .

By exchangeability of (X1,..., Xn),

( 1 δXj , w.p. M+n−1 , j = 1,..., n − 1, Xi |Xj , j 6= i ∼ M α,¯ w.p. M+n−1 .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Joint distribution

The joint distribution of (X1,..., Xn) is not absolutely continuous with respect to the Lebesgue measure. A realization of (X1,..., Xn) partitions {1, 2,..., n} by identifying components that have identical values. Corresponding to each partition, consider a lower dimensional Lebesgue measure on the “diagonals” by setting the values of each component within the same partitioning set identical. For instance, if n = 3 and {{1, 2}, {3}} is a partition, then the 3 corresponding diagonal is D12;3 = {(x1, x2, x3) ∈ R : x1 = x2} with the corresponding lower dimensional Lebesgue measure λ12;3. Then the joint distribution is absolutely continuous with respect to the sum µ of such mutually orthogonal measures. The joint distribution has several components each of which is absolutely continuous with respect to exactly one component of µ. The density with respect to µ is the sum of the densities of each component of the joint distribution with respect to the corresponding component of µ.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Clustering

A lot of ties are produced by Dirichlet samples, and hence it induces random partitions on {1,..., n}, creating clusters.

Expected number of distinct elements Kn grows like E(Kn) ∼ M log(n/M), which arises from the partial sum of a harmonic series.

var(Kn) also grows at the same rate.

Kn/ log n → M using Kolmogorov’s strong law. Kn−M log(1+n/M) √ →d N(0, 1). M log(1+n/M) Poisson approximation holds with parameter M log(1 + n/M). k Γ(M) Exact distribution: P(Kn = k|M, n) = Cn(k)n!M Γ(M+n) .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Mutual singularity

Theorem (Korwar and Hollander, 1973, Ann. Statist.)

If α1 6= α2 are two non-atomic measures on X, then Dα1 and Dα2 are mutually singular. More generally, if αi is decomposed into continuous part αi,c and

atomic part αi,a, i = 1, 2, and α1,c 6= α2,c , then Dα1 and Dα2 are mutually singular.

If α1,c = α2,c but supp(α1,a) 6= supp(α2,a), then Dα1 and Dα2 are mutually singular.

If α1,c = α2,c and supp(α1,a) = supp(α2,a), then Dα1 and Dα2 need not be mutually singular.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Tails

Theorem (Doss and Sellke, 1982, Ann. Statist.) Let F be the c.d.f. of a random P following DP(M, G). Let h be a real-valued function on [0, 1] which is strictly increasing and convex on (0, ) for some > 0. Then a.s. R F (x) 0, if 0 log− h(t)dt < ∞, lim = R x→−∞ h(MG(x)) ∞, if 0 log− h(t)dt = ∞. ¯ R F (x) 0, if 0 log− h(t)dt < ∞, lim = x→∞ ¯ R h(MG(x)) ∞, if 0 log− h(t)dt = ∞.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Tails

The proof is almost an immediate corollary of a similar tail representation for gamma processes. The most interesting conclusion is that the tails of the random F are much thinner than the center measure. For instance, if the center measure is normal, F has exponential of exponential tails. If the center measure is Cauchy, the random F has finite moment generating function. In particular, R ψdF can be a.s. finite even if R ψdG = ∞. Using c.f. based techniques, it can be shown that R |ψ|dF < ∞ a.s. iff R log(1 + |ψ|)dG < ∞.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Distribution of median

Theorem

Let F be the c.d.f. of a random P following DP(M, G). Let mF stands for any choice of median of F and H the distribution of mF . Then

the c.d.f. H of mF is given by

1 Z Γ(M) ¯ H(x) = uMG(x)−1(1 − u)MG(x)−1du; 1/2 Γ(MG(x))Γ(MG¯ (x))

H is continuous if G is so; any median of H is a median of G and conversely.

The distribution above, called the median-Dirichlet distribution, can be evaluated numerically. Proof is related to monotone likelihood ratio property of the beta family.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Distribution of mean

Theorem The distribution of L = R xdF (x),F ∼ DP(M, G) is given by

1 1 Z ∞ M Z ∞ + t−1 exp − log(1 + t2(s − x)2)dG(x) 2 π 0 2 −∞ Z × sin M tan−1(t(s − x))dG(x) dt. R

Uses an identity [Cifarelli and Regazzini, 1990, Ann. Statist.] Z Z E exp ψdS = exp − log(1 − itψ(x))dα(x) R for S following a gamma process with mean measure α — use c.f. and independent increments of S. One interesting case: The distribution of the mean is Cauchy iﬀ the base measure is Cauchy — the only such case.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Characterization

A Dirichlet can be roughly characterized as the only process satisfying any of the following properties: Tail-free with respect to any sequence of reﬁning partitions;

Neutral, that is P(B1) ⊥ P(B2)/P(B1) ⊥ P(B3)/P(B2) ⊥ · · · for all B1 ⊃ B2 ⊃ · · · ; Posterior distribution of P(A) depends on Pn(A) only for all A; Polya tree process with αε0 + αε1 = αε for every ﬁnite string ε.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Construction of a Dirichlet process

Finite dimensional distributions are self-consistent, so there is a joint distribution in [0, 1]∞. But measure theoretic diﬃculties can arise. For instance, the space of measures is not a measurable subset of the product space. Diﬃculties can be overcome by using a countable generator and the countable additivity of the mean measure. Existence of gamma process also gives Dirichlet process upon normalization. Stick-breaking representation also gives a direct construction.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Invariant Dirichlet process

Sometimes we need to put a shape restriction on the random P, for instance if we need to model ar error distribution, symmetry is often imposed. One can consider a symmetrized probability ˜ 1 P(A) = 2 (P(A) + P(−A)), where P follows Dirichlet [Dalal, 1982, Stoch. Process Appl.]. If the center measure G is symmetric, then the posterior mean of symmetrized Dirichlet is obtained as

1 Pn MG(x) + [1l(Xi ≤ x) + 1l(Xi ≥ −x)] E(F (x)|X ,..., X ) = 2 i=1 . 1 n M + n More generally, invariance under the action of a compact group of transformations can be considered. Invariant Dirichlet process is not tail-free.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Constrained Dirichlet process

Sometimes a restriction is imposed on quantile(s), such as median set to zero. Applications include quantile regression, regression with asymmetric error etc. More generally, one can express that as finitely many restrictions P(Bj ) = vj , where Bj ’s form a finite partition, called a control set. By the self-similarity of a Dirichlet process, such a thing can be expressed as a finite mixture of independent Dirichlet process with orthogonal supports. Restriction on moments is also sensible, but is less tractable.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Mixture of Dirichlet process

It is typically hard to specify the base measure exactly, but letting it a member of a (parametric) family is more easily comprehendible. For instance, it may be hard to say that the center measure should be N(0, 1), but it is sensible to choose the center measure as N(µ, σ2) with µ and σ undetermined. In other words, the center measure G and/or the precision M may involve and additional parameter θ. Then a further prior π is imposed on θ, often a very diﬀuse prior. The resulting hierarchical prior is called a mixture of Dirichlet process (MDP) [Antoniak, 1974, Ann. Statist.]. Invariant Dirichlet and constrained Dirichlet are special cases of MDP. The precision parameter of a Dirichlet process has high sensitivity, so it is a good idea to impose a prior on it, rather than choosing it directly.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Properties

MDP is a.s. discrete, but is not tail-free. R RR Mean: E( ψdP) = ψdGθ dπ(θ) = ψ¯. Variance: var(R ψdP) = R 2 R ¯ 2 RR (ψ− ψdGθ) R ( ψdGθ−ψ) dGθ dπ(θ) + dπ(θ). 1+Mθ 1+Mθ Posterior of an MDP is again an MDP: Mθ 7→ Mθ + n, MθGθ+nPn Gθ 7→ and θ ∼ π(·|X1, ..., Xn). Mθ+n To ﬁnd π(·|X1,..., Xn), assume that Gθ has a density gθ. The joint distribution of X1,..., Xn|θ is given by the Polya urn scheme for a Dirichlet process. Then π(·|X1,..., Xn) is obtained by the Bayes theorem.

In particular, if all Xi ’s are distinct, the joint density of Qn X1,..., Xn|θ is the parametric likelihood i=1 gθ(Xi ).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet process mixtures

Since Dirichlet samples are a.s. discrete, it is not usable in density estimation. Ferguson (1983) and Lo (1984, Ann. Statist.) suggested convoluting with a parametric kernel, f (x) = R ψ(x, θ, ϕ)dP(θ). This is equivalent to parametric mixture model: iid Xi |θi ∼ ψ(·, θi , ϕ), θi |P ∼ P, P ∼ DP(M, G).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior distribution

Formally, the posterior for P is an MDP: L(P|θ, X) = L(P|θ) = Dα+Pn δ . i=1 θi The posterior distribution of θ|X can be calculated by Bayes Qn theorem because p(X|θ) = i=1 ψ(Xi , θi ), and the joint (marginal) distribution of θ is that of the sample of size n from a Dirichlet process (P being integrated out). Thus the posterior mean of a function R h(x, θ)dP(θ) can be written analytically, but it has too many terms — impossible to compute even for moderate n.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics MCMC algorithms

Idea is based on Gibbs sampling, by describing the posterior distribution of θi |θj , j 6= i, X using Bayes theorem.

The conditional prior distribution of θ1, . . . , θn is given by the Polya urn scheme.

The likelihood is ψ(Xi , θi ) — the posterior distribution of θi |θj , j 6= i, X depends on X only through Xi . All this information can be summarized in the following theorem:

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Theorem

δθj , w.p. qi,j , θi |(θ−i , X) = θi |(θ−i , Xi ) ∼ Gb,i , w.p. qi,0, where Z qi,j = cψ(Xi ; θj ), j 6= i, qi,0 = cM ψ(Xi ; θ)dG(θ),

P c is chosen to satisfy qi,0 + j6=i qi,j = 1, and Gb,i is the “baseline posterior measure” given by

ψ(Xi ; θ)dG(θ) dGb,i (θ) = R . ψ(Xi ; t)dG(t)

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics MCMC algorithms

The theorem above describes the algorithm: Treat θ1, . . . , θn as your hidden parameters.

However, many of the θi ’s are tied with each other, so a reduction in number of parameters is possible, keeping track of the labels and the number of distinct values. This usually improves the speed. Calculation of the posterior weights and sampling from the baseline posterior is essential for the algorithm. If the base measure is conjugate with the likelihood, both of these are easily done. When there is no such conjugacy, specialized algorithms using acceptance-rejection strategies can overcome the problems.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Variational method

MCMC methods are slow in complex problems. A variational method involves deterministic iterative optimization in the space of distributions. First used in exponential families. One assumes that the actual posterior is closely approximated by some very ﬂexible but tractable family of distributions of the product type, called the variational distribution. Then the idea would be to pick up that distribution which is the Kullback-Leibler projection of the true posterior in the family. The Kullback-Leibler divergence is bounded below using Jensen’s inequality, and then one minimizes the lower bound iteratively in each parameter. The corrections from the previous value are called variational updates. For Dirichlet process mixture models, one uses an approximate ﬁxed term stick-breaking representation to convert the Dirichlet model to a parametric class, and use beta family on the variables for stick-breaking.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Choice of kernel

For density estimation on the line, a location-scale kernel like the normal, t or logistic is used. The scale parameter is either mixed of separately treated. Should allow small values of the scale for the bias to be small. For densities on the unit interval, beta mixtures may be considered. In fact, very special beta mixtures given by Bernstein polynomials already have good approximation property. Densities on the half-line can be treated by mixtures of gamma, log-normal, Weibull, inverse gamma, inverse Gaussian etc. Sometimes shape restriction is an important issue. For instance, normal scale mixtures produce strongly unimodal densities, scale mixtures of uniform decreasing densities etc.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Feller sampling scheme

More generally, the approximation property of a mixture can often be connected with a sampling scheme E[Zk,x ] → x, var(Zk,x ) → 0, so f (x) is approximated by E[f (Zk,x )]. −1 On R with Zk,x ∼ N(x, k ) gives the normal kernel naturally — a sort of canonical choice. −1 On [0, 1] with Zk,x ∼ k Bin(k, x) gives the Bernstein polynomial kernel. −1 On (0, ∞), using Zk,x ∼ k Poi(kx) gives the gamma kernel.

On (0, ∞), using Zk,x ∼ Ga(k, k/x) gives the inverse-gamma kernel.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics The Bayesian bootstrap

∗ Pn D = DPn δ = Wi δX , where n i=1 Xi i=1 i (W1,..., Wn) ∼ Dir(n; 1,..., 1). ∗ Dn is actually a resampling distribution like Efron’s bootstrap. The main diﬀerence from a general Dirichlet posterior is that only ﬁnitely many atoms are involved, so simulation is particularly easy. Pn iid In fact, Wi = Yi / j=1 Yj , where Yi ∼ Ex(1). Density of mean functional µ = R ψdP under BB may be obtained analytically

n n−2 X (ψ(Xi ) − µ)+ p(µ|X1,..., Xn) = (n − 1) Q , (ψ(Xi ) − ψ(Xj )) i=1 j6=i

a B-spline of order (n − 2) with knots at ψ(X1), . . . , ψ(Xn).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency

“What if” study — if the data are generated from a model (n) with true parameter δ0, then the posterior Π(θ|X ) should

approach the perfect knowledge δθ0 . Equivalent to ”For any neighborhood U of θ0, Π(Uc |X (n)) → 0”.

The prior Π must support θ0; otherwise very little chance of consistency (Dirichlet is an exception). We tend to think (based on experience with the parametric situation) that if the prior puts positive probability in the neighborhood of θ0, we must have consistency, at least when the data are i.i.d. Not quite true in inﬁnite dimension.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency

Example (Freedman, 1963, Ann. Statist.)

Inﬁnite dimensional multinomial; p unknown p.m.f.; true p.m.f. p0 is Geo(1/4). We can construct a prior Π which gives positive mass to every neighborhood of p0 but the posterior concentrates at p1 := Geo(3/4).

Example is actually generic: The collection of (p, Π) which leads to consistency is topologically very small.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency

Example (Diaconis and Freedman, 1986, Ann. Statist.) To estimate the point of symmetry θ of a symmetric density. Put normal prior on θ and symmetrized DP with Cauchy base measure on the rest of the density. Then there is a trimodal symmetric true density under which the posterior concentrates at two wrong values of θ. Doss (1985, Ann. Statist.) has a similar result for constrained Dirichlet with median zero.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples of inconsistency

Example (Kim and Lee, 2001, Ann. Statist.) Consider estimating hazard cumulative function H. There are two priors Π1, Π2, both having prior mean equal to the true hazard H0, Π1 with a uniformly smaller prior variance that Π2, such that the posterior for Π2 is consistent but the posterior for Π1 is inconsistent.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Why consistency matters

It will be embarrassing if Bayesian method is unable to uncover the truth even with infinitely rich resources. Consistency makes the estimator more acceptable to other people. Merging of opinion: Two Bayesian with different priors eventually agree iff consistency holds.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Doob’s consistency result

Theorem (Doob, 1948) Consider any sequence of models with observations X (n) guided by parameter θ and θ having prior Π. Suppose that θ is a “function of all observations”: θ = f (X (∞)). Then for any bounded h(θ), the posterior mean converges to the true value, i.e., (n) E[h(θ)|X ] → h(θ0) a.s. under θ0, for almost all θ0 w.r.t. Π.

For i.i.d. observations from an identiﬁable model, or more generally whenever there is a consistent estimator, Doob’s condition holds, so the result is extremely general, and the conclusion is also much stronger than consistency. However, the null set can spoil the party. Generally, the conclusion is really useful when the parameter space is countable, or the prior is equivalent with some standard measure like the Lebesgue measure.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Notions of consistency

Consistency depends on the choice of neighborhoods, i.e., topology and metric. The weak topology is generated by closeness of expectation of bounded, continuous functions: R R {P : | ψi dP − ψi dP0| < , i = 1,..., k}.

On densities, stronger distances are more relevant: The L1 distance between two densities p, q is deﬁned by R kp − qk1 = |p − q|. The Hellinger distance is q √ √ R √ √ 2 k p − qk2 = ( p − q) . These two distances give the same topology, sometimes called the strong topology.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency for tail-free process

Theorem (Freedman, 1963, Ann. Statist.) iid Let Xi |P ∼ P and P is given a tail-free process prior Π. Then the posterior for P is consistent with respect to the weak topology at any P0 in the weak support of Π. In particular, this means that that the posterior based on Dirichlet (of course) and Polya tree processes are consistent in weak topology whenever the true P0 is in the support of the prior. This holds because of the strong connection between tail-free priors and finite dimensional priors. Because of the weak topology, the calculation can be reduced to finite dimensional multinomial model, where everything is fine. The problem is that tail-free is a very fragile property, easily destroyed by mixtures and other operations.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Schwartz’s theory

Initiated by Schwartz (1965), extending Freedman’s work in the discrete case. iid Consider i.i.d. observations Xi ∼ pθ, θ ∈ Θ. Assume the {p(x, θ), θ ∈ Θ} family is dominated, so densities exist and Bayes theorem is applicable. Let θ0 be the true value, U be a neighborhood of θ0 and Π a prior for θ. To show that Qn R i=1 p(Xi ,θ) c Qn dΠ(θ) c (n) U i=1 p(Xi ,θ0) Π(θ ∈ U |X ) = Qn → 0 R i=1 p(Xi ,θ) Qn dΠ(θ) Θ i=1 p(Xi ,θ0) To bound, we can replace Θ in the integral in the denominator by any subset. Since observations are i.i.d., we can parameterize by the density itself, i.e., θ = pθ = p, and the true density is p0. We write F for Θ and U for U.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Schwartz’s theorem

Theorem Suppose that (i) the family is “statistically separable”, i.e., there exists a test function Φn = Φn(X1,..., Xn) for testing H0 : p = p0 against H : p ∈ U c such that

−bn −bn P0(Φn) ≤ Be , sup P(1 − Φn) ≤ Be ; p∈U c

(ii) p0 belongs to the Kullback-Leibler support of Π, i.e., Π(K(p0, p) < ) > 0 for all > 0, where K(p, q) = R p log(p/q). c ∞ Then Π(U |X1,..., Xn) → 0 exponentially fast a.s. [P0 ].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Idea of the proof

R Qn p(Xi ) −nc We show that, for some c > 0, c dΠ(p) ≤ e U i=1 p0(Xi ) and for all c > 0, enc R Qn p(Xi ) dΠ(p) → ∞. {K(p0,p)<} i=1 p0(Xi ) The ﬁrst assertion follows from Z n Z Y p(Xi ) −nc E[(1−Φn) dΠ(p)] = P(1−Φn)dΠ(p) ≤ e c p0(Xi ) c U i=1 U by the testing condition. The second assertion is a consequence of Fatou’s lemma, since nc Qn p(Xi ) nc−nK(p ,p) e ≈ e 0 → ∞ for K(p0, p) < < c/2. i=1 p0(Xi )

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Comments on tests

Uniformity in the testing condition is the biggest challenge. Otherwise exponentially powerful tests separating two distributions are always there. If there is a uniformly consistent test, the exponential consistency is obtained automatically from the i.i.d. structure. For the weak topology on distributions, separating tests exist by Hoeﬀding’s inequality.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Building up tests

The weak topology is too weak for density estimation and many other problems. For stronger metrics like the Hellinger distance, a test can be build up from more basic tests. From known work of Le Cam, Birgéetc., it is known that exponentially powerful tests exist for any pair of convex sets separated by positive distance. Sets like U c are not convex. But one can cover it with small convex balls, each separated from p0, and get a test for each such ball. Consider the maximum of these tests. If the number of balls needed is finite, then the maximum of those tests is a test satisfying required conditions, with the number appearing as a multiplicative constant in the exponential bound for P0(Φn). Unfortunately, the number is not finite unless the class of densities is compact — a strong condition.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Role of sieves

The class F can be replaced by a subset Fn in the testing c condition provided we can show that Π(Fn |X1,..., Xn) → 0. c A suﬃcient condition to ensure this is that Π(Fn ) is exponentially small.

This Fn, called a sieve, may be taken to be compact. But now, as this is not ﬁxed, just knowing that ﬁnitely many balls cover Fn will not be enough. It’s growth has be slower than exponential to be absorbed in the bound. Thus the number of small balls needed to cover the sieve needs to be estimated.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Entropy

The covering number ,N(ε Θ, d) of a metric space (Θ, d) is the minimal number of balls of radius ε needed to cover Θ.

ε big ε small

Entropy is the logarithm log N(ε, Θ, d).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Consistency in L1

Theorem

Let X1, X2,... be X-valued i.i.d. random elements with density p ∈ F. Let the true density p0 ∈ KL(Π). Suppose that given any 2 > 0, there exists δ < /4, c1, c2 > 0, β < /32 and sieves Fn such that

c −c2n (i) Π(Fn ) ≤ c1e ; (ii) log N(δ, Fn, k · k1) ≤ nβ, where N stands for the covering number.

Then for all > 0, Π(p ∈ F : kp − p0k1 > |X1,..., Xn) → 0 a.s. ∞ [P0 ].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Kullback-Leibler positivity property

Needs closeness and control of likelihood ratios to make the KL divergence small.

Random p should come close to true p0 with positive probability.

The ratios p0/p need to be bounded above — typically needs lower bounds on p and some integrability of p0.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Preservation of Kullback-Leibler property

The KL property is very stable, unlike the tail-freeness. It is preserved under mixtures, projection on a co-ordinate, taking products, symmetrization, small distortions of the true density like small location shift etc.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Common priors with Kullback-Leibler property

The Polya tree prior with canonical partition and parameter P∞ −1/2 sequence am growing to meet m=1 am < ∞ will satisfy the KL property at p0 with certain integrability conditions. For inﬁnite dimensional exponential family or a ﬁnite random series prior on the unit interval, only continuity of p0 is needed. Gaussian process priors are included here, but they have much richer structural property, so much stronger conclusions will hold, to be discussed later.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Kernel mixture prior

For kernel mixture priors, to meet the KL property, one needs full weak support of the prior on the mixing distribution and some integrability conditions on the true p0. Depending on the kernel used, the integrability conditions vary. For a location-scale kernel, this usually means some moment condition on p0. For the normal kernel, the condition is ﬁniteness of some 2 + δ moment. For thicker tail kernels, like logistic or double exponential, only ﬁniteness of 1 + δ moment is needed. For even thicker tail kernel like the t-distribution, only logmoment will be needed. For a kernel on the unit interval like a Bernstein polynomial, only continuity of p0 is needed. Other kernels can be treated [Wu and Ghosal, 2008, Electon. J. Statist.].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Density estimation

Once the KL property holds, it is only the matter of defining an appropriate sieve Fn, controlling its entropy, and c controlling Π(Fn ). It is a trade-off — neither too big nor too small sieves will work. Priors with more regular support, the random p’s have less complexity, so meeting the conditions on the sieve is easier. This translates into milder conditions on the parameters of the prior and the true density. For instance, for a Dirichlet mixture of normal prior with normal base measure, the prior on the bandwidth can be reasonably flexible. In particular, the commonly used prior like inverse-gamma will be allowed [Ghosal, Ghosh and Ramamoorthi (1999), Ann. Statist.; Tokdar (2006), Sankhya]. Multivariate normal kernel [Wu and Ghosal (2010), J. Mult. Anal.]

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Semiparametric applications

Schwartz theorem with weak topology on the nuisance nonparametric part is an appropriate tool for dealing with semiparametric problems. KL property plays a vital role. In many cases, even a sieve is not required.

Location problem: X ∼ pθ = p(· − θ). Dirichlet does not admit densities; use Polya tree, Dirichlet mixtures or any other prior with KL property. Interesting observation: (θ, p) 7→ pθ is bi-continuous, so just need to get weak consistency for the density pθ based on i.i.d. observations. KL property is essentially preserved under location change. Many other applications possible: Linear regression (using a non-i.i.d. generalization of Schwartz theory), exponential frailty model, Cox proportional hazard model, etc. [Wu and Ghosal, 2008, Sankhya].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Non-i.i.d. generalizations of Schwartz’s theorem

The handling of the numerator in Schwartz’s theorem only uses properties of the tests, not independence or identical distribution of observations. Thus if testing conditions are assumed as in the i.i.d. case, then the rest will go through. The denominator is handled through strong law on log-likelihood ratios, so some condition on dependence and/or distributions of observations will be needed. For independent but not identically distributed (i.n.i.d.), strong law is available. KL property will then be described by average KL divergences and second moment of log-likelihood ratios [Amewou-Atisso et al., 2003, Bernoulli; Choudhuri et al., 2004, J. Amer. Statist. Assoc.; Wu and Ghosal, 2008, Sankhya]. Tests in such situations are studied by Birg´e,or can be constructed directly in speciﬁc applications.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dependent generalizations of Schwartz’s theorem

When observations are not independent, Schwartz’s theorem can still be extended under certain situations. The dependence must allow a strong law, and required tests need to be constructed. A favorable setting is that of ergodic Markov processes. Strong law holds in this case. Tests have been constructed by Birgéfor (an analog) of the Hellinger distance. Under specific modeling assumptions of transition densities, such tests may be constructed directly as well. Under mild conditions, consistency can be verified for a certain Dirichlet process mixture models for conditional densities [Tang and Ghosal, 2007, J. Statist. Plan. Inf.]

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate of posterior

(n) (n) For a statistical model Pθ with observations X and prior Π, posterior convergence rate is n → 0 at true value Θ0 w.r.t (n) a metric d if Π(d(θ, θ0) ≥ Mnn|X ) → 0 for every Mn → ∞.

n is the smallest size of a ball around true θ0 that holds nearly all posterior mass, thus measures the speed of convergence — a quantitative form of consistency.

If the posterior converges at rate n, then there is a point estimator converging at the same rate. Deﬁne as the center of the ball of radius n containing maximum posterior probability. Thus posterior convergence rate is restricted by the minimax rate of convergence for all estimators — the optimal rate.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate of posterior

In some cases convergence rate can be obtained by direct calculations. For instance, if posterior mean and variance can (n) be calculated, then Π(d(θ, θ0) ≥ Mnn|X ) can be estimated by Chebyshev’s inequality. ind −1 ind 2 Inﬁnite dimensional normal Xi ∼ N(θi , n ), θi ∼ N(0, τi ) can be treated in this way. Survival analysis has other examples. A theory of posterior convergence rate can be built as a quantitative analog of Schwartz’s theory (in particular, assuming that the family is dominated). First we consider i.i.d. observations.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate theorem

Theorem (Ghosal, Ghosh and van ver Vaart, 2000, Ann. Statist.)

Let Πn be a sequence of priors on P. Suppose ¯n, ñ → 0 with 2 2 n min(¯n, ñ) → ∞, constants c1, c2, c3, c4 > 0 and sets Pn ⊂ P such that 2 log N(¯n, Pn, d) ≤ c1n¯n; 2 −c2n˜ Πn(K(p0, p) ≤ ñ, V (p0, p) ≤ ñ) ≥ c3e n ,

then for n = max(¯n, ˜n) and a suﬃciently large M > 0,

n Πn(p ∈ Pn : d(p, p0) > Mn X1,..., Xn) → 0 in P0 -probability. If further, 2 c −(c2+4)n˜n Πn(Pn ) ≤ c4e ,

then Πn(p ∈ P : d(p, p0) > Mn X1,..., Xn) → 0 in n P0 -probability.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Proof of the theorem

Follow the same decomposition as in proof of consistency theorem. 2 The numerator needs tests with error probabilities like e−cnn . From Le Cam, Birg´e’swork, such tests exist when the 2 alternative is a small ball. Now the ecnn like bound for covering number asserts that the combined test will satisfy the requirements. To bound the denominator, the following lemma is used.

Lemma

Z n n Y p −(1+C)n2 1 P0 (Xi ) dΠ(p) ≤ Π(B(; p0))e ≤ 2 2 . p0 C n i=1

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Optimal rate by bracketing

Cover a space of densities by N[](j , P) many brackets. Normalize upper brackets to obtain a discrete approximation and let Πj be uniform on the collection. Take a convex combination of these as the ﬁnal prior. Then the convergence rate is given by

2 n : log N[](, P) ≤ n . Often bracketing numbers are essentially equal to usual covering numbers, so the device will produce optimal rates, for instance for H¨older α-smooth class of densities — entropy is −1/α, so −α/(1+2α) n = n −1 −1/3 monotone densities — entropy is , so n = n .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Optimal rate by log-splines

Density estimation on [0, 1].

Split [0, 1] into Kn equal intervals.

Form an exponential family using the corresponding Jn many B-splines and put uniform (say) prior on the coeﬃcients. α −2α If p0 is C , spline density approximates p0 up to Jn in KL. Hellinger and Euclidean are comparable, so calculations reduce to Euclidean.

(local) Entropy√ grows like Jn while prior concentrates like e−Jn(c+log(n Jn)). 2 −α Leads to rate equations nn ∼ Jn, n ∼ Jn giving optimal −α/(1+2α) rate n = n .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet mixture of normal

Reference: Ghosal and van der Vaart, 2001, Ann. Statist.

True density p0 itself a normal mixture R p0 = φσ0 (x − z)dF0(z), σ ≤ σ ≤ σ, known as the supersmooth case. Basic technique in calculation of entropy and prior concentration is ﬁnding discrete mixture approximation with only N = O(log 1/) support points, leading to calculation in N dimension. Entropy grows like (log 1/)2 and prior concentration rate is N −(log 1/)2 −1/2 = e , leading to rate n ∼ n (log n) for most favorable situation.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dirichlet mixture of normal

Reference: Ghosal and van der Vaart, 2007a, Ann. Statist. −1/5 Take a prior and scale by a sequence σn like n . ∗ Approximate p0 by a normal mixture p0 with bandwidth σn up 2 ∗ to σn, and work with p0 as target. Similar strategy as before but the number of support points −1 1 increases to N = σn log . 2 −1 −1 2 2 Rate equations nn = σn (log n ) and n = σn leading to −2/5 4/5 n = n (log n) . Usual sieve selection does not give good results. One needs to use structural properties of Dirichlet mixtures to bound posterior probability of F [−a, a] > , and then construct a much smaller sieve. A more recent work Kruijer et al. (2011, Electron. J. Statist.) improves by allowing a ﬁxed prior on σ and allowing diﬀerent smoothness levels of p0.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bernstein polynomials

Reference: Ghosal (2001) Ann. Statist; Kruijer and van der Vaart (2008) J. Statist. Plan. Inf. If true density is itself a Bernstein polynomial, then nearly parametric rate is obtained. If the true density is α-smooth, 0 < α ≤ 2 and prior on order k has exponential tail, then the posterior converges at rate n−α/(2+2α) up to logarithmic factors. The sieve used is a set of Bernstein polynomials of certain order changing with n. Entropy calculation can be related to that of the unit simplex. The rate is far from optimal. A technique called coarsening can improve the rate to the optimal order n−α/(1+2α) up to logarithmic factors, provided α ≤ 1.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Automatic adaptation of kernel mixture priors

In obtaining posterior convergence rate of a kernel mixture prior, an important step is to ﬁnd an approximation p∗ of the true density p0 within the model F. ∗ A natural candidate for p is p0 convoluted with the kernel. For smoothness level α up to 2, this works well since the bias ∗ α (distance between p0 and p ) is of the order σ . 2 Unfortunately, this does not improve from σ if p0 is actually smoother (α > 2). Recently Rousseau (2010, Ann. Statist.) and Kruijer et al. (2011, Electron. J. Statist.) introduced a better approximation for p0 in the model if p0 is smoother.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics For the normal mixture model, the idea is to consider 2 p1 = p0 ∗ φσ is p0 is C , but for smoothness α between 2 and 4, replace p1 by p2 = φσ ∗ (p0 − (p1 − p0)). Then the KL approximation order improves to σ2α.

However, the new p2 is not a probability density, so some modiﬁcation is actually made on the approximation. The process can be repeated to obtain the correct approximation for a given smoothness level: pj+1 = φσ ∗ (p0 − (pj − pj−1)). The resulting high quality approximation allows using much larger bandwidth than before for smoother densities, and subsequently gives a better convergence rate — the optimal rate for the correct smoothness level (up to a logarithmic factor).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Note that the approximation is a technical device unrelated with the method of analysis. Bandwidths are also automatically selected from the appropriate range using the prior. Thus the posterior will converge at the right rate near p0 without actually knowing the smoothness of p0 and without using that knowledge in deﬁning the prior on the bandwidth. For technical reasons, however, Dirichlet mixtures are not covered — they worked with ﬁnite but arbitrary order mixtures.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Compare this with the classical kernel density estimator. It tries to really estimate p0 ∗ φσ using empirical average, allowing the bias. This works because the bias is made small using a small bandwidth. The order of the bias remains the same even if p0 is smoother. The target does not change. So the rate does not improve. One can improve the bias using a higher order kernel (a nuisance since density estimates can then be negative), but one will have to know the correct smoothness order of p0 and use that knowledge in choosing the kernel and the bandwidth. The Bayesian estimator, on the other hand, picks up the right target within the model automatically without knowing the smoothness of p0, and converges around it at the correct rate. Thus the Bayesian estimator is smarter, automatically adapting to the given situation — the bandwidth problem is solved!

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Non-i.i.d. observations

Theorem (Ghosal and van der Vaart, 2007b, Ann. Statist.)

Let dn and en be semi-metrics on Θ so that

(n) 2 (n) 2 P φ ≤ e−Kn , sup P (1 − φ ) ≤ e−Kn . θ0 n θ n θ∈Θ:en(θ,θ1)<ξ

2 Let n > 0, n → 0, nn 0, k > 1 and Θn ⊂ Θ be such that for every suﬃciently large j ∈ N, ξ 2 sup log N , {θ ∈ Θn : dn(θ, θ0) < }, en ≤ nn, >n 2

Πn (θ ∈ Θn : jn < dn(θ, θ0) ≤ 2jn) 2 2 ≤ eKnnj /2 (n) (n) (n) (n) Π K(p ; p ) ≤ n2; V (p ; p ) ≤ nk/2k ) n θ0 θ n k,0 θ0 θ n

Then for every Mn → ∞, (n) P Π (θ ∈ Θ : d (θ, θ ) ≥ M |X (n)) → 0. θ0 n n n 0 n n

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics I.N.I.D. observations

Theorem (Ghosal and van der Vaart, 2007b, Ann. Statist.) (n) 2 Let Pθ be product measures and dn be the average squared Hellinger distance. Suppose that there exist n → 0, k > 1 and 2 sets Θn ⊂ Θ such that nn 0 and

2 sup log N (/36, {θ ∈ Θn : dn(θ, θ0) < }, dn) ≤ nn, >n

c Π (Θ ) 2 n n −2nn ∗ = o e Πn (Bn (θ0, n; k))

Π (θ ∈ Θ : j < d (θ, θ ) ≤ 2j ) 2 2 n n n n 0 n nnj /4 −1 Pn 2 ≤ e . Πn (n i=1 max{K(pθ0,i , pθ,i ), V (pθ0,i , pθ,i )} ≤ )) (n) Then P Π (θ : d (θ, θ ) ≥ M |X (n)) → 0 for every M → ∞. θ0 n n 0 n n n

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples

Markov chains White noise signal model Gaussian time series

Nonlinear autoregression: Xi = f (Xi−1) + εi Regression using spline basis expansion Binary regression with unknown link Interval censoring Estimating spectral density of a stationary time series using Whittle likelihood

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rates under misspeciﬁcation

Kleijn and van der Vaart (2006, Ann. Statist.) studied posterior convergence rate when the true density can lie outside the model. They showed that under certain conditions, the posterior concentrates around the KL projection p∗ of the true density p0 on the model Fn at a rate roughly given by the rate 2 ∗ 2 equation log N(n, Fn) nn and − log Π(K(p , n)) nn. Technically, the analysis requires dealing with a suitable modification of entropy, a weighted version of the Hellinger distance, a modification of KL around p∗ and testing against finite measures.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bayesian adaptation

Commonly priors appropriate to obtain optimal (or at least good) posterior convergence rate need the knowledge of the smoothness class. Priors based on brackets or log-splines for instance use the smoothness information. Can a single prior give optimal rate for all classes? If so, the prior is called rate adaptive. Obtaining a procedure that works for all smoothness classes is the classical problem of adaptation. Typically models are nested with diﬀerent convergence rates, especially if they are indexed by a smoothness level.

Natural Bayesian approach: If Πα gives the optimal rate for R class Cα, then a hierarchical mixture prior Π = Παdλ(α) may give the right rate for every class. Strategy works in many cases. One also gets an adaptive point estimator as a by-product.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Inﬁnite dimensional normal

Theorem (Belitser and Ghosal, 2003, Ann. Statist.) ind −1 In the inﬁnite dimensional normal model Xi ∼ N(θi , n ), let the P∞ 2q 2 true mean satisfy i=1 i θ0i < ∞, q unknown but belongs to a ind −(2q+1) discrete set Q. Let Πq : θi ∼ N(0, i ), q ∼ λq, P Π = q λqΠq. Then the posterior converges at the right rate −q /(2q +1) n 0 0 corresponding to the true value q0 of q.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Sketch of the proof

Treat the correct model q = q0, smoother cases q > q0 and coarser cases q < q0 separately. Selection step: Coarser models have higher complexity, we show Π(q < q0|X) → 0, i.e., probability of selecting a coarser model from the posterior is small. This effectively reduces the P prior to the form m≥0 λmΠm. Compactification step: In the smoother models, the parameter P∞ 2q0 2 lies inside a compact set {θ : i=1 i θi ≤ B, q > q0} for a sufficiently large B with high posterior probability. This allows to control covering numbers of the effective part of smoother models.

The correct model q = q0 can be handled by direct conjugacy calculations. Finally in the reduced parameter space P∞ 2q0 2 {θ : i=1 i θi ≤ B}, the general theory applies.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Adaptation in density estimation

Reference: Ghosal, Lember and van der Vaart, 2008, Electon. J. Statist.

Consider a setting with countably many models Pn,α, and in model α, the entropy matching rate is n,α.

Let Bn,α() and Cn,α() respectively be KL and ordinary neighborhoods.

Let βn be the best model index for the true density p0.

Let An,&βn or An,<βn , respectively be the smoother and coarser models.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Main theorem

Theorem Assume that λn,α Πn,α Cn,α(in,α) Li 2n2 ≤ µn,αe n,α , α < βn, i ≥ I , λn,βn Πn,βn Bn,βn (n,βn ) λn,α Πn,α Cn,α(in,βn ) Li 2n2 ≤ µn,αe n,βn , α & βn, i ≥ I , λn,βn Πn,βn Bn,βn (n,βn ) X λn,α Πn,α Cn,α(IBn,α) −2n2 = o e n,βn , λn,βn Πn,βn Bn,βn (n,βn ) α∈An:α<βn

√ n2 2 and P ≤ e n,βn . If n → ∞, then under some α∈An µn,α n,βn restrictions on the constants, the posterior converges at the rate

n,βn , i.e., is rate adaptive.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Two models context

Somewhat easier to comprehend what’s going on by looking at two models family.

Corollary

Assume that the entropy condition holds for α ∈ An = {1, 2} and sequences n,1 > n,2. −n2 n2 (i) If Πn,1 Bn,1(n,1) ≥ e n,1 and λn,2/λn,1 ≤ e n,1 , then the posterior converges at the rate n,1. −n2 −n2 (ii) If Πn,1 Bn,2(n,2) ≥ e n,2 , λn,2/λn,1 ≥ e n,1 and −3n2 Πn,1 Cn,1(I n,1) ≤ (λn,2/λn,1)o(e n,2 ) for every I , then the posterior converges at the rate n,2.

The range of relative weights allowed is very broad

−n2 λn,2 n2 e n,1 ≤ ≤ e n,1 . λn,1

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Log-spline prior

Universal weight scheme:

−Cn2 λαe n,α λn,α = 2 1lAn (α). P e−Cnn,γ γ∈An λγ

−β/(2β+1) Flat priors:√ λn,α = µα. Adaptation to correct rate n up to log n factor. Finitely many classes, decreasing weights: Q Jn,γ λn,α ∝ γ∈A:γ<α(Cn,γ) . This choice can remove the log factor.

Bracketing induced log-splines: First ﬁnd n,α bracketing, and then ﬁnd the closest log-spline approximations. The corresponding θ’s are given uniform weights, forming a log-spline prior Πn,α. The the universal weights, which increase with smoothness, is considered. The resulting prior is rate adaptive.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Model selection

Can the Bayesian device automatically select the correct model in large sample situations? If one model is simple (singleton family) and KL property holds in the other model, then this is a consequence of Doob’s theorem and Schwartz’s arguments [Dass and Lee, 2004, J. Stat. Plan. Inf.] Because of the the point mass at the simple model, Doob’s theorem implies its posterior probability goes to one whenever simple model holds. The other side follows from consistency. One model with KL property, other not, then model with KL property is chosen [Walker et al. 2005, J. Amer. Statist. Assoc.] under some conditions. If models are with diﬀerent prior concentration rate and order of entropy, then the correct model is chosen [Ghosal, Lember and van der Vaart]

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Model selection

Theorem (Ghosal, Lember and van der Vaart, 2008) Under assumptions of the adaptive rate theorem

n P0 Πn An,<βn |X1,..., Xn → 0,

n P0 Πn α ∈ An,&βn : d(p0, Pn,α) > IB n,βn |X1,..., Xn → 0.

In testing a parametric model against a nonparametric alternative, the required conditions, which essentially need weak concentration of the nonparametric prior around parametric family, hold. Testing a parametric family of densities on the unit interval against Bernstein polynomial prior on nonparametric alternative. Testing a parametric family of densities on the unit interval against log-spline prior on nonparametric alternative. Finite dimensional normal against inﬁnite dimensional normal.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian process

When using a Gaussian process as prior, convergence rates will be determined by the complexity of the space this process spans and the probability of concentration around the true function. Both properties are guided by the geometry of an associated Hilbert space, known as the reproducing kernel Hilbert space (RKHS). To understand the idea of RKHS, think about the simplest nontrivial Gaussian process, namely a multivariate normal distribution. Only vectors in the linear span of the covariance matrix Σ are in the support of the distribution. Further, the elliptical contours described by Σ determine variation in diﬀerent directions. Thus an intrinsic way to measure variability relative to the inherent variability of the 2 0 −1 process is khkΣ = h Σ h.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Reproducing kernel Hilbert space

Consider a Gaussian process as a random element of a Banach space (B, k · k). A notion of expectation is available for B-valued random variables, known as Pettis integral. ∗ ∗ For every Element b of the dual space B , define Sb∗ = E[Wb∗(W )]. On the range define an inner product ∗ ∗ ∗ ∗ hSb1, Sb2i = E[b1(W )b2(W )]. This is analogous to rescaling the usual inner product by the inverse of the covariance matrix in the finite dimensional case. The corresponding norm is stronger than the Banach space norm. ∗ SB is not complete (w.r.t. the new norm) in infinite dimensional spaces. The completion is a Hilbert space, which we call the RKHS and denote by H. The reproducing property refers to ∗ ∗ ∗ ∗ ∗ ∗ b2(Sb1) = E[b1(W )b2(W )] = hSb1, Sb2i.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Gaussian processes of various smoothness

1 Sample paths of the Brownian motion Bt are < 2 -order smooth. Any process like the Brownian bridge or Ornstein-Uhlenbeck process, whose paths are related to the Brownian motion, also have the same smoothness. k-fold integrated Brownian motion W = I k B is smooth of 1 order < α := k + 2 . There is a notion of fractional order integration, allowing k to be non-integer. The corresponding process is called Riemann-Liouville process, has smoothness < α. Another process, called the fractional Brownian motion, which 1 2α 2α 2α has covariance kernel 2 (s + t − |t − s| ) has also smoothness order < α. The squared exponential process with covariance kernel e−|s−t|2 has analytic sample paths.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Role of RKHS

The support of a mean-zero Gaussian process is the norm closure of the RKHS. Small ball probability: P(kW k < ). Concentration function: ( ) = inf{khk2 : kh − wk ≤ } − log (kW k ). ϕw H P < Describes the concentration of W near an element w ∈ H¯ . Borell’s inequality:

−1 −ϕ0() P W ∈ B1 + MH1 ≥ Φ Φ (e ) + M .

Thus an -cushion around a suﬃciently large bounded subset of the RKHS contains most of the probability. This is a key in sieve construction related to Gaussian process prior. Concentration function is related to growth of entropy in p RKHS: ϕ0() grows like log N(/ ϕ0(), H1, k · k1).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Convergence rate for Gaussian process

Theorem (van der Vaart and van Zanten, 2008, Ann. Statist.) 2 If ϕw0 (n) ≤ nn, then there exists a sequence of measurable sets Bn ⊂ B such that

2 log N(3n, Bn, k · k) ≤ 6Cnn, −Cn2 P(W ∈/ Bn) ≤ e n , −n2 P kW − w0k < 2n ≥ e n .

The “sieve” Bn is chosen using Borell’s inequality. Thus the complement has exponentially small prior probability. Bound for concentration function takes care of the prior mass condition. Its relation with entropy of the RKHS assures that the entropy of the essential support given by a slight thickening of a large bounded subset of the RKHS is controlled appropriately.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Implications of the rate theorem

The rate theorem is generic master theorem for all inference problems, like density estimation, normal regression, binary regression, white noise model etc. Studying the relation between the function w and the target w R w function (like density pw = e / e ), one can convert this rate theorem for rate in the relevant inference problem. For example, for density estimation the following relations kv−wk∞/2 hold: dH (pv , pw ) ≤ kv − wk∞ e , 2 kv−wk∞ K(pv , pw ) . kv − wk∞ e (1 + kv − wk∞), 2 kv−wk∞ 2 V (pv , pw ) . kv − wk∞ e (1 + kv − wk∞) . Thus the “rate” n calculated in the master theorem is the rate for all the inference problems.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples

For the k-fold integrated Brownian motion polynomially released at zero, the rate at any β-smooth function is −β/(2k+2) 1 n , which is optimal for β = k + 2 . For the Riemann-Liouville process of index α polynomially released at zero, the rate at any β-smooth function is n−α∧β/(2α+1), possibly with a log-factor. For the fractional Brownian motion of index α, the rate at any β-smooth function w with w(0) = 0, is n−α/(2α+1). Consider an inﬁnite dimensional exponential family formed by the trigonometric basis and coeﬃcients independently distributed as N(0, k−1−2α). Then the rate at any β-smooth function is n−α∧β/(2α+1).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Lower bound for convergence

Theorem (Castillo, 2008, Electron. J. Statist.) 2 2 −cnn Let n → 0, nn → ∞ and Π(kW − w0k ≤ n) ≥ e for some −1 2 2 c > 0. Then Π(kW − w0k ≤ ϕw0 ((2D + c)nn)|X1,..., Xn) → 0 (n) in Pw0 -probability.

The main idea is to show that the prior probability of this set is exponentially small, so the posterior probability is small as well. For the Gaussian series prior coeﬃcients independently distributed as N(0, k−1−2α), the rate n−α∧β/(2α+1) is sharp when α = β, and sharp within a log-factor if α < β.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Rescaling

A common practice is to use a specific type covariance kernel, often stationary, such as the squared exponential kernel e−a2|s−t|2 . We saw that if the right order of smoothness of the prior sample paths is not chosen, then the convergence rate suffers. This drawback can be overcome by rescaling, by choosing the constant inside the covariance kernel depending on n appropriately, like a bandwidth parameter in kernel estimation methods. The effect is equivalent to contraction or dilation of time.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Rescaling

Stretching or shrinking

Sample paths can be smoothed by stretching 40 20 0 −40 −20

0 1 2 3 4 5 and roughened by shrinking −4 −2 0 2 4

0 1 2 3 4 5

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Rescaled stationary process

Theorem (van der Vaart and van Zanten, 2009, Electon. J. Statist.) Consider a rescaled stationary Gaussian process W with scaling sequence an and spectral measure having ﬁnite exponential moment. Then the posterior convergence rate n at any β-smooth function is given by

2 2 −β 2 an log− n . nn, an . n, an . nn.

1/(2β+1) −2/(2β+1) The optimal choice of an is an = n (log n) giving −β/(2β+1) 2β/(2β+1) the best rate n = n (log n) .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Rescaled stationary process

The rate is optimal up to a logarithmic factor, and the −1 optimal choice of an agrees with the optimal bandwidth for kernel smoothing up to a logarithmic factor. The proof proceeds by studying the eﬀect of rescaling on RKHS (which remains the same but gets a new norm). Entropy is modiﬁed under rescaling, which leads to change in concentration function, and hence the rate.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Rescaled self-similar process

Theorem (van der Vaart and van Zanten, 2009) −r 2 −s Assume that ϕ0(; W ) and inf{khkW : kh − wk ≤ } . (s−r)/(4α+4rα+rsα) Then the scaling an = n leads to the best rate −(2+r)/(4+4r+rs) n = n .

Example k −1/2 Pk i Consider Wt = (I0+B)at + b i=0 Zi t /i!, rescaled k-fold integrated Brownian motion with polynomial release. Let a = an −(2k+1) −2/β and b = bn ≤ an n , the best convergence rate is −β/(2β+1) n at any β-smooth function, for the optimal choice of an, provided β ≤ k + 1.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Random rescaling and adaptation

Theorem (van der Vaart and van Zanten, 2010, Ann. Statist.) Consider a random rescaling W A on a stationary process, where independently A ∼ g with some tail condition like Ad following gamma. −α/(2α+d) If w0 is α-smooth, then the rate is n up to a logarithmic factor.

If w0 is analytic, then the rate is nearly parametric.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bernstein-von Mises theorem

For regular parametric models, a intriguing result is (up to the first order) the asymptotic equivalence of Bayesian and √ sampling distributions of n(θ − θˆ), where θˆ is the MLE or any efficient estimator [Le Cam; Bickel and Yahav; van der Vaart etc.] If this holds, then Bayesian credible intervals have approximately correct confidence. To what extent the result holds in infinite dimensional models? For estimating distribution function F , the natural estimator is the empirical distribution function Fn. Its asymptotic distribution is the Brownian bridge process. If a Dirichlet process prior is used on F , the posterior is nearly centered around . Then limiting finite dimensional √Fn distributions of n(F − Fn), in both Bayesian and frequentist senses, are the same multivariate normal.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Nonparametric Bernstein-von Mises theorem

√ In fact, n(F − Fn) is tight, so it converges as a process to the Brownian bridge. [Lo, 1983, 1986, Sankhya]. The last result is very reliant on properties of Dirichlet process and the resulting form of the posterior. It does not readily generalize to other processes. In fact, the Bernstein-von Mises theorem (BvM) does not hold for the Pitman-Yor process, a popular generalization of the Dirichlet process [James, 2008, IMS Volume].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Semiparametric Bernstein-von Mises theorem

Consider estimation in a semiparametric situation with parameters (θ, f ) using i.i.d. data X1,..., Xn. References: Shen (2002, J. Amer. Statist. Assoc.), Castillo (2010, Probab. Theory Relat. Fields), Kleijn and Bickel (2010, Preprint). Castillo’s approach: Consider a local quadratic expansion of the log-likelihood ratio:

√ 2 Λn(θ, f ) = nWn(θ−θ0, f −f0)−nkθ−θ0, f −f0kL/2+Rn(θ, f ).

Let (0, γ) stand for the projection of (1, 0) on the space spanned by f − f0: Then

2 2 2 2 2 ks, gkL = (k1, 0kL − k0, γkL)s + k0, g + sγkL

and γ is called the least favorable direction.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Case of no information loss

2 If γ = 0, I = k1, 0kL is the Fisher information, −1 ∆n = I Wn(1, 0) is the score.

Assume n is the global posterior convergence rate, Vn the corresponding neighborhood. |Rn(θ, f ) − Rn(θ0, f )| Assume sup 2 = op(1). (θ,f )∈Vn 1 + n(θ − θ0)

Theorem (Castillo)

−1/2 −1 −1 sup Π(B × F | X) − N(θ0 + n ∆n,η0 , n I )(B) → 0. B

Using the bound for the remainder and factorization of the remaining part of the likelihood, the proof roughly reduces to the parametric case.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Case of information loss

In case of information loss γ 6= 0, restrict to Gaussian process prior for f .

γ approximated by γn belonging to RKHS within ρn → 0 and k k2 ≤ 2n 2. γn H ρn Assume global convergence rate is n, and uniformly in given θ in n-neighborhood of θ0, f has posterior convergence rate n. |Rn(θ, f ) − Rn(θ0, f − (θ − θ0)γn)| Assume sup 2 = op(1). (θ,f )∈Vn 1 + n(θ − θ0)

Theorem (Castillo)

−1/2 −1 −1 sup Π(B × F | X) − N(θ0 + n ∆n,η0 , n I )(B) → 0. B

The proof uses Gausssianity of the prior and change of variable to adjust the drift in f due to eﬀect of θ.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Examples

A signal plus white noise model, where the signal is symmetric about an unknown point, is an example of no information loss. Random trigonometric series is used as prior. Smoothness level at least one is needed for BvM. A discretely sampled paired functional data model: Yi = f (ti ) + σεi , Zi = f (ti − θ) + τζi . There is information loss in this case. Random trigonometric series with normal coeﬃcients is used as prior. Smoothness at least 3/2 is required for BvM. Cox proportional hazard model: Hazard rate eθZ λ(x), to be estimated with censored data. Log-hazard density is given Riemann-Liouville process prior with polynomial release. This is a case with information loss. Need at least 3/2 smoothness and suitable index for the Riemann-Liouville process for BvM.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Failures of nonparametric BvM

First case was pointed out by Cox (1993, Ann. Statist.), Bayesian credible sets for the regression function in a normal regression model may have arbitrarily low coverage probability. Freedman (1999, Ann. Statist.) illustrated the same fact more clearly in the inﬁnite normal model. −1 −α 2 Consider Xi |θi ∼ N(θi , n ), θi ∼ N(0, Ai ), Tn = kθ − θˆk , −α −1 −α where θˆ is the Bayes estimator θˆi = Xi Ai /(n + Ai ).

Theorem (Freedman)

The posterior√ distribution of Tn is approximately described by Tn = Cn + DnZn and the sampling distribution by √ √ −1+1/α Tn = Cn + FnUn + GnVn, where Cn ∼ Cn , −2+1/α −2+1/α −2+1/α Dn ∼ Dn ,Fn ∼ Fn ,Gn ∼ Gn . Since the bias and the variability are of the same order, BvM does not hold.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Recent papers on BvM

Bontemps (2011, Preprint) considered a normal linear model Xi = Fi + εi , i = 1,..., n, where Fi is a known linear combination of kn many unknown regression coefficients, kn → ∞. Assuming normality of εi ’s, he showed BvM holds under mild restrictions on growth of kn, improving earlier results of Ghosal (1999, Bernoulli; 2000, J. Mult. Anal.). Thus BvM may hold in finite but increasing dimensional setup. The result also implies BvM for sequence of finite Gaussian series priors in normal nonparametric regression contexts. Leahu (2011, Electron. J. Statist.) showed that in the infinite dimensional normal model, if normal priors on the components are sufficiently diffuse (in that inverse variances sum to finite value, rather than variances), then BvM holds. Clearly such priors are not supported in `2.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Recent papers on BvM

Rivoirard and Rousseau (2011, Preprint) considered BvM for linear functional of the density of observations, such as probability of a set, or more generally an integral. They showed that for certain inﬁnite dimensional exponential families, the posterior is sometimes asymptotically normal, sometimes asymptotically mixed normal. Kleijn and van der Vaart (2011, Preprint) considered BvM under misspeciﬁcation. The setup is parametric, but has interest in nonparametric models since the true density can lie outside the parametric model.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Survival analysis

A distribution F on [0, ∞) is to be estimated. Data is usually subject to random right censoring: Observe T = min(X , C), ∆ = 1l{X ≤ C}. Classical estimator for the survival function F¯ = 1 − F is given by the Kaplan-Meier estimator. For Bayesian estimation, a prior on F is to be imposed. Obvious prior is DP(M, G) [Susarla and van Ryzin, 1976, J. Amer. Statist. Assoc.]. Conjugacy fails, posterior is a cumbersome mixture of Dirichlet processes, or can be represented by a Polya tree with partitions deﬁned by observations. Non-informative limit as M → 0 of the posterior mean is the Kaplan-Meier estimator. Even though cumbersome, the posterior is consistent.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Hazard function

An important concept in survival analysis is the hazard rate h(x) = f (x)/F¯(x). The cumulative hazard function is R ¯ H(x) = (0,x] dF (u)/F (u−), defined even if F does not have a density. If F is continuous, H(x) = − log F¯(x) and F¯(x) = e−H(x). ¯ Q More generally, F (x) = u≤x (1 − dH(u)), the product integral. H is an increasing function. F is proper iff H(∞) = ∞, i.e., an infinite Lebesgue-Stieltjes measure. Classical estimator of H is the Nelson-Aalen estimator.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Counting process

N(t), the number of failures by time t. Y (t) is the number of subjects exposed to risk of failure at time t. R Martingale representation N(t) − (0,t] Y (u)dH(u) is a martingale. Nelson-Aalen and Kaplan-Meier estimators are representable by the counting process.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Neutral to the right process

A prior for F is called neutral to the right (NTR) if for every k, t1 < ··· < tk , the variables F (t1), (F (t2) − F (t1))/(1 − F (t1)),..., (F (tk ) − F (tk−1))/(1 − F (tk−1)) are independent [Doksum, 1974, Ann. Probab.]. In terms of A(t) = − log F¯(t), this means A has independent increments, i.e., a L´evyprocess.

Dirichlet process Dα is clearly an NTR process by self-similarity property. The corresponding L´evymeasure is given by ν(dx, ds) = e−sα(x,∞)(1 − e−s )−1ds α(dx). Another example is the beta-Stacy process whose L´evy measure is given by ν(dx, ds) = e−sc(x)α(x,∞)(1 − e−s )−1ds c(x)α(dx) [Walker and Muliere, 1997, Ann. Statist.].

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Properties

NTR property is also equivalent with independent increments of H. An NTR process has full weak support if the corresponding L´evymeasure ν has support [0, ∞) × [0, 1]. The class of NTR processes is conjugate w.r.t. right censored data. The family of beta-Stacy processes is also conjugate with respect to right censored data. The updating rule is given by

Y c(x)dα(x) + dN(x) α∗((0, t]) = 1 − 1 − c(x)α([x, ∞)) + Y (x) 0

and Pn c(x)α([x, ∞)) + Y (x) − 1l{Ti = x}∆i c∗(x) = i=1 . α∗([x, ∞))

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Beta process

A prior can be imposed on the cumulative hazard function such that conjugacy is obtained. Consider a discrete time framework. The discrete hazard rate h(jb) ∈ [0, 1]. A prior on h(jb)’s can be independent Be(c(jb)h0(jb), c(jb)(1 − h0(jb))), where h0 is the prior guess for h. In the continuous limit b → 0, we obtain a L´evy process such that dH(t) ∼ Be(c(s)dH0(s), c(s)(1 − dH0(s)) [Hjort, 1990, Ann. Statist.]. The corresponding L´evymeasure is −1 c(x)−1 ν(dx, ds) = c(x)s (1 − s) H0(dx)ds.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Beta process is conjugate w.r.t. right censoring. The updating rule is given by c∗(t) = c(t) + Y (t) and ∗ dH0 (t) = (c(t)dH0(t) + dN(t))/(c(t) + Y (t)). A beta process for H translates into an NTR process for F .

F ∼ Dα leads to a beta process for H with c(t) = α[t, ∞). A beta process on H is equivalent to a beta-Stacy process on F .

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics L´evyprocess

A Lévyprocess is a natural prior on H which includes the beta process as a special case. Prior expectation and variance are easily computed by the Laplace transform, which is given by the Lévy-Khinchine formula: R 1 R t E[H(t)] = 0 0 xν(dx, ds) R 1 R t 2 P R 1 2 var[H(t)] = 0 0 x ν(dx, ds) − s≤t ( 0 xν(dx, {s}) . The posterior distribution is another Lévyprocess [Hjort, 1990, Ann. Statist., Kim, 1999, Ann. Statist.]: If ν has density f (x, s), then in the posterior ν is updated to

∗ R R t Y (s) ν (B × [0, t]) = B 0 (1 − x) f (x, s)ds dx R t −1 R ∆N(s) Y (s)−∆N(s) −1 + 0 c(s) B x (1 − x) f (x, s)dx(∆N(s)) dN(s),

R 1 ∆N(s) Y (s)−∆N(s) where c(s) = 0 x (1 − x) f (x, s)dx.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Monte-Carlo computation

Time discretization: Hjort (1990). Inverse L´evymeasure algorithm: Wolpert and Ickstadt (1998). Poisson weighting algorithm: Damien, Laud and Smith (1996, Scand. J. Statist.) -approximation: Lee and Kim (2004, Comp. Statist. Data Anal.) Truncated inﬁnite series.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Posterior consistency

−1 R 1 Consider ν with density s g(x, s)λ0(x), where 0 g(x, s)ds = 1 for all x, λ0 bounded. The true hazard is assumed to have density. Theorem (Kim and Lee (2001), Ann. Statist.) Assume that (1 − s)g(x, s) is bounded on [0, τ] × [0, 1], g(x, s) → h(x) as s → 0 uniformly on x ∈ [0, τ], where h is bounded and bounded away from zero. Then the posterior distribution based on right censored data of H and hence that of F , are consistent in Kolmogorov-Smirnov distance.

The proof uses convergence of posterior mean to the true hazard, and posterior variance to zero, and then use monotonicity.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Illustration

Extended beta process: 1 α(x)−2 β(x)−1 ν(dx, ds) = B(α(x),β(x)) s (1 − s) dsλ0(x)dx. Unless α(x) = 1 (which means beta process), the behavior near zero is not like cs−1, and then condition for consistency fails. In fact, then inconsistency is obtained at every continuous H. This leads to the stunning counter-example discussed earlier. It is possible to choose α(x) 6= 1 such that the prior variance is smaller than that of the beta process, but the former is inconsistent while the latter is consistent. Gamma process: Γ(t) gamma process with mean measure R t −1 c(t)λ0(t)dt, Y (t) = 0 [c(x)] dΓ(x). The corresponding g(x, s) is given by c∗(x)s(1 − s)c(x)−1/[log(1 − s)−1], where c∗ is bounded and bounded away from 0. Then the suﬃcient condition for posterior consistency holds.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Bernstein-von Mises Theorem

−1 R 1 Consider ν with density s g(x, s)λ0(x), where 0 g(x, s)ds = 1 for all s, λ0 bounded. The true hazard is assumed to have density. Theorem (Kim and Lee (2004), Ann. Statist.) Assume that (1 − s)g(x, s) is bounded on [0, τ] × [0, 1], |g(x, s) − h(x)| = O(sα) as s → 0 uniformly on x ∈ [0, τ] for some α > 1 , where h is bounded and bounded away from zero. Then √ 2 n(H − Hˆ ) converges in distribution to a Brownian bridge B conditioned on the sample a.s., where Hˆ is the Nelson-Aalen √ estimator, and B is also the weak limit of n(Hˆ − H0).

Required conditions hold for beta process (with natural conditions on parameters), Dirichlet process and gamma process.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Cox model

Consider the Cox model with L´evyprocess prior satisfying conditions of BvM. Theorem (Kim (2006), Ann. Statist.) The posterior density of , centered at Cox’s estimator ˆ and √ β β scaled by n converges to a normal distribution with mean zero and dispersion matrix I −1 conditioned on the sample a.s.. √ Further, the joint posterior distributions n(H − Hˆ , β − βˆ) converges to the joint distribution of a drifted Brownian bridge, where the drift is determined by the weak normal limit of the second component.

The proof goes by ﬁrst treating the parametric component, where the nonparametric part integrates out analytically, essentially reducing to the parametric setting. For the second assertion, one conditions on the second component.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Chinese restaurant process

The partitions generated by a Dirichlet process sample of size n for each n are self-consistent across different n. They induce a unique distribution on partitions of N, to be called the Chinese restaurant process. The name comes from the analogy with customers seating in Chinese restaurants in the San Francisco Bay area. Assuming infinitely many tables with infinite capacities, a new customer seats in an existing table with probability proportional to the number of people seating current in that table, with the remaining probability for opening up a new table. The distribution of ties is controlled by the precision parameter M only (if the center measure G is nonatomic). More specifically, if there are currently k tables open and they have N1,..., Nk people respectively, N1 + ··· + Nk = n, the the probability that the (n + 1)th customer sits in table j is Nj /(M + n), j = 1,..., k, and a in new table (to be called k + 1) with probability M/(M + n).

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Models for clustering

As discussed earlier, the number of clusters formed by n observations is of the order of log n, so very sparse. In some applications, like language word frequency modeling, a power law is prevailing, so need more ﬂexibility from the log n order. The probability of a new observation falling in a non-existing cluster (that is, belonging to a new “species”) depends only on n. This is a unique (somewhat simplistic) characteristic of the Chinese restaurant process.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Pitman-Yor process

P∞ iid P = i=1 Wi δθi , θi ∼ G nonatomic. Consider ordered jumps to follow certain Lévyprocess divided by its total mass, and assign some distribution on the total mass of a Lévyprocess. For a gamma process, the first distribution is called the one parameter Poisson-Dirichlet process. Consider an α-stable process Hα, 0 < α < 1, which contains the gamma process as a limiting case as α → 0. Let (W1, W2,...) be distributed as the ordered jumps of Hα conditioned on T := Hα(R) = t, and −M let T be independent with density αΓ(M)t fα(t)/Γ(M/α), where fα stands for the density of an α-stable random variable. The resulting distribution is a generalization of the Poisson-Dirichlet process, called the two-parameter Poisson-Dirichlet process with parameters (M, α), with the restriction M > −α. P∞ iid The distribution of i=1 Wi δθi , θi ∼ G is called the Pitman-Yor process with parameters (M, α, G), with Dirichlet being the limiting special case α = 0. Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Predictive distribution and clustering

If X1, X2, ... |P ∼ P and P ∼ PY(M, α, G), then the predictive distribution of Xn+1 is given by

k M + kα 1 X G + (Nj − α)δX ∗ , M + n M + n j j=1

∗ Xj ’s the distinct observations, k = number of distinct observations, just like the Dirichlet process, with a slightly more complicated predictive structure. Note that the probability of a new observation depends on k as well. The above predictive structure also gives rise to random partitions just like the Chinese restaurant process, but the number of clusters in n observation now grows at the order nα, instead of like log n.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Stick-breaking representation

The Pitman-Yor process also has a stick-breaking representation

P∞ iid Qj−1 P = j=1 Wj θj , θj ∼ G, Wj = [ l=1(1 − Vl )]Vj , ind Vj ∼ Be(1 − α, M + jα).

In particular, this shows that the process has full weak support, and hence its mixtures will have large Kullback-Leibler support for density estimation. MCMC algorithm for Pitman-Yor process mixtures can therefore be developed using latent variable representation.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Hierarchical Dirichlet process [Teh et al., 2006]

iid iid Consider models like Xij |Fj ∼ Fj , F1, F2,... |G ∼ DP(M, G), G ∼ DP(M0, G0). This is a Dirichlet process mixture of nonaparametric distributions, unlike Dirichlet process mixture models which are Dirichlet process mixture of parametric distribution. Because Dirichlet process samples are discrete, G will be discrete a.s., thus Fj ’s will inherit many atoms of G, and therefore share several atoms with each other.

The net effect of this is that Xij ’s, for a fixed j will frequently tie with each other just as a sample from Dirichlet process, but on top of that, will also tie across different values of j. This type of sharing pattern is useful in some genetic modeling.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Indian buﬀet process [Griﬃths and Ghahramani, 2006]

Chinese restaurant process (or the Pitman-Yor generalization) is a good sharing model if we are concerned about just one feature. In many applications, one needs to consider several latent features affecting an outcome. In principle, the number of features is unlimited, although any subject will probably have only finitely many of them. Thus each observation can be represented as an infinite row of 0’s and 1’s, and thus among subjects, some features will be shared but possibly not all. How to model the sharing of features?

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Indian buﬀet process

Imagine a lunch buffet in an Indian restaurant in London serving an infinite number of dishes. Each customers tastes a random number of dishes, but popular ones are more likely to be shared. First customer tastes a Poisson(γ) number of dishes. The (i + 1)th customer tastes dish j with probability Nij /(i + M), where Nij is the number of customers {1,..., i} who tasted dish j, and trying a Poisson(Mγ/(M + i)) additional dishes not tasted before. The customer-dish matrix is the analogy of feature matrix Z with binary entries. Its distribution is known as the Indian buffet process (IBP). To represent, the binary matrix should be “left-ordered” in the order dishes were first tasted.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Properties

The process is infinitely exchangeable. The number of ones in each row is distributed as Poisson(γ). The expected total number of ones is γn after n customers. The number of nonzero columns grows as O(Mγ log n). The IBP a stick-breaking representation [Teh, Görürand Ghahramani (2007)]. The de Finetti measure of IBP is a beta process [Thibaux and Jordan (2007)]. It is possible to start with a finite number of features, and then let K → ∞ to arrive at the distribution. Often used in latent mixture models. MCMC algorithms have been developed, and especially easy to implement if the corresponding values θ’s are not needed.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Dependent Dirichlet process

Often in applications, an extra covariate x, such as a spatial eﬀect, is present aﬀecting distributions: Y |x ∼ Fx .

How to put a prior on the family Fx , such that they are diﬀerent, but close if x values are close. One idea is to use a dependent Dirichlet process through a P∞ stick-breaking representation Fx = j=1 Wj (x)δθj (x), Qj−1 Wj (x) = [ k=1(1 − Vk (x))]Vj (x), Vj (x) ∼ Be(1, M), iid θj (x) ∼ G, marginally for each x. Many diﬀerent constructions are available, with this general idea.

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics Thank you!

Subhashis Ghoshal An Invitation to Bayesian Nonparametrics