Nonparametric Bayesian Methods

Chapter 27 Nonparametric Bayesian Methods

Most of this book emphasizes frequentist methods, especially for nonparametric problems. However, there are Bayesian approaches to many nonparametric problems. In this chapter we present some of the most commonly used nonparametric Bayesian methods. These methods place priors on inﬁnite dimensional spaces. The priors are based on certain stochastic processes called Dirichlet processes and Gaussian processes. In many cases, we cannot write down explicit formulas for the priors. Instead, we give explicit algorithms for drawing from the prior and the posterior.

27.1 What is Nonparametric Bayes?

In parametric Bayesian inference we have a model = f(y ✓): ✓ ⇥ and data M { | 2 } Y ,...,Y f(y ✓). We put a prior distribution ⇡(✓) on the parameter ✓ and compute the 1 n ⇠ | posterior distribution using Bayes’ rule:

n(✓)⇡(✓) ⇡(✓ Y )=L (27.1) | m(Y ) where Y =(Y ,...,Y ), (✓)= f(Y ✓) is the likelihood function and 1 n Ln i i | Q n m(y)=m(y ,...,y )= f(y ,...,y ✓)⇡(✓)d✓ = f(y ✓)⇡(✓)d✓ 1 n 1 n| i| Z Z Yi=1 is the marginal distribution for the data induced by the prior and the model. We call m the induced marginal. The model may be summarized as:

✓ ⇡ ⇠ Y ,...,Y ✓ f(y ✓). 1 n| ⇠ | 633 634 Chapter 27. Nonparametric Bayes

We use the posterior to compute a point estimator such as the posterior mean of ✓.We can also summarize the posterior by drawing a large sample ✓1,...,✓N from the posterior ⇡(✓ Y ) and the plotting the samples. | In nonparametric Bayesian inference, we replace the ﬁnite dimensional model f(y ✓): { | ✓ ⇥ with an inﬁnite dimensional model such as 2 }

2 = f : (f 00(y)) dy < (27.2) F 1 ⇢ Z Typically, neither the prior nor the posterior have a density function with respect to a dominating measure. But the posterior is still well deﬁned. On the other hand, if there is a dominating measure for a set of densities then the posterior can be found by Bayes theo- F rem: A n(f)d⇡(f) ⇡n(A) P(f A Y )= L (27.3) ⌘ 2 | n(f)d⇡(f) RF L where A , (f)= f(Y ) is the likelihoodR function and ⇡ is a prior on . If there ⇢F Ln i i F is no dominating measure for then the posterior stull exists but cannot be obtained by Q F simply applying Bayes’ theorem. An estimate of f is the posterior mean

f(y)= f(y)d⇡n(f). (27.4) Z A posterior 1 ↵ region is any setbA such that ⇡ (A)=1 ↵. n Several questions arise:

1. How do we construct a prior ⇡ on an inﬁnite dimensional set ? F 2. How do we compute the posterior? How do we draw random samples from the posterior?

3. What are the properties of the posterior?

The answers to the third question are subtle. In ﬁnite dimensional models, the inferences provided by Bayesian methods usually are similar to the inferences provided by frequentist methods. Hence, Bayesian methods inherit many properties of frequentist methods: consistency, optimal rates of convergence, frequency coverage of interval estimates etc. In inﬁnite dimensional models, this is no longer true. The inferences provided by Bayesian methods do not necessarily coincide with frequentist methods and they do not necessarily have properties like consistency, optimal rates of convergence, or coverage guarantees.

27.2 Distributions on Inﬁnite Dimensional Spaces

To use nonparametric Bayesian inference, we will need to put a prior ⇡ on an inﬁnite dimensional space. For example, suppose we observe X ,...,X F where F is an unknown 1 n ⇠ 27.3. Four Nonparametric Problems 635 distribution. We will put a prior ⇡ on the set of all distributions . In many cases, we F cannot explicitly write down a formula for ⇡ as we can in a parametric model. This leads to the following problem: how we we describe a distribution ⇡ on an inﬁnite dimensional space? One way to describe such a distribution is to give an explicit algorithm for drawing from the distribution ⇡. In a certain sense, “knowing how to draw from ⇡” takes the place of “having a formual for ⇡.” The Bayesian model can be written as

F ⇡ ⇠ X ,...,X F F. 1 n| ⇠

The model and the prior induce a marginal distribution m for (X1,...,Xn),

m(A)= PF (A)d⇡(F ) Z where

PF (A)= IA(x1,...,xn)dF (x1) dF (xn). ··· Z We call m the induced marginal. Another aspect of describing our Bayesian model will be to give an algorithm for drawing X =(X1,...,Xn) from m.

After we observe the data X =(X1,...,Xn), we are interested in the posterior distribution ⇡ (A) ⇡(F A X ,...,X ). (27.5) n ⌘ 2 | 1 n Once again, we will describe the posterior by giving an algorithm for drawing randonly from it. To summarize: in some nonparametric Bayesian models, we describe the prior distribution by giving an algorithm for sampling from the prior ⇡, the marginal m and the posterior ⇡n.

27.3 Four Nonparametric Problems

We will focus on four speciﬁc problems. The four problems and their most common frequentist and Bayesian solutions are:

Statistical Problem Frequentist Approach Bayesian Approach Estimating a cdf empirical cdf Dirichlet process Estimating a density kernel smoother Dirichlet process mixture Estimating a regression function kernel smoother Gaussian process Estimating several sparse multinomials empirical Bayes hierarchical Dirichlet process mixture 636 Chapter 27. Nonparametric Bayes

27.4 Estimating a cdf

Let X1,...,Xn be a sample from an unknown cdf (cumulative distribution function) F where Xi R. The usual frequentist estimate of F is the empirical distribution function 2 n 1 F (x)= I(X x). (27.6) n n i  Xi=1 Recall from equation (6.46) of Chapter 6 that for every ✏>0 and every F ,

2n✏2 PF sup Fn(x) F (x) >✏ 2e . (27.7) x | |  ⇣ ⌘ 1 2 Setting ✏n = 2n log ↵ we have q

inf PF Fn(x) ✏n F (x) Fn(x)+✏n for all x 1 ↵ (27.8) F   ! where the inﬁmum is over all cdf’s F . Thus, F (x) ✏ ,F (x)+✏ is a 1 ↵ conﬁdence n n n n band for F . To estimate F from a Bayesian perspective we put a prior ⇡ on the set of all cdf’s F and then we compute the posterior distribution on given X =(X ,...,X ). The most F 1 n commonly used prior is the Dirichlet process prior which was invented by the statistician Thomas Ferguson in 1973.

The distribution ⇡ has two parameters, F0 and ↵ and is denoted by DP(↵,F0). The parameter F0 is a distribution function and should be thought of as a prior guess at F . The number ↵ controls how tightly concentrated the prior is around F0. The model may be summarized as:

F ⇡ ⇠ X ,...,X F F 1 n| ⇠ where ⇡ =DP(↵,F0).

How to Draw From the Prior. To draw a single random distribution F from Dir(↵,F0) we do the following steps:

1. Draw s1,s2,...independently from F0. 2. Draw V ,V ,... Beta(1,↵). 1 2 ⇠ j 1 3. Let w = V and w = V (1 V ) for j =2, 3,.... 1 1 j j i=1 i

4. Let F be the discrete distributionQ that puts mass wj at sj, that is, F = j1=1 wjsj where s is a point mass at sj. j P Figure 27.2. The Chinese restaurant process. A new person arrives and either sits at a table with people or sits at a new table. The probability of sitting at a table is proportional to the number of people at the table.

It is clear from this description that F is discrete with probability one. The construction of the weights w1,w2,...is often called the stick breaking process. Imagine we have a stick of unit length. Then w1 is is obtained by breaking the stick a the random point V1. The stick now has length 1 V . The second weight w is obtained by breaking a proportion V from 1 2 2 the remaining stick. The process continues and generates the whole sequence of weights w ,w ,.... See Figure 27.1. It can be shown that if F Dir(↵,F ) then the mean is 1 2 ⇠ 0 E(F )=F0. You might wonder why this distribution is called a Dirichlet process. The reason is this. Recall that a random vector P =(P1,...,Pk) has a Dirichlet distribution with parameters (↵, g1,...,gk) (with j gj =1) if the distribution of P has density P k (↵) ↵g 1 j f(p1,...,pk)= k pj Figure 27.1. The stick breaking process shows howj=1 the(↵ weightsgj) j=1w1,w2,...from the Dirichlet process are constructed. First we draw V ,V ,... Beta(1Y,↵). Then we set w = V ,w = 1 2 Q ⇠ 1 1 2 Vover(1 theV simplex),w = Vp(1=(Vp1)(1,...,pV k)):,.... pj 0, j pj =1. Let (A1,...,Ak) be any 2 1 3 {3 1 2 } partition of R and let F DP(↵,F0) be a random draw from the Dirichlet process. Let ⇠ P F (Aj) be the amount of mass that F puts on the set Aj. Then (F (A1),...,F(Ak)) has a Dirichlet distribution with parameters (↵,F0(A1),...,F0(Ak)). In fact, this property characterizes the Dirichlet process.

V1 V2(1 − V1) …

w1 w2 … How to Sample From the Marginal. One way is to draw from the induced marginal m is to sample F ⇡ (as described above) and then draw X ,...,X from F . But there is an ⇠ 1 n alternative method, called the Chinese Restaurant Process or inﬁnite Polya´ urn (Blackwell and MacQueen, 1973). The algorithm is as follows. 27.4. Estimating a cdf 637 1. Draw X F . 1 ⇠ 0 2. For i =2,...,n: draw i 1 X Fi 1 with probability i+↵ 1 Xi X1,...Xi 1 = ⇠ ↵ | X F0 with probability i+↵ 1 ⇢ ⇠ where Fi 1 is the empirical distribution of X1,...Xi 1.

The sample X1,...,Xn is likely to have ties since F is discrete. Let X1⇤,X2⇤,...denote the unique values of X1,...,Xn. Deﬁne cluster assignment variables c1,...,cn where c = j means that X takes the value X . Let n = i : c = j . Then we can write i i j⇤ j |{ j }| nj Xj⇤ with probability n+↵ 1 Xn = ↵ X F0 with probability n+↵ 1 . ⇢ ⇠ In the metaphor of the Chinese restaurant process, when the nth customer walks into the restaurant, he sits at table j with probability n /(n + ↵ 1), and occupies a new table j with probability ↵/(n + ↵ 1). The jth table is associated with a “dish” X F . j⇤ ⇠ 0 Since the process is exchangeable, it induces (by ignoring Xj⇤) a partition over the integers 1,...,n , which corresponds to a clustering of the indices. See Figure 27.2. { }

How to Sample From the Posterior. Now suppose that X ,...,X F and that we 1 n ⇠ place a Dir(↵,F0) prior on F .

27.9 Theorem. Let X1,...,Xn F and let F have prior ⇡ =Dir(↵,F0). Then the ⇠ posterior ⇡ for F given X1,...,Xn is Dir ↵ + n, F n where n ↵ F = F + F . (27.10) n n + ↵ n n + ↵ 0

Since the posterior is again a Dirichlet process, we can sample from it as we did the prior but we replace ↵ with ↵ + n and we replace F0 with F n. Thus the posterior mean is F n is a convex combination of the empirical distribution and the prior guess F0. Also, the predictive distribution for a new observation Xn+1 is given by F n. To explore the posterior distribution, we could draw many random distribution functions from the posterior. We could then numerically construct two functions Ln and Un such that ⇡ L (x) F (x) U (x) for all x X ,...,X =1 ↵. n   n | 1 n This is a 1 ↵ Bayesian conﬁdence band for F . Keep in mind that this is not a frequentist conﬁdence band. It does not guarantee that

inf PF (Ln(x) F (x) Un(x) for all x) = 1 ↵. F   When n is large, F F in which case there is little difference between the Bayesian and n ⇡ n frequentist approach. The advantage of the frequentist approach is that it does not require speciﬁying ↵ or F0. 638 Chapter 27. Nonparametric Bayes F weights 0.0 0.2 0.4 0.6 0.8 1.0

−3−2−1 0 1 2 3 −3−2−1 0 1 2 3 F 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

−10 −5 0 5 10 −5 0 5 10

Figure 27.3. The top left plot shows the discrete probabilty function resulting from a single draw from the prior which is a DP(↵,F0) with ↵ = 10 and F0 = N(0, 1). The top right plot shows the resulting cdf along with F0. The bottom left plot shows a few draws from the posterior based on n = 25 observations from a N(5,1) distribution. The blue line is the posterior mean and the red line is the true F . The posterior is biased because of the prior. The bottom right plot shows the empirical distribution function (solid black) the true F (red) the Bayesian postrior mean (blue) and a 95 percnt frequentist conﬁdence band.

27.11 Example. Figure 27.3 shows a simple example. The prior is DP(↵,F0) with ↵ = 10 and F0 = N(0, 1). The top left plot shows the discrete probabilty function resulting from a single draw from the prior. The top right plot shows the resulting cdf along with F0. The bottom left plot shows a few draws from the posterior based on n = 25 observations from a N(5,1) distribution. The blue line is the posterior mean and the red line is the true F . The posterior is biased because of the prior. The bottom right plot shows the empirical distribution function (solid black) the true F (red) the Bayesian postrior mean (blue) and a 95 percnt frequentist conﬁdence band. 2

27.5 Density Estimation

Let X1,...,Xn F where F has density f and Xi R. Our goal is to estimate f. The ⇠ 2 Dirichlet process is not a useful prior for this problem since it produces discrete distributions which do not even have densities. Instead, we use a modiﬁcation of the Dirichlet process. But ﬁrst, let us review the frequentist approach. 27.5. Density Estimation 639

The most common frequentist estimator is the kernel estimator

n 1 1 x X f(x)= K i n h h Xi=1 ✓ ◆ b where K is a kernel and h is the bandwidth. A related method for estimating a density is to use a mixture model k f(x)= wjf(x; ✓j). Xj=1 For example, of f(x; ✓) is Normal then ✓ =(µ, ). The kernel estimator can be thought of as a mixture with n components. In the Bayesian approach we would put a prior on ✓1,...,✓k, on w1,...,wk and a prior on k. We could be more ambitious and use an infinite mixture 1 f(x)= wjf(x; ✓j). Xj=1 As a prior for the parameters we could take ✓1,✓2,... to be drawn from some F0 and we could take w1,w2,..., to be drawn from the stick breaking prior. (F0 typically has parameters that require further priors.) This infinite mixture model is known as the Dirichlet process mixture model. This infinite mixture is the same as the random distribution F ⇠ DP(↵,F0) which had the form F = j1=1 wj✓j except that the point mass distributions are replaced by smooth densities f(x ✓j). ✓j P| The model may be re-expressed as:

F DP(↵,F ) (27.12) ⇠ 0 ✓ ,...,✓ F F (27.13) 1 n| ⇠ X ✓ f(x ✓ ),i=1,...,n. (27.14) i| i ⇠ | i

(In practice, F0 itself has free parameters which also require priors.) Note that in the DPM, the parameters ✓i of the mixture are sampled from a Dirichlet process. The data Xi are not sampled from a Dirichlet process. Because F is sampled from from a Dirichlet process, it will be discrete. Hence there will be ties among the ✓i’s. (Recall our erlier discussion of the Chinese Restaurant Process.) The k

How to Sample From the Prior. Draw ✓1,✓2,...,F0 and draw w1,w2,..., from the stsick breaking process. Set f(x)= j1=1 wjf(x; ✓j). The density f is a random draw from the prior. We could repeat this process many times resulting in many randomly drawn P densities from the prior. Plotting these densities could give some intuition about the structure of the prior. 640 Chapter 27. Nonparametric Bayes − 4 − 2 0 2 4 − 6 − 4 − 2 0 2 4 −2−1 0 1 2 −2−1 0 1 2 3

Figure 27.4. Samples from a Dirichlet process mixture model with Gaussian generator, n = 500.

How to Sample From the Prior Marginal. The prior marginal m is n m(x ,x ,...,x )= f(x F ) d⇡(F ) (27.15) 1 2 n i | Z i=1 Yn = f(x ✓) p(✓ F ) dF (✓) dP (G) (27.16) i | | Z Yi=1 ✓Z ◆ If we wnat to draw a sample from m, we ﬁrst draw F from a Dirichlet process with parameters ↵ and F0, and then generate ✓i independently from this realization. Then we sample X f(x ✓ ). i ⇠ | i As before, we can also use the Chinese restaurant representation to draw the ✓j’s se- quentially. Given ✓1,...,✓i 1 we draw ✓j from n 1 ↵F ( )+ ( ). (27.17) 0 · ✓i · Xi=1 Let ✓j⇤ denote the unique values among the ✓i, with nj denoting the number of elements in the cluster for parameter ✓i⇤; that is, if c1,c2,...,cn 1 denote the cluster assignments ✓ = ✓ then n = i : c = j . Then we can write i c⇤i j |{ i }| nj ✓j⇤ with probability n+↵ 1 (27.18) ✓n = ↵ (✓ F0 with probability n+↵ 1 . ⇠ How to Sample From the Posterior. We sample from the posterior by Gibbs sampling (reference to simulation chapter xxxx). Our ultimate goal is to approximate the predictive distribuiton of a new observation xn+1: f(x ) f(x x ,...,x ). n+1 ⌘ n+1| 1 n b 27.5. Density Estimation 641

This density is our Bayesian density estimator. The Gibbs sampler for the DP mixture is straightforward in the case where the base distribution F is conjugate to the data model f(x ✓). Recall that if f(x ✓) is in the 0 | | exponential family it can be written in the (canonical) natural parameterization as

T f(x ✓)=h(x)exp ✓ x a(✓) (27.19) | The conjugate prior for this model takes the form

T p(✓ = , )=g(✓)exp ✓ a(✓) b( , ) (27.20) | { 1 2} 1 2 1 2 Here a(✓) is the moment generating function (log normalizing constant) for the original model, and b() is the moment generating function for the prior. The parameter of the prior has two parts, corresponding to the two components of the vector of sufﬁcient statistics (✓, a(✓)). The parameter has the same dimension as the parameter ✓ of the model, and 1 2 is a scalar. To verify conjugacy, note that

p(✓ x, ) p(x ✓) p(✓ ) (27.21) | / | T| T h(x)exp(✓ x a(✓))g(✓)exp(1 ✓ 2a(✓) b(1,2)) (27.22) / T g(✓)exp((x + ) ✓ ( + 1)a(✓)) (27.23) / 1 2 The factor h(x) drops out in the normalization. Thus, the parameters of the posterior are =(1 + x, 2 + 1).

27.24 Example. Take p( µ) be normal with known variance. Thus, ·| 1 (x µ)2 p(x µ)= exp (27.25) | p2⇡ 22 ✓ ◆ 1 x2 xµ µ2 = exp exp (27.26) p2⇡ 22 2 22 ✓ ◆ ✓ ◆ h(x)

µ | {z } Let ⌫ = 2 be the natural parameter. Then

1 x2 ⌫22 p(x ⌫)= exp exp x⌫ (27.27) | p2⇡ 22 2 ✓ ◆ ✓ ◆ h(x) | {z } Thus, a(⌫)=⌫22/2. The conjugate prior then takes the form

µ22 p(µ , )=g(µ)exp µ b( , ) (27.28) | 1 2 1 2 2 1 2 ✓ ◆ where b(1,2) is chosen so that the prior integrates to one. 2 642 Chapter 27. Nonparametric Bayes

Under conjugacy, the parameters ✓1,...,✓n can be integrated out, and the Gibbs sampling is carried out with respect to the cluster assignments c1,...,cn. Let c i denote the vector of the n 1 cluster assignments for all data points other than i. The Gibbs sampler cycles through indices i according to some schedule, and sets ci = k according to the conditional probability

p(ci = k x1:n,c i,) (27.29) |

This either assigns ci to one of the existing clusters, or starts a new cluster. By the chain rule, we can factor this conditional probability as

p(ci = k x1:n,c i,)=p(ci = k c i) p(xi x i,c i,ci = k, ) (27.30) | | |

The class assignment probability p(ci = k c i) is governed by the Polya´ urn scheme: |

# j : cj = k, j = i if k is an existing cluster p(ci = k c i) { 6 } (27.31) | / (↵ if k is a new cluster

The conditional probability of xi is, by conjugacy, given by

p(xi x i,c i,ci = k, )= (27.32) | = p(x other x in cluster k, ) (27.33) i | j = p(x ✓)p(✓ other x in cluster k, ) (27.34) i | | j Z exp b 1 + j=i 1[cj = k]xj + xi,2 + j=i 1[cj = k]+1 = 6 6 (27.35) ⇣ ⇣ ⌘⌘ exp b P1 + j=i 1[cj = k]xj,2 + Pj=i 1[cj = k] 6 6 ⇣ ⇣ P P ⌘⌘ The probability of xi conditioned on the event that it starts a new cluster is

p(x F )= p(x ✓) dF (✓) (27.36) i | 0 i | 0 Z =exp(b(xi + 1,1 + 1)) (27.37)

The algorithm iteratively updates the cluster assignments in this manner, until convergence. After appropriate convergence has been determined, the approximation procedure is to collect a set of partitions c(b), for b =1,...,B. The predictive distribution is then approximated as

B 1 (b) p(x x ,,↵,F ) p(x c ,x ,,↵,F ) (27.38) n+1 | 1:n 0 ⇡ B n+1 | 1:n 1:n 0 Xb=1 where the probabilities are computed just as in the Gibbs sampling procedure, as described above. 27.5. Density Estimation 643

If the base measure F0 is not conjugate, MCMC is significantly more complicated and problematic in high dimensions. See Neal (2000) for a discussion of MCMC algorithms for this case. The Mean Field Approximation. An alternative to sampling is to use an approximation. Recall that in the mean field variational approximation, we treat all of the variables as independent, and assume a fully factorized variational approximation q. The strategy is then to maximize the lower bound on the data likelihood, or equivalently to minimize the KL divergence D(q p) with respect to the variational parameters that determine q. k In this setting, the variables we are integrating over are ✓j⇤ and Vj, for the infinite sequence j =1, 2,..., together with the mixture component indicator variables Zi, for i =1, 2,...,n. Since it is of course not possible to implement an infinite model explicitly, we take a finite variational approximation that corresponds to breaking the stick into T pieces. Thus, we take

T 1 T n q(V1:T ,✓1:⇤ T ,Z1:n)= qt (Vt) q⌧t (✓t⇤) qi (Zi) (27.39) t=1 t=1 Y Y Yi=1 where each factor has its own variational parameter. Each qt is a beta distribution, each q⌧t is a conjugate distribution over ✓ , and each q is a (T 1)-dimensional multinomial dis- t⇤ i tribution. Note that while there are T mixture components in the variational approximation, the model itself is not truncated.

Let denote the parameters of the conjugate distribution F0, as we did above for the Gibbs sampler. According to the standard variational procedure, we then bound the log marginal probability of the data from below as

log p(x1:n ↵,) (27.40) | n

Eq[log p(V ↵)] + Eq[log p(✓⇤ )] + Eq[log ⇡Zi ]+Eq[log p(xi ✓⇤ )] | | | Zi Xi=1 T 1 T n + H(qt )+ H(q⌧t )+ H(qi ) (27.41) t=1 t=1 X X Xi=1 where H denotes entropy. For details on a coordinate ascent algorithm to optimize this lower bound as a function of the variational parameters, see (Blei and Jordan, 2005). To estimate the predictive distribution, note ﬁrst that the true predictive distribution under the stick breaking representation is given by

1 p(x x ,↵,)= ⇡ (v) p(x ✓⇤) dP (v, ✓⇤ x, ,↵) (27.42) n+1 | 1:n t n+1 | t | t=1 Z X We approximate this by replacing the true stick breaking distribution with the variational distribution. Since, under the variational approximation, the mixture is truncated and the V 644 Chapter 27. Nonparametric Bayes

and ✓⇤ variables are conditionally independent, the approximated predictive distribution is thus T p(xn+1 x1:n,↵,) Eq[⇡t(V )] Eq(p(xn+1 ✓⇤)). (27.43) | ⇡ | t t=1 X

27.5.1 A Detailed Implementation

To understand better how to use the model, we consider how to use the DPM for estimating density using a mixture of Normals. There are numerous implementations. We consider the one in Ishwaran and James (2002), because it is very clear and explicit. The first step is to replace the infinite mixture with a large but finite mixture. Thus we replace the stick-breaking process with V1,...,VN 1 Beta(1,↵) and w1 = V1,w2 = ⇠ V (1 V ),.... This generates w ,...,w which sum to 1. Replacing the infinite mixture 2 1 1 N with the finite mixture is a numerical trick not an inferential step and has little numerical effect as long as N is large. For example, they show tha when n =1, 000 it suffices to use N = 50. A full specification of the resulting model, including priors on the hyperparame- ters is: ✓ N(0,A) ⇠ ↵ Gamma(⌘ ,⌘ ) ⇠ 1 2 µ ,...,µ N(✓,B2) 1 N ⇠ 1 1 2 ,..., 2 Gamma(⌫1,⌫2) 1 N ⇠ N K ,...,K w 1 n ⇠ j j Xj=1 X N(µ ,2) i =1,...,n i ⇠ i i The hyperparemeters A,B,1,2,⌫1,⌫2 still need to be set. Compare this to kernel density estimation whihc requires the single bandwidth h. Ishwaran and James (2002) use A = 1000, ⌫1 = ⌫2 = ⌘1 = ⌘2 =2and they take B to be 4 ties the standard deviation of the data. It is now possible to wite down a Gibbs sampling algorithm for sampling from the prior; see Ishwaran and James (2002) for details. The authors apply the method to the problem of estimating the density of thicknesses of stamps issued in Mexico between 1872-1874. The final Bayeian density estimator is similar to the kernel density estimator. Of curse, this raises the question of whether the Bayesian method is worth all the extra effort.

27.6 Nonparametric Regression

Consider the nonparametric regression model

Yi = m(Xi)+✏i,i=1,...,n (27.44) 27.6. Nonparametric Regression 645

where E(✏i)=0. The frequentist kernel estimator for m is

n x Xi i=1 Yi K || h || m(x)= (27.45) P n ⇣x Xi ⌘ i=1 K || h || b P ⇣ ⌘ where K is a kernel and h is a bandwidth. The Bayesian version requires a prior ⇡ on the set of regression functions . A common choice is the Gaussian process prior. M A stochastic process m(x) indexed by x Rd is a Gaussian process if for each 2X⇢ x ,...,x the vector (m(x ),m(x ),...,m(x )) is Normally distributed: 1 n 2X 1 2 n (m(x ),m(x ),...,m(x )) N(µ(x),K(x)) (27.46) 1 2 n ⇠ where Kij(x)=K(xi,xj) is a Mercer kernel (reference xxxx).

Let’s assume that µ =0. Then for given x1,x2,...,xn the density of the Gaussian process prior of m =(m(x1),...,m(xn)) is

n/2 1/2 1 T 1 ⇡(m)=(2⇡) K exp m K m (27.47) | | 2 ✓ ◆ Under the change of variables m = K↵, we have that ↵ N(0,K 1) and thus ⇠

n/2 1/2 1 T ⇡(↵)=(2⇡) K exp ↵ K↵ (27.48) | | 2 ✓ ◆ Under the additive Gaussian noise model, we observe Y = m(x )+✏ where ✏ N(0,2). i i i i ⇠ Thus, the log-likelihood is

1 2 log p(y m)= (y m(x )) + const (27.49) | 22 i i Xi and the log-posterior is

1 2 1 T log p(y m) + log ⇡(m)= y K↵ ↵ K↵ + const (27.50) | 22 k k2 2 1 2 1 2 = y K↵ ↵ + const (27.51) 22 k k2 2k kK What functions have high probability according to the Gaussian process prior? The prior T 1 favors ↵ K ↵ being small. Suppose we consider an eigenvector v of K, with eigenvalue , so that Kv = v. Then we have that

1 T 1 = v K v (27.52) Thus, eigenfunctions with large eigenvalues are favored by the prior. These correspond to smooth functions; the eigenfunctions that are very wiggly correspond to small eigenvalues. 646 Chapter 27. Nonparametric Bayes

In this Bayesian setup, MAP estimation corresponds to Mercer kernel regression, which regularizes the squared error by the RKHS norm ↵ 2 . The posterior mean is k kK 2 1 E(↵ Y )= K + I Y (27.53) | and thus 2 1 m = E(m Y )=K K + I Y. (27.54) | We see that m is nothing but a linear smoother and is, in fact, very similar to the frequentist kernel smoother. b Unlike kernelb regression, where we just need to choose a bandwidth h, here we need to choose the function K(x, y). This is a delicate matter; see xxxx.

Now, to compute the predictive distribution for a new point Yn+1 = m(xn+1)+✏n+1, we note that (Y ,...,Y ) N(0, (K + 2I)↵). Let k be the vector 1 n ⇠

k =(K(x1,xn+1),...,K(xn,xn+1)) (27.55)

Then (Y1,...,Yn+1) is jointly Gaussian with covariance

K + 2Ik (27.56) kT k(x ,x )+2 ✓ n+1 n+1 ◆ Therefore, conditional distribution of Yn+1 is

T 2 1 2 T 2 1 Y Y ,x N k (K + I) Y,k(x ,x )+ k (K + I) k n+1 | 1:n 1:n ⇠ n+1 n+1 (27.57) Note that the above variance differs from the variance estimated using the frequentist method. However, Bayesian Gaussian process regression and kernel regression ofte lead to similar results. The advantages of the kernel regression is that it requires a single parameter h that can be chosen by cross-valdiation and its theoretical properties are simple and well- understood.

27.7 Estimating Many Multinomials

In many domains, the data naturally fall into groups. In such cases, we may want to model each group using a mixture model, while sharing the mixing components from group to group. For instance, text documents are naturally viewed as groups of words. Each docu- ment might be modeled as being generated from a mixture of topics, where a topic assigns high probability to words from a particular semantic theme, with the topics shared across documents—a finance topic might assign high probability to words such as “earnings,” “dividend,” and “report.” Other examples arise in genetics, where each individual has a genetic profile exhibiting a pattern of haplotypes, which can be modeled as arising from a mixture of several ancestral populations. As we’ve seen, Dirichlet process mixtures enable the use of mixture models with a potentially infinite number of mixture components, allowing the number of components 27.7. Estimating Many Multinomials 647 gong(x) − 2 − 1 0 1

0.5 1.0 1.5 2.0

Figure 27.5. Mean of a Gaussian process to be selected adaptively from the data. A hierarchical Dirichlet process mixture is an extension of the Dirichlet process mixture to grouped data, where the mixing components are shared between groups. Hierarchical modeling is an important method for “borrowing strength” across different populations. In Chapter 14 we discussed a simple hierarchical model that allows for different disease rates in m different cities, where ni people are selected from the ith city, and we observe how many people Xi have the disease being studied. We can think of the probability ✓i of the disease as a random draw from some distribution G0, so that the hierarchical model can be written as

For each j =1,...,m : (27.58) ✓ G (27.59) j ⇠ 0 X n ,✓ Binomial(n ,✓ ). (27.60) j | j j ⇠ i j

It is then of interest to estimate the parameters ✓i for each city, or the overall disease rate ✓d⇡(✓), tasks that can be carried out using Gibbs sampling. This hierarchical model is shown in Figure 27.6. R To apply such a hierarchical model to grouped data, suppose that F (✓) is a family of distributions for ✓ ⇥, and G = 1 ⇡ is a (potentially) infinite mixture of the 2 i=1 i ✓i distributions F (✓i) , where ⇡i =1and ✓i ⇥. We denote sampling from this mixture { } i P 2 as X Mix(G, F ), meaning the two-step process ⇠ P Z ⇡ Mult(⇡) (27.61) | ⇠ X Z F (✓ ). (27.62) | ⇠ Z Here’s a first effort at forming a nonparametric Bayesian model for grouped data. For each group, draw Gj from a Dirichlet process DP(,H). Then, sample the data within group j from the mixture model specified by Gj. Thus: 648 Chapter 27. Nonparametric Bayes

θ θ θ 1 2 ··· m

X X X 1 2 ··· m

Figure 27.6. A hierarchical model. The parameters ✓i are sampled conditionally independently from G0, and the observations Xi are made within the ith group. The hierarchical structure statistically couples together the groups.

For each j =1,...,m: (a) Sample G ,H DP(,H) j | ⇠ (b) For each i =1,...,nj: Sample X G Mix(G ,F),i=1...,n . ji | j ⇠ j j This process, however, does not satisfy the goal of statistically tying together the groups: each G is discrete, and for j = k, the mixtures G and G will not share any atoms, with j 6 j k probability one. A simple and elegant solution, proposed by Teh et al. (2006), is to add a layer to the hierarchy, by ﬁrst sampling G0 from a Dirichlet process DP(,H), and then sampling each Gj from the Dirichlet process DP(↵0,G0). Drawing G0 from a Dirichlet process ensures (with probability one) that it is a discrete measure, and therefore that the discrete mea- sures G DP(↵ ,G ) have the opportunity to share atoms. This leads to the following j ⇠ 0 0 procedure. The model is shown graphically in Figure 27.7.

Generative process for a hierarchical Dirichlet process mixture:

1. Sample G ,H DP(,H) 0 | ⇠ 2. For each j =1,...,m:

(a) Sample G ↵ ,G DP(↵ ,G ) j | 0 0 ⇠ 0 0 (b) For each i =1,...,nj: Sample X G Mix(G ,F),i=1...,n . ij | j ⇠ j j 27.7. Estimating Many Multinomials 649

H H DP(γ,H)

DP(γ,H) G0 G DP(α0,G0)DP(α0,G0)

Mix(G, F ) G1 G2 Gm ··· Mix(G1,F)Mix(Gm,F) X X X X 1 2 ··· m

Figure 27.7. Left: A Dirichlet process mixture. Right: A hierarchical Dirichlet process mixture. An extra layer is added to the hierarchy to ensure that the mixtures Gj share atoms, by forcing the measure G0 to be discrete.

The hierarchical Dirichlet process can be viewed as a hierarchical K-component mixture model, in the limit as K , in the following way. Let !1

✓ H H, k =1,...,K (27.63) k | ⇠ Dir(/K,...,/K) (27.64) | ⇠ ⇡ ↵ , Dir(↵ ,...,↵ ),j=1,...,m (27.65) j | 0 ⇠ 0 1 0 K K

Xji ⇡j,✓ Mix ⇡jk✓k ,F ,i=1,...,nj. (27.66) | ⇠ ! Xk=1 The marginal distribution of X converges to the hierarchical Dirichlet process as K . !1 This ﬁnite version was used for statistical language modeling by MacKay and Peto (1994).

27.7.1 Gibbs Sampling for the HDP

Suppose that the distribution H, which generates models ✓, is conjugate to the data distribution F ; this allows ✓ to be integrated out, circumventing the need to directly sample it. In this case, it is possible to derive several efﬁcient Gibbs sampling algorithms for the hierarchical Dirichlet process mixture. We summarize one such algorithm here; see Teh et al. (2006) for alternatives and further details.

The Gibbs sampler iteratively samples the variables =(1,2,...), which are the weights in the stick breaking representation of G0, and the variables zj =(zj1,zj2,...,zjnj ) indicating which mixture component generates the jth data group xj =(xj1,xj2,...,xjnj ).

In addition, note that in the mixture Gj = k1=1 ⇡ji✓ji for the jth group, an atom ✓k of G0 can appear multiple times. The variable m indicates the number of times component Pjk k appears in Gj; this is also stochastically sampled in the Gibbs sampler. A dot is used 650 Chapter 27. Nonparametric Bayes

m to denote a marginal count; thus, m k = j=1 mjk is the number of times component k · appears in the mixtures G1,...,Gm. We denote P

nj k = 1[zji = k] (27.67) · Xi=1 which is the number of times component k is used in generating group j. These variables can be given mnemonic interpretations in terms of the “Chinese restaurant franchise” (Teh et al., 2006), an extension of the Chinese restaurant process metaphor to grouped data. Finally, the superscript ji denotes that the ith element of the jth group is held out of a \ calculation. In particular, ji \ (27.68) nj k = 1[zji0 = k] · i =i X06 Finally, we use the notation

f(xji ✓k) f(xla ✓k) h(✓k) d✓k ji | la=ji,zla=k | fk\ (xji)= 6 (27.69) la=ji,z =k f(xla ✓k) h(✓k) d✓k R 6 Q la | R Q to denote the conditional density of xji under component k, given all of the other data generated from this component. Here f( ✓) is the density of F (✓) and h(✓) is the density ·| of H(✓). Under the conjugacy assumption, the integrals have closed form expressions. Using this notation, the Gibbs sampler can be expressed as follows. At each point in the algorithm, a (random) number K of mixture components are active, with weights ,..., satisfying K 1. A weight 0 is left for an as yet “unassigned” 1 K j=1 j  u component k . These weights are updated according to new P

(1,...,K ,u) Dir(m 1,...,mK ,) (27.70) ⇠ · ·

With ﬁxed, the latent variable zji for the ith data point in group j is sampled according to

ji ji ji nj\ k + ↵0k fk\ (xji) if component k previously used p(z = k z ,)= (27.71) ji \ · ji | (↵⇣ f \ (x ⌘) if k = k . 0 u knew ji new

Finally, the variable mjk is updated according to the conditional distribution

(↵0k) m p(mjk = m z,)= s(nj k,m)(↵0k) (27.72) | (↵0k + nj k) · · where s(n, m) are unsigned Stirling numbers of the ﬁrst kind, which count the number permutations of n elements having m disjoint cycles. 27.8. The Inﬁnite Hidden Markov Model 651

27.8 The Inﬁnite Hidden Markov Model

27.9 Theoretical Properties of Nonparametric Bayes

In this section we briefly discuss some theoretical properties of nonparametric Bayesian methods. We will focus on density estimation. In frequentist nonparametric inference, procedures are required to have certain guarantees such as consistency and minimaxity. Similar reasoning can be applied to Bayesian procedures. It is desirable, for example, that the posterior distribution ⇡n has mass that is concentrated near the true density function f. More specifically, we can ask three specific questions:

1. Is the posterior consistent? 2. Does posterior concentrate at the optimal rate? 3. Does posterior have correct coverage?

27.9.1 Consistency

Let f denote the true density. By consistency we mean that, when f A, ⇡ (A) should 0 0 2 n converge, in some sense, to 1. According to Doob’s theorem, consistency holds under very weak conditions. To state Doob’s theorem we need some notation. The prior ⇡ and the model deﬁne a n n 25 joint distribution µn on sequences Y =(Y1,...,Yn), namely, for any B R , 2 n µn(Yn B)= P(Y B f)d⇡(f)= f(y1) f(yn)d⇡(f). (27.73) 2 2 | ··· Z ZB In fact, the model and prior determine a joint distribution µ on the set of inﬁnite sequences26 = Y =(y ,y ,...,) . Y1 { 1 1 2 } 27.74 Theorem (Doob 1949). For every measurable A,

µ lim ⇡n(A)=I(f0 A) =1. (27.75) n 2 ⇣ !1 ⌘

By Doob’s theorem, consistency holds except on a set of probability zero. This sounds good but it isn’t; consider the following example.

27.76 Example. Let Y1,...,Yn N(✓, 1). Let the prior ⇡ be a point mass at ✓ =0. Then ⇠ the posterior is point mass at ✓ =0. This posterior is inconsistent on the set N = R 0 . { } This set has probability 0 under the prior so this does not contradict Doob’s theorem. But clearly the posterior is useless.

25 More precisely, for any Borel set B.2 26 More precisely, on an appropriate -ﬁeld over the set of inﬁnite sequences. 652 Chapter 27. Nonparametric Bayes

Doob’s theorem is useless for our purposes because it is solopsistic. The result is with respect to the Bayesian’s own distribution µ. Instead, we want to say that the posterior is consistent with respect to P0, the distribution generating the data. To continue, let us deﬁne three types of neighborhoods. Let f be a density and let Pf be the corresponding probability measure. A Kullback-Leibler neighborhood around Pf is

f(x) B (p, ✏)= P : f(x) log dx ✏ . (27.77) K g g(x)  ( Z ✓ ◆ )

A Hellinger neighborhood around Pf is

2 2 BH (p, ✏)= Pg : ( f(x) pg(x)) ✏ . (27.78) (  ) Z p A weak neighborhood around Pf is

BW (P,✏)= Q : dW (P, Q) ✏ (27.79) (  ) where dW is the Prohorov metric

✏ dW (P, Q)=inf ✏>0: P (B) Q(B )+✏, for all B (27.80) (  )

✏ where B = x :infy B x y ✏ . Weak neighborhoods are indeed very weak: if { 2 k k } P B (P ,✏) it does not imply that g resembles f. g 2 W f 27.81 Theorem (Schwartz 1963). If

⇡(BK (f0,✏)) > 0, for all ✏>0 (27.82) then, for any >0, a.s. ⇡ (B (P,)) 1 (27.83) n W ! with respect to P0.

This is still unsatisfactory since weak neighborhoods are large. Let N( ,✏) denote M the smallest number of functions f ,...,f such that, for each f , there is a f such 1 N 2M j that f(x) f (x) for all x and such that sup (f (x) f(x)) ✏. Let H( ,✏)=  j x j  M log N( ,✏). M 27.84 Theorem (Barron, Schervish and Wasserman (1999) and Ghosal, Ghosh and Ramamoorthi (1999)). Suppose that

⇡(BK (f0,✏)) > 0, for all ✏>0. (27.85) 27.9. Theoretical Properties of Nonparametric Bayes 653

Further, suppose there exists , ,...such that ⇡( c) c e jc2 and H( ,) M1 M2 Mj  1 Mj  c3j for all large j. Then, for any >0,

a.s. ⇡ (B (P,)) 1 (27.86) n H ! with respect to P0.

27.87 Example. Recall the Normal means model 1 Y = ✓ + ✏ ,i=1, 2,... (27.88) i i pn i where ✏ N(0,2). We want to infer ✓ =(✓ ,✓ ,...). Assume that ✓ is contained in the i ⇠ 1 2 Sobolev space

2 2p ✓ ⇥= ✓ : ✓i i < . (27.89) 2 ( 1) Xi Recall that the estimator ✓i = biYi is minimax for this Sobolev space where bi was given in Chapter ??. In fact the Efromovich-Pinsker estimator is adaptive minimax over the smooth- ness index p. A simple Bayesianb analysis is to use the prior ⇡ that treats each ✓i as independent random variables and ✓ N(0,⌧2) where ⌧ 2 = i 2q. Have we really deﬁned a prior i ⇠ i i on ⇥? We need to make sure that ⇡(⇥) = 1. Fix K>0. Then,

2 2p 2 2p 1 2 2p i E⇡(✓i )i i ⌧i i i i2(q p) ⇡ ✓ i >K = = . (27.90) i  K K K i P ⇣X ⌘ P P The numerator is ﬁnite as long as q>p+(1/2). Assuming q>p+(1/2) we then see that ⇡( 2 i2p >K) 0 as K which shows that ⇡ puts all its mass on ⇥. But, as i ! !1 we shall see below, the condition q>p+(1/2) is guaranteed to yield a posterior with a P suboptimal rate of convergence. 2 27.9.2 Rates of Convergence

Here the situation is more complicated. Recall the Normal means model

1 Y = ✓ + ✏ ,i=1, 2,... (27.91) i i pn i where ✏ N(0,2). We want to infer ✓ =(✓ ,✓ ,...) ⇥ from Y =(Y ,Y ,...,). i ⇠ 1 2 2 1 2 Assume that ✓ is contained in the Sobolev space

2 2p ✓ ⇥= ✓ : ✓i i < . (27.92) 2 ( 1) Xi 654 Chapter 27. Nonparametric Bayes

The following results are from Zhao (2000), Shen and Wasserman (2001), and Ghosal, Ghosh and van der Vaart (2000).

2 2 2q 27.93 Theorem. Put independent Normal priors ✓i N(0,⌧ ) where ⌧ = i . The ⇠ i i Bayes estimator attains the optimal rate only when q = p +(1/2). But then:

⇡(⇥) = 0 and ⇡(⇥ Y )=0. (27.94) |

27.9.3 Coverage

Suppose ⇡ (A)=1 ↵. Does this imply that n n P (f0 A) 1 ↵? (27.95) f0 2 or even n lim inf inf Pf (f0 A) 1 ↵? (27.96) n f 0 !1 0 2 Recall what happens for parametric models: if A =( ,a] and 1 P(✓ A data) = 1 ↵ (27.97) 2 | then 1 P✓(✓ A)=1 ↵ + O (27.98) 2 pn ✓ ◆ and, moreover, if we use the Jeffreys’ prior then 1 P✓(✓ A)=1 ↵ + O . (27.99) 2 n ✓ ◆ Is the same true for nonparametric models? Unfortunately, no; counterexamples are given by Cox (1993) and Freedman (1999). In their examples, one has:

⇡ (A)=1 ↵ (27.100) n but

lim inf inf Pf0 (f0 A) = 0! (27.101) n f !1 0 2

27.10 Bibliographic Remarks

Nonparametric Bayes began with Ferguson (1973) who invented the Dirichlet process. The Dirichlet process mixture is due to Escobar and West (1995). Hierarchical Drichlet models were developed in Teh et al. (2004), Blei et al. (2004), Blei and Jordan (2004) and Blei and Jordan (2005). For Gaussian process priors see, for example, Mackay (1997) and Altun et al. (2004). Theoretical properties of nonparametric Bayesian procedures are studied in numerous places, including Barron et al. (1999); Diaconis and Freedman (1986, 1997, 1993); Freedman (1999); Shen and Wasserman (2001); Ghosal et al. (1999, 2000); Zhao (2000). Exercises 655

Exercises

27.1 Let w1,w2,...be the weights generated from the stick-breaking process. Show that j1=1 wj =1with probability 1.

27.2 Let F DP(↵,F0). Show that E(F )=F0. Show that the prior gtes more concen- P ⇠ trated around F as ↵ . 0 !1 27.3 Find a bound on P(sup F n(x) F (x) >✏) (27.102) x | |

where F n is deﬁned by (27.10).

27.4 Consider the Dirichlet process DP(↵,F0). (a) Set F0 = N(0, 1). Draw 100 random distributions from the prior and plot them. Ty several diferent values of ↵. (b) Draw X ,...,X F where F = N(5, 3). Compute and plot the empirical dis- 1 n ⇠ tribution function and plot a 95 percent confidence band. Now compute the Bayesian posterior using a DP(↵,F0) prior with F0 = N(0, 1). Note that, to make this real- istic, we are assuming that the prior guess F0 is not equal to the true (but unknown) F . Plot the Bayes estimator F n. (Try a few different values of ↵.) Compute a 95 percent Bayesian confidence band. Repeat the entire process many ties and see how often the Bayesian confidence bands actually contains F . 27.5 In the hierarchical Dirichlet process, we first draw G DP(,H). The stick break- 0 ⇠ ing representation allows us to write this as

1 G0 = i✓i . (27.103) Xi=1 The next level in the hierarchy samples G DP(↵,G ). j ⇠ 0 1. Show that, under the stick breaking representation for Gj,

1 Gj = ⇡jk✓k (27.104) Xk=1 where ⇡ =(⇡ ,⇡ ,...) DP(↵ ,). j j1 j2 ⇠ 0 2. Show that ⇡j can equivalently be constructed as

k Vjk Beta ↵0k,↵0 1 i (27.105) ⇠ !! Xk=1 k 1 ⇡ = V (1 V ) (27.106) jk jk ji Yi=1