Lecture Notes on Bayesian Nonparametrics Peter Orbanz

Lecture Notes on Bayesian Nonparametrics Peter Orbanz Version: May 16, 2014 These are class notes for a PhD level course on Bayesian nonparametrics, taught at Columbia University in Fall 2013. This text is a draft. I apologize for the many typos, false statements and poor explanations no doubt still left in the text; I will post corrections as they become available. Contents Chapter 1. Terminology1 1.1. Models1 1.2. Parameters and patterns2 1.3. Bayesian and nonparametric Bayesian models3 Chapter 2. Clustering and the Dirichlet Process5 2.1. Mixture models5 2.2. Bayesian mixtures7 2.3. Dirichlet processes and stick-breaking8 2.4. The posterior of a Dirichlet process 10 2.5. Gibbs-sampling Bayesian mixtures 11 2.6. Random partitions 15 2.7. The Chinese restaurant process 16 2.8. Power laws and the Pitman-Yor process 17 2.9. The number of components in a mixture 19 2.10. Historical references 20 Chapter 3. Latent features and the Indian buffet process 23 3.1. Latent feature models 24 3.2. The Indian buffet process 25 3.3. Exchangeability in the IBP 26 3.4. Random measure representation of latent feature models 26 Chapter 4. Regression and the Gaussian process 29 4.1. Gaussian processes 29 4.2. Gaussian process priors and posteriors 30 4.3. Is the definition meaningful? 33 Chapter 5. Models as building blocks 35 5.1. Mixture models 35 5.2. Hierarchical models 36 5.3. Covariate-dependent models 40 Chapter 6. Exchangeability 43 6.1. Bayesian models and conditional independence 43 6.2. Prediction and exchangeability 44 6.3. de Finetti's theorem 46 6.4. Exchangeable partitions 48 6.5. Exchangeable arrays 50 6.6. Applications in Bayesian statistics 52 iii iv CONTENTS Chapter 7. Posterior distributions 55 7.1. Existence of posteriors 56 7.2. Bayes' theorem 57 7.3. Dominated models 58 7.4. Conjugacy 60 7.5. Gibbs measures and exponential families 63 7.6. Conjugacy in exponential families 65 7.7. Posterior asymptotics, in a cartoon overview 66 Chapter 8. Random measures 71 8.1. Sampling models for random measure priors 72 8.2. Random discrete measures and point processes 73 8.3. Poisson processes 73 8.4. Total mass and random CDFs 76 8.5. Infinite divisibility and subordinators 77 8.6. Poisson random measures 79 8.7. Completely random measures 80 8.8. Normalization 81 8.9. Beyond the discrete case: General random measures 83 8.10. Further references 85 Appendix A. Poisson, gamma and stable distributions 87 A.1. The Poisson 87 A.2. The gamma 87 A.3. The Dirichlet 88 A.4. The stable 89 Appendix B. Nice spaces 91 B.1. Polish spaces 91 B.2. Standard Borel spaces 92 B.3. Locally compact spaces 93 Appendix C. Conditioning 95 C.1. Probability kernels 95 C.2. Conditional probability 95 C.3. Conditional random variables 96 C.4. Conditional densities 97 Appendix. Index of definitions 99 Appendix. Bibliography 101 CHAPTER 1 Terminology 1.1. Models The term \model" will be thrown around a lot in the following. By a statistical model on a sample space X, we mean a set of probability measures on X. If we write PM(X) for the space of all probability measures on X, a model is a subset M ⊂ PM(X). The elements of M are indexed by a parameter θ with values in a parameter space T, that is, M = fPθjθ 2 Tg ; (1.1) where each Pθ is an element of PM(X). (We require of course that the set M is measurable in PM(X), and that the assignment θ 7! Pθ is bijective and measurable.) We call a model parametric if T has finite dimension (which usually means T ⊂ Rd for some d 2 N). If T has infinite dimension, M is called a nonparametric model. To formulate statistical problems, we assume that n observations x1; : : : ; xn with values in X are recorded, which we model as random variables X1;:::;Xn. In classical statistics, we assume that these random variables are generated i.i.d. from a measure in the model, i.e. X1;:::;Xn ∼iid Pθ for some θ 2 T : (1.2) The objective of statistical inference is then to draw conclusions about the value of θ (and hence about the distribution Pθ of the data) from the observations. Example 1.1 (Parametric and Nonparametric density estimation). There is noth- ing Bayesian to this example: We merely try to illustrate the difference between parametric and nonparametric methods. Suppose we observe data X1;X2;::: in R and would like to get an estimate of the underlying density. Consider the following two estimators: Figure 1.1. Density estimation with Gaussians: Maximum likelihood estimation (left) and kernel density estimation (right). 1 2 1. TERMINOLOGY (1) Gaussian fit. We fit a Gaussian density to the data by maximum likelihood estimation. (2) Kernel density estimator. We again use a Gaussian density function g, in this case as a kernel: For each observation Xi = xi, we add one Gaussian density with mean xi to our model. The density estimate is then the density 1 Pn pn(x) := n i=1 g(xjxi; σ). Intuitively, we are \smoothing" the data by con- volution with a Gaussian. (Kernel estimates usually also decrease the variance with increasing n, but we skip details.) Figure 1.1 illustrates the two estimators. Now compare the number of parameters used by each of the two estimators: • The Gaussian maximum likelihood estimate has 2 degrees of freedom (mean and standard deviation), regardless of the sample size n. This model is parametric. • The kernel estimate requires an additional mean parameter for each additional data point. Thus, the number of degrees of freedom grows linearly with the sample size n. Asymptotically, the number of scalar parameters required is infinite, and to summarize them as a vector in a parameter space T, we need an infinite-dimensional space. / 1.2. Parameters and patterns A helpful intuition, especially for Bayesian nonparametrics, is to think of θ as a pattern that explains the data. Figure 1.2 (left) shows a simple example, a linear regression problem. The dots are the observed data, which shows a clear linear trend. The line is the pattern we use to explain the data; in this case, simply a linear function. Think of θ as this function. The parameter space T is hence the set of linear functions on R. Given θ, the distribution Pθ explains how the dots scatter around the line. Since a linear function on R can be specified using two scalars, an offset and a slope, T can equivalently be expressed as R2. Comparing to our definitions above, we see that this linear regression model is parametric. Now suppose the trend in the data is clearly nonlinear. We could then use the set of all functions on R as our parameter space, rather than just linear ones. Of course, we would usually want a regression function to be continuous and reasonably smooth, so we could choose T as, say, the set of all twice continuously differentiable functions on R. An example function θ with data generated from it could then look Figure 1.2. Regression problems: Linear (left) and nonlinear (right). In either case, we regard the regression function (plotted in blue) as the model parameter. 1.3. BAYESIAN AND NONPARAMETRIC BAYESIAN MODELS 3 like the function in Figure 1.2 (right). The space T is now infinite-dimensional, which means the model is nonparametric. 1.3. Bayesian and nonparametric Bayesian models In Bayesian statistics, we model the parameter as a random variable: The value of the parameter is unknown, and a basic principle of Bayesian statistics is that all forms of uncertainty should be expressed as randomness. We therefore have to consider a random variable Θ with values in T. We make a modeling assumption on how Θ is distributed, by choosing a specific distribution Q and assuming Q = L(Θ). The distribution Q is called the prior distribution (or prior for short) of the model. A Bayesian model therefore consists of a model M as above, called the observation model, and a prior Q. Under a Bayesian model, data is generated in two stages, as Θ ∼ Q (1.3) X1;X2;::: jΘ ∼iid PΘ : This means the data is conditionally i.i.d. rather than i.i.d. Our objective is then to determine the posterior distribution, the conditional distribution of Θ given the data, Q[Θ 2 • jX1 = x1;:::;Xn = xn] : (1.4) This is the counterpart to parameter estimation in the classical approach. The value of the parameter remains uncertain given a finite number of observations, and Bayesian statistics uses the posterior distribution to express this uncertainty. A nonparametric Bayesian model is a Bayesian model whose parameter space has infinite dimension. To define a nonparametric Bayesian model, we have to define a probability distribution (the prior) on an infinite-dimensional space. A distribution on an infinite-dimensional space T is a stochastic process with paths in T. Such distributions are typically harder to define than distributions on Rd, but we can draw on a large arsenal of tools from stochastic process theory and applied probability. CHAPTER 2 Clustering and the Dirichlet Process The first of the basic models we consider is the Dirichlet process, which is used in particular in data clustering. In a clustering problem, we are given observations x1; : : : ; xn, and the objective is to subdivide the sample into subsets, the clusters. The observations within each cluster should be mutually similar, in some sense we have to specify.

Load more