2 Cluster and Feature Modeling from Combinatorial Stochastic Processes 6 2.1 Introduction

Clusters and Features from Combinatorial Stochastic Processes by Tamara Ann Broderick A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael I. Jordan, Chair Professor Thomas L. Griffiths Professor Cari Kaufman Professor James W. Pitman Fall 2014 Clusters and Features from Combinatorial Stochastic Processes Copyright 2014 by Tamara Ann Broderick 1 Abstract Clusters and Features from Combinatorial Stochastic Processes by Tamara Ann Broderick Doctor of Philosophy in Statistics University of California, Berkeley Professor Michael I. Jordan, Chair Clustering involves placing entities into mutually exclusive categories. We wish to relax the requirement of mutual exclusivity, allowing objects to belong simultaneously to multiple classes, a formulation that we refer to as \feature allocation." The first step is a theoretical one. In the case of clustering the class of probability distributions over exchangeable partitions of a dataset has been characterized (via exchangeable partition probability functions and the Kingman paintbox). These characterizations support an elegant nonparametric Bayesian framework for clustering in which the number of clusters is not assumed to be known a priori. We establish an analogous characterization for feature allocation; we define notions of \exchangeable feature probability functions" and \feature paintboxes" that lead to a Bayesian framework that does not require the number of features to be fixed a priori. We focus on particular models within this framework that are both practical for inference and provide desirable modeling properties. And we explore a further generalization to feature allocations where objects may exhibit any non-negative integer number of features, or traits. The second step is a computational one. Rather than appealing to Markov chain Monte Carlo for Bayesian inference, we develop a method to transform Bayesian methods for feature allocation (and other latent structure problems) into optimization problems with objective functions analogous to K-means in the clustering setting. These yield approximations to Bayesian inference that are scalable to large inference problems. i Contents Contents i List of Figures iv List of Tables vi List of Algorithms vii 1 Introduction 1 I Models, connections, and inference 5 2 Cluster and feature modeling from combinatorial stochastic processes 6 2.1 Introduction . 6 2.2 Partitions and feature allocations . 8 2.3 Exchangeable probability functions . 10 2.4 Stick lengths . 22 2.5 Subordinators . 28 2.6 Completely random measures . 37 2.7 Discussion . 41 3 Beta processes, stick-breaking, and power laws 43 3.1 Introduction . 43 3.2 The beta process and the Bernoulli process . 46 3.3 Stick-breaking for the Dirichlet process . 49 3.4 Power law behavior . 51 3.5 Stick-breaking for the beta process . 53 3.6 Power law derivations . 57 3.7 Simulation . 65 3.8 Experimental results . 68 3.9 Discussion . 72 3.A A Markov chain Monte Carlo algorithm . 72 ii 4 Combinatorial clustering and the beta negative binomial process 78 4.1 Introduction . 78 4.2 Completely random measures . 81 4.3 Conjugacy and combinatorial clustering . 85 4.4 Mixtures and admixtures . 89 4.5 Asymptotics . 92 4.6 Simulation . 94 4.7 Posterior inference . 95 4.8 Document topic modeling . 99 4.9 Image segmentation and object recognition . 101 4.10 Discussion . 106 4.A Connections . 106 4.B Proofs for Appendix 4.A . 109 4.C Full results for Section 4.5 . 111 4.D Proofs for Appendix 4.C . 113 4.E Conjugacy proofs . 120 4.F Posterior inference details . 136 5 Feature allocations, probability functions, and paintboxes 140 5.1 Introduction . 140 5.2 Feature allocations . 142 5.3 Labeling features . 144 5.4 Exchangeable feature probability function . 150 5.5 The Kingman paintbox and feature paintbox . 155 5.6 Feature frequency models . 158 5.7 Discussion . 168 5.A Intermediate lemmas leading to Lemma 5.6.6 . 169 5.B Proof of Lemma 5.6.6 . 170 5.C Proof of Lemma 5.A.1 . 171 5.D Proof of Lemma 5.A.2 . 172 6 Posteriors, conjugacy, and exponential families for CRMs 175 6.1 Introduction . 175 6.2 Bayesian models based on completely random measures . 178 6.3 Posteriors . 184 6.4 Exponential families . 188 6.5 Size-biased representations . 195 6.6 Marginal processes . 202 6.7 Discussion . 209 6.A Further automatic conjugate priors . 209 6.B Further size-biased representations . 211 6.C Further marginals . 212 iii II Scaling inference 214 7 Streaming variational Bayes 215 7.1 Introduction . 215 7.2 Streaming, distributed, asynchronous Bayesian updating . 216 7.3 Case study: latent Dirichlet allocation . 219 7.4 Evaluation . 223 7.5 Discussion . 226 7.A Variational Bayes . 228 7.B Expectation propagation . 234 8 MAP-based asymptotic derivations from Bayes 239 8.1 Introduction . 239 8.2 MAP asymptotics for clusters . 241 8.3 MAP asymptotics for features . 243 8.4 Extensions . 245 8.5 Experiments . 247 8.6 Discussion . 252 8.A DP-means objective derivation . 252 8.B BP-means objective derivation . 253 8.C Collapsed DP-means objective derivation . 254 8.D Collapsed BP-means objective derivation . 256 8.E Parametric objectives . 256 8.F General multivariate Gaussian likelihood . 258 8.G Proof of BP-means local convergence . 259 Bibliography 261 iv List of Figures 2.1 A Chinese restaurant process seating arrangement. 11 2.2 An Indian buffet process sample. 15 2.3 Stick-breaking. 24 2.4 A Pólya urn interpretation of the Chinese restaurant process. 25 2.5 A Pólya urn interpretation of the Indian buffet process. 27 2.6 A subordinator. 30 2.7 Poisson thinning. 32 2.8 The gamma process. 39 2.9 The beta process. 40 3.1 Poisson point process with beta density. 47 3.2 Bernoulli process. 48 3.3 Stick-breaking process. 50 3.4 Poissonization. 59 3.5 Simulated feature count growth as a function of number of data points. 66 3.6 Features of size j as a function of j and lack of Type III power laws. 67 3.7 Simulated and theoretical feature frequencies. 68 3.8 Latent feature cardinality as a function of MCMC iteration. 70 3.9 Hyperparameter values as a function of MCMC iteration. 70 3.10 Autocorrelation of hyperparameters. 71 3.11 Feature means learned for MNIST handwritten digits data. 72 4.1 Simulated and asymptotic number of clusters. 96 4.2 Number of admixture components found by the finite and exact samplers. 98 4.3 Document length distributions and word.

2 Cluster and Feature Modeling from Combinatorial Stochastic Processes 6 2.1 Introduction

Partnership As Experimentation: Business Organization and Survival in Egypt, 1910–1949

A Model of Gene Expression Based on Random Dynamical Systems Reveals Modularity Properties of Gene Regulatory Networks†

POISSON PROCESSES 1.1. the Rutherford-Chadwick-Ellis

STOCHASTIC COMPARISONS and AGING PROPERTIES of an EXTENDED GAMMA PROCESS Zeina Al Masry, Sophie Mercier, Ghislain Verdier

Hierarchical Dirichlet Process Hidden Markov Model for Unsupervised Bioacoustic Analysis

Hyperwage Theory

Introduction to Lévy Processes

The 7Th Workshop on MARKOV PROCESSES and RELATED TOPICS

Superpositions and Products of Ornstein-Uhlenbeck Type Processes: Intermittency and Multifractality

Nonparametric Bayesian Latent Factor Models for Network Reconstruction

Option Pricing in Bilateral Gamma Stock Models

Contents-Preface