Dirichlet Processes: Tutorial and Practical Course

Dirichlet Processes: Tutorial and Practical Course Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 1 / 80 Dirichlet Processes Dirichlet processes (DPs) are a class of Bayesian nonparametric models. Dirichlet processes are used for: Density estimation. Semiparametric modelling. Sidestepping model selection/averaging. I will give a tutorial on DPs, followed by a practical course on implementing DP mixture models in MATLAB. Prerequisites: understanding of the Bayesian paradigm (graphical models, mixture models, exponential families, Gaussian processes)—you should know these from Zoubin and Carl. Other tutorials on DPs: Zoubin Gharamani, UAI 2005. Michael Jordan, NIPS 2005. Volker Tresp, ICML nonparametric Bayes workshop 2006. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 2 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 3 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 4 / 80 Function Estimation Parametric function estimation (e.g. regression, classification) Data: x = fx1; x2;:::g; y = fy1; y2;:::g 2 Model: yi = f (xi jw) + N (0; σ ) Prior over parameters p(w) Posterior over parameters p(w)p(yjx; w) p(wjx; y) = p(yjx) Prediction with posteriors Z p(y?jx?; x; y) = p(y?jx?; w)p(wjx; y) dw university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 5 / 80 Function Estimation Bayesian nonparametric function estimation with Gaussian processes Data: x = fx1; x2;:::g; y = fy1; y2;:::g 2 Model: yi = f (xi )+ N (0; σ ) Prior over functions f ∼ GP(µ, Σ) Posterior over functions p(f )p(yjx; f ) p(f jx; y) = p(yjx) Prediction with posteriors Z p(y?jx?; x; y) = p(y?jx?; f )p(f jx; y) df university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 6 / 80 Function Estimation Figure from Carl’s lecture. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 7 / 80 Density Estimation Parametric density estimation (e.g. mixture models) Data: x = fx1; x2;:::g Model: xi jw ∼ F(·|w) Prior over parameters p(w) Posterior over parameters p(w)p(xjw) p(wjx) = p(x) Prediction with posteriors Z p(x?jx) = p(x?jw)p(wjx) dw university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 8 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Prior: 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 !15 !10 !5 0 5 10 15 Red: mean density. Blue: median density. Grey: 5-95 quantile. Others: draws. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 10 / 80 Density Estimation Posterior: 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 !15 !10 !5 0 5 10 15 Red: mean density. Blue: median density. Grey: 5-95 quantile. Black: data. Others: draws. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 11 / 80 Semiparametric Modelling Linear regression model for inferring effectiveness of new medical treatments. > > yij = β xij + bi zij + ij yij is outcome of jth trial on ith subject. xij ; zij are predictors (treatment, dosage, age, health...). β are fixed-effects coefficients. bi are random-effects subject-specific coefficients. ij are noise terms. Care about inferring β. If xij is treatment, we want to determine p(β > 0jx; y). university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 12 / 80 Semiparametric Modelling > > yij = β xij + bi zij + ij 2 Usually we assume Gaussian noise ij ∼ N (0; σ ). Is this a sensible prior? Over-dispersion, skewness,... May be better to model noise nonparametrically, ij ∼ F F ∼ DP Also possible to model subject-specific random effects nonparametrically, bi ∼ G G ∼ DP university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 13 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx)= p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx) = p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx) = p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Marginal likelihood is usually extremely hard to compute. Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection/averaging is to prevent underfitting and overfitting. But reasonable and proper Bayesian methods should not overfit [Rasmussen and Ghahramani 2001]. Use a really large model M1 instead, and let the data speak for themselves. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 15 / 80 Model Selection/Averaging Clustering How many clusters are there? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 16 / 80 Model Selection/Averaging Spike Sorting How many neurons are there? university-logo [Görür 2007, Wood et al. 2006] Yee Whye Teh (Gatsby) DP August 2007 / MLSS 17 / 80 Model Selection/Averaging Topic Modelling How many topics are there? university-logo [Blei et al. 2004, Teh et al. 2006] Yee Whye Teh (Gatsby) DP August 2007 / MLSS 18 / 80 Model Selection/Averaging Grammar Induction How many grammar symbols are there? [Liang et al. 2007, Finkel et al. 2007] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 19 / 80 Model Selection/Averaging Visual Scene Analysis How many objects, parts, features? Figure from Sudderth. [Sudderth et al. 2007] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 20 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 21 / 80 Finite Mixture Models A finite mixture model is defined as follows: α φk ∼ H π ∼ Dirichlet(α=K ; : : : ; α=K ) zi jπ ∼ Discrete(π) π H xi jφzi ∼ F(·|φzi ) z Model selection/averaging over: i φk k = 1, . , K Hyperparameters in H. Dirichlet parameter α. xi Number of components K . i = 1, . , n Determining K hardest. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 22 / 80 Infinite Mixture Models Imagine that K 0 is really large. If parameters φk and mixing proportions π integrated out, the number of latent variables left α does not grow with K —no overfitting. At most n components will be associated with data, aka “active”. π H Usually, the number of active components is much less than n. zi φk This gives an infinite mixture model. k = 1, . , K Demo: dpm_demo2d xi Issue 1: can we take this limit K ! 1? i = 1, . , n Issue 2: what is the corresponding limiting model? [Rasmussen 2000] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 23 / 80 Gaussian Processes What are they? A Gaussian process (GP) is a distribution over functions f : X 7! R Denote f ∼ GP if f is a GP-distributed random function. For any finite set of input points x1;:::; xn, we require (f (x1);:::; f (xn)) to be a multivariate Gaussian. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 24 / 80 Gaussian Processes What are they? The GP is parametrized by its mean m(x) and covariance c(x; y) functions: 2 3 02 3 2 31 f (x1) m(x1) c(x1; x1) ::: c(x1; xn) 6 .

Load more