Dirichlet Processes: Tutorial and Practical Course

Dirichlet Processes: Tutorial and Practical Course Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 1 / 80 Dirichlet Processes Dirichlet processes (DPs) are a class of Bayesian nonparametric models. Dirichlet processes are used for: Density estimation. Semiparametric modelling. Sidestepping model selection/averaging. I will give a tutorial on DPs, followed by a practical course on implementing DP mixture models in MATLAB. Prerequisites: understanding of the Bayesian paradigm (graphical models, mixture models, exponential families, Gaussian processes)—you should know these from Zoubin and Carl. Other tutorials on DPs: Zoubin Gharamani, UAI 2005. Michael Jordan, NIPS 2005. Volker Tresp, ICML nonparametric Bayes workshop 2006. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 2 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 3 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 4 / 80 Function Estimation Parametric function estimation (e.g. regression, classification) Data: x = fx1; x2;:::g; y = fy1; y2;:::g 2 Model: yi = f (xi jw) + N (0; σ ) Prior over parameters p(w) Posterior over parameters p(w)p(yjx; w) p(wjx; y) = p(yjx) Prediction with posteriors Z p(y?jx?; x; y) = p(y?jx?; w)p(wjx; y) dw university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 5 / 80 Function Estimation Bayesian nonparametric function estimation with Gaussian processes Data: x = fx1; x2;:::g; y = fy1; y2;:::g 2 Model: yi = f (xi )+ N (0; σ ) Prior over functions f ∼ GP(µ, Σ) Posterior over functions p(f )p(yjx; f ) p(f jx; y) = p(yjx) Prediction with posteriors Z p(y?jx?; x; y) = p(y?jx?; f )p(f jx; y) df university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 6 / 80 Function Estimation Figure from Carl’s lecture. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 7 / 80 Density Estimation Parametric density estimation (e.g. mixture models) Data: x = fx1; x2;:::g Model: xi jw ∼ F(·|w) Prior over parameters p(w) Posterior over parameters p(w)p(xjw) p(wjx) = p(x) Prediction with posteriors Z p(x?jx) = p(x?jw)p(wjx) dw university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 8 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Bayesian nonparametric density estimation with Dirichlet processes Data: x = fx1; x2;:::g Model: xi ∼ F Prior over distributions F ∼ DP(α; H) Posterior over distributions p(F)p(xjF) p(Fjx) = p(x) Prediction with posteriors Z Z 0 p(x?jx) = p(x?jF)p(Fjx) dF = F (x?)p(Fjx) dF Not quite correct; see later. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 9 / 80 Density Estimation Prior: 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 !15 !10 !5 0 5 10 15 Red: mean density. Blue: median density. Grey: 5-95 quantile. Others: draws. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 10 / 80 Density Estimation Posterior: 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 !15 !10 !5 0 5 10 15 Red: mean density. Blue: median density. Grey: 5-95 quantile. Black: data. Others: draws. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 11 / 80 Semiparametric Modelling Linear regression model for inferring effectiveness of new medical treatments. > > yij = β xij + bi zij + ij yij is outcome of jth trial on ith subject. xij ; zij are predictors (treatment, dosage, age, health...). β are fixed-effects coefficients. bi are random-effects subject-specific coefficients. ij are noise terms. Care about inferring β. If xij is treatment, we want to determine p(β > 0jx; y). university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 12 / 80 Semiparametric Modelling > > yij = β xij + bi zij + ij 2 Usually we assume Gaussian noise ij ∼ N (0; σ ). Is this a sensible prior? Over-dispersion, skewness,... May be better to model noise nonparametrically, ij ∼ F F ∼ DP Also possible to model subject-specific random effects nonparametrically, bi ∼ G G ∼ DP university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 13 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx)= p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx) = p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Data: x = fx1; x2;:::g Models: p(θk jMk ), p(xjθk ; Mk ) Marginal likelihood Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection M = argmax p(xjMk ) Mk Model averaging X X p(xjMk )p(Mk ) p(x jx) = p(x jM )p(M jx) = p(x jM ) ? ? k k ? k p(x) Mk Mk But: is this computationally feasible? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 14 / 80 Model Selection/Averaging Marginal likelihood is usually extremely hard to compute. Z p(xjMk ) = p(xjθk ; Mk )p(θk jMk ) dθk Model selection/averaging is to prevent underfitting and overfitting. But reasonable and proper Bayesian methods should not overfit [Rasmussen and Ghahramani 2001]. Use a really large model M1 instead, and let the data speak for themselves. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 15 / 80 Model Selection/Averaging Clustering How many clusters are there? university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 16 / 80 Model Selection/Averaging Spike Sorting How many neurons are there? university-logo [Görür 2007, Wood et al. 2006] Yee Whye Teh (Gatsby) DP August 2007 / MLSS 17 / 80 Model Selection/Averaging Topic Modelling How many topics are there? university-logo [Blei et al. 2004, Teh et al. 2006] Yee Whye Teh (Gatsby) DP August 2007 / MLSS 18 / 80 Model Selection/Averaging Grammar Induction How many grammar symbols are there? [Liang et al. 2007, Finkel et al. 2007] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 19 / 80 Model Selection/Averaging Visual Scene Analysis How many objects, parts, features? Figure from Sudderth. [Sudderth et al. 2007] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 20 / 80 Outline 1 Applications 2 Dirichlet Processes 3 Representations of Dirichlet Processes 4 Modelling Data with Dirichlet Processes 5 Practical Course university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 21 / 80 Finite Mixture Models A finite mixture model is defined as follows: α φk ∼ H π ∼ Dirichlet(α=K ; : : : ; α=K ) zi jπ ∼ Discrete(π) π H xi jφzi ∼ F(·|φzi ) z Model selection/averaging over: i φk k = 1, . , K Hyperparameters in H. Dirichlet parameter α. xi Number of components K . i = 1, . , n Determining K hardest. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 22 / 80 Infinite Mixture Models Imagine that K 0 is really large. If parameters φk and mixing proportions π integrated out, the number of latent variables left α does not grow with K —no overfitting. At most n components will be associated with data, aka “active”. π H Usually, the number of active components is much less than n. zi φk This gives an infinite mixture model. k = 1, . , K Demo: dpm_demo2d xi Issue 1: can we take this limit K ! 1? i = 1, . , n Issue 2: what is the corresponding limiting model? [Rasmussen 2000] university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 23 / 80 Gaussian Processes What are they? A Gaussian process (GP) is a distribution over functions f : X 7! R Denote f ∼ GP if f is a GP-distributed random function. For any finite set of input points x1;:::; xn, we require (f (x1);:::; f (xn)) to be a multivariate Gaussian. university-logo Yee Whye Teh (Gatsby) DP August 2007 / MLSS 24 / 80 Gaussian Processes What are they? The GP is parametrized by its mean m(x) and covariance c(x; y) functions: 2 3 02 3 2 31 f (x1) m(x1) c(x1; x1) ::: c(x1; xn) 6 .

Dirichlet Processes: Tutorial and Practical Course

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support