Paper: a Tutorial on Dirichlet Process Mixture Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Journal of Mathematical Psychology 91 (2019) 128–144 Contents lists available at ScienceDirect Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp Tutorial A tutorial on Dirichlet process mixture modeling ∗ Yuelin Li a,b, , Elizabeth Schofield a, Mithat Gönen b a Department of Psychiatry & Behavioral Sciences, Memorial Sloan Kettering Cancer Center, New York, NY 10022, USA b Department of Epidemiology & Biostatistics, Memorial Sloan Kettering Cancer Center, New York, NY 10017, USA h i g h l i g h t s • Mathematical derivations essential to the development of Dirichlet Process are provided. • An accessible pedagogy for complex and abstract mathematics in DP modeling. • A publicly accessible computer program in R is explained line-by-line. article info a b s t r a c t Article history: Bayesian nonparametric (BNP) models are becoming increasingly important in psychology, both as Received 14 June 2018 theoretical models of cognition and as analytic tools. However, existing tutorials tend to be at a level of Received in revised form 21 March 2019 abstraction largely impenetrable by non-technicians. This tutorial aims to help beginners understand Available online xxxx key concepts by working through important but often omitted derivations carefully and explicitly, Keywords: with a focus on linking the mathematics with a practical computation solution for a Dirichlet Process Bayesian nonparametric Mixture Model (DPMM)—one of the most widely used BNP methods. Abstract concepts are made Dirichlet process explicit and concrete to non-technical readers by working through the theory that gives rise to them. Mixture model A publicly accessible computer program written in the statistical language R is explained line-by-line Gibbs sampling to help readers understand the computation algorithm. The algorithm is also linked to a construction Chinese Restaurant Process method called the Chinese Restaurant Process in an accessible tutorial in this journal (Gershman and Blei, 2012). The overall goals are to help readers understand more fully the theory and application so that they may apply BNP methods in their own work and leverage the technical details in this tutorial to develop novel methods. ' 2019 Elsevier Inc. All rights reserved. 1. Introduction his or her current reach. The DP is often described in abstract concepts in the literature, as a stochastic process that generalizes Bayesian nonparametric (BNP) methods address model com- the Dirichlet distribution from being the conjugate prior for a plexity in fundamentally different ways than conventional meth- fixed number of categories into the prior for infinitely many ods. For example, in conventional cluster analysis (e.g., k-means) categories. The term nonparametric in this broad context basically the analyst fits several models with varying degrees of com- means that the model has in principle an infinite number of pa- plexity (k D 1; 2;:::; K clusters) and then chooses the de- rameters. Austerweil, Gershman, Tenenbaum, and Griffiths(2015) sired model based on model comparison metrics. The number describe this definition. A BNP model defined in this framework of parameters is always fixed in the conventional analytics. BNP is not restricted to a specific parametric class. There are minimal methods, however, are based on a statistical framework that assumptions of what models are allowed. An emergent property contains in principle an infinite number of parameters. One such of this flexibility in model parameterization is that the structural statistical framework is a stochastic process called the Dirichlet complexity of a model can grow as data accrue. DP is also char- Process (DP). This tutorial aims to explain DP to the curious acterized as a distribution over distributions (e.g., Escobar & West, non-technicians. 1995; Jara, 2017; Jara, Hanson, Quintana, Müller, & Rosner, 2011; We write primarily for a student who wants to learn BNP Teh, 2017). A student reading these materials may absorb all but finds most academic papers at a level of abstraction beyond these concepts perfectly fine, but he or she may only have a vague impression as to the origin of these somewhat abstract concepts. ∗ This tutorial aims to make these concepts more concrete, explicit, Correspondence to: 641 Lexington Ave, 7th Floor, Department of Psychiatry & Behavioral Sciences, MSKCC, New York, NY 10022, USA. and transparent. E-mail addresses: [email protected] (Y. Li), [email protected] Articles with a focus on applications of BNP methods offer lim- (E. Schofield), [email protected] (M. Gönen). ited technical details (e.g., Karabatsos, 2017; Karabatsos & Walker, https://doi.org/10.1016/j.jmp.2019.04.004 0022-2496/' 2019 Elsevier Inc. All rights reserved. Y. Li, E. Schofield and M. Gönen / Journal of Mathematical Psychology 91 (2019) 128–144 129 Fig. 1. A hypothetical example of psychological assessment scores from 240 individuals divided into four known clusters. From each cluster, a random sample of 60 individuals is drawn from a bivariate Gaussian distribution with specified mean and covariance matrix. Also plotted are the 95th percentile ellipsoids for the underlying distributions. 2009; Navarro, Griffiths, Steyvers, & Lee, 2006). More techni- This tutorial is organized as follows. To establish a basic frame- cal papers, such as several recent sources (Ghoshal & van der work, we begin with a straightforward finite Gaussian mixture Vaart, 2015; Hjort, Holmes, Müller, & Walker, 2012; Jara, 2017; model with the number of clusters K fixed and known. Then we Müller, Quintana, Jara, & Hanson, 2015; Orbanz & Teh, 2011) explain in detail how DP extends it to an infinite mixture model. explain the DP by Measure Theory, a concept that non-technicians Next, the construction method called the Chinese Restaurant are probably not familiar with. Classic papers (e.g., Antoniak, Process (CRP) is explained (Aldous, 1985; Blackwell & MacQueen, 1974; Blackwell & MacQueen, 1973; Ferguson, 1973) and a re- 1973). We then introduce a computer program written in the cent overview (Jara, 2017) are also abstract. To minimize these statistical language R to demonstrate what the DP ``looks like'' abstract concepts, DP is often explained using several construction in action, when it spawns new clusters as data accrue. The R methods, metaphors that help explain how the computation is program is adapted from the tutorial by Broderick(2016), an actually implemented at a more intuitive level (Gershman & Blei, implementation of the Gibbs sampling algorithm in Neal(2000, 2012). However, important details and derivations are typically algorithm 3). Our exposition may also be useful for technical not covered. There is, we believe, a need to bridge the gaps experts as a refresher of the omitted derivations in Neal(2000) between these approaches in order to help non-technical readers and other academic papers. understand more fully how DP is actually implemented in a self-contained how-to guide. 2. Finite mixture model There has been a rapid growth in applications involving BNP methods, including in human cognition (Anderson, 1991; Auster- 2.1. Hypothetical data weil & Griffiths, 2013; Collins & Frank, 2013), modeling individ- ual variability (Liew, Howe, & Little, 2016; Navarro et al., 2006), The hypothetical dataset in Fig.1 will be used throughout and psychometrics (Karabatsos & Walker, 2008, 2009). DP is one this tutorial to make the explanations more explicit and concrete. of the simplest and most widely-used methods that support BNP. Each observation consists of two assessments, both standardized There are already many tutorials and how-to guides (e.g., Ger- onto a z-scale. Each dot represents one participant's two mea- shman & Blei, 2012; Houpt, 2018; Kourouklides, 2018; Orbanz, surements, e.g., on anxiety (y1) and depressive symptoms (y2), 2018). This tutorial offers something different. We aim to make respectively. The observations are simulated from K D 4 known the technical details more accessible to non-technicians. Thus, the clusters, each represented by a bivariate Gaussian distribution details are described much more explicitly than what is typically 2 with mean µk and covariance matrix σk . The presumed known available in elementary introductions. We aim to describe in great parameters for all 4 clusters are shown in Fig.1. DPMM is ex- detail exactly how a DP cluster analysis spawns a new cluster. pected to find a 4-cluster solution without being explicitly told Does DP somehow calculate a fit between an observation and its so. cluster? The omitted technical details are the focus of this tuto- rial, covered in an illustrative cluster analysis using DP mixture 2.2. Model likelihood function in a finite mixture model modeling. As we proceed along this tutorial, we will interject explanations of the omitted derivations. The intended readers We begin with a straightforward finite mixture model with are advanced graduate students in a quantitative psychology pro- the number of clusters K fixed and known to establish the basic gram or a related field, someone who is familiar with probability framework. The likelihood of the model is defined as a mixture theory and conditional probability but unfamiliar with Measure of Gaussian distributions. Theory. Familiarity with R programming and Bayesian statistics is needed (e.g., at the introductory level in Lee, 2012). However, K 2 2 X 2 the required programming skills are no more complicated than p(yijµ1; : : : ; µk; σ1 ; : : : ; σk ; π1; : : : ; πk) D πkN (yiI µk;