Gaussian Mixture Models (GMM) and ML Estimation Examples 3

Total Page:16

File Type:pdf, Size:1020Kb

Gaussian Mixture Models (GMM) and ML Estimation Examples 3 Gaussian Mixture Models (GMM) and ML Estimation Examples 3 Let us look at the log likelihood function n l(µ) = log L(µ)= log P (X µ) i| Xi=1 2 1 2 1 =2log + log µ +3 log + log µ +3 log + log(1 µ) +2 log + log(1 µ) µ 3 ∂ µ 3 ∂ µ 3 ° ∂ µ 3 ° ∂ = C + 5 log µ + 5 log(1 µ) ° where C is a constant which does not depend on µ. It can be seen that the log likelihood function is easier to maximize compared to the likelihood function. Let the derivative of l(µ) with respect to µ be zero: dl(µ) 5 5 = =0 dµ µ ° 1 µ ° and the solution gives us the MLE, which is µˆ =0.5. We remember that the method of moment estimation is µˆ =5/12, which is diÆerent from MLE. Example 2: Suppose X1,X2, ,Xn are i.i.d. random variables with density function 1 x ··· f(x æ)= exp | | , please find the maximum likelihood estimate of æ. | 2æ ° æ Solution: The log-likelihood≥ ¥ function is n X l(æ)= log 2 log æ | i| "° ° ° æ # Xi=1 Let the derivative with respect to µ be zero: n n 1 Xi n i=1 Xi 0 l (æ)= + | 2 | = + 2| | =0 "°æ æ # °æ P æ Xi=1 and this gives us the MLE for æ as n X æˆ = i=1 | i| P n Mean and Variance of GaussianAgain this is diÆerent from the method of moment estimation which is n X2 æˆ = i=1 i 4 sP 2n Example 3: Use the method of moment to estimate the parameters µ and æ for the normal • Consider the Gaussian PDF:density 1 (x µ)2 f(x µ, æ2)= exp ° , | p2ºæ (° 2æ2 ) 4 Given the observations (sample)based on a random sample X , ,X . 1 ··· n based on a random sample X , ,X . Solution: In this example,1 ··· wen have two unknown parameters, µ and æ, therefore the pa- rameter µ =(Solution:µ, æIn) this is example, a vector. we have two We unknown first parameters, write outµ and æ the, therefore log the likelihood pa- function as Form the log-likelihood functionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n n n 1 1 2 n 1 2 n l(µ, æ)= log æ 1log 2º (Xi µ)1 = n log æ log 2º (Xi µ) n 1 ° ° 2 ° 2æ2 ° ° ° 2 2 ° 2æ2 ° 2 l(µ, æ)= i=1log∑ æ log 2º ∏ (X µ) = i=1n log æ log 2º (X µ) °X ° 2 ° 2æ2 i ° °X ° 2 ° 2æ2 i ° Take the derivatives wrt X�iSetting=1 �∑� the� partial � a derivativend s toe bet 0, i wet t haveo zero ∏ Xi=1 @l(µ, æ) 1 n = (X µ)=0 Setting the partial derivative@µ to beæ2 0,i we° have Xi=1 n @l(µ, æ) n 3 2 n = @l(+µ,æ° æ)(Xi µ)1=0 @æ °æ i=1 ° X = 2 (Xi µ)=0 Solving these equations will give us the MLE@µ for µ and æ: æ ° Xi=1 1 n µˆ = X andæ ˆ = (X X)2 n v i @l(µ, æ) un i=1 n ° 3 2 u X =t + æ° (Xi µ) =0 This time the MLE is the same as@æ the result of method°æ of moment. ° Xi=1 From these examples, we can see that the maximum likelihood result may or may not be the Solving thesesame as equations the result of method will of moment. give us the MLE for µ and æ: Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: 1 n µ µ 1 2 f(x xµˆ, µ)==µxXx° ° ,xandx , æ ˆµ >=1 (Xi X) | 0 0 ∏ 0 v un i=1 ° u X Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample.t Find the MLE of µ. ··· This timeSolution: the MLEThe log-likelihood is the same function is as the result of method of moment. n n From these examples,l(µ)= welog canf(X seeµ)= that(log µ + theµ log x maximum(µ + 1) log X ) likelihood result may or may not be the i| 0 ° i Xi=1 Xi=1 same as the result of method of moment.n = n log µ + nµ log x (µ + 1) log X 0 ° i Example 4: The Pareto distributionXi=1 has been used in economics as a model for a density function withLet the a derivative slowly with decaying respect to µ be zero: tail: dl(µ) n n = + n log x0 log Xi =0 dµ µ ° i=1 µ µ 1 f(x x , µ)=Xµx x° ° ,x x , µ > 1 | 0 0 ∏ 0 Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero: dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 4 based on a random sample X , ,X . 1 ··· n Solution: In this example, we have two unknown parameters, µ and æ, therefore the pa- Solutionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n 1 1 n 1 n l(µ, æ)= log æ log 2º (X µ)2 = n log æ log 2º (X µ)2 ° ° 2 ° 2æ2 i ° ° ° 2 ° 2æ2 i ° Xi=1 ∑ ∏ Xi=1 • Sample mean and variance:Setting the partial derivative to be 0, we have @l(µ, æ) 1 n = (X µ)=0 @µ æ2 i ° Xi=1 n @l(µ, æ) n 3 2 = + æ° (X µ) =0 @æ °æ i ° Xi=1 Solving these equations will give us the MLE for µ and æ: 1 n µˆ = X andæ ˆ = (X X)2 v i un i=1 ° u X t This time the MLE is the same as the result of method of moment. From these examples, we can see that the maximum likelihood result may or may not be the same as the result of method of moment. Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: µ µ 1 f(x x , µ)=µx x° ° ,x x , µ > 1 | 0 0 ∏ 0 Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero: dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 Gaussian Mixture Model Gaussian Mixture Model Gaussian Mixture • Probabilistic story: Each clusterModel is associated with a • ProbabilisticGaussian distribution. story: Each To cluster generate is associated data, randomly with a choose Gaussiana cluster distribution.k with probability To generate⇡k and data,sample randomly from its choose distribution. • GMM a cluster k with probabilityK ⇡k and sample from its distribution.• Likelihood Pr(x)= ⇡k (x µk , ⌃k ) where K N | k =1 • LikelihoodK Pr(x)= X⇡ (x µ , ⌃ ) where k N | k k ⇡ = 1, 0 ⇡ k =11. k kX K k =1 X⇡ = 1, 0 ⇡ 1. k k kX=1 : : Sriram Sankararaman Clustering Sriram Sankararaman Clustering X is multidimensional. Ref: https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf Can we use the ML estimation method to estimate the unknown parameters, Gaussian Mixture Model �1, �1, �4 ? • It is not easy: • Loss function is the negative log likelihood n K log Pr(x ⇡, µ, ⌃)= log ⇡k (x µk , ⌃k ) − | − ( N | ) Xi=1 kX=1 • Why is this function difficult to optimize? • Notice that the sum over the components appears inside the log, thus coupling all the parameters. However, it is possible to obtain an iterative solution! Sriram Sankararaman Clustering • Need a procedure that would optimize the log likelihood by working with the (easier) complete log likelihood. • “Fill-in” the latent variables using current estimate of the parameters. • Adjust parameters based on the filled-in variables. We can estimate the parameters using iterative Expectation-Maximization (EM) algorithm • GivenGaussianthe observations Mixturexi, i=1,2,...,n Model • Each xi is associated with a latent variable zi =(zi1,...,ziK ). • Given the complete data (x, z)=(xi , zi ), i = 1,...,n • We can estimate the parameters by maximizing the complete data log likelihood. N K log Pr(x, z ⇡, µ, ⌃)= z log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 Xk =1 • Notice that the ⇡k and (µk , ⌃k ) decouple. Trivial closed-form solution exists. The latent variable parameter zik represents the contribution of k-th Gaussian to xi Take the derivative of the log-likelihood wrt �1,�1,�4 and set it to zero to get equations to be used in EM algorithm Sriram Sankararaman Clustering Iterate for k=1,2,… • Initialize with �5, �5�, �5 The Expectation-Maximization (EM) • Update algorithmequations at the k-th iteration: • E-step: Given parameters, compute ∆ ⇡k (xi µk , ⌃k ) rik = E(zik )= N | K ⇡ (x µ , ⌃ ) k =1 k N i | k k • M-step: Maximize the expectedP complete log likelihood n K E [log Pr(x, z ⇡, µ, ⌃)] = r log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 kX=1 By updatingTo update the parameters r r x r (x µ )(x µ )T ⇡ = i ik ,µ = i ik i , ⌃ = i ik i − k i − k k +1 n k+1 r k+1 r P P i ik P i ik • Iterate till likelihoodP converges.
Recommended publications
  • LECTURE 13 Mixture Models and Latent Space Models
    LECTURE 13 Mixture models and latent space models We now consider models that have extra unobserved variables. The variables are called latent variables or state variables and the general name for these models are state space models.A classic example from genetics/evolution going back to 1894 is whether the carapace of crabs come from one normal or from a mixture of two normal distributions. We will start with a common example of a latent space model, mixture models. 13.1. Mixture models 13.1.1. Gaussian Mixture Models (GMM) Mixture models make use of latent variables to model different parameters for different groups (or clusters) of data points. For a point xi, let the cluster to which that point belongs be labeled zi; where zi is latent, or unobserved. In this example (though it can be extended to other likelihoods) we will assume our observable features xi to be distributed as a Gaussian, so the mean and variance will be cluster- specific, chosen based on the cluster that point xi is associated with. However, in practice, almost any distribution can be chosen for the observable features. With d-dimensional Gaussian data xi, our model is: zi j π ∼ Mult(π) xi j zi = k ∼ ND(µk; Σk); where π is a point on a K-dimensional simplex, so π 2 RK obeys the following PK properties: k=1 πk = 1 and 8k 2 f1; 2; ::; Kg : πk 2 [0; 1]. The variables π are known as the mixture proportions and the cluster-specific distributions are called th the mixture components.
    [Show full text]
  • START HERE: Instructions Mixtures of Exponential Family
    Machine Learning 10-701 Mar 30, 2015 http://alex.smola.org/teaching/10-701-15 due on Apr 20, 2015 Carnegie Mellon University Homework 9 Solutions START HERE: Instructions • Thanks a lot to John A.W.B. Constanzo for providing and allowing to use the latex source files for quick preparation of the HW solution. • The homework is due at 9:00am on April 20, 2015. Anything that is received after that time will not be considered. • Answers to every theory questions will be also submitted electronically on Autolab (PDF: Latex or handwritten and scanned). Make sure you prepare the answers to each question separately. • Collaboration on solving the homework is allowed (after you have thought about the problems on your own). However, when you do collaborate, you should list your collaborators! You might also have gotten some inspiration from resources (books or online etc...). This might be OK only after you have tried to solve the problem, and couldn’t. In such a case, you should cite your resources. • If you do collaborate with someone or use a book or website, you are expected to write up your solution independently. That is, close the book and all of your notes before starting to write up your solution. • Latex source of this homework: http://alex.smola.org/teaching/10-701-15/homework/ hw9_latex.zip. Mixtures of Exponential Family In this problem we will study approximate inference on a general Bayesian Mixture Model. In particular, we will derive both Expectation-Maximization (EM) algorithm and Gibbs Sampling for the mixture model. A typical
    [Show full text]
  • Robust Mixture Modelling Using the T Distribution
    Statistics and Computing (2000) 10, 339–348 Robust mixture modelling using the t distribution D. PEEL and G. J. MCLACHLAN Department of Mathematics, University of Queensland, St. Lucia, Queensland 4072, Australia [email protected] Normal mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster sets of continuous multivariate data. However, for a set of data containing a group or groups of observations with longer than normal tails or atypical observations, the use of normal components may unduly affect the fit of the mixture model. In this paper, we consider a more robust approach by modelling the data by a mixture of t distributions. The use of the ECM algorithm to fit this t mixture model is described and examples of its use are given in the context of clustering multivariate data in the presence of atypical observations in the form of background noise. Keywords: finite mixture models, normal components, multivariate t components, maximum like- lihood, EM algorithm, cluster analysis 1. Introduction tailed alternative to the normal distribution. Hence it provides a more robust approach to the fitting of normal mixture mod- Finite mixtures of distributions have provided a mathematical- els, as observations that are atypical of a component are given based approach to the statistical modelling of a wide variety of reduced weight in the calculation of its parameters. Also, the random phenomena; see, for example, Everitt and Hand (1981), use of t components gives less extreme estimates of the pos- Titterington, Smith and Makov (1985), McLachlan and Basford terior probabilities of component membership of the mixture (1988), Lindsay (1995), and B¨ohning (1999).
    [Show full text]
  • Lecture 16: Mixture Models
    Lecture 16: Mixture models Roger Grosse and Nitish Srivastava 1 Learning goals • Know what generative process is assumed in a mixture model, and what sort of data it is intended to model • Be able to perform posterior inference in a mixture model, in particular { compute the posterior distribution over the latent variable { compute the posterior predictive distribution • Be able to learn the parameters of a mixture model using the Expectation-Maximization (E-M) algorithm 2 Unsupervised learning So far in this course, we've focused on supervised learning, where we assumed we had a set of training examples labeled with the correct output of the algorithm. We're going to shift focus now to unsupervised learning, where we don't know the right answer ahead of time. Instead, we have a collection of data, and we want to find interesting patterns in the data. Here are some examples: • The neural probabilistic language model you implemented for Assignment 1 was a good example of unsupervised learning for two reasons: { One of the goals was to model the distribution over sentences, so that you could tell how \good" a sentence is based on its probability. This is unsupervised, since the goal is to model a distribution rather than to predict a target. This distri- bution might be used in the context of a speech recognition system, where you want to combine the evidence (the acoustic signal) with a prior over sentences. { The model learned word embeddings, which you could later analyze and visu- alize. The embeddings weren't given as a supervised signal to the algorithm, i.e.
    [Show full text]
  • Linear Mixture Model Applied to Amazonian Vegetation Classification
    Remote Sensing of Environment 87 (2003) 456–469 www.elsevier.com/locate/rse Linear mixture model applied to Amazonian vegetation classification Dengsheng Lua,*, Emilio Morana,b, Mateus Batistellab,c a Center for the Study of Institutions, Population, and Environmental Change (CIPEC) Indiana University, 408 N. Indiana Ave., Bloomington, IN, 47408, USA b Anthropological Center for Training and Research on Global Environmental Change (ACT), Indiana University, Bloomington, IN, USA c Brazilian Agricultural Research Corporation, EMBRAPA Satellite Monitoring Campinas, Sa˜o Paulo, Brazil Received 7 September 2001; received in revised form 11 June 2002; accepted 11 June 2002 Abstract Many research projects require accurate delineation of different secondary succession (SS) stages over large regions/subregions of the Amazon basin. However, the complexity of vegetation stand structure, abundant vegetation species, and the smooth transition between different SS stages make vegetation classification difficult when using traditional approaches such as the maximum likelihood classifier (MLC). Most of the time, classification distinguishes only between forest/non-forest. It has been difficult to accurately distinguish stages of SS. In this paper, a linear mixture model (LMM) approach is applied to classify successional and mature forests using Thematic Mapper (TM) imagery in the Rondoˆnia region of the Brazilian Amazon. Three endmembers (i.e., shade, soil, and green vegetation or GV) were identified based on the image itself and a constrained least-squares solution was used to unmix the image. This study indicates that the LMM approach is a promising method for distinguishing successional and mature forests in the Amazon basin using TM data. It improved vegetation classification accuracy over that of the MLC.
    [Show full text]
  • Graph Laplacian Mixture Model Hermina Petric Maretic and Pascal Frossard
    1 Graph Laplacian mixture model Hermina Petric Maretic and Pascal Frossard Abstract—Graph learning methods have recently been receiv- of signals that are an unknown combination of data with ing increasing interest as means to infer structure in datasets. different structures. Namely, we propose a generative model Most of the recent approaches focus on different relationships for data represented as a mixture of signals naturally living on between a graph and data sample distributions, mostly in settings where all available data relate to the same graph. This is, a collection of different graphs. As it is often the case with however, not always the case, as data is often available in mixed data, the separation of these signals into clusters is assumed form, yielding the need for methods that are able to cope with to be unknown. We thus propose an algorithm that will jointly mixture data and learn multiple graphs. We propose a novel cluster the signals and infer multiple graph structures, one for generative model that represents a collection of distinct data each of the clusters. Our method assumes a general signal which naturally live on different graphs. We assume the mapping of data to graphs is not known and investigate the problem of model, and offers a framework for multiple graph inference jointly clustering a set of data and learning a graph for each of that can be directly used with a number of state-of-the- the clusters. Experiments demonstrate promising performance in art network inference algorithms, harnessing their particular data clustering and multiple graph inference, and show desirable benefits.
    [Show full text]
  • Comparing Unsupervised Clustering Algorithms to Locate Uncommon User Behavior in Public Travel Data
    Comparing unsupervised clustering algorithms to locate uncommon user behavior in public travel data A comparison between the K-Means and Gaussian Mixture Model algorithms MAIN FIELD OF STUDY: Computer Science AUTHORS: Adam Håkansson, Anton Andrésen SUPERVISOR: Beril Sirmacek JÖNKÖPING 2020 juli This thesis was conducted at Jönköping school of engineering in Jönköping within [see field of study on the previous page]. The authors themselves are responsible for opinions, conclusions and results. Examiner: Maria Riveiro Supervisor: Beril Sirmacek Scope: 15 hp (bachelor’s degree) Date: 2020-06-01 Mail Address: Visiting address: Phone: Box 1026 Gjuterigatan 5 036-10 10 00 (vx) 551 11 Jönköping Abstract Clustering machine learning algorithms have existed for a long time and there are a multitude of variations of them available to implement. Each of them has its advantages and disadvantages, which makes it challenging to select one for a particular problem and application. This study focuses on comparing two algorithms, the K-Means and Gaussian Mixture Model algorithms for outlier detection within public travel data from the travel planning mobile application MobiTime1[1]. The purpose of this study was to compare the two algorithms against each other, to identify differences between their outlier detection results. The comparisons were mainly done by comparing the differences in number of outliers located for each model, with respect to outlier threshold and number of clusters. The study found that the algorithms have large differences regarding their capabilities of detecting outliers. These differences heavily depend on the type of data that is used, but one major difference that was found was that K-Means was more restrictive then Gaussian Mixture Model when it comes to classifying data points as outliers.
    [Show full text]
  • Semiparametric Estimation in the Normal Variance-Mean Mixture Model∗
    (September 26, 2018) Semiparametric estimation in the normal variance-mean mixture model∗ Denis Belomestny University of Duisburg-Essen Thea-Leymann-Str. 9, 45127 Essen, Germany and Laboratory of Stochastic Analysis and its Applications National Research University Higher School of Economics Shabolovka, 26, 119049 Moscow, Russia e-mail: [email protected] Vladimir Panov Laboratory of Stochastic Analysis and its Applications National Research University Higher School of Economics Shabolovka, 26, 119049 Moscow, Russia e-mail: [email protected] Abstract: In this paper we study the problem of statistical inference on the parameters of the semiparametric variance-mean mixtures. This class of mixtures has recently become rather popular in statistical and financial modelling. We design a semiparametric estimation procedure that first es- timates the mean of the underlying normal distribution and then recovers nonparametrically the density of the corresponding mixing distribution. We illustrate the performance of our procedure on simulated and real data. Keywords and phrases: variance-mean mixture model, semiparametric inference, Mellin transform, generalized hyperbolic distribution. Contents 1 Introduction and set-up . .2 2 Estimation of µ ...............................3 3 Estimation of G with known µ ......................4 4 Estimation of G with unknown µ .....................7 5 Numerical example . .8 6 Real data example . 10 7 Proofs . 11 arXiv:1705.07578v1 [stat.OT] 22 May 2017 7.1 Proof of Theorem 2.1......................... 11 7.2 Proof of Theorem 3.1......................... 13 7.3 Proof of Theorem 4.1......................... 17 References . 20 ∗ This work has been funded by the Russian Academic Excellence Project \5-100". 1 D.Belomestny and V.Panov/Estimation in the variance-mean mixture model 2 1.
    [Show full text]
  • Structure of a Mixture Model
    In statistics, a mixture model is a probabilistic model for representing the presence of sub-populations within an overall population, without requiring that an observed data-set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population-identity information. Some ways of implementing mixture models involve steps that do attribute postulated sub-population-identities to individual observations (or weights towards such sub-populations), in which case these can be regarded as a types of unsupervised learning or clustering procedures. However not all inference procedures involve such steps. Mixture models should not be confused with model for compositional data, i.e., data whose components are constrained to sum to a constant value (1, 100%, etc.). Structure of a mixture model General mixture model A typical finite-dimensional mixture model is a hierarchical model consisting of the following components: • N random variables corresponding to observations, each assumed to be distributed according to a mixture of K components, with each component belonging to the same parametric family of distributions but with different parameters • N corresponding random latent variables specifying the identity of the mixture component of each observation, each distributed according to a K-dimensional categorical distribution • A set of K mixture weights, each of which is a probability (a real number between 0 and 1), all of which sum to 1 • A set of K parameters, each specifying the parameter of the corresponding mixture component.
    [Show full text]
  • Student's T Distribution Based Estimation of Distribution
    1 Student’s t Distribution based Estimation of Distribution Algorithms for Derivative-free Global Optimization ∗Bin Liu Member, IEEE, Shi Cheng Member, IEEE, Yuhui Shi Fellow, IEEE Abstract In this paper, we are concerned with a branch of evolutionary algorithms termed estimation of distribution (EDA), which has been successfully used to tackle derivative-free global optimization problems. For existent EDA algorithms, it is a common practice to use a Gaussian distribution or a mixture of Gaussian components to represent the statistical property of available promising solutions found so far. Observing that the Student’s t distribution has heavier and longer tails than the Gaussian, which may be beneficial for exploring the solution space, we propose a novel EDA algorithm termed ESTDA, in which the Student’s t distribution, rather than Gaussian, is employed. To address hard multimodal and deceptive problems, we extend ESTDA further by substituting a single Student’s t distribution with a mixture of Student’s t distributions. The resulting algorithm is named as estimation of mixture of Student’s t distribution algorithm (EMSTDA). Both ESTDA and EMSTDA are evaluated through extensive and in-depth numerical experiments using over a dozen of benchmark objective functions. Empirical results demonstrate that the proposed algorithms provide remarkably better performance than their Gaussian counterparts. Index Terms derivative-free global optimization, EDA, estimation of distribution, Expectation-Maximization, mixture model, Student’s t distribution I. INTRODUCTION The estimation of distribution algorithm (EDA) is an evolutionary computation (EC) paradigm for tackling derivative-free global optimization problems [1]. EDAs have gained great success and attracted increasing attentions during the last decade arXiv:1608.03757v2 [cs.NE] 25 Nov 2016 [2].
    [Show full text]
  • The Infinite Gaussian Mixture Model
    in Advances in Neural Information Processing Systems 12 S.A. Solla, T.K. Leen and K.-R. Muller¨ (eds.), pp. 554–560, MIT Press (2000) The Infinite Gaussian Mixture Model Carl Edward Rasmussen Department of Mathematical Modelling Technical University of Denmark Building 321, DK-2800 Kongens Lyngby, Denmark [email protected] http://bayes.imm.dtu.dk Abstract In a Bayesian mixture model it is not necessary a priori to limit the num- ber of components to be finite. In this paper an infinite Gaussian mixture model is presented which neatly sidesteps the difficult problem of find- ing the “right” number of mixture components. Inference in the model is done using an efficient parameter-free Markov Chain that relies entirely on Gibbs sampling. 1 Introduction One of the major advantages in the Bayesian methodology is that “overfitting” is avoided; thus the difficult task of adjusting model complexity vanishes. For neural networks, this was demonstrated by Neal [1996] whose work on infinite networks led to the reinvention and popularisation of Gaussian Process models [Williams & Rasmussen, 1996]. In this paper a Markov Chain Monte Carlo (MCMC) implementation of a hierarchical infinite Gaussian mixture model is presented. Perhaps surprisingly, inference in such models is possible using finite amounts of computation. Similar models are known in statistics as Dirichlet Process mixture models and go back to Ferguson [1973] and Antoniak [1974]. Usually, expositions start from the Dirichlet process itself [West et al, 1994]; here we derive the model as the limiting case of the well- known finite mixtures. Bayesian methods for mixtures with an unknown (finite) number of components have been explored by Richardson & Green [1997], whose methods are not easily extended to multivariate observations.
    [Show full text]
  • Outlier Detection and Robust Mixture Modeling Using Nonconvex Penalized Likelihood
    Journal of Statistical Planning and Inference 164 (2015) 27–38 Contents lists available at ScienceDirect Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi Outlier detection and robust mixture modeling using nonconvex penalized likelihood Chun Yu a, Kun Chen b, Weixin Yao c,∗ a School of Statistics, Jiangxi University of Finance and Economics, Nanchang 330013, China b Department of Statistics, University of Connecticut, Storrs, CT 06269, United States c Department of Statistics, University of California, Riverside, CA 92521, United States article info a b s t r a c t Article history: Finite mixture models are widely used in a variety of statistical applications. However, Received 29 April 2014 the classical normal mixture model with maximum likelihood estimation is prone to the Received in revised form 9 March 2015 presence of only a few severe outliers. We propose a robust mixture modeling approach Accepted 10 March 2015 using a mean-shift formulation coupled with nonconvex sparsity-inducing penalization, Available online 19 March 2015 to conduct simultaneous outlier detection and robust parameter estimation. An efficient iterative thresholding-embedded EM algorithm is developed to maximize the penalized Keywords: log-likelihood. The efficacy of our proposed approach is demonstrated via simulation EM algorithm Mixture models studies and a real application on Acidity data analysis. Outlier detection ' 2015 Elsevier B.V. All rights reserved. Penalized likelihood 1. Introduction Nowadays finite mixture distributions are increasingly important in modeling a variety of random phenomena (see Everitt and Hand, 1981; Titterington et al., 1985; McLachlan and Basford, 1988; Lindsay, 1995; Böhning, 1999). The m-component finite normal mixture distribution has probability density m I D X I 2 f .y θ/ πiφ.y µi; σi /; (1.1) iD1 T 2 where θ D .π1; µ1; σ1I ::: I πm; µm; σm/ collects all the unknown parameters, φ.·; µ, σ / denotes the density function 2 Pm D of N(µ, σ /, and πj is the proportion of jth subpopulation with jD1 πj 1.
    [Show full text]