Gaussian Mixture Models
Total Page:16
File Type:pdf, Size:1020Kb
Gaussian Mixture Models David S. Rosenberg New York University April 24, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 1 / 35 Contents 1 Gaussian Mixture Models 2 Mixture Models 3 Learning in Gaussian Mixture Models 4 Issues with MLE for GMM 5 The EM Algorithm for GMM David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 2 / 35 Gaussian Mixture Models David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 3 / 35 Probabilistic Model for Clustering Let’s consider the following generative model (i.e. a way to generate data). Suppose 1 There are k clusters (or “mixture components”). 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z 2 f1,2,...,kg. 2 Choose a point from the distribution for cluster z. Data generated in this way is said to have a mixture distribution. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 4 / 35 Gaussian Mixture Model (k = 3) 1 1 Choose z 2 f1,2,3g with p(1) = p(2) = p(3) = 3 . 2 Choose x j z ∼ N(X j µz ,Σz ). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 5 / 35 Gaussian Mixture Model Parameters (k Components) Cluster probabilities : π = (π1,...,πk ) Cluster means : µ = (µ1,...,µk ) Cluster covariance matrices: Σ = (Σ1,...Σk ) For now, suppose all these parameters are known. We’ll discuss how to learn or estimate them later. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 6 / 35 Gaussian Mixture Model: Joint Distribution Factorize the joint density: p(x,z) = p(z)p(x j z) = πz N(x j µz ,Σz ) πz is probability of choosing cluster z. x j z has distribution N(µz ,Σz ). z corresponding to x is the true cluster assignment. Suppose we know the model parameters πz ,µz ,Σz . Then we can easily evaluate the joint density p(x,z). David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 7 / 35 Latent Variable Model We observe x. We don’t observe z (the cluster assignment). Cluster assignment z is called a hidden variable or latent variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 8 / 35 The GMM “Inference” Problem We observe x. We want to know its cluster assignment z. The conditional probability for cluster z given x is p(z j x) = p(x,z)=p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is z∗ = argmax p(z j x). z2f1,...,kg So if we know the model parameters, clustering is trival. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 9 / 35 Mixture Models David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 10 / 35 General Mixture Models: Generative Construction Let S be a set of k probability distributions (“mixture components”). Let π = (π1,...,πk ) be a distribution on f1,...,kg (“mixture weights”) Suppose we generate x with the following procedure: 1 Choose a distribution randomly from S according to π. 2 Sample x from the chosen distribution. Then we say x has a mixture distribution. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 11 / 35 Mixture Densities Suppose we have a mixture distribution with mixture components represented as densities p1,...,pk , and mixture weights π = (π1,...,πk ) , then the corresponding probability density for x is k p(x) = πi pi (x). i=1 X Note that p is a convex combination of the mixture component densities. p(x) is called a mixture density. Conversely, if x has a density of this form, then x has a mixture distribution. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 12 / 35 Gaussian Mixture Model (GMM): Marginal Distribution For example: The marginal distribution for a single observation x in a GMM is k p(x) = p(x,z) z=1 X k = πz N(x j µz ,Σz ) z=1 X David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 13 / 35 Learning in Gaussian Mixture Models David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 14 / 35 The GMM “Learning” Problem Given data x1,...,xn drawn from a GMM, Estimate the parameters: Cluster probabilities : π = (π1,...,πk ) Cluster means : µ = (µ1,...,µk ) Cluster covariance matrices: Σ = (Σ1,...Σk ) Once we have the parameters, we’re done. Just do “inference” to get cluster assignments. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 15 / 35 Estimating/Learning the Gaussian Mixture Model One approach to learning is maximum likelihood find parameter values with highest likelihood for the observed data. The model likelihood for D = (x1,...,xn) sampled iid from a GMM is n L(π,µ,Σ) = p(xi ) i=1 Y n k = πz N(xi j µz ,Σz ). i=1 z=1 YX As usual, we’ll take our objective function to be the log of this: n k J(π,µ,Σ) = log πz N(xi j µz ,Σz ) i=1 z=1 X X David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 16 / 35 Review: Estimating a Gaussian Distribution Recall that the density for x ∼ N(µ,Σ) is 1 1 p(x j µ,Σ) = exp - (x - µ)T Σ-1(x - µ) pj2πΣj 2 And the log-density is 1 1 logp(x j µ,Σ) = - logj2πΣj - (x - µ)T Σ-1(x - µ) 2 2 To estimate µ and Σ from a sample x1,...,xn i.i.d. N(µ,Σ), we’ll maximize the log joint density: n n 1 n logp(x j µ,Σ) = - logj2πΣj - (x - µ)T Σ-1(x - µ) i 2 2 i i i=1 i=1 X X David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 17 / 35 Review: Estimating a Gaussian Distribution To estimate µ and Σ from a sample x1,...,xn i.i.d. N(µ,Σ), we’ll maximize the log joint density: n n 1 n J(µ,Σ) = logp(x j µ,Σ) = - logj2πΣj - (x - µ)T Σ-1(x - µ) 2 2 i i i=1 i=1 X X This is a solid exercise in vector and matrix differentiation. Find µ^ and Σ^ satisfying rµJ(µ,Σ) = 0 rΣJ(µ,Σ) = 0 We get a closed form solution: 1 n µ^ = x MLE n i i=1 X 1 n Σ^ = (x - µ^ )T (x - µ^ ) MLE n i MLE i MLE i=1 X David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 18 / 35 Properties of the GMM Log-Likelihood GMM log-likelihood: n k π 1 J(π,µ,Σ) = log z exp - (x - µ )T Σ-1(x - µ ) p 2 z z z i=1 z=1 j2πΣz j X X Let’s compare to the log-likelihood for a single Gaussian: n 1 n J(µ,Σ) = - logj2πΣj - (x - µ)T Σ-1(x - µ) 2 2 i i i=1 X For a single Gaussian, the log cancels the exp in the Gaussian density. =) Things simplify a lot. For the GMM, the sum inside the log prevents this cancellation. =) Expression more complicated. No closed form expression for MLE. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 19 / 35 Issues with MLE for GMM David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 20 / 35 Identifiability Issues for GMM Suppose we have found parameters Cluster probabilities : π = (π1,...,πk ) Cluster means : µ = (µ1,...,µk ) Cluster covariance matrices: Σ = (Σ1,...Σk ) that are at a local minimum. What happens if we shuffle the clusters? e.g. Switch the labels for clusters 1 and 2. We’ll get the same likelihood. How many such equivalent settings are there? Assuming all clusters are distinct, there are k! equivalent solutions. Not a problem per se, but something to be aware of. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 21 / 35 Singularities for GMM Consider the following GMM for 7 data points: Let σ2 be the variance of the skinny component. What happens to the likelihood as σ2 ! 0? In practice, we end up in local minima that do not have this problem. Or keep restarting optimization until we do. Bayesian approach or regularization will also solve the problem. From Bishop’s Pattern recognition and machine learning, Figure 9.7. David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 April 24, 2018 22 / 35 Gradient Descent / SGD for GMM What about running gradient descent or SGD on n k J(π,µ,Σ) = - log πz N(xi j µz ,Σz ) ? i=1 z=1 X X Can be done, in principle – but need to be clever about it. Each matrix Σ1,...,Σk has to be positive semidefinite. How to maintain that constraint? T Rewrite Σi = Mi Mi , where Mi is an unconstrained matrix.