K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser Looks like two clusters. How to find these clusters algorithmically? David Rosenberg (New York University) DS-GA 1003 June 15, 2015 2 / 43 K-Means Clustering k-Means: By Example Standardize the data. Choose two cluster centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(a). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 3 / 43 K-Means Clustering k-means: by example Assign each point to closest center. From Bishop’s Pattern recognition and machine learning, Figure 9.1(b). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 4 / 43 K-Means Clustering k-means: by example Compute new class centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(c). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 5 / 43 K-Means Clustering k-means: by example Assign points to closest center. From Bishop’s Pattern recognition and machine learning, Figure 9.1(d). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 6 / 43 K-Means Clustering k-means: by example Compute cluster centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(e). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 7 / 43 K-Means Clustering k-means: by example Iterate until convergence. From Bishop’s Pattern recognition and machine learning, Figure 9.1(i). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 8 / 43 K-Means Clustering k-means: formalization d Dataset D = fx1,...,xng 2 R Goal (version 1): Partition data into k clusters. Goal (version 2): Partition Rd into k regions. Let µ1,...,µk denote cluster centers. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 9 / 43 K-Means Clustering k-means: formalization For each xi , use a one-hot encoding to designate membership: k ri = (0,0,...,0,0,1,0,0) 2 R Let ric = 1(xi assigned to cluster c). Then ri = (ri1,ri2,...,rik ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 10 / 43 K-Means Clustering k-means: objective function Find cluster centers and cluster assignments minimizing n k 2 J(r,µ) = ric kxi - µc k . i=1 c=1 XX Is objective function convex? What’s the domain of J? r 2 f0,1gn×k , which is not a convex set... So domain of J is not convex =) J is not a convex function We should expect local minima. Could replace k · k2 with something else: e.g. using k · k (or any distance metric) gives k-medoids. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 11 / 43 K-Means Clustering k-means algorithm For fixed r (cluster assignments), minimizing over µ is easy: n k 2 J(r,µ) = ric kxi - µc k i=1 c=1 XX k n 2 = ric kxi - µc k c=1 i=1 XX =Jc 2 Jc (µc ) = | {z }kxi - µc k fijxi belongsX to cluster cg Jc is minimized by µc = mean(fxi j xi belongs to cluster cg) David Rosenberg (New York University) DS-GA 1003 June 15, 2015 12 / 43 K-Means Clustering k-means algorithm For fixed µ (cluster centers), minimizing over r is easy: n k 2 J(r,µ) = ric kxi - µc k i=1 c=1 XX For each i, exactly one of the following terms is nonzero: 2 2 2 ri1kxi - µ1k ,ri2kxi - µ2k ,...,rik kxi - µk k Take 2 ric = 1(c = argminkxi - µj k ) j That is, assign xi to cluster c with minimum distance 2 kxi - µc k David Rosenberg (New York University) DS-GA 1003 June 15, 2015 13 / 43 K-Means Clustering k-means algorithm (summary) We will use an alternating minimization algorithm: 1 Choose initial cluster centers µ = (µ1,...,µk ). e.g. choose k randomly chosen data points 2 Repeat 1 For given cluster centers, find optimal cluster assignments: new 2 ric = 1(c = argminkxi - µj k ) j 2 Given cluster assignments, find optimal cluster centers: new 2 µc = argmin; kxi - µc k m2Rd fijXric =1g David Rosenberg (New York University) DS-GA 1003 June 15, 2015 14 / 43 K-Means Clustering k-Means Algorithm: Convergence Note: Objective value never increases in an update. (Obvious: worst case, everything stays the same) Consider the sequence of objective values: J1,J2,J3,... monotonically decreasing bounded below by zero Therefore, k-Means objective value converges to inft Jt . Reminder: This is convergence to a local minimum. Best to repeat k-means several times, with different starting points David Rosenberg (New York University) DS-GA 1003 June 15, 2015 15 / 43 K-Means Clustering k-Means: Objective Function Convergence Blue circles after “E” step: assigning each point to a cluster Red circles after “M” step: recomputing the cluster centers From Bishop’s Pattern recognition and machine learning, Figure 9.2. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 16 / 43 K-Means Clustering k-Means Algorithm: Standardizing the data With standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 17 / 43 K-Means Clustering k-Means Algorithm: Standardizing the data Without standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 18 / 43 k-Means: Failure Cases k-Means: Suboptimal Local Minimum The clustering for k = 3 below is a local minimum, but suboptimal: From Sontag’s DS-GA 1003, 2014, Lecture 8. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 19 / 43 Gaussian Mixture Models Probabilistic Model for Clustering Let’s consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z 2 f1,2,...,kg. Z ∼ Multi(π1,...,πk ). 2 Choose a point from the distribution for cluster Z. X j Z = z ∼ p(x j z). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 20 / 43 Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 1 1 1 Choose Z 2 f1,2,3g ∼ Multi 3 , 3 , 3 . 2 Choose X j Z = z ∼ N(X j µz ,Σz ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 21 / 43 Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize joint according to Bayes net: p(x,z) = p(z)p(x j z) = πz N(x j µz ,Σz ) πz is probability of choosing cluster z. X j Z = z has distribution N(µz ,Σz ). z corresponding to x is the true cluster assignment. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 22 / 43 Gaussian Mixture Models Latent Variable Model Back in reality, we observe X , not (X ,Z). Cluster assignment Z is called a hidden variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 23 / 43 Gaussian Mixture Models Model-Based Clustering We observe X = x. The conditional distribution of the cluster Z given X = x is p(z j X = x) = p(x,z)=p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is ∗ z = argmin P(Z = z j X = x). z2f1,...,kg So if we have the model, clustering is trival. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 24 / 43 Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model We’ll use the common acronym GMM. What does it mean to “have” or “know” the GMM? It means knowing the parameters Cluster probabilities : π = (π1,...,πk ) Cluster means : µ = (µ1,...,µk ) Cluster covariance matrices: Σ = (Σ1,...Σk ) We have a probability model: let’s find the MLE. Suppose we have data D = fx1,...,xng. We need the model likelihood for D. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 25 / 43 Gaussian Mixture Models Gaussian Mixture Model: Marginal Distribution Since we only observe X , we need the marginal distribution: k p(x) = p(x,z) z=1 X k = πz N(x j µz ,Σz ) z=1 X Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... David Rosenberg (New York University) DS-GA 1003 June 15, 2015 26 / 43 Gaussian Mixture Models Mixture Distributions (or Mixture Models) Definition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = wi pi (x), i=1 X k where wi > 0, i=1 wi = 1, and each pi is a probability density. P In our Gaussian mixture model, X has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample X from the chosen distribution. Then X has a mixture distribution. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 27 / 43 Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model The model likelihood for D = fx1,...,xng is n L(π,µ,Σ) = p(xi ) i=1 Y n k = πz N(xi j µz ,Σz ). i=1 z=1 YX As usual, we’ll take our objective function to be the log of this: n k J(π,µ,Σ) = log πz N(xi j µz ,Σz ) i=1 z=1 X X David Rosenberg (New York University) DS-GA 1003 June 15, 2015 28 / 43 Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: n k J(π,µ,Σ) = log πz N(xi j µz ,Σz ) i=1 z=1 X X Let’s compare to the log-likelihood for a single Gaussian: n logN(xi j µ,Σ) i=1 X nd n 1 n = - log(2π) - logjΣj - (x - µ)0Σ-1(x - µ) 2 2 2 i i i=1 X For a single Gaussian, the log cancels the exp in the Gaussian density.

Load more