K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser Looks like two clusters. How to find these clusters algorithmically? David Rosenberg (New York University) DS-GA 1003 June 15, 2015 2 / 43 K-Means Clustering k-Means: By Example Standardize the data. Choose two cluster centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(a). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 3 / 43 K-Means Clustering k-means: by example Assign each point to closest center. From Bishop’s Pattern recognition and machine learning, Figure 9.1(b). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 4 / 43 K-Means Clustering k-means: by example Compute new class centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(c). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 5 / 43 K-Means Clustering k-means: by example Assign points to closest center. From Bishop’s Pattern recognition and machine learning, Figure 9.1(d). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 6 / 43 K-Means Clustering k-means: by example Compute cluster centers. From Bishop’s Pattern recognition and machine learning, Figure 9.1(e). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 7 / 43 K-Means Clustering k-means: by example Iterate until convergence. From Bishop’s Pattern recognition and machine learning, Figure 9.1(i). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 8 / 43 K-Means Clustering k-means: formalization d Dataset D = fx1,...,xng 2 R Goal (version 1): Partition data into k clusters. Goal (version 2): Partition Rd into k regions. Let µ1,...,µk denote cluster centers. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 9 / 43 K-Means Clustering k-means: formalization For each xi , use a one-hot encoding to designate membership: k ri = (0,0,...,0,0,1,0,0) 2 R Let ric = 1(xi assigned to cluster c). Then ri = (ri1,ri2,...,rik ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 10 / 43 K-Means Clustering k-means: objective function Find cluster centers and cluster assignments minimizing n k 2 J(r,µ) = ric kxi - µc k . i=1 c=1 XX Is objective function convex? What’s the domain of J? r 2 f0,1gn×k , which is not a convex set... So domain of J is not convex =) J is not a convex function We should expect local minima. Could replace k · k2 with something else: e.g. using k · k (or any distance metric) gives k-medoids. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 11 / 43 K-Means Clustering k-means algorithm For fixed r (cluster assignments), minimizing over µ is easy: n k 2 J(r,µ) = ric kxi - µc k i=1 c=1 XX k n 2 = ric kxi - µc k c=1 i=1 XX =Jc 2 Jc (µc ) = | {z }kxi - µc k fijxi belongsX to cluster cg Jc is minimized by µc = mean(fxi j xi belongs to cluster cg) David Rosenberg (New York University) DS-GA 1003 June 15, 2015 12 / 43 K-Means Clustering k-means algorithm For fixed µ (cluster centers), minimizing over r is easy: n k 2 J(r,µ) = ric kxi - µc k i=1 c=1 XX For each i, exactly one of the following terms is nonzero: 2 2 2 ri1kxi - µ1k ,ri2kxi - µ2k ,...,rik kxi - µk k Take 2 ric = 1(c = argminkxi - µj k ) j That is, assign xi to cluster c with minimum distance 2 kxi - µc k David Rosenberg (New York University) DS-GA 1003 June 15, 2015 13 / 43 K-Means Clustering k-means algorithm (summary) We will use an alternating minimization algorithm: 1 Choose initial cluster centers µ = (µ1,...,µk ). e.g. choose k randomly chosen data points 2 Repeat 1 For given cluster centers, find optimal cluster assignments: new 2 ric = 1(c = argminkxi - µj k ) j 2 Given cluster assignments, find optimal cluster centers: new 2 µc = argmin; kxi - µc k m2Rd fijXric =1g David Rosenberg (New York University) DS-GA 1003 June 15, 2015 14 / 43 K-Means Clustering k-Means Algorithm: Convergence Note: Objective value never increases in an update. (Obvious: worst case, everything stays the same) Consider the sequence of objective values: J1,J2,J3,... monotonically decreasing bounded below by zero Therefore, k-Means objective value converges to inft Jt . Reminder: This is convergence to a local minimum. Best to repeat k-means several times, with different starting points David Rosenberg (New York University) DS-GA 1003 June 15, 2015 15 / 43 K-Means Clustering k-Means: Objective Function Convergence Blue circles after “E” step: assigning each point to a cluster Red circles after “M” step: recomputing the cluster centers From Bishop’s Pattern recognition and machine learning, Figure 9.2. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 16 / 43 K-Means Clustering k-Means Algorithm: Standardizing the data With standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 17 / 43 K-Means Clustering k-Means Algorithm: Standardizing the data Without standardizing: David Rosenberg (New York University) DS-GA 1003 June 15, 2015 18 / 43 k-Means: Failure Cases k-Means: Suboptimal Local Minimum The clustering for k = 3 below is a local minimum, but suboptimal: From Sontag’s DS-GA 1003, 2014, Lecture 8. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 19 / 43 Gaussian Mixture Models Probabilistic Model for Clustering Let’s consider a generative model for the data. Suppose 1 There are k clusters. 2 We have a probability density for each cluster. Generate a point as follows 1 Choose a random cluster z 2 f1,2,...,kg. Z ∼ Multi(π1,...,πk ). 2 Choose a point from the distribution for cluster Z. X j Z = z ∼ p(x j z). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 20 / 43 Gaussian Mixture Models Gaussian Mixture Model (k = 3) 1 1 1 1 Choose Z 2 f1,2,3g ∼ Multi 3 , 3 , 3 . 2 Choose X j Z = z ∼ N(X j µz ,Σz ). David Rosenberg (New York University) DS-GA 1003 June 15, 2015 21 / 43 Gaussian Mixture Models Gaussian Mixture Model: Joint Distribution Factorize joint according to Bayes net: p(x,z) = p(z)p(x j z) = πz N(x j µz ,Σz ) πz is probability of choosing cluster z. X j Z = z has distribution N(µz ,Σz ). z corresponding to x is the true cluster assignment. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 22 / 43 Gaussian Mixture Models Latent Variable Model Back in reality, we observe X , not (X ,Z). Cluster assignment Z is called a hidden variable. Definition A latent variable model is a probability model for which certain variables are never observed. e.g. The Gaussian mixture model is a latent variable model. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 23 / 43 Gaussian Mixture Models Model-Based Clustering We observe X = x. The conditional distribution of the cluster Z given X = x is p(z j X = x) = p(x,z)=p(x) The conditional distribution is a soft assignment to clusters. A hard assignment is ∗ z = argmin P(Z = z j X = x). z2f1,...,kg So if we have the model, clustering is trival. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 24 / 43 Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model We’ll use the common acronym GMM. What does it mean to “have” or “know” the GMM? It means knowing the parameters Cluster probabilities : π = (π1,...,πk ) Cluster means : µ = (µ1,...,µk ) Cluster covariance matrices: Σ = (Σ1,...Σk ) We have a probability model: let’s find the MLE. Suppose we have data D = fx1,...,xng. We need the model likelihood for D. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 25 / 43 Gaussian Mixture Models Gaussian Mixture Model: Marginal Distribution Since we only observe X , we need the marginal distribution: k p(x) = p(x,z) z=1 X k = πz N(x j µz ,Σz ) z=1 X Note that p(x) is a convex combination of probability densities. This is a common form for a probability model... David Rosenberg (New York University) DS-GA 1003 June 15, 2015 26 / 43 Gaussian Mixture Models Mixture Distributions (or Mixture Models) Definition A probability density p(x) represents a mixture distribution or mixture model, if we can write it as a convex combination of probability densities. That is, k p(x) = wi pi (x), i=1 X k where wi > 0, i=1 wi = 1, and each pi is a probability density. P In our Gaussian mixture model, X has a mixture distribution. More constructively, let S be a set of probability distributions: 1 Choose a distribution randomly from S. 2 Sample X from the chosen distribution. Then X has a mixture distribution. David Rosenberg (New York University) DS-GA 1003 June 15, 2015 27 / 43 Gaussian Mixture Models Estimating/Learning the Gaussian Mixture Model The model likelihood for D = fx1,...,xng is n L(π,µ,Σ) = p(xi ) i=1 Y n k = πz N(xi j µz ,Σz ). i=1 z=1 YX As usual, we’ll take our objective function to be the log of this: n k J(π,µ,Σ) = log πz N(xi j µz ,Σz ) i=1 z=1 X X David Rosenberg (New York University) DS-GA 1003 June 15, 2015 28 / 43 Gaussian Mixture Models Properties of the GMM Log-Likelihood GMM log-likelihood: n k J(π,µ,Σ) = log πz N(xi j µz ,Σz ) i=1 z=1 X X Let’s compare to the log-likelihood for a single Gaussian: n logN(xi j µ,Σ) i=1 X nd n 1 n = - log(2π) - logjΣj - (x - µ)0Σ-1(x - µ) 2 2 2 i i i=1 X For a single Gaussian, the log cancels the exp in the Gaussian density.

K-Means and Gaussian Mixture Models

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support