Mixture Models & EM
Nicholas Ruozzi University of Texas at Dallas
based on the slides of Vibhav Gogate Previously…
• We looked at 푘-means and hierarchical clustering as mechanisms for unsupervised learning – 푘-means was simply a block coordinate descent scheme for a specific objective function • Today: how to learn probabilistic models for unsupervised learning problems
2 EM: Soft Clustering
• Clustering (e.g., k-means) typically assumes that each instance is given a “hard” assignment to exactly one cluster • Does not allow uncertainty in class membership or for an instance to belong to more than one cluster
– Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster • Soft clustering gives probabilities that an instance belongs to each of a set of clusters
3 Probabilistic Clustering
• Try a probabilistic model! Z X1 X2
• Allows overlaps, clusters of different size, ?? 0.1 2.1 etc. ?? 0.5 -1.1
• Can tell a generative story for data ?? 0.0 3.0
– 푝(푥|푦) 푝(푦) ?? -0.1 -2.0
• Challenge: we need to estimate model ?? 0.2 1.5 parameters without labeled 푦’s (i.e., in … … … the unsupervised setting)
4 Probabilistic Clustering
• Clusters of different shapes and sizes • Clusters can overlap! (푘-means doesn’t allow this)
5 Finite Mixture Models
• Given a dataset: 푥(1), … , 푥(푁)
• Mixture model: Θ = {휆1, … , 휆푘, 휃1, … , 휃푘} 푘
푝 푥 Θ = 휆푦푝푦(푥|휃푦) 푦=1
where 푝푦(푥|휃푦) is a mixture component from some family of probability distributions parameterized by 휃푦 and 휆 ≥ 0 such that σ푦 휆푦 = 1 are the mixture weights
– We can think of 휆푦 = 푝 푌 = 푦|Θ for some random variable 푌 that takes values in {1, … , 푘}
6 Finite Mixture Models
Uniform mixture of 3 Gaussians
7 Multivariate Gaussian
• A 푑-dimensional multivariate Gaussian distribution is defined by a 푑 × 푑 covariance matrix Σ and a mean vector 휇 1 1 푝 푥 휇, Σ = exp − 푥 − 휇 푇Σ−1(푥 − 휇) 2휋 ddet Σ 2 • The covariance matrix describes the degree to which pairs of variables vary together – The diagonal elements correspond to variances of the individual variables
8 Multivariate Gaussian
• A 푑-dimensional multivariate Gaussian distribution is defined by a 푑 × 푑 covariance matrix Σ and a mean vector 휇 1 1 푝 푥 휇, Σ = exp − 푥 − 휇 푇Σ−1(푥 − 휇) 2휋 ddet Σ 2 • The covariance matrix must be a symmetric positive definite matrix in order for the above to make sense – Positive definite: all eigenvalues are positive & matrix is invertible – Ensures that the quadratic form is concave
9 Gaussian Mixture Models (GMMs)
• We can define a GMM by choosing the 푘푡ℎ component of the mixture to be a Gaussian density with parameters
휃푘 = 휇푘, Σ푘
1 1 푇 −1 푝 푥 휇푘, Σk = exp − 푥 − 휇푘 Σ푘 (푥 − 휇푘) d 2 2휋 det Σk
We could cluster by fitting a mixture of 풌 Gaussians to our data
How do we learn these kinds of models?
10 Learning Gaussian Parameters
• MLE for supervised univariate Gaussian 푁 1 휇 = 푥 𝑖 푀퐿퐸 푁 𝑖=1 푁 1 2 휎2 = 푥 𝑖 − 휇 푀퐿퐸 푁 푀퐿퐸 𝑖=1 • MLE for supervised multivariate Gaussian
푁 1 휇 = 푥 𝑖 푀퐿퐸 푁 𝑖=1 푁 1 푇 훴 = 푥 𝑖 − 휇 푥 𝑖 − 휇 MLE 푁 푀퐿퐸 푀퐿퐸 𝑖=1
11 Learning Gaussian Parameters
• MLE for supervised multivariate mixture of 푘 Gaussian distributions
푘 1 𝑖 휇푀퐿퐸 = 푥 |푀푘| 𝑖∈푀푘
푘 1 𝑖 푘 𝑖 푘 푇 Σ푀퐿퐸 = 푥 − 휇푀퐿퐸 푥 − 휇푀퐿퐸 |푀푘| 𝑖∈푀푘
Sums are over the observations that were generated by the 푘푡ℎ mixture component (this requires that we know which points were generated by which distribution!)
12 The Unsupervised Case
• What if our observations do not include information about which of the 푘 mixture components generated them?
• Consider a joint probability distribution over data points, 푥(𝑖), and mixture assignments, 푦 ∈ {1, … , 푘}
푁 푁 푘 arg max ෑ 푝(푥 𝑖 |Θ) = arg max ෑ 푝(푥 𝑖 , 푌 = 푦|Θ) Θ Θ 𝑖=1 𝑖=1 푦=1 푁 푘 = arg max ෑ 푝 푥 𝑖 푌 = 푦, Θ 푝(푌 = 푦|Θ) Θ 𝑖=1 푦=1
13 The Unsupervised Case
• What if our observations do not include information about which of the 푘 mixture components generated them?
• Consider a joint probability distribution over data points, 푥(𝑖), and mixture assignments, 푦 ∈ {1, … , 푘}
푁 푁 푘 arg max ෑ 푝(푥 𝑖 |Θ) = arg max ෑ 푝(푥 𝑖 , 푌 = 푦|Θ) Θ Θ 𝑖=1 𝑖=1 푦=1 푁 푘 = arg max ෑ 푝 푥 𝑖 푌 = 푦, Θ 푝(푌 = 푦|Θ) Θ 𝑖=1 푦=1
We only know how to compute the probabilities for each mixture component
14 The Unsupervised Case
• In the case of a Gaussian mixture model
𝑖 1 1 (𝑖) 푇 −1 (𝑖) 푝 푥 푌 = 푦, Θ = exp − 푥 − 휇푦 Σ푦 (푥 − 휇푦) d 2 2휋 det Σy
푝 푌 = 푦|Θ = 휆푦 • Differentiating the MLE objective yields a system of equations that is difficult to solve in general
– The solution: modify the objective to make the optimization easier
15 Expectation Maximization
16 Jensen’s Inequality
푛 For a convex function 푓: ℝ → ℝ, any 푎1, … , 푎푘 ∈ [0,1] (1) 푘 푛 such that 푎1 + ⋯ + 푎푘 = 1, and any 푥 , … , 푥 ∈ ℝ ,
1 푘 1 푘 푎1푓 푥 + ⋯ + 푎푘푓 푥 ≥ 푓 푎1푥 + ⋯ + 푎푘푥
17 EM Algorithm
푁 푘 log ℓ(Θ) = log 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = log 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ ≥ 푞𝑖 푦 log 푞𝑖 푦 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)
18 EM Algorithm
푁 푘 log ℓ(Θ) = log 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = log 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푞 (푦) is an 푁 푘 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ arbitrary positive ≥ 푞𝑖 푦 log probability 푞𝑖 푦 distribution 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)
19 EM Algorithm
푁 푘 log ℓ(Θ) = log 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = log 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ ≥ 푞𝑖 푦 log 푞𝑖 푦 Jensen’s ineq. 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)
20 EM Algorithm
푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ arg max 푞𝑖 푦 log Θ,푞1,..,푞푁 푞𝑖 푦 𝑖=1 푦=1
• This objective is not jointly concave in Θ and 푞1, … , 푞푁
– Best we can hope for is a local maxima (and there could be A LOT of them)
• The EM algorithm is a block coordinate ascent scheme that finds a local optimum of this objective
0 0 0 – Start from an initialization Θ and 푞1 , … , 푞푁
21 EM Algorithm
• E step: with the 휃’s fixed, maximize the objective over 푞
푁 푘 𝑖 푡 푡+1 푝 푥 , 푌 = 푦 Θ 푞 ∈ arg max 푞𝑖 푦 log 푞1,..,푞푁 푞𝑖 푦 𝑖=1 푦=1
• Using the method of Lagrange multipliers for the constraint that σ푦 푞𝑖 푦 = 1 gives
푡+1 𝑖 푡 푞𝑖 푦 = 푝 푌 = 푦 푋 = 푥 , Θ
22 EM Algorithm
• M step: with the 푞’s fixed, maximize the objective over Θ
푁 푘 푝 푥 𝑖 , 푌 = 푦 휃 휃푡+1 ∈ arg max 푞푡+1 푦 log Θ 𝑖 푞푡+1 푦 𝑖=1 푦=1 𝑖
• For the case of GMM, we can compute this update in closed form – This is not necessarily the case for every model – May require gradient ascent
23 EM Algorithm
• Start with random parameters • E-step maximizes a lower bound on the log-sum for fixed parameters • M-step solves a MLE estimation problem for fixed probabilities • Iterate between the E-step and M-step until convergence
24 EM for Gaussian Mixtures
• E-step: 휆푡 ⋅푝 𝑖 푡 푡 푡 푦 푥 휇푦, Σ푦 Probability of 푞 푦 = (𝑖) 𝑖 푡 𝑖 푡 푡 푥 under the σ 휆 ⋅푝 푥 휇 ′, Σ ′ 푦′ 푦′ 푦 푦 appropriate • M-step: multivariate normal σ푁 푡 𝑖 푡+1 𝑖=1 푞𝑖 푦 푥 distribution 휇푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푇 σ푁 푞푡 푦 푥 𝑖 − 휇푡+1 푥 𝑖 − 휇푡+1 푡+1 𝑖=1 𝑖 푦 푦 Σ푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푁 1 휆푡+1 = 푞푡 푦 푦 푁 𝑖 𝑖=1
25 EM for Gaussian Mixtures
• E-step: 휆푡 ⋅푝 𝑖 푡 푡 푡 푦 푥 휇푦, Σ푦 Probability of 푞 푦 = (𝑖) 𝑖 푡 𝑖 푡 푡 푥 under the σ 휆 ⋅푝 푥 휇 ′, Σ ′ 푦′ 푦′ 푦 푦 mixture model • M-step: σ푁 푡 𝑖 푡+1 𝑖=1 푞𝑖 푦 푥 휇푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푇 σ푁 푞푡 푦 푥 𝑖 − 휇푡+1 푥 𝑖 − 휇푡+1 푡+1 𝑖=1 𝑖 푦 푦 Σ푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푁 1 휆푡+1 = 푞푡 푦 푦 푁 𝑖 𝑖=1
26 Gaussian Mixture Example: Start
27 After first iteration
28 After 2nd iteration
29 After 3rd iteration
30 After 4th iteration
31 After 5th iteration
32 After 6th iteration
33 After 20th iteration
34 Properties of EM
• EM converges to a local optima – This is because each iteration improves the log- likelihood – Proof same as 푘-means (just block coordinate ascent)
• E-step can never decrease likelihood
• M-step can never decrease likelihood • If we make hard assignments instead of soft ones, algorithm is equivalent to 푘-means!
35