<<

Mixture Models & EM

Nicholas Ruozzi University of Texas at Dallas

based on the slides of Vibhav Gogate Previously…

• We looked at 푘- and hierarchical clustering as mechanisms for – 푘-means was simply a block coordinate descent scheme for a specific objective function • Today: how to learn probabilistic models for unsupervised learning problems

2 EM: Soft Clustering

• Clustering (e.g., k-means) typically assumes that each instance is given a “hard” assignment to exactly one cluster • Does not allow uncertainty in class membership or for an instance to belong to more than one cluster

– Problematic because data points that lie roughly midway between cluster centers are assigned to one cluster • Soft clustering gives probabilities that an instance belongs to each of a set of clusters

3 Probabilistic Clustering

• Try a probabilistic model! Z X1 X2

• Allows overlaps, clusters of different size, ?? 0.1 2.1 etc. ?? 0.5 -1.1

• Can tell a generative story for data ?? 0.0 3.0

– 푝(푥|푦) 푝(푦) ?? -0.1 -2.0

• Challenge: we need to estimate model ?? 0.2 1.5 parameters without labeled 푦’s (i.e., in … … … the unsupervised setting)

4 Probabilistic Clustering

• Clusters of different shapes and sizes • Clusters can overlap! (푘-means doesn’t allow this)

5 Finite Mixture Models

• Given a dataset: 푥(1), … , 푥(푁)

: Θ = {휆1, … , 휆푘, 휃1, … , 휃푘} 푘

푝 푥 Θ = ෍ 휆푦푝푦(푥|휃푦) 푦=1

where 푝푦(푥|휃푦) is a mixture component from some family of probability distributions parameterized by 휃푦 and 휆 ≥ 0 such that σ푦 휆푦 = 1 are the mixture weights

– We can think of 휆푦 = 푝 푌 = 푦|Θ for some random variable 푌 that takes values in {1, … , 푘}

6 Finite Mixture Models

Uniform mixture of 3 Gaussians

7 Multivariate Gaussian

• A 푑-dimensional multivariate Gaussian distribution is defined by a 푑 × 푑 covariance matrix Σ and a vector 휇 1 1 푝 푥 휇, Σ = exp − 푥 − 휇 푇Σ−1(푥 − 휇) 2휋 ddet Σ 2 • The covariance matrix describes the degree to which pairs of variables vary together – The diagonal elements correspond to of the individual variables

8 Multivariate Gaussian

• A 푑-dimensional multivariate Gaussian distribution is defined by a 푑 × 푑 covariance matrix Σ and a mean vector 휇 1 1 푝 푥 휇, Σ = exp − 푥 − 휇 푇Σ−1(푥 − 휇) 2휋 ddet Σ 2 • The covariance matrix must be a symmetric positive definite matrix in order for the above to make sense – Positive definite: all eigenvalues are positive & matrix is invertible – Ensures that the quadratic form is concave

9 Gaussian Mixture Models (GMMs)

• We can define a GMM by choosing the 푘푡ℎ component of the mixture to be a Gaussian density with parameters

휃푘 = 휇푘, Σ푘

1 1 푇 −1 푝 푥 휇푘, Σk = exp − 푥 − 휇푘 Σ푘 (푥 − 휇푘) d 2 2휋 det Σk

We could cluster by fitting a mixture of 풌 Gaussians to our data

How do we learn these kinds of models?

10 Learning Gaussian Parameters

• MLE for supervised univariate Gaussian 푁 1 휇 = ෍ 푥 𝑖 푀퐿퐸 푁 𝑖=1 푁 1 2 휎2 = ෍ 푥 𝑖 − 휇 푀퐿퐸 푁 푀퐿퐸 𝑖=1 • MLE for supervised multivariate Gaussian

푁 1 휇 = ෍ 푥 𝑖 푀퐿퐸 푁 𝑖=1 푁 1 푇 훴 = ෍ 푥 𝑖 − 휇 푥 𝑖 − 휇 MLE 푁 푀퐿퐸 푀퐿퐸 𝑖=1

11 Learning Gaussian Parameters

• MLE for supervised multivariate mixture of 푘 Gaussian distributions

푘 1 𝑖 휇푀퐿퐸 = ෍ 푥 |푀푘| 𝑖∈푀푘

푘 1 𝑖 푘 𝑖 푘 푇 Σ푀퐿퐸 = ෍ 푥 − 휇푀퐿퐸 푥 − 휇푀퐿퐸 |푀푘| 𝑖∈푀푘

Sums are over the observations that were generated by the 푘푡ℎ mixture component (this requires that we know which points were generated by which distribution!)

12 The Unsupervised Case

• What if our observations do not include information about which of the 푘 mixture components generated them?

• Consider a joint over data points, 푥(𝑖), and mixture assignments, 푦 ∈ {1, … , 푘}

푁 푁 푘 arg max ෑ 푝(푥 𝑖 |Θ) = arg max ෑ ෍ 푝(푥 𝑖 , 푌 = 푦|Θ) Θ Θ 𝑖=1 𝑖=1 푦=1 푁 푘 = arg max ෑ ෍ 푝 푥 𝑖 푌 = 푦, Θ 푝(푌 = 푦|Θ) Θ 𝑖=1 푦=1

13 The Unsupervised Case

• What if our observations do not include information about which of the 푘 mixture components generated them?

• Consider a joint probability distribution over data points, 푥(𝑖), and mixture assignments, 푦 ∈ {1, … , 푘}

푁 푁 푘 arg max ෑ 푝(푥 𝑖 |Θ) = arg max ෑ ෍ 푝(푥 𝑖 , 푌 = 푦|Θ) Θ Θ 𝑖=1 𝑖=1 푦=1 푁 푘 = arg max ෑ ෍ 푝 푥 𝑖 푌 = 푦, Θ 푝(푌 = 푦|Θ) Θ 𝑖=1 푦=1

We only know how to compute the probabilities for each mixture component

14 The Unsupervised Case

• In the case of a Gaussian mixture model

𝑖 1 1 (𝑖) 푇 −1 (𝑖) 푝 푥 푌 = 푦, Θ = exp − 푥 − 휇푦 Σ푦 (푥 − 휇푦) d 2 2휋 det Σy

푝 푌 = 푦|Θ = 휆푦 • Differentiating the MLE objective yields a system of equations that is difficult to solve in general

– The solution: modify the objective to make the optimization easier

15 Expectation Maximization

16 Jensen’s Inequality

푛 For a convex function 푓: ℝ → ℝ, any 푎1, … , 푎푘 ∈ [0,1] (1) 푘 푛 such that 푎1 + ⋯ + 푎푘 = 1, and any 푥 , … , 푥 ∈ ℝ ,

1 푘 1 푘 푎1푓 푥 + ⋯ + 푎푘푓 푥 ≥ 푓 푎1푥 + ⋯ + 푎푘푥

17 EM Algorithm

푁 푘 log ℓ(Θ) = ෍ log ෍ 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = ෍ log ෍ 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ ≥ ෍ ෍ 푞𝑖 푦 log 푞𝑖 푦 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)

18 EM Algorithm

푁 푘 log ℓ(Θ) = ෍ log ෍ 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = ෍ log ෍ 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푞 (푦) is an 푁 푘 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ arbitrary positive ≥ ෍ ෍ 푞𝑖 푦 log probability 푞𝑖 푦 distribution 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)

19 EM Algorithm

푁 푘 log ℓ(Θ) = ෍ log ෍ 푝 푥 𝑖 , 푌 = 푦 Θ 𝑖=1 푦=1 푁 푘 푞 푦 = ෍ log ෍ 𝑖 푝 푥 𝑖 , 푌 = 푦 Θ 푞𝑖 푦 𝑖=1 푦=1 푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ ≥ ෍ ෍ 푞𝑖 푦 log 푞𝑖 푦 Jensen’s ineq. 𝑖=1 푦=1 ≡ 퐹(Θ, 푞)

20 EM Algorithm

푁 푘 푝 푥 𝑖 , 푌 = 푦 Θ arg max ෍ ෍ 푞𝑖 푦 log Θ,푞1,..,푞푁 푞𝑖 푦 𝑖=1 푦=1

• This objective is not jointly concave in Θ and 푞1, … , 푞푁

– Best we can hope for is a local maxima (and there could be A LOT of them)

• The EM algorithm is a block coordinate ascent scheme that finds a local optimum of this objective

0 0 0 – Start from an initialization Θ and 푞1 , … , 푞푁

21 EM Algorithm

• E step: with the 휃’s fixed, maximize the objective over 푞

푁 푘 𝑖 푡 푡+1 푝 푥 , 푌 = 푦 Θ 푞 ∈ arg max ෍ ෍ 푞𝑖 푦 log 푞1,..,푞푁 푞𝑖 푦 𝑖=1 푦=1

• Using the method of Lagrange multipliers for the constraint that σ푦 푞𝑖 푦 = 1 gives

푡+1 𝑖 푡 푞𝑖 푦 = 푝 푌 = 푦 푋 = 푥 , Θ

22 EM Algorithm

• M step: with the 푞’s fixed, maximize the objective over Θ

푁 푘 푝 푥 𝑖 , 푌 = 푦 휃 휃푡+1 ∈ arg max ෍ ෍ 푞푡+1 푦 log Θ 𝑖 푞푡+1 푦 𝑖=1 푦=1 𝑖

• For the case of GMM, we can compute this update in closed form – This is not necessarily the case for every model – May require gradient ascent

23 EM Algorithm

• Start with random parameters • E-step maximizes a lower bound on the log-sum for fixed parameters • M-step solves a MLE estimation problem for fixed probabilities • Iterate between the E-step and M-step until convergence

24 EM for Gaussian Mixtures

• E-step: 휆푡 ⋅푝 𝑖 푡 푡 푡 푦 푥 휇푦, Σ푦 Probability of 푞 푦 = (𝑖) 𝑖 푡 𝑖 푡 푡 푥 under the σ 휆 ⋅푝 푥 휇 ′, Σ ′ 푦′ 푦′ 푦 푦 appropriate • M-step: multivariate normal σ푁 푡 𝑖 푡+1 𝑖=1 푞𝑖 푦 푥 distribution 휇푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푇 σ푁 푞푡 푦 푥 𝑖 − 휇푡+1 푥 𝑖 − 휇푡+1 푡+1 𝑖=1 𝑖 푦 푦 Σ푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푁 1 휆푡+1 = ෍ 푞푡 푦 푦 푁 𝑖 𝑖=1

25 EM for Gaussian Mixtures

• E-step: 휆푡 ⋅푝 𝑖 푡 푡 푡 푦 푥 휇푦, Σ푦 Probability of 푞 푦 = (𝑖) 𝑖 푡 𝑖 푡 푡 푥 under the σ 휆 ⋅푝 푥 휇 ′, Σ ′ 푦′ 푦′ 푦 푦 mixture model • M-step: σ푁 푡 𝑖 푡+1 𝑖=1 푞𝑖 푦 푥 휇푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푇 σ푁 푞푡 푦 푥 𝑖 − 휇푡+1 푥 𝑖 − 휇푡+1 푡+1 𝑖=1 𝑖 푦 푦 Σ푦 = 푁 푡 σ𝑖=1 푞𝑖 푦 푁 1 휆푡+1 = ෍ 푞푡 푦 푦 푁 𝑖 𝑖=1

26 Gaussian Mixture Example: Start

27 After first iteration

28 After 2nd iteration

29 After 3rd iteration

30 After 4th iteration

31 After 5th iteration

32 After 6th iteration

33 After 20th iteration

34 Properties of EM

• EM converges to a local optima – This is because each iteration improves the log- likelihood – Proof same as 푘-means (just block coordinate ascent)

• E-step can never decrease likelihood

• M-step can never decrease likelihood • If we make hard assignments instead of soft ones, algorithm is equivalent to 푘-means!

35