Lecture 12: Gaussian Mixture Models Shuai Li John Hopcroft Center, Shanghai Jiao Tong University https://shuaili8.github.io

https://shuaili8.github.io/Teaching/VE445/index.html

1 Outline

• Generative models • GMM • EM

2 Generative Models (review)

3 Discriminative / Generative Models

• Discriminative models • Modeling the dependence of unobserved variables on observed ones • also called conditional models • Deterministic: • Probabilistic: • Generative models • Modeling the joint probabilistic distribution of data • Given some hidden parameters or variables

• Then do the conditional inference

4 Discriminative Models

• Discriminative models • Modeling the dependence of unobserved variables on observed ones • also called conditional models • Deterministic: • • Probabilistic: •

• Directly model the dependence for label prediction • Easy to define dependence on specific features and models • Practically yielding higher prediction performance • E.g. linear regression, logistic regression, k nearest neighbor, SVMs, (multi- layer) perceptrons, decision trees, random forest 5 Generative Models

• Generative models • Modeling the joint probabilistic distribution of data • Given some hidden parameters or variables

• Then do the conditional inference

• Recover the data distribution [essence of data science] • Benefit from hidden variables modeling • E.g. Naive Bayes, , Mixture Gaussian, Markov Random Fields, Latent Dirichlet Allocation

6 Discriminative Models vs Generative Models

• In General • A Discriminative model models the decision boundary between the classes • A explicitly models the actual distribution of each class

• Example: Our training set is a bag of fruits. Only apples and oranges are labeled. Imagine a post-it note stuck to the fruit • A generative model will model various attributes of fruits such as color, weight, shape, etc • A discriminative model might model color alone, should that suffice to distinguish apples from oranges

7 8 Gaussian Mixture Models

9 Gaussian Mixture Models

• Is a clustering algorithms • Difference with K- • K-means outputs the label of a sample • GMM outputs the probability that a sample belongs to a certain class • GMM can also be used to generate new samples!

10 K-means vs GMM

11 Gaussian distribution

• Very common in probability theory and important in • often used in the natural and social sciences to represent real-valued random variables whose distributions are not known • is useful because of the • averages of samples independently drawn from the same distribution converge in distribution to the normal with the true and , that is, they become normally distributed when the number of observations is sufficiently large • Physical quantities that are expected to be the sum of many independent processes often have distributions that are nearly normal • The probability density of the Gaussian distribution is 푥−휇 2 1 − 푓 푥|휇, 휎2 = 푒 2휎2 2휋휎2 12 High-dimensional Gaussian distribution 푥−휇 2 1 − 푓 푥|휇, 휎2 = 푒 2휎2 2휋휎2 ⊤ • The probability density of Gaussian distribution on 푥 = 푥1, … , 푥푑 is 1 exp − 푥 − 휇 ⊤∑−1 푥 − 휇 2 풩 푥|휇, ∑ = 2휋 푑 ∑ • where 휇 is the mean vector • ∑ is the symmetric covariance matrix (positive semi-definite) • E.g. the Gaussian distribution with

13 Mixture of Gaussian

• The probability given in a mixture of K Gaussians is: 퐾

푝 푥 = ෍ 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1

where 푤푗 is the of the j-th Gaussian

• Example

14 Examples

Observation 15 Data generation

• Let the parameter set 휃 = {푤푗, 휇푗, Σ푗: 푗}, then the probability density of mixture Gaussian can be written as 퐾

푝 푥 휃 = ෍ 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1 • Equivalent to generate data points in two steps • Select which component 푗 the data point belongs to according to the categorical (or multinoulli) distribution of 푤1, … , 푤퐾 • Generate the data point according to the probability of 푗-th component

16 Learning task

• Given a dataset 푋 = 푥(1), 푥(2),∙∙∙, 푥(푁) to train the GMM model

• Find the best 휃 that maximizes the probability ℙ 푋 휃 • Maximal likelihood estimator (MLE)

17 Introduce latent variable

• For data points 푥(푖), 푖 = 1, … , 푁, let’s write the probability as 퐾 (푖) (푖) ℙ 푥 휃 = ෍ 푤푗 ∙ 풩(푥 |휇푗, Σ푗) 푗=1 퐾 where ∑푗=1 푤푗 = 1 • Introduce latent variable • 푧(푖) is the Gaussian cluster ID indicates which Gaussian 푥(푖) comes from (푖) • ℙ 푧 = 푗 = 푤푗 “Prior” (푖) (푖) (푖) • ℙ 푥 푧 = 푗; 휃 = 풩(푥 |휇푗, Σ푗) (푖) 퐾 (푖) (푖) (푖) • ℙ 푥 휃 = ∑푗=1 ℙ 푧 = 푗 ∙ ℙ 푥 푧 = 푗; 휃

18 Maximal likelihood 푁 푖 푁 푖 푖 • Let 푙 휃 = ∑푖=1 log ℙ(푥 ; 휃) = ∑푖=1 log ∑푗 ℙ(푥 , 푧 = 푗; 휃) be the log-likelihood • We want to solve 푁 퐾 argmax 푙 휃 = argmax ෍ log ෍ ℙ 푧(푖) = 푗 ∙ ℙ 푥(푖) 푧(푖) = 푗; 휃 푖=1 푗=1 푁 퐾 (푖) = argmax ෍ log ෍ 푤푗 ∙ 풩(푥 |휇푗, Σ푗) 푖=1 푗=1 • No closed solution by solving 휕ℙ 푋|휃 휕ℙ 푋|휃 휕ℙ 푋|휃 = 0, = 0, = 0 휕푤 휕휇 휕Σ

19 Likelihood maximization • If we know 푧(푖) for all 푖, the problem becomes 푁 argmax 푙 휃 = argmax ෍ log ℙ 푥(푖) 휃

푁 푖=1 = argmax ෍ log ℙ 푥(푖), 푧(푖) 휃 푖=1 Average over 푁 each cluster (푖) (푖) (푖) The solution is 1 = argmax ෍ log ℙ 푥 휃, 푧 + log ℙ 푧 휃 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 푗 푁 푖=1 푁 (푖) (푖) 푖=푁1 ∑푖=1 ퟏ 푧 =푗 푥 • 휇푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 (푖) 푁 (푖) (푖) (푖) ⊤ = argmax ෍ log 풩(푥 |휇 (푖), Σ (푖)) + log 푤 (푖) ∑푖=1 ퟏ 푧 =푗 푥 −휇푗 푥 −휇푗 푧 푧 푧 • Σ푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 푖=1

20 Likelihood maximization (cont.)

• Given the parameter 휃 = {푤푗, 휇푗, Σ푗: 푗}, the posterior distribution of each latent variable 푧(푖) can be inferred (푖) (푖) ℙ (푖) (푖) 푥 , 푧 = 푗 휃 • ℙ 푧 = 푗 푥 ; 휃 = (푖) ℙ 푥 휃 (푖) (푖) (푖) ℙ 푥 푧 = 푗; 휇푗, Σ푗 ℙ 푧 = 푗 푤 • = (푖) (푖) ′ (푖) ′ ∑퐾 ℙ ′ ′ ℙ 푗′=1 푥 푧 = 푗 ; 휇푗 , Σ푗 푧 = 푗 푤

(푖) (푖) (푖) (푖) (푖) (푖) • Or 푤푗 = ℙ 푧 = 푗 푥 ; 휃 ∝ ℙ 푥 푧 = 푗; 휇푗, Σ푗 ℙ 푧 = 푗 푤

21 푤(푖) = ℙ 푧(푖) = 푗 푥(푖); 휃 Likelihood maximization (cont.) 푗 • For every possible values of 푧(푖)’s The solution is Which is equivalent to 1 1 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 푗 푁 푖=1 푗 푁 푖=1 ∑푁 ퟏ 푧(푖)=푗 푥(푖) • ∑푁 ퟏ 푧(푖) = 푗 휇 = ∑푁 ퟏ 푧(푖) = 푗 푥(푖) • 휇 = 푖=1 푖=1 푗 푖=1 푗 ∑푁 ퟏ 푧(푖)=푗 ⊤ 푖=1 • ∑푁 ퟏ 푧(푖) = 푗 Σ = ∑푁 ퟏ 푧(푖) = 푗 푥(푖) − 휇 푥(푖) − 휇 푁 (푖) (푖) (푖) ⊤ 푖=1 푗 푖=1 푗 푗 ∑푖=1 ퟏ 푧 =푗 푥 −휇푗 푥 −휇푗 • Σ푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 (푖) Take • Take the expectation on probability of 푧 expectation The solution is 1 푁 (푖) Take expectation on two sides • 푤푗 = ∑푖=1 푤 푁 푗 1 푁 (푖) 푁 (푖) (푖) • 푤푗 = ∑푖=1 ℙ(푧 = 푗) ∑푖=1 푤푗 푥 푁 • 휇푗 = 푁 푖 푁 푖 (푖) ∑푁 (푖) • ∑푖=1 ℙ(푧 = 푗) 휇푗 = ∑푖=1 ℙ(푧 = 푗)푥 푖=1 푤푗 ⊤ 푁 푖 푁 푖 (푖) (푖) ⊤ 푁 (푖) (푖) (푖) • ∑ ℙ(푧 = 푗) Σ = ∑ ℙ(푧 = 푗) 푥 − 휇 푥 − 휇 ∑푖=1 푤푗 푥 −휇푗 푥 −휇푗 푖=1 푗 푖=1 푗 푗 • Σ푗 = ∑푁 (푖) 푖=1 푤푗 22 Likelihood maximization (cont.)

• If we know the distribution of 푧(푖)’s, then it is equivalent to maximize 푁 퐾 (푖) 푖 (푖) ℙ 푥 , 푧 = 푗|휃 푙 ℙ (푖), 휃 = ෍ ෍ ℙ 푧 = 푗 log 푧 ℙ 푧(푖) = 푗 푖=1 푗=1 • Compared with previous if we know the values of 푧(푖) 푁 is a lower bound of argmax ෍ log ℙ 푥(푖), 푧(푖) 휃 푖=1 • Note that it is hard to solve if we directly maximize 푁 푁 퐾 푙 휃 = ෍ log ℙ 푥(푖) 휃 = ෍ log ෍ ℙ 푧(푖) = 푗|휃 ℙ 푥(푖) 푧(푖) = 푗; 휃 푖=1 푖=1 푗=1

23 Expectation maximization methods

• E-step: • Infer the posterior distribution of the latent variables given the model parameters • M-step: • Tune parameters to maximize the data likelihood given the latent variable distribution • EM methods • Iteratively execute E-step and M-step until convergence

24 EM for GMM

• Repeat until convergence:{ (E-step) For each 푖, 푗, set (푖) (푖) (푖) 푤푗 = ℙ 푧 = 푗 푥 , 푤, 휇, Σ (M-step) Update the parameters 푁 푁 (푖) (푖) 1 ∑푖=1 푤푗 푥 푤 = ෍ 푤(푖) , 휇 = 푗 푁 푗 푗 푁 (푖) 푖=1 ∑푖=1 푤푗 푁 (푖) (푖) (푖) ⊤ ∑푖=1 푤푗 푥 − 휇푗 푥 − 휇푗 Σ = 푗 푁 (푖) ∑푖=1 푤푗 } 25 Example 퐾 푝 푥 휃 = ෍ 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1 • Hidden variable: for each point, which Gaussian generates it?

26 Example (cont.)

• E-step: for each point, estimate the probability that each Gaussian component generated it

27 Example (cont.)

• M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data

28 Example 2

29 Remaining issues

• Initialization: • GMM is a kind of EM algorithm which is very sensitive to initial conditions

• Number of Gaussians: • Use some information-theoretic criteria to obtain the optima K • Minimal description length (MDL)

30 Minimal description length (MDL)

• All statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to compress the data most • In an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose own description is also efficient (negative) log-likelihood The cost of coding GMM itself • For the GMM The cost of coding data

• 푋 is the vector of data elements • 푝(푥) is the probability by GMM • 푃 is the number of free parameters needed to describe the GMM • 푃 = 퐾 푑 + 푑(푑 + 1)/2 + (퐾 − 1) 31 Example 퐾 = 2 퐾 = 16

400 new data points generated from the 16-GMM

32 Remaining issues (cont.)

• Simplification of the covariance matrices

33 Application in computer vision

• Image segmentation

34 Application in computer vision (cont.)

• Object tracking: Knowing the moving object distribution in the first frame, we can localize the objet in the next frames by tracking its distribution

35 Expectation and Maximization

36 Background: Latent variable models

• Some of the variables in the model are not observed • Examples: , Hidden Markov Models, LDA, etc

37 Background: Marginal likelihood

• Joint model ℙ 푥, 푧|휃 , 휃 is the model parameter • With 푧 unobserved, we marginalize out 푧 and use the marginal log- likelihood for learning ℙ 푥|휃 = ෍ ℙ 푥, 푧|휃 푧 • Example: mixture model

ℙ 푥|휃 = ෍ ℙ 푥|푧 = 푘, 휃푘 ℙ 푧 = 푘|휃푘 = ෍ 휋푘ℙ 푥|푧 = 푘, 휃푘 푘 푘 where 휋푘 is the mixing proportions

38 Examples

• Mixture of Bernoulli

• Mixture of Gaussians

• Hidden Markov Model

39 Learning the hidden variable 푍

• If all 푧 observed, the likelihood factorizes, and learning is relatively easy

• If 푧 not observed, we have to handle a sum inside log • Idea 1: ignore this problem and simply take derivative and follow the gradient • Idea 2: use the current 휃 to estimate 푧, fill them in and do fully-observed learning

• Is there a better way?

40 Example

• Let events be “grades in a class”

• Assume we want to estimate 휇 from data. In a class, suppose there are • a students get A, b students get B, c students get C and d students get D.

• What’s the maximum likelihood estimate of 휇 given 푎, 푏, 푐, 푑 ?

41 Example (cont.)

• for MLE 휇, set

• Solve

• So if class got max like 42 Example (cont.)

• What if the observed information is

• What is the maximal likelihood estimator of 휇 now?

43 Example (cont.)

• What is the MLE of 휇 now? • We can answer this question circularly: Expectation Maximization If we know the value of 휇 If we know the expected we could compute the values of 푎 and 푏 we could expected value of 푎 and 푏 compute the maximum likelihood value of 휇

Since the ratio 푎: 푏 should be the same as the ratio ½ : 휇 44 Example – Algorithm

• We begin with a guess for 휇, and then iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of 휇 and 푎, 푏 • Define • μ (t) the estimate of μ on the t-th iteration • b (t) the estimate of b on t-th iteration

45 Example – Algorithm (cont.)

• Iteration will converge • But only assured to converge to local optimum

46 General EM methods

• Repeat until convergence{ (E-step) For each data point 푖, set (푖) 푖 푞푖 푗 = ℙ 푧 = 푗|푥 ; 휃 (M-step) Update the parameters

푁 ℙ(푥 푖 , 푧 푖 = 푗; 휃) argmax휃 푙 푞푖, 휃 = ෍ ෍ 푞푖 푗 log 푞푖 푗 푖=1 푗 }

47 Convergence of EM

• Denote 휃(푡) and 휃(푡+1) as the parameters of two successive iterations of EM, we want to prove that 푙 휃(푡) ≤ 푙 휃 푡+1 where note that 푁 푁 푙 휃 = ෍ log ℙ(푥 푖 ; 휃) = ෍ log ෍ ℙ(푥 푖 , 푧 푖 = 푗; 휃) 푖=1 푖=1 푗 • This shows EM always monotonically improves the log-likelihood, thus ensures EM will at least converge to a local optimum

48 Proof • Start from 휃(푡), we choose the posterior of latent variable (푡) (푖) 푖 (푡) 푞푖 푗 = 푝 푧 = 푗|푥 ; 휃 • By Jensen’s inequality on the log function 푝(푥 푖 ,푧 푖 =푗;휃(푡)) • 푙 푞, 휃(푡) = ∑푁 ∑ 푞 푗 log 푖=1 푗 푖 푞 푗 푗 푖 푝 푥 푖 , 푧 푖 = 푗; 휃 푡 ≤ ෍ log ෍ 푞푖 푗 ∙ 푞푖 푗 푖=1 푗 푗 푖 푖 (푡) (푡) 푙 휃(푡) = 푙 푞(푡), 휃(푡) ≥ 푙 푞, 휃(푡) = ෍ log ෍ 푝 푥 , 푧 = 푗; 휃 = 푙 휃 푖 푖=1 푗 • When 푞(푡) 푗 = 푝 푧(푖) = 푗|푥 푖 ; 휃(푡) , the above inequality is equality 푖 49 Proof (cont.)

• Then the parameter 휃(푡+1) is obtained by maximizing

푁 푖 푖 (푡) (푡) 푝(푥 , 푧 = 푗; 휃) 푙 푞푖 , 휃 = ෍ ෍ 푞푖 log (푡) 푖=1 푗 푞푖 • Thus (푡+1) (푡) (푡+1) (푡) (푡) (푡) 푙 휃 ≥ 푙 푞푖 , 휃 ≥ 푙 푞푖 , 휃 = 푙 휃 • Since the likelihood is at most 1, EM will converge (to a local optimum)

50 Example

• Follow previous example, suppose ℎ = 20, 푐 = 10, 푑 = 10, 휇(0) = 0

• Then

• Convergence is generally linear: error decreases by a constant factor each time step

51 Advantages

• EM is often applied to NP-hard problems • Widely used • Often effective

52 K-means and EM

• After initialize the position of the centers, K-means interactively execute the following two operations: • Assign the label of the points based on their distances to the centers • Specific posterior distribution of latent variable • E-step • Update the positions of the centers based on the labeling results • Given the labels, optimize the model parameters • M-step

53