Lecture 12: Gaussian Mixture Models Shuai Li John Hopcroft Center, Shanghai Jiao Tong University https://shuaili8.github.io
https://shuaili8.github.io/Teaching/VE445/index.html
1 Outline
• Generative models • GMM • EM
2 Generative Models (review)
3 Discriminative / Generative Models
• Discriminative models • Modeling the dependence of unobserved variables on observed ones • also called conditional models • Deterministic: • Probabilistic: • Generative models • Modeling the joint probabilistic distribution of data • Given some hidden parameters or variables
• Then do the conditional inference
4 Discriminative Models
• Discriminative models • Modeling the dependence of unobserved variables on observed ones • also called conditional models • Deterministic: • Linear regression • Probabilistic: • Logistic regression
• Directly model the dependence for label prediction • Easy to define dependence on specific features and models • Practically yielding higher prediction performance • E.g. linear regression, logistic regression, k nearest neighbor, SVMs, (multi- layer) perceptrons, decision trees, random forest 5 Generative Models
• Generative models • Modeling the joint probabilistic distribution of data • Given some hidden parameters or variables
• Then do the conditional inference
• Recover the data distribution [essence of data science] • Benefit from hidden variables modeling • E.g. Naive Bayes, Hidden Markov Model, Mixture Gaussian, Markov Random Fields, Latent Dirichlet Allocation
6 Discriminative Models vs Generative Models
• In General • A Discriminative model models the decision boundary between the classes • A Generative Model explicitly models the actual distribution of each class
• Example: Our training set is a bag of fruits. Only apples and oranges are labeled. Imagine a post-it note stuck to the fruit • A generative model will model various attributes of fruits such as color, weight, shape, etc • A discriminative model might model color alone, should that suffice to distinguish apples from oranges
7 8 Gaussian Mixture Models
9 Gaussian Mixture Models
• Is a clustering algorithms • Difference with K-means • K-means outputs the label of a sample • GMM outputs the probability that a sample belongs to a certain class • GMM can also be used to generate new samples!
10 K-means vs GMM
11 Gaussian distribution
• Very common in probability theory and important in statistics • often used in the natural and social sciences to represent real-valued random variables whose distributions are not known • is useful because of the central limit theorem • averages of samples independently drawn from the same distribution converge in distribution to the normal with the true mean and variance, that is, they become normally distributed when the number of observations is sufficiently large • Physical quantities that are expected to be the sum of many independent processes often have distributions that are nearly normal • The probability density of the Gaussian distribution is 푥−휇 2 1 − 푓 푥|휇, 휎2 = 푒 2휎2 2휋휎2 12 High-dimensional Gaussian distribution 푥−휇 2 1 − 푓 푥|휇, 휎2 = 푒 2휎2 2휋휎2 ⊤ • The probability density of Gaussian distribution on 푥 = 푥1, … , 푥푑 is 1 exp − 푥 − 휇 ⊤∑−1 푥 − 휇 2 풩 푥|휇, ∑ = 2휋 푑 ∑ • where 휇 is the mean vector • ∑ is the symmetric covariance matrix (positive semi-definite) • E.g. the Gaussian distribution with
13 Mixture of Gaussian
• The probability given in a mixture of K Gaussians is: 퐾
푝 푥 = 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1
where 푤푗 is the prior probability of the j-th Gaussian
• Example
14 Examples
Observation 15 Data generation
• Let the parameter set 휃 = {푤푗, 휇푗, Σ푗: 푗}, then the probability density of mixture Gaussian can be written as 퐾
푝 푥 휃 = 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1 • Equivalent to generate data points in two steps • Select which component 푗 the data point belongs to according to the categorical (or multinoulli) distribution of 푤1, … , 푤퐾 • Generate the data point according to the probability of 푗-th component
16 Learning task
• Given a dataset 푋 = 푥(1), 푥(2),∙∙∙, 푥(푁) to train the GMM model
• Find the best 휃 that maximizes the probability ℙ 푋 휃 • Maximal likelihood estimator (MLE)
17 Introduce latent variable
• For data points 푥(푖), 푖 = 1, … , 푁, let’s write the probability as 퐾 (푖) (푖) ℙ 푥 휃 = 푤푗 ∙ 풩(푥 |휇푗, Σ푗) 푗=1 퐾 where ∑푗=1 푤푗 = 1 • Introduce latent variable • 푧(푖) is the Gaussian cluster ID indicates which Gaussian 푥(푖) comes from (푖) • ℙ 푧 = 푗 = 푤푗 “Prior” (푖) (푖) (푖) • ℙ 푥 푧 = 푗; 휃 = 풩(푥 |휇푗, Σ푗) (푖) 퐾 (푖) (푖) (푖) • ℙ 푥 휃 = ∑푗=1 ℙ 푧 = 푗 ∙ ℙ 푥 푧 = 푗; 휃
18 Maximal likelihood 푁 푖 푁 푖 푖 • Let 푙 휃 = ∑푖=1 log ℙ(푥 ; 휃) = ∑푖=1 log ∑푗 ℙ(푥 , 푧 = 푗; 휃) be the log-likelihood • We want to solve 푁 퐾 argmax 푙 휃 = argmax log ℙ 푧(푖) = 푗 ∙ ℙ 푥(푖) 푧(푖) = 푗; 휃 푖=1 푗=1 푁 퐾 (푖) = argmax log 푤푗 ∙ 풩(푥 |휇푗, Σ푗) 푖=1 푗=1 • No closed solution by solving 휕ℙ 푋|휃 휕ℙ 푋|휃 휕ℙ 푋|휃 = 0, = 0, = 0 휕푤 휕휇 휕Σ
19 Likelihood maximization • If we know 푧(푖) for all 푖, the problem becomes 푁 argmax 푙 휃 = argmax log ℙ 푥(푖) 휃
푁 푖=1 = argmax log ℙ 푥(푖), 푧(푖) 휃 푖=1 Average over 푁 each cluster (푖) (푖) (푖) The solution is 1 = argmax log ℙ 푥 휃, 푧 + log ℙ 푧 휃 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 푗 푁 푖=1 푁 (푖) (푖) 푖=푁1 ∑푖=1 ퟏ 푧 =푗 푥 • 휇푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 (푖) 푁 (푖) (푖) (푖) ⊤ = argmax log 풩(푥 |휇 (푖), Σ (푖)) + log 푤 (푖) ∑푖=1 ퟏ 푧 =푗 푥 −휇푗 푥 −휇푗 푧 푧 푧 • Σ푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 푖=1
20 Likelihood maximization (cont.)
• Given the parameter 휃 = {푤푗, 휇푗, Σ푗: 푗}, the posterior distribution of each latent variable 푧(푖) can be inferred (푖) (푖) ℙ (푖) (푖) 푥 , 푧 = 푗 휃 • ℙ 푧 = 푗 푥 ; 휃 = (푖) ℙ 푥 휃 (푖) (푖) (푖) ℙ 푥 푧 = 푗; 휇푗, Σ푗 ℙ 푧 = 푗 푤 • = (푖) (푖) ′ (푖) ′ ∑퐾 ℙ ′ ′ ℙ 푗′=1 푥 푧 = 푗 ; 휇푗 , Σ푗 푧 = 푗 푤
(푖) (푖) (푖) (푖) (푖) (푖) • Or 푤푗 = ℙ 푧 = 푗 푥 ; 휃 ∝ ℙ 푥 푧 = 푗; 휇푗, Σ푗 ℙ 푧 = 푗 푤
21 푤(푖) = ℙ 푧(푖) = 푗 푥(푖); 휃 Likelihood maximization (cont.) 푗 • For every possible values of 푧(푖)’s The solution is Which is equivalent to 1 1 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 • 푤 = ∑푁 ퟏ 푧(푖) = 푗 푗 푁 푖=1 푗 푁 푖=1 ∑푁 ퟏ 푧(푖)=푗 푥(푖) • ∑푁 ퟏ 푧(푖) = 푗 휇 = ∑푁 ퟏ 푧(푖) = 푗 푥(푖) • 휇 = 푖=1 푖=1 푗 푖=1 푗 ∑푁 ퟏ 푧(푖)=푗 ⊤ 푖=1 • ∑푁 ퟏ 푧(푖) = 푗 Σ = ∑푁 ퟏ 푧(푖) = 푗 푥(푖) − 휇 푥(푖) − 휇 푁 (푖) (푖) (푖) ⊤ 푖=1 푗 푖=1 푗 푗 ∑푖=1 ퟏ 푧 =푗 푥 −휇푗 푥 −휇푗 • Σ푗 = 푁 (푖) ∑푖=1 ퟏ 푧 =푗 (푖) Take • Take the expectation on probability of 푧 expectation The solution is 1 푁 (푖) Take expectation on two sides • 푤푗 = ∑푖=1 푤 푁 푗 1 푁 (푖) 푁 (푖) (푖) • 푤푗 = ∑푖=1 ℙ(푧 = 푗) ∑푖=1 푤푗 푥 푁 • 휇푗 = 푁 푖 푁 푖 (푖) ∑푁 (푖) • ∑푖=1 ℙ(푧 = 푗) 휇푗 = ∑푖=1 ℙ(푧 = 푗)푥 푖=1 푤푗 ⊤ 푁 푖 푁 푖 (푖) (푖) ⊤ 푁 (푖) (푖) (푖) • ∑ ℙ(푧 = 푗) Σ = ∑ ℙ(푧 = 푗) 푥 − 휇 푥 − 휇 ∑푖=1 푤푗 푥 −휇푗 푥 −휇푗 푖=1 푗 푖=1 푗 푗 • Σ푗 = ∑푁 (푖) 푖=1 푤푗 22 Likelihood maximization (cont.)
• If we know the distribution of 푧(푖)’s, then it is equivalent to maximize 푁 퐾 (푖) 푖 (푖) ℙ 푥 , 푧 = 푗|휃 푙 ℙ (푖), 휃 = ℙ 푧 = 푗 log 푧 ℙ 푧(푖) = 푗 푖=1 푗=1 • Compared with previous if we know the values of 푧(푖) 푁 is a lower bound of argmax log ℙ 푥(푖), 푧(푖) 휃 푖=1 • Note that it is hard to solve if we directly maximize 푁 푁 퐾 푙 휃 = log ℙ 푥(푖) 휃 = log ℙ 푧(푖) = 푗|휃 ℙ 푥(푖) 푧(푖) = 푗; 휃 푖=1 푖=1 푗=1
23 Expectation maximization methods
• E-step: • Infer the posterior distribution of the latent variables given the model parameters • M-step: • Tune parameters to maximize the data likelihood given the latent variable distribution • EM methods • Iteratively execute E-step and M-step until convergence
24 EM for GMM
• Repeat until convergence:{ (E-step) For each 푖, 푗, set (푖) (푖) (푖) 푤푗 = ℙ 푧 = 푗 푥 , 푤, 휇, Σ (M-step) Update the parameters 푁 푁 (푖) (푖) 1 ∑푖=1 푤푗 푥 푤 = 푤(푖) , 휇 = 푗 푁 푗 푗 푁 (푖) 푖=1 ∑푖=1 푤푗 푁 (푖) (푖) (푖) ⊤ ∑푖=1 푤푗 푥 − 휇푗 푥 − 휇푗 Σ = 푗 푁 (푖) ∑푖=1 푤푗 } 25 Example 퐾 푝 푥 휃 = 푤푗 ∙ 풩(푥|휇푗, Σ푗) 푗=1 • Hidden variable: for each point, which Gaussian generates it?
26 Example (cont.)
• E-step: for each point, estimate the probability that each Gaussian component generated it
27 Example (cont.)
• M-Step: modify the parameters according to the hidden variable to maximize the likelihood of the data
28 Example 2
29 Remaining issues
• Initialization: • GMM is a kind of EM algorithm which is very sensitive to initial conditions
• Number of Gaussians: • Use some information-theoretic criteria to obtain the optima K • Minimal description length (MDL)
30 Minimal description length (MDL)
• All statistical learning is about finding regularities in data, and the best hypothesis to describe the regularities in data is also the one that is able to compress the data most • In an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose own description is also efficient (negative) log-likelihood The cost of coding GMM itself • For the GMM The cost of coding data
• 푋 is the vector of data elements • 푝(푥) is the probability by GMM • 푃 is the number of free parameters needed to describe the GMM • 푃 = 퐾 푑 + 푑(푑 + 1)/2 + (퐾 − 1) 31 Example 퐾 = 2 퐾 = 16
400 new data points generated from the 16-GMM
32 Remaining issues (cont.)
• Simplification of the covariance matrices
33 Application in computer vision
• Image segmentation
34 Application in computer vision (cont.)
• Object tracking: Knowing the moving object distribution in the first frame, we can localize the objet in the next frames by tracking its distribution
35 Expectation and Maximization
36 Background: Latent variable models
• Some of the variables in the model are not observed • Examples: mixture model, Hidden Markov Models, LDA, etc
37 Background: Marginal likelihood
• Joint model ℙ 푥, 푧|휃 , 휃 is the model parameter • With 푧 unobserved, we marginalize out 푧 and use the marginal log- likelihood for learning ℙ 푥|휃 = ℙ 푥, 푧|휃 푧 • Example: mixture model
ℙ 푥|휃 = ℙ 푥|푧 = 푘, 휃푘 ℙ 푧 = 푘|휃푘 = 휋푘ℙ 푥|푧 = 푘, 휃푘 푘 푘 where 휋푘 is the mixing proportions
38 Examples
• Mixture of Bernoulli
• Mixture of Gaussians
• Hidden Markov Model
39 Learning the hidden variable 푍
• If all 푧 observed, the likelihood factorizes, and learning is relatively easy
• If 푧 not observed, we have to handle a sum inside log • Idea 1: ignore this problem and simply take derivative and follow the gradient • Idea 2: use the current 휃 to estimate 푧, fill them in and do fully-observed learning
• Is there a better way?
40 Example
• Let events be “grades in a class”
• Assume we want to estimate 휇 from data. In a class, suppose there are • a students get A, b students get B, c students get C and d students get D.
• What’s the maximum likelihood estimate of 휇 given 푎, 푏, 푐, 푑 ?
41 Example (cont.)
• for MLE 휇, set
• Solve
• So if class got max like 42 Example (cont.)
• What if the observed information is
• What is the maximal likelihood estimator of 휇 now?
43 Example (cont.)
• What is the MLE of 휇 now? • We can answer this question circularly: Expectation Maximization If we know the value of 휇 If we know the expected we could compute the values of 푎 and 푏 we could expected value of 푎 and 푏 compute the maximum likelihood value of 휇
Since the ratio 푎: 푏 should be the same as the ratio ½ : 휇 44 Example – Algorithm
• We begin with a guess for 휇, and then iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of 휇 and 푎, 푏 • Define • μ (t) the estimate of μ on the t-th iteration • b (t) the estimate of b on t-th iteration
45 Example – Algorithm (cont.)
• Iteration will converge • But only assured to converge to local optimum
46 General EM methods
• Repeat until convergence{ (E-step) For each data point 푖, set (푖) 푖 푞푖 푗 = ℙ 푧 = 푗|푥 ; 휃 (M-step) Update the parameters
푁 ℙ(푥 푖 , 푧 푖 = 푗; 휃) argmax휃 푙 푞푖, 휃 = 푞푖 푗 log 푞푖 푗 푖=1 푗 }
47 Convergence of EM
• Denote 휃(푡) and 휃(푡+1) as the parameters of two successive iterations of EM, we want to prove that 푙 휃(푡) ≤ 푙 휃 푡+1 where note that 푁 푁 푙 휃 = log ℙ(푥 푖 ; 휃) = log ℙ(푥 푖 , 푧 푖 = 푗; 휃) 푖=1 푖=1 푗 • This shows EM always monotonically improves the log-likelihood, thus ensures EM will at least converge to a local optimum
48 Proof • Start from 휃(푡), we choose the posterior of latent variable (푡) (푖) 푖 (푡) 푞푖 푗 = 푝 푧 = 푗|푥 ; 휃 • By Jensen’s inequality on the log function 푝(푥 푖 ,푧 푖 =푗;휃(푡)) • 푙 푞, 휃(푡) = ∑푁 ∑ 푞 푗 log 푖=1 푗 푖 푞 푗 푗 푖 푝 푥 푖 , 푧 푖 = 푗; 휃 푡 ≤ log 푞푖 푗 ∙ 푞푖 푗 푖=1 푗 푗 푖 푖 (푡) (푡) 푙 휃(푡) = 푙 푞(푡), 휃(푡) ≥ 푙 푞, 휃(푡) = log 푝 푥 , 푧 = 푗; 휃 = 푙 휃 푖 푖=1 푗 • When 푞(푡) 푗 = 푝 푧(푖) = 푗|푥 푖 ; 휃(푡) , the above inequality is equality 푖 49 Proof (cont.)
• Then the parameter 휃(푡+1) is obtained by maximizing
푁 푖 푖 (푡) (푡) 푝(푥 , 푧 = 푗; 휃) 푙 푞푖 , 휃 = 푞푖 log (푡) 푖=1 푗 푞푖 • Thus (푡+1) (푡) (푡+1) (푡) (푡) (푡) 푙 휃 ≥ 푙 푞푖 , 휃 ≥ 푙 푞푖 , 휃 = 푙 휃 • Since the likelihood is at most 1, EM will converge (to a local optimum)
50 Example
• Follow previous example, suppose ℎ = 20, 푐 = 10, 푑 = 10, 휇(0) = 0
• Then
• Convergence is generally linear: error decreases by a constant factor each time step
51 Advantages
• EM is often applied to NP-hard problems • Widely used • Often effective
52 K-means and EM
• After initialize the position of the centers, K-means interactively execute the following two operations: • Assign the label of the points based on their distances to the centers • Specific posterior distribution of latent variable • E-step • Update the positions of the centers based on the labeling results • Given the labels, optimize the model parameters • M-step
53