<<

EM & Models Gaussian Mixture Models EM Theory

The Expectation-Maximization Algorithm

Mihaela van der Schaar

Department of Engineering Science University of Oxford

...... 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Latent Variables and Marginal Likelihoods

Many probabilistic models have hidden variables that are not observable in the dataset D: these models are known as latent variable models. Examples: Hidden Markov Models & Mixture Models. How would MLE be carried out for such models? Each data point is drawn from a joint distribution Pθ(X , Z). For a realization ((X1, Z1), . . ., (Xn, Zn)), we only observe the variables in the dataset D = (X1, . . ., Xn). Complete-data likelihood: ∏n Pθ((X1, Z1), . . ., (Xn, Zn)) = Pθ(Xi , Zi ). i=1 Marginal likelihood: ∏n ∑ Pθ(X1, . . ., Xn) = Pθ(Xi , Zi = z). i=1 z ...... 2/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (I)

The MLE is obtained by maximizing the marginal likelihood: ( ) ∑n ∑ ˆ∗ P θn = arg max log θ(Xi , Zi = z) . ∈ θ Θ i=1 z Solving this optimization problem is often a hard task! Non-convex. Many local maxima. No analytic solution.

Complete-data likelihood Marginal likelihood 0 −1

−2 −5 −3

−10 −4 )) ))

X X −5 ( (

θ −15 θ

P P −6

log( −20 log( −7

−8 −25 −9

−30 −10 −5 0 θ 5 10 −5 0 θ 5 10 ...... 3/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (II)

The MLE for θ is obtained by maximizing the marginal log likelihood function: ( ) ∑n ∑ ˆ∗ P θn = arg max log θ(Xi , Zi = z) . ∈ θ Θ i=1 z Solving this optimization problem is often a hard task! The methods used in the previous lecture would not work. Need a simpler approximate procedure! The Expectation-Maximization is an iterative algorithm that computes an approximate solution for the MLE optimization problem.

...... 4/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (I)

The EM algorithm is well-suited for distributions.

Exponential Family A single-parameter exponential family is a set of probability distributions that can be expressed in the form

Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) ,

where h(X ), A(θ) and T (X ) are known functions. An alternative, equivalent form often given as

Pθ(X ) = h(X ) · g(θ) · exp (η(θ) · T (X )) .

The variable θ is called the parameter of the family...... 5/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (II)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) . T (X ) is a sufficient statistic of the distribution The sufficient statistic is a function of the data that fully summarizes the data X within the density function Pθ(X ). This that for any data sets D1 and D2, the density function is the same if T (D1) = T (D2). This is true even if D1 and D2 are quite different. The sufficient statistic of a set of independent identically distributed data observations is simply∑ the sum of individual D n sufficient , i.e. T ( ) = i=1 T (Xi ).

...... 6/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (III)

Exponential family distributions:

Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) . η(θ) is called the natural parameter

The set of values of η(θ) for which the function Pθ(X ) is finite is called the natural parameter space. A(θ) is called the log-partition function The , and other moments of the sufficient statistic T (X ) can be derived by differentiating A(θ).

...... 7/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (IV)

Exponential Family Example: ( ) 1 −(X − µ)2 P (X ) = √ · exp θ 2 σ2 2πσ ( ) 1 X 2 − 2 X µ + µ2 = √ · exp − − log(σ) 2π 2 σ2 (⟨[ ] [ ] ⟩ ( )) µ −1 T 2 T − µ2 exp σ2 , 2σ2 , X , X 2σ2 + log(σ) = √ . 2π [ ] T µ −1 − 1 η(θ) = , , h(X ) = (2π) 2 σ2 2σ2 ( ) 2 [ ]T µ T (X ) = X , X 2 , A(θ) = + log(σ) . 2σ2

...... 8/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (V)

Properties of Exponential Families Exponential families have sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values. Exponential families have conjugate priors (an important property in Bayesian statistics). The posterior predictive distribution of an exponential-family random variable with a can always be written in closed form.

...... 9/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

MLE for Latent Variable Models Exponential Families (VI)

The Canonical Form of Exponential Families If η(θ) = θ, then the exponential family is said to be in canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that T (X ) is multiplied by that constant’s reciprocal, or a constant c can be added to η(θ) and h(X ) multiplied by exp(−c · T (x)) to offset it.

...... 10/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm Expectation-Maximization (I)

Two unknowns: The latent variables Z = (Z1, . . ., Zn). The parameter θ. Complications arise because we don’t know the latent variables (Z1, . . ., Zn)→ maximizing Pθ((X1, Z1), . . ., (Xn, Zn)) is often a simpler task! Recall that maximizing the complete-data likelihood is often simpler than maximizing the marginalized likelihood!

Complete-data likelihood Marginal likelihood 0 −1

−2 −5 −3

−10 −4 )) ))

X X −5 ( (

θ −15 θ

P P −6

log( −20 log( −7

−8 −25 −9

−30 −10 −5 0 θ 5 10 −5 0 θ 5 10 ...... 11/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm Expectation-Maximization (II)

The EM Algorithm (o) 1 Start with an initial guess θˆ for θ. For every iteration t, do the following: ∑ 2 ˆ(t) P D · P | D ˆ(t) E-Step: Q(θ, θ ) = z log ( θ(Z = z, )) (Z , θ ) ˆ(t+1) ˆ(t) 3 M-Step: θ = arg maxθ∈Θ Q(θ, θ ) 4 Go to step 2 if stopping criterion is not met.

...... 12/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm Expectation-Maximization (III)

Two unknowns:

The latent variables Z = (Z1, . . ., Zn). The parameter θ. ∑ P D · P | D “Expected” Likelihood: z log ( θ(Z = z, )) (Z , θ).

Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable.

Complete-data likelihood Marginal likelihood 0 −1

−2 −5 −3

−10 −4 )) ))

X X −5 ( (

θ −15 θ

P P −6

log( −20 log( −7

−8 −25 −9

−30 −10 −5 0 θ 5 10 −5 0 θ 5 10

...... 13/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm Expectation-Maximization (III)

Two unknowns: The latent variables Z = (Z1, . . ., Zn). The parameter θ. ∑ P D · P | D “Expected” Likelihood: z log ( θ(Z = z, )) (Z , θ).

Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable. But we still have two terms (log (Pθ(Z = z, D)) & P(Z| D, θ)) that depend on the two unknowns Z and θ. The EM algorithm: E-step: Fix the posterior Z| D, θ by conditioning on the current guess for θ, i.e. Z| D, θ(t). M-step: Update the guess for θ by solving a tractable optimization problem. The EM algorithm breaks down the intractable MLE

optimization problem into simpler, tractable...... iterative...... steps...... 14/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm EM for Exponential Family (I)

The critical points of∑ the marginal likelihood function: ∂log(P (D)) ∂P (D,Z=z) θ = 1 z θ = 0. ∂θ Pθ(D) ∂θ   ∂log (P (D, Z)) ∂ θ = log exp (⟨η(θ), T (D, Z)⟩ − A(θ)) · h(D, Z) . ∂θ ∂θ | {z } Canonical form of exponential family

For η(θ) = θ, we have that ( ) ∂P (D, Z) ∂ θ = T (D, Z) − A(θ) P (D, Z). ∂θ ∂θ θ

...... 15/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm EM for Exponential Family (II)

E D ∂ For exponential families: θ [T ( , Z)] = ∂θ A(θ). ∂P (D, Z) θ = (T (D, Z) − E [T (D, Z)]) P (D, Z). ∂θ θ θ ∑ P D Since 1 ∂ θ( ,Z=z) = 0, we have that Pθ(D) z ∂θ 1 ∑ (T (D, Z = z) − E [T (D, Z)]) P (D, Z = z) = 0 P (D) θ θ θ z ∑ P (D, Z = z) T (D, Z = z) θ − E [T (D, Z)] = 0 P (D) θ z θ

Eθ [T (D, Z)|D] − Eθ [T (D, Z)] = 0

...... 16/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM: The Algorithm EM for Exponential Family (III)

For the critical values of θ, the following condition is satisfied:

Eθ [T (D, Z)|D] = Eθ [T (D, Z)] . How is this related to the EM objective Q(θ, θˆ(t))? ∑ ˆ(t) P D ·P | D Q(θ, θ ) = log ( θ(Z = z, )) θˆ(t) (Z ) z E D |D − = θ θˆ(t) [T ( , Z) ] A(θ) + Constant E D |D − E D = θ θˆ(t) [T ( , Z) ] θ [T ( , Z)] + Constant. ∂Q(θ,θˆ(t)) → E D |D E D ∂θ = 0 θˆ(t) [T ( , Z) ] = θ [T ( , Z)] . Since it is difficult to solve the above equation analytically, the EM algorithm solves for θ via successive approximations, i.e. solve the following for θˆ(t+1): E [T (D, Z)|D] = E [T (D, Z)] . θˆ(t) θˆ(t.+1)...... 17/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models Example: Multivariate Gaussian Mixtures

Parameters for a mixture of K Gaussians: mixture { }K proportions πk k=1, mean vectors and covariance matrices { }K (µk , Σk ) k=1.

K = 3 3

2

1

0

2 −1 X

−2

−3

−4

−5 −5 −4 −3 −2 −1 0 1 2 3 4 5 X1

Figure: Contour plot for the density of a mixture of 3 bivariate Gaussian distributions...... 18/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models The Generative Process

Zi = z ∼ Categorical(π1, . . . , πK ), and Xi ∼ N (µz , Σz ).

3

2

1

0

−1 2 X −2

−3

−4

−5

−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 X1

Figure: A sample from a : every data point is colored according to its component membership.

...... 19/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

Multivariate Gaussian Mixture Models The Dataset

K Need to learn the parameters (πk , µk , Σk )k=1 from the data points D = (X1,..., Xn) that are not “colored by the component memberships”, i.e. we do not observe the latent variables Z = (Z1,..., Zn).

3 3

2 2

1 1

0 0

−1 −1 2 2 X X −2 −2

−3 −3

−4 −4

−5 −5

−6 −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 X1 X1

(a) (D, Z): the data points and their component (b) D: the dataset with the observed data points memberships. (component memberships are latent)...... 20/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models MLE for the Gaussian Mixture Models

The complete-data likelihood function is given by ∏n P D ·N | θ( , Z) = πzi (Xi µzi , Σzi ). i=1 The marginal likelihood function is ∏n ∑K Pθ(D) = πk ·N (Xi |µk , Σk ). i=1 k=1 The MLE can be obtained by maximizing the marginal log likelihood function: ( ) ∑n ∑K ˆ∗ ·N | θn = arg max log πk (Xi µk , Σk ) . ∈ θ Θ i=1 k=1

Exercise: Is the objective function above. . . . . concave...... ?...... 21/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (I)

The expected complete-data log likelihood function is

∑n ∑K Ez [Pθ(D, Z)] = γ(k, Xi |θ) (log(πk ) + log (N (Xi |µk , Σk ))) i=1 k=1

γ(k, Xi |θ) = Pθ(Zi = k|Xi ).

γ(k, Xi |θ) is called the responsibility of component k towards data point Xi π ·N (X |µ , Σ ) γ(k, X |θ) = ∑ k i k k i K ·N | j=1 πj (Xi µj , Σj ) Try to work out the derivation above yourself!

...... 22/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (II)

(E-step) Approximate expected complete-data likelihood by fixing the responsibilities γ(k, Xi |θ) using the parameter estimates obtained from the previous iteration. ∑n ∑K (t) (t) Q(θ, θˆ ) = γ(k, Xi |θˆ ) (log(πk ) + log (N (Xi |µk , Σk ))) i=1 k=1 (t) ·N | (t) ˆ (t) (t) πˆk (Xi µˆk , Σk ) γ(k, Xi |θˆ ) = ∑ . K (t) ·N | (t) ˆ (t) j=1 πˆj (Xi µˆj , Σj ) (M-step) Solve a tractable optimization problem (ˆπ(t+1), µˆ(t+1), Σˆ (t+1)) = ∑n ∑K (t) arg max γ(k, Xi |θˆ ) (log(πk ) + log (N (Xi |µk , Σk ))) (π,µ,Σ) i=1 k=1 ...... 23/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models Implementing EM for the Gaussian Mixture Model (III)

The (M-step) yields the following parameter updating equations

1 ∑n πˆ(t+1) = γ(k, X |θˆ(t)) k n i i=1 1 ∑n µˆ(t+1) = X · γ(k, X |θˆ(t)) k n i i i=1 ∑n |ˆ(t) (t+1) γ(k, Xi θ ) (t+1) (t+1) T Σˆ = ∑ (Xi − µˆ )(Xi − µˆ ) k n |ˆ(t) k k i=1 j=1 γ(k, Xj θ ) Try to work out the updating equations by yourself!

...... 24/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models EM in Practice

Consider a Gaussian mixture model with K = 3, and the following parameters:

π1 = 0.6, π2 = 0.05, and π3 = 0.35. T T T µ1 = [−1.4, 1.8] , µ2 = [−1.4, −2.8] , µ3 = [−1.9, 0.55] . [ ] [ ] [ ] 0.8 −0.8 1.2 2.3 0.4 −0.01 Σ = , Σ = , Σ = 1 −0.8 4 2 2.3 5.2 3 −0.01 0.35

Try writing a MATLAB code that generates a random dataset of 5000 data points drawn from the model specified above, and implement the EM algorithm to learn the model parameters from this dataset.

...... 25/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models EM in Practice

The complete-data log likelihood increases after every EM iteration! This means that every new iteration finds a ”better” estimate!

−3.2

−3.22

−3.24

−3.26

−3.28

−3.3

−3.32 Log-likelihood −3.34

−3.36

−3.38

−3.4 0 10 20 30 40 50 60 70 80 EM iteration ...... 26/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM for Gaussian Mixture Models EM in Practice

Compare the true density function with the estimated one.

Contour Plot for the True Density Function Contour Plot for the Estimated Density Function 6 6

4 4

2 2

0 2 2 0 X X

−2 −2

−4 −4

−6

−6 −6 −4 −2 0 2 4 6 −6 −4 −2 0 2 4 6 X1 X1

...... 27/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM Performance Guarantees What Does EM Guarantee?

The EM algorithm does not guarantee that θˆ(t) will converge ˆ∗ to θn. EM guarantees the following: θˆ(t) always converges (to a local optimum). P D Every iteration improves the marginal likelihood θˆ(t) ( ). Does the Initial Value Matter?

1 The initial value θ(o) affects the speed of convergence and the value of θ(∞)! Smart initialization methods are often needed.

2 The K-means algorithm is often used to initialize the parameters in a Gaussian mixture model before applying the EM algorithm.

...... 28/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory

EM Performance Guarantees References

1 Robert W Keener, ”Statistical theory: notes for a course in theoretical statistics,” 2006. 2 Robert W Keener, ”Theoretical Statistics: Topics for a Core Course,” 2010. 3 Christopher Bishop, ” and Machine Learning,” 2007.

...... 29/29