The Expectation-Maximization Algorithm

EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford . 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Latent Variables and Marginal Likelihoods Many probabilistic models have hidden variables that are not observable in the dataset D: these models are known as latent variable models. Examples: Hidden Markov Models & Mixture Models. How would MLE be carried out for such models? Each data point is drawn from a joint distribution Pθ(X ; Z). For a realization ((X1; Z1); : : :; (Xn; Zn)), we only observe the variables in the dataset D = (X1; : : :; Xn). Complete-data likelihood: Yn Pθ((X1; Z1); : : :; (Xn; Zn)) = Pθ(Xi ; Zi ): i=1 Marginal likelihood: Yn X Pθ(X1; : : :; Xn) = Pθ(Xi ; Zi = z): i=1 z . 2/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (I) The MLE is obtained by maximizing the marginal likelihood: ! Xn X ^∗ P θn = arg max log θ(Xi ; Zi = z) : 2 θ Θ i=1 z Solving this optimization problem is often a hard task! Non-convex. Many local maxima. No analytic solution. Complete-data likelihood Marginal likelihood 0 −1 −2 −5 −3 −10 −4 )) )) X X −5 ( ( θ −15 θ P P −6 log( −20 log( −7 −8 −25 −9 −30 −10 −5 0 θ 5 10 −5 0 θ 5 10 . 3/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models The Hardness of Maximizing Marginal Likelihoods (II) The MLE for θ is obtained by maximizing the marginal log likelihood function: ! Xn X ^∗ P θn = arg max log θ(Xi ; Zi = z) : 2 θ Θ i=1 z Solving this optimization problem is often a hard task! The methods used in the previous lecture would not work. Need a simpler approximate procedure! The Expectation-Maximization is an iterative algorithm that computes an approximate solution for the MLE optimization problem. 4/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (I) The EM algorithm is well-suited for exponential family distributions. Exponential Family A single-parameter exponential family is a set of probability distributions that can be expressed in the form Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) ; where h(X ); A(θ) and T (X ) are known functions. An alternative, equivalent form often given as Pθ(X ) = h(X ) · g(θ) · exp (η(θ) · T (X )) : The variable θ is called the parameter of the family. 5/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (II) Exponential family distributions: Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) : T (X ) is a sufficient statistic of the distribution The sufficient statistic is a function of the data that fully summarizes the data X within the density function Pθ(X ). This means that for any data sets D1 and D2, the density function is the same if T (D1) = T (D2). This is true even if D1 and D2 are quite different. The sufficient statistic of a set of independent identically distributed data observations is simplyP the sum of individual D n sufficient statistics, i.e. T ( ) = i=1 T (Xi ). 6/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (III) Exponential family distributions: Pθ(X ) = h(X ) · exp (η(θ) · T (X ) − A(θ)) : η(θ) is called the natural parameter The set of values of η(θ) for which the function Pθ(X ) is finite is called the natural parameter space. A(θ) is called the log-partition function The mean, variance and other moments of the sufficient statistic T (X ) can be derived by differentiating A(θ). 7/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (IV) Exponential Family Example: Normal Distribution ( ) 1 −(X − µ)2 P (X ) = p · exp θ 2 σ2 2πσ ( ) 1 X 2 − 2 X µ + µ2 = p · exp − − log(σ) 2π 2 σ2 (⟨[ ] [ ] E ( )) µ −1 T 2 T − µ2 exp σ2 ; 2σ2 ; X ; X 2σ2 + log(σ) = p : 2π [ ] T µ −1 − 1 η(θ) = ; ; h(X ) = (2π) 2 σ2 2σ2 ( ) 2 [ ]T µ T (X ) = X ; X 2 ; A(θ) = + log(σ) : 2σ2 . 8/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (V) Properties of Exponential Families Exponential families have sufficient statistics that can summarize arbitrary amounts of independent identically distributed data using a fixed number of values. Exponential families have conjugate priors (an important property in Bayesian statistics). The posterior predictive distribution of an exponential-family random variable with a conjugate prior can always be written in closed form. 9/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory MLE for Latent Variable Models Exponential Families (VI) The Canonical Form of Exponential Families If η(θ) = θ, then the exponential family is said to be in canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that T (X ) is multiplied by that constant's reciprocal, or a constant c can be added to η(θ) and h(X ) multiplied by exp(−c · T (x)) to offset it. 10/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm Expectation-Maximization (I) Two unknowns: The latent variables Z = (Z1; : : :; Zn). The parameter θ. Complications arise because we don't know the latent variables (Z1; : : :; Zn)! maximizing Pθ((X1; Z1); : : :; (Xn; Zn)) is often a simpler task! Recall that maximizing the complete-data likelihood is often simpler than maximizing the marginalized likelihood! Complete-data likelihood Marginal likelihood 0 −1 −2 −5 −3 −10 −4 )) )) X X −5 ( ( θ −15 θ P P −6 log( −20 log( −7 −8 −25 −9 −30 −10 −5 0 θ 5 10 −5 0 θ 5 10 . 11/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm Expectation-Maximization (II) The EM Algorithm (o) 1 Start with an initial guess θ^ for θ. For every iteration t, do the following: P 2 ^(t) P D · P j D ^(t) E-Step: Q(θ; θ ) = z log ( θ(Z = z; )) (Z ; θ ) ^(t+1) ^(t) 3 M-Step: θ = arg maxθ2Θ Q(θ; θ ) 4 Go to step 2 if stopping criterion is not met. 12/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm Expectation-Maximization (III) Two unknowns: The latent variables Z = (Z1; : : :; Zn). The parameter θ. P P D · P j D \Expected" Likelihood: z log ( θ(Z = z; )) (Z ; θ). Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable. Complete-data likelihood Marginal likelihood 0 −1 −2 −5 −3 −10 −4 )) )) X X −5 ( ( θ −15 θ P P −6 log( −20 log( −7 −8 −25 −9 −30 −10 −5 0 θ 5 10 −5 0 θ 5 10 . 13/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm Expectation-Maximization (III) Two unknowns: The latent variables Z = (Z1; : : :; Zn). The parameter θ. P P D · P j D \Expected" Likelihood: z log ( θ(Z = z; )) (Z ; θ). Here the logarithm acts directly on the complete-data likelihood, so the corresponding M-step will be tractable. But we still have two terms (log (Pθ(Z = z; D)) & P(Zj D; θ)) that depend on the two unknowns Z and θ. The EM algorithm: E-step: Fix the posterior Zj D; θ by conditioning on the current guess for θ, i.e. Zj D; θ(t). M-step: Update the guess for θ by solving a tractable optimization problem. The EM algorithm breaks down the intractable MLE optimization problem into simpler, tractable. iterative. steps.. 14/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm EM for Exponential Family (I) The critical points ofP the marginal likelihood function: @log(P (D)) @P (D;Z=z) θ = 1 z θ = 0. @θ Pθ(D) @θ 0 1 @log (P (D; Z)) @ θ = log @exp (hη(θ); T (D; Z)i − A(θ)) · h(D; Z)A : @θ @θ | {z } Canonical form of exponential family For η(θ) = θ, we have that ( ) @P (D; Z) @ θ = T (D; Z) − A(θ) P (D; Z): @θ @θ θ . 15/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm EM for Exponential Family (II) E D @ For exponential families: θ [T ( ; Z)] = @θ A(θ). @P (D; Z) θ = (T (D; Z) − E [T (D; Z)]) P (D; Z): @θ θ θ P P D Since 1 @ θ( ;Z=z) = 0; we have that Pθ(D) z @θ 1 X (T (D; Z = z) − E [T (D; Z)]) P (D; Z = z) = 0 P (D) θ θ θ z X P (D; Z = z) T (D; Z = z) θ − E [T (D; Z)] = 0 P (D) θ z θ Eθ [T (D; Z)jD] − Eθ [T (D; Z)] = 0 . 16/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory EM: The Algorithm EM for Exponential Family (III) For the critical values of θ, the following condition is satisfied: Eθ [T (D; Z)jD] = Eθ [T (D; Z)] : How is this related to the EM objective Q(θ; θ^(t))? X ^(t) P D ·P j D Q(θ; θ ) = log ( θ(Z = z; )) θ^(t) (Z ) z E D jD − = θ θ^(t) [T ( ; Z) ] A(θ) + Constant E D jD − E D = θ θ^(t) [T ( ; Z) ] θ [T ( ; Z)] + Constant: @Q(θ;θ^(t)) ! E D jD E D @θ = 0 θ^(t) [T ( ; Z) ] = θ [T ( ; Z)] : Since it is difficult to solve the above equation analytically, the EM algorithm solves for θ via successive approximations, i.e.

Load more