1 the EM Algorithm

1 The EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) algorithm, which is a common algorithm used in statistical estimation to try and find the MLE. It is often used in situations that are not exponential families, but are derived from exponential families. A common mechanism by which these likelihoods are derived is through missing data, i.e. we only observe some of the sufficient statistics of the family. 1.1 Mixture model A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (Xi)1≤i≤n from L Y ∼ Multinomial(1; π); π 2 R XjY = l ∼ Pηl with the simplest example of Pη being the univariate normal model 2 Pηl = N(µl; σl ) keeping in mind that the parameters on the right are the mean space parameters, not the natural parameters. 1.1.1 Exercise 1. Show that the joint distribution of (X; Y ) is an exponential family. What is its reference mea- sure, its sufficient statistics? Write out the log-likelihood based on observing an IID sample (Xi;Yi)1≤i≤n for this model. Call this `c(η; X; Y ) the complete likelihood. 2. What is the marginal density of X? 3. Write out the log-likelihood `(η; X) based on observing an IID sample (Xi)1≤i≤n from this model. What are its parameters? In the mixture model, we only observe X, though the marginal distribution of X is the same as if we had generated pairs (X; Y ) and marginalized over Y . In this problem, Y is missing data which we might call M, and X is observed data which we might call O. Formally, then, we partition our sufficient statistic into two sets: those observed, and those missing. 1.2 The EM algorithm The EM algorithm usually has two steps, both of which are based on the following function Q(η;η ~) = Eη~ `c(η; O; M) O The basis of the EM algorithm is the following result: Q(η;η ~) ≥ Q(~η;η ~) =) `(η; O) ≥ `(~η; O): 1 (k) Therefore, any sequence (η )k≥1 satisfying Q(η(k+1); η(k)) ≥ Q(η(k); η(k)) has `(η(k); O) non-decreasing. An algorithm that produces such a sequence is called a GEM algorithm (generalized EM algorithm). The proof of this is fairly straightforward after some initial slight of hand. After this slight of hand, we see the main ingredient in the proof is deviance of the conditional distribution of MjO. In the general case, this deviance is not expressed in terms natural parameters but the argument is the same. Here is the proof: writing the joint distribution of (O; M) (assuming it has a density with respect to P0) as dPη = fη;(O;M)(o; m) = fη;O(o) · fη;MjO(mjo) dP0 where the f's are densities with respect to P0. Or, fη;(O;M)(o; m) fη;O(o) = : fη;MjO(mjo) Although the RHS seems to depend on m, the above equality shows that it is actually measurable with respect to o. We see that n X `(η; O) = log fη(Oi) i=1 n X = [log fη(Oi;Mi) − log fη(MijOi)] i=1 n X = [log fη(Oi;Mi) − log fη(MijOi)] i=1 where we know that fη(mjo) is an exponential family for O fixed. The right hand side is measurable with respect to O so its conditional expectation with respect 2 to O leaves it unchanged. Therefore, for anyη ~ we have the equality n X `(η; O) = log fη(Oi) i=1 n X = [log fη(Oi;Mi) − log fη(MijOi)] i=1 n X = Eη~ log fη(Oi;Mi) O − Eη~ log fη(MijOi) O i=1 n X = Eη~ `c(η; O; M) O − Eη~ log fη(MijOi) Oi i=1 n X = Q(η;η ~) − Eη~ log fη(MijOi) Oi : i=1 Now, `(η; O) − `(~η; O) = Q(η;η ~) − Q(~η;η ~) n X + Eη~ log fη~(MijOi) Oi − Eη~ log fη(MijOi) Oi i=1 The term n X Eη~ log fη~(MijOi) O − Eη~ log fη(MijOi) O i=1 is essentially half the deviance of the exponential family of conditional distributions for MjO with sufficient statistics M. p To see this, recall our general form of the conditional density of T1jT2 = s2 for an R valued k p−k sufficient statistic partitioned as T1 2 R ;T2 2 R : fT1;T2 (t1; s2) fT1jT2=s2 (t1) = R k f (s1; s2) ds1 R T1;T2 ηT t +ηT s e 1 1 2 2 m~ 0(t1; s2) = T T R η s1+η s2 k e 1 2 m~ 0(s1; s2) ds1 R ηT t e 1 1 m~ 0(t1; s2) = T R η s1 k e 1 m~ 0(s1; s2) ds1 R Therefore, with C a function independent of η Z T ηT s log fη(MijOi) = ηM Mi − log e M m~ 0(s; Oi) ds + C(Mi;Oi) Rk T ~ = ηM Mi − Λ(ηM ;Oi) + C(Mi;Oi) where Λ(~ ηM ;Oi) is the appropriate CGF for this conditional distribution. 3 We see then, that T log fη~(MijOi) − log fη(MijOi) = Λ(~ ηM ;Oi) − Λ(~~ ηM ;Oi) − (ηM − η~M ) Mi: Taking conditional expectation with respect to O yields atη ~ yields n X 1 η~ log fη~(MijOi) − log fη(MijOi) O = D(~η; ηjO) ≥ 0: E 2 i=1 1.3 The two basic steps The algorithm is often described as having two steps the E step and the M step. Formally, the E step can be described as evaluating Q(η;η ~) withη ~ fixed. That is, fixη ~ and compute qη~(η) = Eη~ `c(η; O; M) O as a function of η. The M is the maximization step and amounts to finding η^(~η) 2 argmaxη Q(η;η ~) = argmaxη qη~(η): 1.4 EM algorithm for exponential families The EM algorithm for exponential families takes a particularly nice form when the MLE map is nice in the complete data problem. Expressed sequentially, it can be expressed by the recursion (k+1) h T i η^ = argmaxη η Eη(k) ((M; O)jO) − Λ(η) : In other words, we need to form the conditional expectation of all the sufficient statistics given the sufficient statistics we did observe. Following this, we just return the MLE as if we had observed those sufficient statistics. Another way to phrase this is (k+1) ∗ η^ = rΛ Eη(k) ((M; O)jO) 1.5 Mixture model example In the mixture model, if we write Yi = (Yi1;:::;YiL) example the sufficient statistics can be taken to be n n n ! X X X 2 t(X; Y ) = Yij; YijXi; YijXi : i=1 i=1 i=1 1≤j≤L PL where only j=1 YijXi = Xi; 1 ≤ i ≤ n is observed. 4 1.5.1 Exercise Use Bayes rule to show that, in our univariat e normal mixture model π φ(x; µ ; σ2) (Y = ljX = x) = l l l Pη PL 2 j=1 πjφ(x; µj; σl ) 2 2 where φ(x; µ, σ ) is the univariate density of N(µ, σl ). If we set γ^l(x; η~) = Pη~(Y = ljX = x) The above exercise shows that n ! n X X Eη~ YilXi X = γ^l(Xi; η~)Xi i=1 i=1 n ! n X 2 X 2 Eη~ YilXi X = γ^l(Xi; η~)Xi i=1 i=1 n ! n X X Eη~ Yil X = γ^l(Xi; η~) i=1 i=1 The usual MLE map (for the mean parameters) in this model can be expressed as n X π^l = Yil=n i=1 Pn i=1 YilXi µ^l = Pn i=1 Yil Pn 2 2 i=1 Yil(Xi − µ^l) σ^l = Pn i=1 Yil Pn 2 Pn 2 i=1 YilXi i=1 YilXi = Pn − Pn i=1 Yil i=1 Yil This leads to the algorithm, given an initial set of parameters η(0) we repeat the following updates for k ≥ 0: (k) • Form the responsibilities γ^l(Xi; η ); 1 ≤ l ≤ L; 1 ≤ i ≤ n. • Compute n (k+1) X (k) π^l = γ^l(Xi; η )=n i=1 Pn γ^ (X ; η(k))X µ^(k+1) = i=1 l i i l Pn (k) i=1 γ^l(Xi; η ) n (k+1) P γ^ (X ; η(k))X2 2 σ^2 = i=1 l i i − µ^(k+1) l Pn (k) l i=1 γ^l(Xi; η ) • Repeat 5 Let's test out our algorithm on some data from the mixture model. mu1, sigma1 = 2, 1 mu2, sigma2 = -1, 0.8 X1= np.random.standard_normal(200)*sigma1+ mu1 X2= np.random.standard_normal(600)*sigma2+ mu2 X= np.hstack([X1,X2]) %R-iX plot(density(X)) def phi(x, mu, sigma): """ Normal density """ return np.exp(-(x-mu)**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi* sigma**2) def responsibilities(X, params): """ Compute the responsibilites, as well as the likelihood at the same time. """ mu1, mu2, sigma1, sigma2, pi1, pi2= params 6 gamma1= phi(X, mu1, sigma1)* pi1 gamma2= phi(X, mu2, sigma2)* pi2 denom= gamma1+ gamma2 gamma1 /= denom gamma2 /= denom return np.array([gamma1, gamma2]).T, np.log(denom).sum() mu1, mu2, sigma1, sigma2, pi1, pi2 = 0, 1, 1, 4, 0.5, 0.5 gamma, likelihood= responsibilities(X,(mu1, mu2, sigma1, sigma2, pi1, pi2)) Here is our recursive estimation procedure, which is fairly straightforward here.

1 the EM Algorithm

LECTURE 13 Mixture Models and Latent Space Models

START HERE: Instructions Mixtures of Exponential Family

Robust Mixture Modelling Using the T Distribution

Lecture 16: Mixture Models

Linear Mixture Model Applied to Amazonian Vegetation Classification

Graph Laplacian Mixture Model Hermina Petric Maretic and Pascal Frossard

Comparing Unsupervised Clustering Algorithms to Locate Uncommon User Behavior in Public Travel Data

Semiparametric Estimation in the Normal Variance-Mean Mixture Model∗

Structure of a Mixture Model

Student's T Distribution Based Estimation of Distribution

The Infinite Gaussian Mixture Model

Outlier Detection and Robust Mixture Modeling Using Nonconvex Penalized Likelihood