<<

1 The EM algorithm

In this set of notes, we discuss the EM (Expectation-Maximization) algorithm, which is a common algorithm used in statistical estimation to try and find the MLE. It is often used in situations that are not exponential families, but are derived from exponential families. A common mechanism by which these likelihoods are derived is through missing data, i.e. we only observe some of the sufficient of the family.

1.1 A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (Xi)1≤i≤n from

L Y ∼ Multinomial(1, π), π ∈ R

X|Y = l ∼ Pηl with the simplest example of Pη being the univariate normal model

2 Pηl = N(µl, σl ) keeping in mind that the parameters on the right are the space parameters, not the natural parameters.

1.1.1 Exercise 1. Show that the joint distribution of (X,Y ) is an . What is its reference mea- sure, its sufficient statistics? Write out the log-likelihood based on observing an IID sample (Xi,Yi)1≤i≤n for this model. Call this `c(η; X,Y ) the complete likelihood. 2. What is the marginal density of X?

3. Write out the log-likelihood `(η; X) based on observing an IID sample (Xi)1≤i≤n from this model. What are its parameters?

In the mixture model, we only observe X, though the marginal distribution of X is the same as if we had generated pairs (X,Y ) and marginalized over Y . In this problem, Y is missing data which we might call M, and X is observed data which we might call O. Formally, then, we partition our sufficient statistic into two sets: those observed, and those missing.

1.2 The EM algorithm The EM algorithm usually has two steps, both of which are based on the following function  Q(η;η ˜) = Eη˜ `c(η; O,M) O

The basis of the EM algorithm is the following result:

Q(η;η ˜) ≥ Q(˜η;η ˜) =⇒ `(η; O) ≥ `(˜η; O).

1 (k) Therefore, any sequence (η )k≥1 satisfying

Q(η(k+1); η(k)) ≥ Q(η(k); η(k)) has `(η(k); O) non-decreasing. An algorithm that produces such a sequence is called a GEM algo- rithm (generalized EM algorithm). The proof of this is fairly straightforward after some initial slight of hand. After this slight of hand, we see the main ingredient in the proof is deviance of the conditional distribution of M|O. In the general case, this deviance is not expressed in terms natural parameters but the argument is the same. Here is the proof: writing the joint distribution of (O,M) (assuming it has a density with respect to P0) as dPη = fη,(O,M)(o, m) = fη,O(o) · fη,M|O(m|o) dP0 where the f’s are densities with respect to P0. Or,

fη,(O,M)(o, m) fη,O(o) = . fη,M|O(m|o)

Although the RHS seems to depend on m, the above equality shows that it is actually measurable with respect to o. We see that n X `(η; O) = log fη(Oi) i=1 n X = [log fη(Oi,Mi) − log fη(Mi|Oi)] i=1

n X = [log fη(Oi,Mi) − log fη(Mi|Oi)] i=1 where we know that fη(m|o) is an exponential family for O fixed. The right hand side is measurable with respect to O so its conditional expectation with respect

2 to O leaves it unchanged. Therefore, for anyη ˜ we have the equality

n X `(η; O) = log fη(Oi) i=1 n X = [log fη(Oi,Mi) − log fη(Mi|Oi)] i=1

n X    = Eη˜ log fη(Oi,Mi) O − Eη˜ log fη(Mi|Oi) O i=1 n  X  = Eη˜ `c(η; O,M) O − Eη˜ log fη(Mi|Oi) Oi i=1 n X  = Q(η;η ˜) − Eη˜ log fη(Mi|Oi) Oi . i=1 Now,

`(η; O) − `(˜η; O) = Q(η;η ˜) − Q(˜η;η ˜) n X    + Eη˜ log fη˜(Mi|Oi) Oi − Eη˜ log fη(Mi|Oi) Oi i=1 The term n X    Eη˜ log fη˜(Mi|Oi) O − Eη˜ log fη(Mi|Oi) O i=1 is essentially half the deviance of the exponential family of conditional distributions for M|O with sufficient statistics M. p To see this, recall our general form of the conditional density of T1|T2 = s2 for an R valued k p−k sufficient statistic partitioned as T1 ∈ R ,T2 ∈ R :

fT1,T2 (t1, s2) fT1|T2=s2 (t1) = R k f (s1, s2) ds1 R T1,T2 ηT t +ηT s e 1 1 2 2 m˜ 0(t1, s2) = T T R η s1+η s2 k e 1 2 m˜ 0(s1, s2) ds1 R ηT t e 1 1 m˜ 0(t1, s2) = T R η s1 k e 1 m˜ 0(s1, s2) ds1 R Therefore, with C a function independent of η Z  T ηT s log fη(Mi|Oi) = ηM Mi − log e M m˜ 0(s, Oi) ds + C(Mi,Oi) Rk T ˜ = ηM Mi − Λ(ηM ,Oi) + C(Mi,Oi) where Λ(˜ ηM ,Oi) is the appropriate CGF for this conditional distribution.

3 We see then, that

T log fη˜(Mi|Oi) − log fη(Mi|Oi) = Λ(˜ ηM ,Oi) − Λ(˜˜ ηM ,Oi) − (ηM − η˜M ) Mi.

Taking conditional expectation with respect to O yields atη ˜ yields

n X  1 η˜ log fη˜(Mi|Oi) − log fη(Mi|Oi) O = D(˜η; η|O) ≥ 0. E 2 i=1

1.3 The two basic steps The algorithm is often described as having two steps the E step and the M step. Formally, the E step can be described as evaluating Q(η;η ˜) withη ˜ fixed. That is, fixη ˜ and compute  qη˜(η) = Eη˜ `c(η; O,M) O as a function of η. The M is the maximization step and amounts to finding

ηˆ(˜η) ∈ argmaxη Q(η;η ˜) = argmaxη qη˜(η).

1.4 EM algorithm for exponential families The EM algorithm for exponential families takes a particularly nice form when the MLE map is nice in the complete data problem. Expressed sequentially, it can be expressed by the recursion

(k+1) h T i ηˆ = argmaxη η Eη(k) ((M,O)|O) − Λ(η) .

In other words, we need to form the conditional expectation of all the sufficient statistics given the sufficient statistics we did observe. Following this, we just return the MLE as if we had observed those sufficient statistics. Another way to phrase this is

(k+1) ∗   ηˆ = ∇Λ Eη(k) ((M,O)|O)

1.5 Mixture model example

In the mixture model, if we write Yi = (Yi1,...,YiL) example the sufficient statistics can be taken to be n n n ! X X X 2 t(X,Y ) = Yij, YijXi, YijXi . i=1 i=1 i=1 1≤j≤L PL where only j=1 YijXi = Xi, 1 ≤ i ≤ n is observed.

4 1.5.1 Exercise Use Bayes rule to show that, in our univariat e normal mixture model

π φ(x, µ , σ2) (Y = l|X = x) = l l l Pη PL 2 j=1 πjφ(x, µj, σl )

2 2 where φ(x, µ, σ ) is the univariate density of N(µ, σl ). If we set γˆl(x, η˜) = Pη˜(Y = l|X = x) The above exercise shows that

n ! n X X Eη˜ YilXi X = γˆl(Xi, η˜)Xi i=1 i=1 n ! n X 2 X 2 Eη˜ YilXi X = γˆl(Xi, η˜)Xi i=1 i=1 n ! n X X Eη˜ Yil X = γˆl(Xi, η˜) i=1 i=1 The usual MLE map (for the mean parameters) in this model can be expressed as

n X πˆl = Yil/n i=1 Pn i=1 YilXi µˆl = Pn i=1 Yil Pn 2 2 i=1 Yil(Xi − µˆl) σˆl = Pn i=1 Yil Pn 2 Pn 2 i=1 YilXi i=1 YilXi = Pn − Pn i=1 Yil i=1 Yil

This leads to the algorithm, given an initial set of parameters η(0) we repeat the following updates for k ≥ 0:

(k) • Form the responsibilities γˆl(Xi; η ), 1 ≤ l ≤ L, 1 ≤ i ≤ n. • Compute n (k+1) X (k) πˆl = γˆl(Xi; η )/n i=1 Pn γˆ (X ; η(k))X µˆ(k+1) = i=1 l i i l Pn (k) i=1 γˆl(Xi; η ) n (k+1) P γˆ (X ; η(k))X2  2 σˆ2 = i=1 l i i − µˆ(k+1) l Pn (k) l i=1 γˆl(Xi; η ) • Repeat

5 Let’s test out our algorithm on some data from the mixture model. mu1, sigma1 = 2, 1 mu2, sigma2 = -1, 0.8

X1= np.random.standard_normal(200)*sigma1+ mu1 X2= np.random.standard_normal(600)*sigma2+ mu2 X= np.hstack([X1,X2]) %R-iX plot(density(X))

def phi(x, mu, sigma): """ Normal density """ return np.exp(-(x-mu)**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi* sigma**2) def responsibilities(X, params): """ Compute the responsibilites, as well as the likelihood at the same time. """ mu1, mu2, sigma1, sigma2, pi1, pi2= params

6 gamma1= phi(X, mu1, sigma1)* pi1 gamma2= phi(X, mu2, sigma2)* pi2 denom= gamma1+ gamma2 gamma1 /= denom gamma2 /= denom

return np.array([gamma1, gamma2]).T, np.log(denom).sum()

mu1, mu2, sigma1, sigma2, pi1, pi2 = 0, 1, 1, 4, 0.5, 0.5

gamma, likelihood= responsibilities(X,(mu1, mu2, sigma1, sigma2, pi1, pi2))

Here is our recursive estimation procedure, which is fairly straightforward here.

niter = 20 n=X.shape[0]

values = [] for_ in range(niter): gamma, likelihood= responsibilities(X,(mu1, mu2, sigma1, sigma2, pi1, pi2)) pi1, pi2= gamma.sum(0) /n mu1=(gamma[:,0] *X).sum() / (pi1*n) mu2=(gamma[:,1] *X).sum() / (pi2*n) sigma1_sq=(gamma[:,0] *X**2).sum() / (n*pi1)- mu1**2 sigma2_sq=(gamma[:,1] *X**2).sum() / (n*pi2)- mu2**2 sigma1= np.sqrt(sigma1_sq) sigma2= np.sqrt(sigma2_sq)

values.append(likelihood)

We can track the value of the likelihood and, since we have an EM algorithm, the likelihood should be monotone with iterations.

plt.plot(values) plt.gca().set_ylabel(r’$\ell^{(k)}$’) plt.gca().set_xlabel(r’Iteration $k$’)

7 Let’s plot our density estimate to see how well the mixture model was fit.

%%R-i pi1,pi2,sigma1,sigma2,mu1,mu2 X= sort(X) plot(X, pi1*dnorm(X,mu1,sigma1)+pi2*dnorm(X,mu2,sigma2), col=’red’, lwd=2, type=’l’, ylab=’Density’) lines(density(X))

8 1.5.2 Exercise

2 2 1. Refit the mixture model assuming the is the same within each class, i.e. σl = σ , independent of class l. 2. Try fitting 3 and 4 component mixture models to the above data which only has two. What do you expect to see in the fitted density?

1.6 Gaussian random effects model Another application of the EM algorithm is to random or linear mixed effects models. One version of a linear mixed effect model is

2 T  Y X,Z ∼ N Xβ, σ I + ZΣZ where X is a fixed effects design matrix, Z is a random effect design matrix and Σ is a covariance matrix that must be estimated along with σ. The covariance matrix Σ might not be estimated in a completely unrestricted fashion. In the 2 example below, the model is Σ = σα · I for some constant. This distribution is the same as the distribution of

Xβ + Zα +  X,Z

9 where α ∼ N(0, Σ),  ∼ N(0, σ2I) independently given X,Z. The simplest version of such a random effects model would one in which observations were grouped by subjects and each subject had a random intercept

T Yij = Xi β + αi + ij, 1 ≤ i ≤ n, 1 ≤ j ≤ ni 2 ij ∼ N(0, σ ) 2 αi ∼ N(0, σα) with the ’s and α’s being independent. This corresponds to Z being a design matrix of indicator variables for a factor that has n levels, 2 i.e. subject. Here, the matrix Σ = σα · In×n.

1.6.1 Exercise

Define the complete data to be (Yij, αi,Xi)1≤i≤n,1≤j≤ni and assume you are only able to observe

(Yij,Xi)1≤i≤n,1≤j≤ni . 1. What are the sufficient statistics for the joint likelihood of the complete data (conditional on X)?

2. What is the conditional distribution of αi|Yij,Xi1 ≤ j ≤ n?

2 2 3. Describe the EM algorithm to estimate (β, σ , σα). 2 4. How would you estimate the accuracy of σα?

10