EM and MM Algorithms

EM and MM Algorithms

Hua Zhou

Department of Biostatistics University of California, Los Angeles

Aug 10, 2016 SAMSI OPT Summer School EM and MM Algorithms

Slides and Jupyter Notebook for the lab session are available at: https://github.com/Hua-Zhou/Public-Talks EM and MM Algorithms

My flowchart for solving optimization problems EM and MM Algorithms Overview

Outline

Overview Problem Setup Review of Newton’s method

EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM

MM algorithm Introduction MM examples

Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms Overview Problem Setup Problem setup

We consider continuous optimization in this lecture

minimize f (x) subject to constraints on x,

where both objective function f and constraint set are continuous.

I’ll switch back and forth between maximization and minimization during lecture (sorry!)

I Applied mathematicians talk about minimization exclusively

I Statisticians talk about maximization due to maximum likelihood estimation They are fundamentally same problem: minimizing f (x) is equivalent to maximizing −f (x) EM and MM Algorithms Overview Review of Newton’s method Newton’s method

Consider maximizing log-likelihood L(θ), θ ∈ Θ ⊂ Rp. I Newton’s method was originally developed for finding roots of nonlinear equations f (x) = 0

I Newton’s method (aka Newton-Raphson method) is considered the gold standard for its fast (quadratic) convergence

kθ(t+1) − θ∗k → constant kθ(t) − θ∗k2

I Idea: iterative quadratic approximation EM and MM Algorithms Overview Review of Newton’s method

2 I Notations. ∇L(θ): vector, d L(θ): 2 I Statistical jargon. ∇L(θ): vector, −d L(θ): observed information matrix, E[−d 2L(θ)]: expected (Fisher) information matrix (t) I Taylor expansion around the current iterate θ 1 L(θ) ≈ L(θ(t)) + ∇L(θ(t))T (θ − θ(t)) + (θ − θ(t))T d 2L(θ(t))(θ − θ(t)) 2 and then maximize the quadratic approximation

I To maximize the quadratic function, we equate its gradient to zero

(t) 2 (t) (t) ∇L(θ ) + [d L(θ )](θ − θ ) = 0p, which suggests the next iterate

θ(t+1) = θ(t) − [d 2L(θ(t))]−1∇L(θ(t)) = θ(t) + [−d 2L(θ(t))]−1∇L(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Issues with Newton’s method and remedies

I Need to derive, evaluate, and “invert” the observed information matrix. In statistical problems, often evaluating Hessian costs O(np2) flops and inverting it costs O(p3) flops

Remedies: 1. automatic (analytical) differentiation 2. numerical differentiation (works for small problems) 3. quasi-Newton methods (BFGS, `-BFGS probably the most popular black-box nonlinear optimizer) 4. exploit structures whenever possible to reduce the cost of evaluating and inverting the Hessian EM and MM Algorithms Overview Review of Newton’s method

I Stability: Na¨ıve Newton’s iterate is not guaranteed to be an ascent algorithm. It’s equally happy to head uphill or downhill Remedies: 1. approximate −d 2L(θ(t)) by a positive definite A (if it’s not), and 2. (backtracking) By first-order Taylor expansion, L(θ(t) + s∆θ(t)) − L(θ(t)) = s∇L(θ(t))T ∆θ(t) + o(s) = s∇L(θ(t))T A−(t)∇L(θ(t)) + o(s). For s sufficiently small, right hand side is strictly positive EM and MM Algorithms Overview Review of Newton’s method

In summary, Newton scheme iterates according to

θ(t+1) = θ(t) + s[A(t)]−1∇L(θ(t)) = θ(t) + s∆θ(t)

where A(t) is a pd approximation of −d 2L(θ(t)) and s is a step length EM and MM Algorithms Overview Review of Newton’s method

I Line search strategy: step-halving (s = 1, 1/2,...), golden section search, cubic interpolation, Amijo rule, ... Newton direction ∆θ(t) only need to be calculated once. Cost of line search mainly lies in objective function evaluation 2 I How to approximate −d L(θ)? More of an art than science. Often requires problem specific analysis. (t) I Taking A = I leads to the method of steepest ascent, aka gradient ascent EM and MM Algorithms Overview Review of Newton’s method Fisher’s scoring

Fisher’s scoring method replaces −d 2L(θ) by the expected (Fisher) information matrix

2 T I(θ) = E[−d L(θ)] = E[∇L(θ)∇L(θ) ]  0p×p,

which is psd under exchangeability of expectation and differentiation. Therefore the Fisher’s scoring algorithm iterates according to

θ(t+1) = θ(t) + s[I(θ(t))]−1∇L(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Constrained optimization Main strategies for dealing with constrained optimization, e.g.,

minimize f (x)

subject to ci (x) ≥ 0, i = 1,..., m

I Interior point method (barrier method)

m X minimize f (x) − µ log(ci (x)) i=1 and push µ to 0

I Sequential (SQP)

minimize quadratic approximation of f

subject to ci (x) ≥ 0

I Active set method Interior point method is the most successful for large problems EM and MM Algorithms EM algorithm

Outline

Overview Problem Setup Review of Newton’s method

EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM

MM algorithm Introduction MM examples

Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms EM algorithm Introduction EM algorithm

I Which are the most cited statistical papers?

Paper Citations Per Year Kaplan-Meier (Kaplan and Meier, 1958) 46886 808 EM (Dempster et al., 1977) 44050 1129 Cox model (Cox, 1972) 40920 930 Metropolis (Metropolis et al., 1953) 31284 497 FDR (Benjamini and Hochberg, 1995) 30975 1450 Unit root test (Dickey and Fuller, 1979) 18259 493 Lasso (Tibshirani, 1996) 15306 765 bootstrap (Efron, 1979) 12992 351 FFT (Cooley and Tukey, 1965) 11319 222 Gibbs sampler (Gelfand and Smith, 1990) 6531 251

Citation counts from Google Scholar on Feb 17, 2016.

I EM is one of the most influential statistical ideas, finding applications in various branches of science EM and MM Algorithms EM algorithm Introduction

General reference books on EM EM and MM Algorithms EM algorithm Introduction

I Notations

I Y : observed data

I Z: missing data

I X = (Y , Z): complete data

I Goal: maximize the log-likelihood of the observed data ln g(y|θ) (optimization!)

I Idea: choose Z such that MLE for the complete data is trivial

I Let f (x|θ) = f (y, z|θ) be the density of complete data I Each iteration of EM contains two steps

I E step: calculate the conditional expectation

(t) (t) Q(θ|θ ) = E (t) [ln f (Y , Z|θ) | Y = y, θ ] Z|Y =y,θ

(t) I M step: maximize Q(θ|θ ) to generate the next iterate

(t+1) (t) θ = argmaxθ Q(θ|θ ) EM and MM Algorithms EM algorithm Introduction

I Fundamental inequality of EM

Q(θ | θ(t)) − ln g(y|θ) = E[ln f (Y , Z|θ)|Y = y, θ(t)] − ln g(y|θ)   f (Y , Z | θ)   = E ln | Y = y, θ(t) g(Y | θ)   f (Y , Z | θ(t))   (inform. ineq.) ≤ E ln | Y = y, θ(t) g(Y | θ(t)) = Q(θ(t) | θ(t)) − ln g(y|θ(t)) Rearranging shows

ln g(y | θ) ≥ Q(θ | θ(t)) − Q(θ(t) | θ(t)) + ln g(y | θ(t))

for all θ. Equality is achieved at θ = θ(t) EM and MM Algorithms EM algorithm Introduction

I Ascent property of EM

ln g(y | θ(t+1)) ≥ Q(θ(t+1) | θ(t)) − Q(θ(t) | θ(t)) + ln g(y | θ(t)) ≥ ln g(y | θ(t))

(t+1) (t) (t) (t) I Obviously we only need Q(θ | θ ) − Q(θ | θ ) ≥ 0 for this ascent property to hold (Generalized EM) EM and MM Algorithms EM algorithm Introduction A picture to keep in mind

0 ln g(θ) (t+1) ← θ Q(θ|θ(t))+c(t) −0.1 Q(θ|θ(t+1))+c(t+1) −0.2

−0.3

−0.4 (t) −0.5 ← θ

−0.6

−0.7

−0.8

−0.9

−1 −3 −2 −1 0 1 2 EM and MM Algorithms EM algorithm Introduction One more ... 2D function EM and MM Algorithms EM algorithm Introduction Remarks on EM

I Work magically (at least to applied mathematicians). No need for gradient and Hessian

I Concavity of the observed log-likelihood function (objective function) is not required

I Under regularity conditions, EM converges to a stationary point of ln g(y|θ). For a non-concave objective function, it can be a local mode or even a saddle point

I Original idea also appeared in the HMM paper (Baum-Welch algorithm) (Baum et al., 1970)

I Tend to separate variables – parallel computing EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model A canonical EM example – finite mixture model A canonical example for EM EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model

d I Each data point yi ∈ R comes from candidate distribution hj Pk with probability πj , j=1 πj = 1 Pk d I Mixture density h(y) = j=1 πj hj (y | µj , Ωj ), y ∈ R , where

 d/2 −1 1 −1/2 − 1 (y−µ )TΩ (y−µ ) h (y | µ , Ω ) = | det(Ω )| e 2 j j j j j j 2π j

are N(µj , Ωj )

I Given data y1,..., yn, want to estimate

θ = (π1, . . . , πk , µ1,..., µk , Ω1,..., Ωk ). Observed (incomplete) data log-likelihood is

n n k X X X ln g(y1,..., yn|θ) = ln h(yi ) = ln πj hj (yi | µj , Ωj ) i=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model EM for fitting mixture model

I Missing data: let zij = I{yi comes from group j}

I Complete data likelihood is

n k Y Y zij f (y, z|θ) = [πj hj (yi |µj , Ωj )] i=1 j=1

and thus complete log-likelihood is

n k X X ln f (y, z|θ) = zij [ln πj + ln hj (yi |µj , Ωj )] i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model E step

I Conditional expectation  n k (t) X X (t) Q(θ|θ ) = E zij [ln πj + ln hj (yi |µj , Ωj ) | Y = y, π ,  i=1 j=1

(t) (t) (t) (t) o µ1 ,..., µk , Ω1 ,..., Ωk ]

I By Bayes rule, we have

(t) (t) (t) (t) (t) (t) wij := E[zij | y, π , µ1 ,..., µk , Ω1 ,..., Ωk ] (t) (t) (t) π hj (yi |µ , Ω ) = j j j k (t) (t) (t) P 0 j0=1 πj0 hj (yi |µj0 , Ωj0 )

I Q function becomes n k n k   X X (t) X X (t) 1 1 T − w ln π + w − ln det Ω − (y − µ ) Ω 1(y − µ ) ij j ij 2 j 2 i j j i j i=1 j=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model M step

I Maximizer of the Q function gives the next iterate

Pn w (t) π(t+1) = i=1 ij j n Pn (t) w yi µ(t+1) = i=1 ij j Pn (t) i=1 wij Pn (t) (t+1) (t+1) T w (yi − µ )(yi − µ ) Ω(t+1) = i=1 ij j j j P (t) i wij

I Why? Multinomial MLE and weighted multivariate normal MLE EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model Remarks

I No gradient and Hessian calculation at all

I Simplex constraint on π and psd constraint on Ωj are satisfied by EM iterates (t) I The “membership variables” wij can be computed in parallel. Suchard et al. (2010) observe > 100 speed up when fitting massive mixture model (huge n) on GPU

I It is a common theme EM/MM type algorithms are amenable to massively parallel computing

I Similar EM applies to other component distributions, as far as its MLE is easy to compute EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model Other classical EM algorithms

I Hidden Markov model (Baum-Welch, or forward-backward algorithm)

I Factor analysis

I Variance component model

I Hyper-parameter estimation in empirical Bayes procedure. MAP (maximum a posteriori estimation)

I Missing data

I Grouped/censorized/truncated model

I Robust regression (t-distribution and t-regression)

I ... EM and MM Algorithms EM algorithm In the Zoo of EM ECM ExpectationConditionalMaximization (Meng and Rubin, 1993)

I In some problems the M step is difficult (no analytic solution)

I Conditional maximization is easy (block ascent)

I partition parameter vector into blocks θ = (θ1,..., θB) I alternate update θb, b = 1,..., B

I Ascent property still holds. Why?

I ECM may converge slower than EM (more iterations) but the total computer time may be shorter due to ease of the CM step EM and MM Algorithms EM algorithm In the Zoo of EM ECME

ECMEither (Liu and Rubin, 1994)

I Each CM step maximizes either the Q function or the original incomplete observed log-likelihood

I Ascent property still holds. Why?

I Faster convergence than ECM EM and MM Algorithms EM algorithm In the Zoo of EM AECM

Alternating ECM (Meng and van Dyk, 1997)

I The specification of the complete-data is allowed to be different on each CM-step

I Ascent property still holds. Why? EM and MM Algorithms EM algorithm In the Zoo of EM PX-EM and efficient data augmentation

I Parameter-EXpandedEM (Liu et al., 1998; Liu and Wu, 1999)

I Efficient data augmentation (Meng and van Dyk, 1997)

I Idea: Speed up the convergence of EM algorithm by efficiently augmenting the observed data (introducing a working parameter in the specification of complete data) EM and MM Algorithms EM algorithm In the Zoo of EM Example: t distribution

p I W ∈ R is a multivariate t-distribution tp(µ, Σ, ν) if ν ν W ∼ Normal(µ, Σ/u) and u ∼ Gamma( 2 , 2 ) I Widely used for robust modeling

I Gamma(α, β) has density

βαuα−1 f (u | α, β) = e−βµ, u ≥ 0, Γ(α) with mean E(U) = α/β and E(ln U) = ψ(α) − ln β

I Density of W is

Γ( ν+p )|Σ|−1/2 f (w | µ, Σ, ν) = 2 , p (πν)p/2Γ(ν/2)[1 + δ(w, µ; Σ)/ν](ν+p)/2

where δ(w, µ; Σ) = (w − µ)T Σ−1(w − µ) is the Mahalanobis squared distance between w and µ EM and MM Algorithms EM algorithm In the Zoo of EM

I Given iid data w1,..., wn, the log-likelihood is

np  ν + p  ν  L(µ, Σ, ν) = − ln(πν) + n ln Γ − ln Γ 2 2 2 n n − ln |Σ| + (ν + p) ln ν 2 2 n 1 X − (ν + p) ln[ν + δ(w , µ; Σ)] 2 j j=1

I How to compute MLE (ˆµ, Σˆ , νˆ)? EM and MM Algorithms EM algorithm In the Zoo of EM Multivariate t MLE – EM

ν ν I Wj |uj independent N(µ, Σ/uj ) and Uj iid gamma( 2 , 2 ) T I Missing data: z = (u1,..., un) I Log-likelihood of complete data is 1 n L (µ, Σ, ν) = − np ln(2π) − ln |Σ| c 2 2 n 1 X T − − u (w − µ) Σ 1(w − µ) 2 j j j j=1  ν  nν  ν  −n ln Γ + ln 2 2 2 n n ν X X + (ln u − u ) − ln u 2 j j j j=1 j=1 EM and MM Algorithms EM algorithm In the Zoo of EM E step

I Since gamma is conjugate prior for U, conditional distribution of U given W = w is gamma((ν + p)/2, (ν + δ(w, µ; Σ))/2). Thus

(t) (t) (t) (t) ν + p (t) E(Uj | wj , µ , Σ , ν ) = =: u (t) (t) (t) j ν + δ(wj , µ ; Σ )   ν(t) + p   ν(t) + p  E(ln U | w , µ(t), Σ(t), ν(t)) = ln u(t) + ψ − ln j j j 2 2

I Overall Q function (up to an additive constant) takes the form

n n 1 X (t) T − − ln |Σ| − u (w − µ) Σ 1(w − µ) 2 2 j j j j=1  ν  nν  ν  −n ln Γ + ln 2 2 2   n  (t)   (t)  nν 1 X (t) (t) ν + p ν + p + (ln u − u ) + ψ − ln 2  n j j 2 2  j=1 EM and MM Algorithms EM algorithm In the Zoo of EM M step

I Maximization over (µ, Σ) is simply a weighted multivariate normal problem

Pn (t) u wj µ(t+1) = j=1 j Pn (t) j=1 uj n ( + ) 1 X (t) ( + ) ( + ) T Σ t 1 = u (w − µ t 1 )(w − µ t 1 ) n j j j j=1 Notice down-weighting of outliers is obvious in the update

I Maximization over ν is a univariate problem – equating derivative to 0 and find the root EM and MM Algorithms EM algorithm In the Zoo of EM Multivariate t MLE – ECM and ECME

Partition parameter (µ, Σ, ν) into two blocks (µ, Σ) and ν

I ECM = EM for this example. Why?

I ECME: In the second CM step, maximize over ν in terms of the original log-likelihood function instead of the Q function. They have similar difficulty since both are univariate optimization problems! EM and MM Algorithms EM algorithm In the Zoo of EM

An example from (Liu and Rubin, 1994). n = 79, p = 2, with missing entries. EM=ECM: solid line. ECME: dashed line. EM and MM Algorithms EM algorithm In the Zoo of EM Multivariate t MLE – efficient data augmentation

Assume ν known 1/2 I Write Wj = µ + Cj /Uj , where Cj is N(0, Σ) independent of Uj ν ν which is Gamma( 2 , 2 ) −a/2 1/2 −a I Wj = µ + |Σ| Cj /Uj (a), where Uj (a) = |Σ| Uj

I Then the complete data is (w, z(a)), a the working parameter

I a = 0 corresponds to the vanilla EM

I Meng and van Dyk (1997) recommend using aopt = 1/(ν + p) to maximize the convergence rate

I Exercise: work out the EM update for this special case

I The only change to the vanilla EM is to replace the denominator Pn (t) n in the update of Σ by j=1 uj I PX-EM (Liu et al., 1998) leads to the same update EM and MM Algorithms EM algorithm In the Zoo of EM EM and MM Algorithms EM algorithm In the Zoo of EM

Assume ν unknown

I Version 1: a = aopt in both updating of (µ, Σ) and ν

I Version 2: a = aopt for updating (µ, Σ) and taking the observed data as complete data for updating ν Conclusion in (Meng and van Dyk, 1997): Version 1 is 8 ∼ 12 faster than EM=ECM or ECME. Version 2 is only slightly more efficient than Version 1 EM and MM Algorithms EM algorithm In the Zoo of EM MCEM

MonteCarloEM (Wei and Tanner, 1990)

I Hard to calculate Q function? Simulate!

m 1 X Q(θ | θ(t)) ≈ ln f (y, z | θ), m j j=1

where zj are iid from conditional distribution of missing data given y and previous iterate θ(t)

I Ascent property may be lost due to Monte Carlo errors

I Example: capture-recapture model (Robert and Casella, 2004, p184), generalized linear mixed model (Booth and Hobert, 1999) EM and MM Algorithms EM algorithm In the Zoo of EM SEM and SAEM

StochasticEM (SEM) (Celeux and Diebolt, 1985)

I Same as MCEM with m = 1. A single draw of missing data z from the conditional distribution (t) I θ forms a Markov chain which converges to a stationary distribution. No definite relation between this stationary distribution and the MLE

I In some specific cases, it can be shown that the stationary distribution concentrates around the MLE with a variance inversely proportional to the sample size

I Advantage: can escape the attraction to inferior mode/saddle point in some mixture model problems SimulatedAnnealingEM (SAEM) (Celeux and Diebolt, 1989)

I Increase m with the iterations, ending up with an EM algorithm EM and MM Algorithms EM algorithm In the Zoo of EM DA

DataAugmentation (DA) algorithm (Tanner and Wong, 1987)

I Aim for sampling from p(θ|y) instead of maximization

I Idea: incomplete data posterior density is complicated, but the complete-data posterior density is relatively easy to sample

I Data augmentation algorithm (t+1) (t) I draw z conditional on (θ , y) (t+1) (t+1) I draw θ conditional on (z , y)

I A special case of Gibbs sampler (t) I θ converges to the distribution p(θ|y) under general conditions

I Ergodic mean converges to the posterior mean E(θ|y), which may perform better than MLE in finite sample EM and MM Algorithms EM algorithm In the Zoo of EM EM as a maximization-maximization procedure

Neal and Hinton (1999)

I Consider the objective function

F(θ, q) = Eq[ln f (Z, y | θ)] + H(q)

over Θ × Q, where Q is the set of all conditional pdfs of the missing data {q(z) = p(z | y, θ), θ ∈ Θ} and H(q) = −Eq ln q is the entropy of q

I EM is essentially performing coordinate ascent for maximizing F (t) I E step: At current iterate θ ,

(t) (t) (t) F(θ , q) = Eq [ln p(Z | y, θ )] − Eq ln q + ln g(y | θ ).

The maximizing conditional pdf is given by q = ln f (Z | y, θ(t)) (t) I M step: Substitute q = ln f (Z | y, θ ) into F and maximize over θ EM and MM Algorithms EM algorithm In the Zoo of EM

Hastie et al. (2009, Section 8.5.3) EM and MM Algorithms EM algorithm In the Zoo of EM Incremental EM

I Assume iid observations. Then the objective function is X F(θ, q) = [Eqi ln p(Zi , yi | θ) + H(qi )] , i Q where we search for q under factored form q(z) = i qi (z) I Maximizing F over q is equivalent to maximizing the contribution of each data with respect to qi

I Update θ by visiting data items sequentially rather than from a global E step

I Finite mixture example: for each data point i

I evaluate membership variables wij , j = 1,..., k (t+1) (t+1) (t+1) I update parameter values (π , µj , Σj ) (only need to keep sufficient !) EM and MM Algorithms EM algorithm In the Zoo of EM

Faster convergence than vanilla EM in some examples EM and MM Algorithms MM algorithm

Outline

Overview Problem Setup Review of Newton’s method

EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM

MM algorithm Introduction MM examples

Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms MM algorithm Introduction Our friend ...

0 ln g(θ) (t+1) ← θ Q(θ|θ(t))+c(t) −0.1 Q(θ|θ(t+1))+c(t+1) −0.2

−0.3

−0.4 (t) −0.5 ← θ

−0.6

−0.7

−0.8

−0.9

−1 −3 −2 −1 0 1 2 EM and MM Algorithms MM algorithm Introduction Our other friend ... 2D function EM and MM Algorithms MM algorithm Introduction EM as a minorization-maximization (MM) algorithm

Refer to the picture in previous slide

I The Q function constitutes a minorizing function of the objective function up to an additive constant

L(θ) ≥ Q(θ|θ(t)) + c(t) for all θ L(θ(t)) = Q(θ(t)|θ(t)) + c(t)

(t+1) I Maximizing the Q function generates an ascent iterate θ EM and MM Algorithms MM algorithm Introduction

Questions:

I Is there any other way to produce such surrogate function?

I Can we flip the picture and apply same principle to minimization problem? This generalization leads to a powerful tool – MM principle (Lange et al., 2000) EM and MM Algorithms MM algorithm Introduction

General reference book on MM EM and MM Algorithms MM algorithm Introduction A powerful tool - MM algorithm

I A prescription for constructing optimization algorithms

I Majorize/Minimize or Minorize/Maximize

I Creating a surrogate function that minorizes/majorizes the objective function

I Optimizing the surrogate function drives the objective function uphill or downhill as needed

Famous special cases in statistics: EM algorithm (Dempster et al., 1977), multidimensional scaling (de Leeuw and Heiser, 1977) EM and MM Algorithms MM algorithm Introduction Majorization and definition of the algorithm

n I A function g(θ|θ ) is said to majorize the function f (θ) at θn provided

f (θn) = g(θn|θn) f (θ) ≤ g(θ|θn) for all θ.

I So we choose a “nice/separable” majorizing function g(θ|θn) and minimize it. This produces the next point θn+1 in the algorithm.

I The descent property follows from

f (θn+1) ≤ g(θn+1|θn) ≤ g(θn|θn) = f (θn).

This makes MM algorithms very stable. EM and MM Algorithms MM algorithm Introduction Rationale for the MM principle

I Free the derivation from missing data structure

I Avoid matrix inversion

I Linearize an optimization problem

I Deal gracefully with certain equality and inequality constraints

I Turn a non-differentiable problem into a smooth problem

I Separate the parameters of a problem (perfect for massive, fine-scale parallelization) EM and MM Algorithms MM algorithm Introduction Generic methods of majorization and minorization

I Jensen’s inequality – EM algorithms

I The Cauchy-Schwartz inequality - multidimensional scaling

I Supporting hyperplane property of a convex function

I Arithmetic-geometric mean inequality

I Quadratic upper bound principle - Bohning¨ and Lindsay

I ... A great source of inequalities: The Cauchy-Schwarz Master Class by Michael Steele. EM and MM Algorithms MM algorithm MM examples Positron Emission Tomography (PET)

I Data: tube readings y = (y1,..., yd )

I Estimate: photon emission intensities (pixels) λ = (λ1, . . . , λp) Pp I Poisson Model: Yi ∼ Poisson( j=1 cij λj ) where cij = cond. prob. that a photon emitted by j-th pixel is detected by i-th tube EM and MM Algorithms MM algorithm MM examples

I Log-likelihood     X X X L(λ|y) = yi ln  cij λj  − cij λj  + const. i j j

Essentially a Poisson regression with constraint λj ≥ 0 I Regularized log-likelihood for smoother image: µ X L(λ|y) − (λ − λ )2 2 j k {j,k}∈N

I EM algorithm does not apply now: no likelihood model for the regularization term EM and MM Algorithms MM algorithm MM examples Minorization step

I By concavity of the ln s function

  (t) (t) ! (t) P 0 X X cij λj j0 cij λj0 X cij λj ( ) λ ≥ λ = λ + t ln  cij j  (t) ln (t) cij j (t) ln j c P 0 P 0 j j j cij λj0 cij λj j j cij λj0

2 I By concavity of the −s function 1 1 −(λ − λ )2 ≥ − (2λ − λ(t) − λ(t))2 − (2λ − λ(t) − λ(t))2 j k 2 j j k 2 k j k

I Combining minorizing terms gives an overall surrogate function

(t) ( ) X X cij λj X X (λ|λ t ) = λ − λ g yi (t) ln j cij j P 0 i j j0 cij λj0 i j µ X (t) (t) (t) (t) − [(2λ − λ − λ )2 + (2λ − λ − λ )2] 4 j j k k j k {j,k}∈N EM and MM Algorithms MM algorithm MM examples Maximization step

(t) I g(λ|λ ) is trivial to maximize because λj are separated! I Solving for the root of ∂ g(λ|λ(t)) ∂λj (t) X cij λj − X X (t) (t) = y λ 1 − c − µ (2λ − λ − λ ) i P (t) j ij j j k 0 cij λ 0 i j j i k∈Nj = 0

(t+1) gives λj EM and MM Algorithms MM algorithm MM examples

I MM algorithm for PET: (0) Initialize: λj = 1 repeat (t) (t) P (t) zij = (yi cij λj )/( k cik λk ) for j = 1 to p do a = −2µ|N |, b = µ(|N |λ(t) + P λ(t)) − 1, c = P z(t) j j j k∈Nj k i ij (t+1) p 2 λj = (−b − b − 4ac)/(2a) end for until convergence occurs

I Parameter constraints λj ≥ 0 are satisfied when start from positive initial values

I The loop for updating pixels can be carried out independently – massive parallelism EM and MM Algorithms MM algorithm MM examples

A simulation example with n = 2016 and p = 4096 (provided by Ravi Varadhan) CPU: i7 @ 3.20GHZ (1 thread), implemented using BLAS in the GNU Scientific Library (GSL) GPU: NVIDIA GeForce GTX 580, implemented using cuBLAS

CPU GPU Penalty µ Iters Time Function Iters Time Function Speedup 0 100000 11250 -7337.152765 100000 140 -7337.153387 80 10−7 24506 2573 -8500.082605 24506 35 -8508.112249 74 10−6 6294 710 -15432.45496 6294 9 -15432.45586 79 10−5 589 67 -55767.32966 589 0.8 -55767.32970 84 EM and MM Algorithms MM algorithm MM examples Multivariate t MLE (revisited)

Problem

I Given iid data w1,..., wn, the log-likelihood is np ν + p ν f (µ, Σ, ν) = − ln(πν) + n[ln Γ( ) − ln Γ( )] 2 2 2 n n − ln |Σ| + (ν + p) ln ν 2 2 n 1 X − (ν + p) ln[ν + δ(w , µ; Σ)], 2 j j=1

T −1 where δ(wj , µ; Σ) = (wj − µ) Σ (wj − µ)

I Want to find MLE (ˆµ, Σˆ , νˆ) EM and MM Algorithms MM algorithm MM examples Minorization step

I Apply the supporting hyperplane inequality

− ln[ν + δ(wj , µ; Σ)]

ν + δ(wj , µ; Σ) (t) (t) (t) ≥ − + 1 − ln[ν + δ(wj , µ ; Σ )] (t) (t) (t) ν + δ(wj , µ ; Σ )

I Summing over observations gives an overall minorizing function g(µ, Σ, ν|µ(t), Σ(t), ν(t)) np ν + p ν = − ln(πν) + n[ln Γ( ) − ln Γ( )] 2 2 2 n n − ln |Σ| + (ν + p) ln ν 2 2 n ν + p X ν + δ(wj , µ; Σ) − 2 (t) (t) (t) j=1 ν + δ(wj , µ ; Σ )   n ν + p  X ( ) ( ) ( )  + n − ln[ν t + δ(w , µ t ; Σ t )] 2 j  j=1  EM and MM Algorithms MM algorithm MM examples Maximization step

Block ascent for maximizing the minorizing function (t) I Given ν , update (µ, Σ) by

n n 1 X ν(t) + p max − ln |Σ| − δ(wj , µ; Σ) (t) (t) (t) , 2 2 µ Σ j=1 ν + δ(wj , µ ; Σ ) A weighted multivariate normal problem – same update as in the vanilla EM(=ECM) (t+1) (t+1) I Given (µ , Σ ), update of ν is easy since it is a univariate optimization problem. This gives a different update from the vanilla EM(=ECM) EM and MM Algorithms MM algorithm MM examples

I Above derivation gives an MCM (minorization conditional maximization) algorithm, which turns out to be different from ECM

I Wait a minute, why don’t we maximize the original objective function directly when updating ν? MCMEither!

I MCME = ECME. Why? EM and MM Algorithms MM algorithm MM examples Yet another MM for t ... Assume ν known

I Notice 1 ν + p − |Σ| − ln[ν + δ(w , µ; Σ)] 2 2 j ν + p = − ln{|Σ|a[ν + δ(w , µ; Σ)]} 2 j where a = 1/(ν + p) is the working parameter

I Minorize via

a − ln{|Σ| [ν + δ(wj , µ; Σ)]} |Σ|a[ν + δ(w , µ; Σ)] ≥ − j + 1 (t) a (t) (t) (t) |Σ | [ν + δ(wj , µ ; Σ )] (t) a (t) (t) (t) − ln{|Σ | [ν + δ(wj , µ ; Σ )]}

I Now working out the maximization step (do it!) yields the efficient data augmentation algorithm in (Meng and van Dyk, 1997) EM and MM Algorithms MM algorithm MM examples Non-negative matrix factorization (NNMF)

Lee and Seung (1999, 2001)

I Goal: Find a low rank r factorization Vm×r Wr×n of non-negative data matrix Xm×n, such that

||X − VW||F

is minimized subject to constraints vij , wjk ≥ 0

I Resembler SVD except the non-negativity constraint

I A popular alternative to PCA and vector quantization

I Used for data compression, clustering, ... EM and MM Algorithms MM algorithm MM examples Image Factorization

I MIT CBCL Face Images: 2429 faces, each 19 × 19 = 361 pixels

I A striking feature of NNMF basis images is they correspond to different parts of face (eyes, nose, ...)

approximation based on a rank-49 NNMF EM and MM Algorithms MM algorithm MM examples EM and MM Algorithms MM algorithm MM examples MM for NNMF

I Objective

!2 X X X kX − VW kF = xij − vik wkj i j k

2 I Majorization: by convexity of s function

!2 (t) (t) P (t) (t) !2 X X vik wkj l vil wlj x − v w ≤ x − v w ij ik kj P (t) (t) ij (t) (t) ik kj k k l vil wlj vik wkj

I Minimization: alternate update of V and W EM and MM Algorithms MM algorithm MM examples

MM Algorithm for NNMF Initialize: V = rand(m, r) and W = rand(r, n). repeat T T V·· ← V·· × (XW ) ·· / (VWW ) ·· for all m × r elements T T W·· ← W·· × (V X) ·· / (V VW) ·· for all r × n elements until convergence

I Updates are extremely simple

I Non-negativity constraints are satisfied

I Utilize high throughput of BLAS 3 routines, either on CPU or on GPU

I Lasso or other sparsity penalty can be incorporated EM and MM Algorithms MM algorithm MM examples Exercise

I In the original paper, Lee and Seung (1999) pose a Poisson P model for the image pixels xij ∼ Poisson( k vik wkj ) I The log-likelihood is " ! # X X X X L(V , W ) = xij ln vik wkj − vik wkj i j k k

I MLE under the Poisson model is amenable to another MM algorithm

I Derive it EM and MM Algorithms MM algorithm MM examples Netflix movie rating data and matrix completion

I Snapshot of the kind of data collected by Netflix. Only 100,480,507 ratings (1.2% entries of the matrix) are observed

I Goal: impute the unobserved ratings for personalized recommendation EM and MM Algorithms MM algorithm MM examples EM and MM Algorithms MM algorithm MM examples

I Observe a very sparse matrix Y = (yij ). Want to impute all the missing entries. It is possible when the matrix is structured and of low rank

I Let Ω = {(i, j): observed entries} collect the observed entries and PΩ(M) denote the projection of matrix M to Ω. The problem

1 2 1 X 2 min kPΩ(Y ) − PΩ(X)kF = (yij − xij ) rank(X)≤r 2 2 (i,j)∈Ω

unfortunately is non-convex and difficult

I Convex relaxation (Mazumder et al., 2010)

1 2 min f (X) = kPΩ(Y ) − PΩ(X)kF + λkXk∗, X 2 P where kXk∗ = kσ(X)k1 = i σi (X) is the nuclear norm EM and MM Algorithms MM algorithm MM examples Majorization step

A majorizing function

1 X 1 X f (X) = (y − x )2 + 0 + λkXk 2 ij ij 2 ∗ (i,j)∈Ω (i,j)∈/Ω

1 X 1 X (t) ≤ (y − x )2 + (x − x )2 + λkXk 2 ij ij 2 ij ij ∗ (i,j)∈Ω (i,j)∈/Ω 1 = kX − Z (t)k2 + λkXk 2 F ∗ = g(X|X (t)),

(t) (t) where Z = PΩ(Y ) + PΩ⊥ (X ) EM and MM Algorithms MM algorithm MM examples Minimization step

I Majorizing function 1 1 g(X|X (t)) = kXk2 − tr(X TZ (t)) + kZ (t)k2 + λkXk 2 F 2 F ∗

I Observe

2 2 X 2 kXkF = kσ(X)k2 = σi i (t) 2 (t) 2 X 2 kZ kF = kσ(Z )k2 = ωi i and by Fan-von Neuman’s inequality

T (t) X tr(X Z ) ≤ σi ωi i with equality achieved if and only if the left and right singular vectors of the two matrices coincide EM and MM Algorithms MM algorithm MM examples

(t) I Thus we can assume X has same singular vectors as Z and

1 X X 1 X g(X|X (t)) = σ2 − σ ω + ω2 + λ σ 2 i i i 2 i i i i i 1 X X = (σ − ω )2 + λ σ , 2 i i i i i

(t+1) with minimizer given by σi = (ωi − λ)+ EM and MM Algorithms MM algorithm MM examples MM algorithm for matrix completion

Initialize X (0) ∈ Rm×n repeat (t+1) (t) Z ← PΩ(Y ) + PΩ⊥ (X ) Compute SVD Z (t+1) = Udiag(w)V T (t+1) T X ← Udiag[(w − λ)+]V until objective value converges

2 2 3 I “Golub-Kahan-Reinsch” algorithm takes 4m n + 8mn + 9n flops for a m ≥ n matrix and is not going to work for 480K-by-18K Netflix matrix. Notice only top singular values are needed. Lanczos/Arnoldi algorithm is the way to go EM and MM Algorithms Acceleration of EM/MM

Outline

Overview Problem Setup Review of Newton’s method

EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM

MM algorithm Introduction MM examples

Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms Acceleration of EM/MM

Acceleration

I Slow convergence of EM/MM (linear rate at best) is a serious concern in many problems

I Some proposed acceleration methods

I Aitken’s and Louis’ methods (Louis, 1982)

I Conjugate gradient method (Jamshidian and Jennrich, 1993)

I Quasi-Newton acceleration (Lange, 1995; Jamshidian and Jennrich, 1997)

I Ikeda acceleration (Ikeda, 2000)

I We review two recent ones which are particularly attractive in high-dimensional setting EM and MM Algorithms Acceleration of EM/MM SQUAREM SQUAREM

Varadhan and Roland (2008) p p I F : R 7→ R algorithm mapping x → F(x) → F ◦ F(x) → F ◦ F ◦ F(x) → · · ·

I SQUAREM extrapolation

x (t+1) = x (t) − 2s[F(x (t)) − x (t)] + s2[F ◦ F(x (t)) − 2F(x (t)) + x (t)] = x (t) − 2su + s2v

where s is a scalar uTu I SqS1: s = uT(v−u) uT(v−u) I SqS2: s = (v−u)T(v−u) q uTu I SqS3: s = − (v−u)T(v−u) EM and MM Algorithms Acceleration of EM/MM SQUAREM

I Extremely simple and effective

I Ascent property may be lost by the extrapolation; but we can always revert back to the EM/MM update as safeguard EM and MM Algorithms Acceleration of EM/MM SQUAREM An example with 2 parameters EM and MM Algorithms Acceleration of EM/MM Quasi-Newton acceleration Quasi-Newton acceleration

Zhou et al. (2011) p p I F : R 7→ R algorithm mapping x → F(x) → F ◦ F(x) → F ◦ F ◦ F(x) → · · ·

I Fixed points of F ⇔ roots of G(x) = x − F(x)

I Newton’s method for finding the roots of the mapping G(x) = x − F(x)

x (t+1) = x (t) − [I − dF(x (t))]−1[x (t) − F(x (t))]

(t) I Idea: Approximate dF(x ) by a low-rank matrix M and explicitly form the inverse (I − M)−1 EM and MM Algorithms Acceleration of EM/MM Quasi-Newton acceleration

(t) (t) (t) (t) I Secant condition: u = F(x ) − x , v = F ◦ F(x ) − F(x ).

Mu = v,

where M = dF(x (t))

I For best results, we collect q secant pairs ui , vi , i = 1,..., q and require that

Mui = vi , i = 1,..., q, or MU = V . EM and MM Algorithms Acceleration of EM/MM Quasi-Newton acceleration

Lemma 2 The minimum of the strictly convex function kMkF subject to the constraint MU = V is attained by the choice M = V (UT U)−1UT .

I Fortunately, we have the explicit inverse by Woodbury formula

(I − M)−1 = [I − V (UT U)−1UT ]−1 = I + V [UT (U − V )]−1UT

I Quasi-Newton update

x (t+1) = x (t) − (I − M)−1[x (t) − F(x (t))] = x (t) − [I − V (UT U)−1UT ]−1[x (t) − F(x (t))] = F(x (t)) − V [UT (U − V )]−1UT [x (t) − F(x (t))] EM and MM Algorithms Acceleration of EM/MM Quasi-Newton acceleration

Advantages of the new quasi-Newton acceleration scheme

x (t+1) = F(x (t)) − V [UT (U − V )]−1UT [x (t) − F(x (t))]

2 3 I The effort per iteration is light: O(pq ) + O(q )

I Linear constraints are preserved

I Do not need to store the whole approximate (inverse) Hessian matrix: O(p2) vs O(pq) p = # parameters, q = # secant conditions EM and MM Algorithms Acceleration of EM/MM Quasi-Newton acceleration An example with 2771 parameters EM and MM Algorithms References

Baum, L. E., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist., 41:164–171. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Ser. B, 57:289–300. Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society. Series B (Methodology), 61(1):265–285. Celeux, G. and Diebolt, J. (1985). The SEM algorithm: A probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quarterly, 2:73–82. Celeux, G. and Diebolt, J. (1989). Une version de type recuit simule´ de l’algorithme EM. Rapport de Recherche INRIA. Cooley, J. W. and Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series. Math. Comp., 19:297–301. Cox, D. R. (1972). Regression models and life-tables. J. Roy. Statist. Soc. Ser. B, 34:187–220. With discussion by F. Downton, Richard Peto, D. J. Bartholomew, D. V. Lindley, P. W. Glassborow, D. E. Barton, Susannah Howard, B. Benjamin, John J. Gart, L. D. Meshalkin, A. R. Kagan, M. Zelen, R. E. Barlow, Jack Kalbfleisch, R. L. Prentice and Norman Breslow, and a reply by D. R. Cox. de Leeuw, J. and Heiser, W. (1977). Convergence of correction matrix algorithms for multidimensional scaling. Geometric Representations of Relational Data. EM and MM Algorithms References

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Soceity Series B., 39(1-38). Dickey, D. A. and Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. J. Amer. Statist. Assoc., 74(366, part 1):427–431. Efron, B. (1979). Bootstrap methods: another look at the jackknife. Ann. Statist., 7(1):1–26. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc., 85(410):398–409. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York, second edition. Ikeda, S. (2000). Acceleration of the EM algorithms. Systems and Computers in Japan, 31(2):10–18. Jamshidian, M. and Jennrich, R. I. (1993). Conjugate gradient acceleration of the EM algorithm. J. Amer. Statist. Assoc., 88(421):221–228. Jamshidian, M. and Jennrich, R. I. (1997). Acceleration of the EM algorithm by using quasi-Newton methods. J. Roy. Statist. Soc. Ser. B, 59(3):569–587. Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc., 53:457–481. Lange, K. (1995). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica, 5(1):1–18. EM and MM Algorithms References

Lange, K., Hunter, D. R., and Yang, I. (2000). Optimization transfer using surrogate objective functions. J. Comput. Graph. Statist., 9(1):1–59. With discussion, and a rejoinder by Hunter and Lange. Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791. Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In NIPS, pages 556–562. MIT Press. Liu, C. and Rubin, D. B. (1994). The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika, 81(4):633–648. Liu, C., Rubin, D. B., and Wu, Y. N. (1998). Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika, 85(4):755–770. Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for data augmentation. J. Amer. Statist. Assoc., 94(448):1264–1274. Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. Ser. B, 44(2):226–233. Mazumder, R., Hastie, T., and Tibshirani, R. (2010). Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322. Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80(2):267–278. Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to a fast new tune. J. Roy. Statist. Soc. Ser. B, 59(3):511–567. With discussion and a reply by the authors. EM and MM Algorithms References

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092. Neal, R. M. and Hinton, G. E. (1999). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in graphical models, pages 355–368. MIT Press, Cambridge, MA, USA. Robert, C. P. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer-Verlag, New York, second edition. Suchard, M. A., Wang, Q., Chan, C., Frelinger, J., Cron, A., and West, M. (2010). Understanding GPU programming for statistical computation: studies in massively parallel massive mixtures. Journal of Computational and Graphical Statistics, 19(2):419–438. Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. Amer. Statist. Assoc., 82(398):528–550. With discussion and with a reply by the authors. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288. Tropp, J. A. (2006). Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3):1030–1051. Varadhan, R. and Roland, C. (2008). Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scand. J. Statist., 35(2):335–353. EM and MM Algorithms References

Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704. Zhou, H., Alexander, D., and Lange, K. (2011). A quasi-Newton acceleration for high-dimensional optimization algorithms. Statistics and Computing, 21:261–273.