EM and MM Algorithms

EM and MM Algorithms EM and MM Algorithms Hua Zhou Department of Biostatistics University of California, Los Angeles Aug 10, 2016 SAMSI OPT Summer School EM and MM Algorithms Slides and Jupyter Notebook for the lab session are available at: https://github.com/Hua-Zhou/Public-Talks EM and MM Algorithms My flowchart for solving optimization problems EM and MM Algorithms Overview Outline Overview Problem Setup Review of Newton’s method EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM MM algorithm Introduction MM examples Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms Overview Problem Setup Problem setup We consider continuous optimization in this lecture minimize f (x) subject to constraints on x; where both objective function f and constraint set are continuous. I’ll switch back and forth between maximization and minimization during lecture (sorry!) I Applied mathematicians talk about minimization exclusively I Statisticians talk about maximization due to maximum likelihood estimation They are fundamentally same problem: minimizing f (x) is equivalent to maximizing −f (x) EM and MM Algorithms Overview Review of Newton’s method Newton’s method Consider maximizing log-likelihood L(θ), θ 2 Θ ⊂ Rp. I Newton’s method was originally developed for finding roots of nonlinear equations f (x) = 0 I Newton’s method (aka Newton-Raphson method) is considered the gold standard for its fast (quadratic) convergence kθ(t+1) − θ∗k ! constant kθ(t) − θ∗k2 I Idea: iterative quadratic approximation EM and MM Algorithms Overview Review of Newton’s method 2 I Notations. rL(θ): gradient vector, d L(θ): Hessian matrix 2 I Statistical jargon. rL(θ): score vector, −d L(θ): observed information matrix, E[−d 2L(θ)]: expected (Fisher) information matrix (t) I Taylor expansion around the current iterate θ 1 L(θ) ≈ L(θ(t)) + rL(θ(t))T (θ − θ(t)) + (θ − θ(t))T d 2L(θ(t))(θ − θ(t)) 2 and then maximize the quadratic approximation I To maximize the quadratic function, we equate its gradient to zero (t) 2 (t) (t) rL(θ ) + [d L(θ )](θ − θ ) = 0p; which suggests the next iterate θ(t+1) = θ(t) − [d 2L(θ(t))]−1rL(θ(t)) = θ(t) + [−d 2L(θ(t))]−1rL(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Issues with Newton’s method and remedies I Need to derive, evaluate, and “invert” the observed information matrix. In statistical problems, often evaluating Hessian costs O(np2) flops and inverting it costs O(p3) flops Remedies: 1. automatic (analytical) differentiation 2. numerical differentiation (works for small problems) 3. quasi-Newton methods (BFGS, `-BFGS probably the most popular black-box nonlinear optimizer) 4. exploit structures whenever possible to reduce the cost of evaluating and inverting the Hessian EM and MM Algorithms Overview Review of Newton’s method I Stability: Na¨ıve Newton’s iterate is not guaranteed to be an ascent algorithm. It’s equally happy to head uphill or downhill Remedies: 1. approximate −d 2L(θ(t)) by a positive definite A (if it’s not), and 2. line search (backtracking) By first-order Taylor expansion, L(θ(t) + s∆θ(t)) − L(θ(t)) = srL(θ(t))T ∆θ(t) + o(s) = srL(θ(t))T A−(t)rL(θ(t)) + o(s): For s sufficiently small, right hand side is strictly positive EM and MM Algorithms Overview Review of Newton’s method In summary, Newton scheme iterates according to θ(t+1) = θ(t) + s[A(t)]−1rL(θ(t)) = θ(t) + s∆θ(t) where A(t) is a pd approximation of −d 2L(θ(t)) and s is a step length EM and MM Algorithms Overview Review of Newton’s method I Line search strategy: step-halving (s = 1; 1=2;:::), golden section search, cubic interpolation, Amijo rule, ... Newton direction ∆θ(t) only need to be calculated once. Cost of line search mainly lies in objective function evaluation 2 I How to approximate −d L(θ)? More of an art than science. Often requires problem specific analysis. (t) I Taking A = I leads to the method of steepest ascent, aka gradient ascent EM and MM Algorithms Overview Review of Newton’s method Fisher’s scoring Fisher’s scoring method replaces −d 2L(θ) by the expected (Fisher) information matrix 2 T I(θ) = E[−d L(θ)] = E[rL(θ)rL(θ) ] 0p×p; which is psd under exchangeability of expectation and differentiation. Therefore the Fisher’s scoring algorithm iterates according to θ(t+1) = θ(t) + s[I(θ(t))]−1rL(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Constrained optimization Main strategies for dealing with constrained optimization, e.g., minimize f (x) subject to ci (x) ≥ 0; i = 1;:::; m I Interior point method (barrier method) m X minimize f (x) − µ log(ci (x)) i=1 and push µ to 0 I Sequential quadratic programming (SQP) minimize quadratic approximation of f subject to ci (x) ≥ 0 I Active set method Interior point method is the most successful for large problems EM and MM Algorithms EM algorithm Outline Overview Problem Setup Review of Newton’s method EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM MM algorithm Introduction MM examples Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms EM algorithm Introduction EM algorithm I Which are the most cited statistical papers? Paper Citations Per Year Kaplan-Meier (Kaplan and Meier, 1958) 46886 808 EM (Dempster et al., 1977) 44050 1129 Cox model (Cox, 1972) 40920 930 Metropolis (Metropolis et al., 1953) 31284 497 FDR (Benjamini and Hochberg, 1995) 30975 1450 Unit root test (Dickey and Fuller, 1979) 18259 493 Lasso (Tibshirani, 1996) 15306 765 bootstrap (Efron, 1979) 12992 351 FFT (Cooley and Tukey, 1965) 11319 222 Gibbs sampler (Gelfand and Smith, 1990) 6531 251 Citation counts from Google Scholar on Feb 17, 2016. I EM is one of the most influential statistical ideas, finding applications in various branches of science EM and MM Algorithms EM algorithm Introduction General reference books on EM EM and MM Algorithms EM algorithm Introduction I Notations I Y : observed data I Z: missing data I X = (Y ; Z): complete data I Goal: maximize the log-likelihood of the observed data ln g(yjθ) (optimization!) I Idea: choose Z such that MLE for the complete data is trivial I Let f (xjθ) = f (y; zjθ) be the density of complete data I Each iteration of EM contains two steps I E step: calculate the conditional expectation (t) (t) Q(θjθ ) = E (t) [ln f (Y ; Zjθ) j Y = y; θ ] ZjY =y;θ (t) I M step: maximize Q(θjθ ) to generate the next iterate (t+1) (t) θ = argmaxθ Q(θjθ ) EM and MM Algorithms EM algorithm Introduction I Fundamental inequality of EM Q(θ j θ(t)) − ln g(yjθ) = E[ln f (Y ; Zjθ)jY = y; θ(t)] − ln g(yjθ) f (Y ; Z j θ) = E ln j Y = y; θ(t) g(Y j θ) f (Y ; Z j θ(t)) (inform. ineq.) ≤ E ln j Y = y; θ(t) g(Y j θ(t)) = Q(θ(t) j θ(t)) − ln g(yjθ(t)) Rearranging shows ln g(y j θ) ≥ Q(θ j θ(t)) − Q(θ(t) j θ(t)) + ln g(y j θ(t)) for all θ. Equality is achieved at θ = θ(t) EM and MM Algorithms EM algorithm Introduction I Ascent property of EM ln g(y j θ(t+1)) ≥ Q(θ(t+1) j θ(t)) − Q(θ(t) j θ(t)) + ln g(y j θ(t)) ≥ ln g(y j θ(t)) (t+1) (t) (t) (t) I Obviously we only need Q(θ j θ ) − Q(θ j θ ) ≥ 0 for this ascent property to hold (Generalized EM) EM and MM Algorithms EM algorithm Introduction A picture to keep in mind 0 ln g(θ) (t+1) ← θ Q(θ|θ(t))+c(t) −0.1 Q(θ|θ(t+1))+c(t+1) −0.2 −0.3 −0.4 (t) −0.5 ← θ −0.6 −0.7 −0.8 −0.9 −1 −3 −2 −1 0 1 2 EM and MM Algorithms EM algorithm Introduction One more ... 2D function EM and MM Algorithms EM algorithm Introduction Remarks on EM I Work magically (at least to applied mathematicians). No need for gradient and Hessian I Concavity of the observed log-likelihood function (objective function) is not required I Under regularity conditions, EM converges to a stationary point of ln g(yjθ). For a non-concave objective function, it can be a local mode or even a saddle point I Original idea also appeared in the HMM paper (Baum-Welch algorithm) (Baum et al., 1970) I Tend to separate variables – parallel computing EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model A canonical EM example – finite mixture model A canonical example for EM EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model d I Each data point yi 2 R comes from candidate distribution hj Pk with probability πj , j=1 πj = 1 Pk d I Mixture density h(y) = j=1 πj hj (y j µj ; Ωj ), y 2 R , where d=2 −1 1 −1=2 − 1 (y−µ )TΩ (y−µ ) h (y j µ ; Ω ) = j det(Ω )j e 2 j j j j j j 2π j are N(µj ; Ωj ) I Given data y1;:::; yn, want to estimate θ = (π1; : : : ; πk ; µ1;:::; µk ; Ω1;:::; Ωk ): Observed (incomplete) data log-likelihood is n n k X X X ln g(y1;:::; ynjθ) = ln h(yi ) = ln πj hj (yi j µj ; Ωj ) i=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model EM for fitting mixture model I Missing data: let zij = Ifyi comes from group jg I Complete data likelihood is n k Y Y zij f (y; zjθ) = [πj hj (yi jµj ; Ωj )] i=1 j=1 and thus complete log-likelihood is n k X X ln f (y; zjθ) = zij [ln πj + ln hj (yi jµj ; Ωj )] i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model E step I Conditional expectation 8 n k (t) <X X (t) Q(θjθ ) = E zij [ln πj + ln hj (yi jµj ; Ωj ) j Y = y; π ; : i=1 j=1 (t) (t) (t) (t) o µ1 ;:::; µk ; Ω1 ;:::; Ωk ] I By Bayes rule, we have (t) (t) (t) (t) (t) (t) wij := E[zij j y; π ; µ1 ;:::; µk ; Ω1 ;:::; Ωk ] (t) (t) (t) π hj (yi jµ ; Ω ) = j j j k (t) (t) (t) P 0 j0=1 πj0 hj (yi jµj0 ; Ωj0 ) I Q function becomes n k n k X X (t) X X (t) 1 1 T − w ln π + w − ln det Ω − (y − µ ) Ω 1(y − µ ) ij j ij 2 j 2 i j j i j i=1 j=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model M step I Maximizer of the Q function gives the next iterate Pn w (t) π(t+1) = i=1 ij j n Pn (t) w yi µ(t+1) = i=1 ij j Pn (t) i=1 wij Pn (t) (t+1) (t+1) T w (yi − µ )(yi − µ ) Ω(t+1) = i=1 ij j j j P (t) i wij I Why? Multinomial MLE and weighted multivariate normal MLE EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model Remarks I No gradient and Hessian calculation at all I Simplex constraint on π and psd constraint on Ωj are satisfied by EM iterates (t) I The “membership variables” wij can be computed in parallel.

EM and MM Algorithms

Machine Learning Or Econometrics for Credit Scoring: Let’S Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi

18.650 (F16) Lecture 10: Generalized Linear Models (Glms)

Maximum Likelihood from Incomplete Data Via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Stati

Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models

Properties of the Zero-And-One Inflated Poisson Distribution and Likelihood

A Brief Survey of Modern Optimization for Statisticians1

On the Maximization of Likelihoods Belonging to the Exponential Family

Gaussian Process Learning Via Fisher Scoring of Vecchia's Approximation

Stochastic Gradient Descent As Approximate Bayesian Inference

Stochastic Gradient MCMC: Algorithms and Applications

Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-Order Convergence

An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials Andrew M