EM and MM Algorithms

Total Page:16

File Type:pdf, Size:1020Kb

EM and MM Algorithms EM and MM Algorithms EM and MM Algorithms Hua Zhou Department of Biostatistics University of California, Los Angeles Aug 10, 2016 SAMSI OPT Summer School EM and MM Algorithms Slides and Jupyter Notebook for the lab session are available at: https://github.com/Hua-Zhou/Public-Talks EM and MM Algorithms My flowchart for solving optimization problems EM and MM Algorithms Overview Outline Overview Problem Setup Review of Newton’s method EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM MM algorithm Introduction MM examples Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms Overview Problem Setup Problem setup We consider continuous optimization in this lecture minimize f (x) subject to constraints on x; where both objective function f and constraint set are continuous. I’ll switch back and forth between maximization and minimization during lecture (sorry!) I Applied mathematicians talk about minimization exclusively I Statisticians talk about maximization due to maximum likelihood estimation They are fundamentally same problem: minimizing f (x) is equivalent to maximizing −f (x) EM and MM Algorithms Overview Review of Newton’s method Newton’s method Consider maximizing log-likelihood L(θ), θ 2 Θ ⊂ Rp. I Newton’s method was originally developed for finding roots of nonlinear equations f (x) = 0 I Newton’s method (aka Newton-Raphson method) is considered the gold standard for its fast (quadratic) convergence kθ(t+1) − θ∗k ! constant kθ(t) − θ∗k2 I Idea: iterative quadratic approximation EM and MM Algorithms Overview Review of Newton’s method 2 I Notations. rL(θ): gradient vector, d L(θ): Hessian matrix 2 I Statistical jargon. rL(θ): score vector, −d L(θ): observed information matrix, E[−d 2L(θ)]: expected (Fisher) information matrix (t) I Taylor expansion around the current iterate θ 1 L(θ) ≈ L(θ(t)) + rL(θ(t))T (θ − θ(t)) + (θ − θ(t))T d 2L(θ(t))(θ − θ(t)) 2 and then maximize the quadratic approximation I To maximize the quadratic function, we equate its gradient to zero (t) 2 (t) (t) rL(θ ) + [d L(θ )](θ − θ ) = 0p; which suggests the next iterate θ(t+1) = θ(t) − [d 2L(θ(t))]−1rL(θ(t)) = θ(t) + [−d 2L(θ(t))]−1rL(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Issues with Newton’s method and remedies I Need to derive, evaluate, and “invert” the observed information matrix. In statistical problems, often evaluating Hessian costs O(np2) flops and inverting it costs O(p3) flops Remedies: 1. automatic (analytical) differentiation 2. numerical differentiation (works for small problems) 3. quasi-Newton methods (BFGS, `-BFGS probably the most popular black-box nonlinear optimizer) 4. exploit structures whenever possible to reduce the cost of evaluating and inverting the Hessian EM and MM Algorithms Overview Review of Newton’s method I Stability: Na¨ıve Newton’s iterate is not guaranteed to be an ascent algorithm. It’s equally happy to head uphill or downhill Remedies: 1. approximate −d 2L(θ(t)) by a positive definite A (if it’s not), and 2. line search (backtracking) By first-order Taylor expansion, L(θ(t) + s∆θ(t)) − L(θ(t)) = srL(θ(t))T ∆θ(t) + o(s) = srL(θ(t))T A−(t)rL(θ(t)) + o(s): For s sufficiently small, right hand side is strictly positive EM and MM Algorithms Overview Review of Newton’s method In summary, Newton scheme iterates according to θ(t+1) = θ(t) + s[A(t)]−1rL(θ(t)) = θ(t) + s∆θ(t) where A(t) is a pd approximation of −d 2L(θ(t)) and s is a step length EM and MM Algorithms Overview Review of Newton’s method I Line search strategy: step-halving (s = 1; 1=2;:::), golden section search, cubic interpolation, Amijo rule, ... Newton direction ∆θ(t) only need to be calculated once. Cost of line search mainly lies in objective function evaluation 2 I How to approximate −d L(θ)? More of an art than science. Often requires problem specific analysis. (t) I Taking A = I leads to the method of steepest ascent, aka gradient ascent EM and MM Algorithms Overview Review of Newton’s method Fisher’s scoring Fisher’s scoring method replaces −d 2L(θ) by the expected (Fisher) information matrix 2 T I(θ) = E[−d L(θ)] = E[rL(θ)rL(θ) ] 0p×p; which is psd under exchangeability of expectation and differentiation. Therefore the Fisher’s scoring algorithm iterates according to θ(t+1) = θ(t) + s[I(θ(t))]−1rL(θ(t)) EM and MM Algorithms Overview Review of Newton’s method Constrained optimization Main strategies for dealing with constrained optimization, e.g., minimize f (x) subject to ci (x) ≥ 0; i = 1;:::; m I Interior point method (barrier method) m X minimize f (x) − µ log(ci (x)) i=1 and push µ to 0 I Sequential quadratic programming (SQP) minimize quadratic approximation of f subject to ci (x) ≥ 0 I Active set method Interior point method is the most successful for large problems EM and MM Algorithms EM algorithm Outline Overview Problem Setup Review of Newton’s method EM algorithm Introduction Canonical Example - Finite Mixture Model In the Zoo of EM MM algorithm Introduction MM examples Acceleration of EM/MM SQUAREM Quasi-Newton acceleration EM and MM Algorithms EM algorithm Introduction EM algorithm I Which are the most cited statistical papers? Paper Citations Per Year Kaplan-Meier (Kaplan and Meier, 1958) 46886 808 EM (Dempster et al., 1977) 44050 1129 Cox model (Cox, 1972) 40920 930 Metropolis (Metropolis et al., 1953) 31284 497 FDR (Benjamini and Hochberg, 1995) 30975 1450 Unit root test (Dickey and Fuller, 1979) 18259 493 Lasso (Tibshirani, 1996) 15306 765 bootstrap (Efron, 1979) 12992 351 FFT (Cooley and Tukey, 1965) 11319 222 Gibbs sampler (Gelfand and Smith, 1990) 6531 251 Citation counts from Google Scholar on Feb 17, 2016. I EM is one of the most influential statistical ideas, finding applications in various branches of science EM and MM Algorithms EM algorithm Introduction General reference books on EM EM and MM Algorithms EM algorithm Introduction I Notations I Y : observed data I Z: missing data I X = (Y ; Z): complete data I Goal: maximize the log-likelihood of the observed data ln g(yjθ) (optimization!) I Idea: choose Z such that MLE for the complete data is trivial I Let f (xjθ) = f (y; zjθ) be the density of complete data I Each iteration of EM contains two steps I E step: calculate the conditional expectation (t) (t) Q(θjθ ) = E (t) [ln f (Y ; Zjθ) j Y = y; θ ] ZjY =y;θ (t) I M step: maximize Q(θjθ ) to generate the next iterate (t+1) (t) θ = argmaxθ Q(θjθ ) EM and MM Algorithms EM algorithm Introduction I Fundamental inequality of EM Q(θ j θ(t)) − ln g(yjθ) = E[ln f (Y ; Zjθ)jY = y; θ(t)] − ln g(yjθ) f (Y ; Z j θ) = E ln j Y = y; θ(t) g(Y j θ) f (Y ; Z j θ(t)) (inform. ineq.) ≤ E ln j Y = y; θ(t) g(Y j θ(t)) = Q(θ(t) j θ(t)) − ln g(yjθ(t)) Rearranging shows ln g(y j θ) ≥ Q(θ j θ(t)) − Q(θ(t) j θ(t)) + ln g(y j θ(t)) for all θ. Equality is achieved at θ = θ(t) EM and MM Algorithms EM algorithm Introduction I Ascent property of EM ln g(y j θ(t+1)) ≥ Q(θ(t+1) j θ(t)) − Q(θ(t) j θ(t)) + ln g(y j θ(t)) ≥ ln g(y j θ(t)) (t+1) (t) (t) (t) I Obviously we only need Q(θ j θ ) − Q(θ j θ ) ≥ 0 for this ascent property to hold (Generalized EM) EM and MM Algorithms EM algorithm Introduction A picture to keep in mind 0 ln g(θ) (t+1) ← θ Q(θ|θ(t))+c(t) −0.1 Q(θ|θ(t+1))+c(t+1) −0.2 −0.3 −0.4 (t) −0.5 ← θ −0.6 −0.7 −0.8 −0.9 −1 −3 −2 −1 0 1 2 EM and MM Algorithms EM algorithm Introduction One more ... 2D function EM and MM Algorithms EM algorithm Introduction Remarks on EM I Work magically (at least to applied mathematicians). No need for gradient and Hessian I Concavity of the observed log-likelihood function (objective function) is not required I Under regularity conditions, EM converges to a stationary point of ln g(yjθ). For a non-concave objective function, it can be a local mode or even a saddle point I Original idea also appeared in the HMM paper (Baum-Welch algorithm) (Baum et al., 1970) I Tend to separate variables – parallel computing EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model A canonical EM example – finite mixture model A canonical example for EM EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model d I Each data point yi 2 R comes from candidate distribution hj Pk with probability πj , j=1 πj = 1 Pk d I Mixture density h(y) = j=1 πj hj (y j µj ; Ωj ), y 2 R , where d=2 −1 1 −1=2 − 1 (y−µ )TΩ (y−µ ) h (y j µ ; Ω ) = j det(Ω )j e 2 j j j j j j 2π j are N(µj ; Ωj ) I Given data y1;:::; yn, want to estimate θ = (π1; : : : ; πk ; µ1;:::; µk ; Ω1;:::; Ωk ): Observed (incomplete) data log-likelihood is n n k X X X ln g(y1;:::; ynjθ) = ln h(yi ) = ln πj hj (yi j µj ; Ωj ) i=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model EM for fitting mixture model I Missing data: let zij = Ifyi comes from group jg I Complete data likelihood is n k Y Y zij f (y; zjθ) = [πj hj (yi jµj ; Ωj )] i=1 j=1 and thus complete log-likelihood is n k X X ln f (y; zjθ) = zij [ln πj + ln hj (yi jµj ; Ωj )] i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model E step I Conditional expectation 8 n k (t) <X X (t) Q(θjθ ) = E zij [ln πj + ln hj (yi jµj ; Ωj ) j Y = y; π ; : i=1 j=1 (t) (t) (t) (t) o µ1 ;:::; µk ; Ω1 ;:::; Ωk ] I By Bayes rule, we have (t) (t) (t) (t) (t) (t) wij := E[zij j y; π ; µ1 ;:::; µk ; Ω1 ;:::; Ωk ] (t) (t) (t) π hj (yi jµ ; Ω ) = j j j k (t) (t) (t) P 0 j0=1 πj0 hj (yi jµj0 ; Ωj0 ) I Q function becomes n k n k X X (t) X X (t) 1 1 T − w ln π + w − ln det Ω − (y − µ ) Ω 1(y − µ ) ij j ij 2 j 2 i j j i j i=1 j=1 i=1 j=1 EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model M step I Maximizer of the Q function gives the next iterate Pn w (t) π(t+1) = i=1 ij j n Pn (t) w yi µ(t+1) = i=1 ij j Pn (t) i=1 wij Pn (t) (t+1) (t+1) T w (yi − µ )(yi − µ ) Ω(t+1) = i=1 ij j j j P (t) i wij I Why? Multinomial MLE and weighted multivariate normal MLE EM and MM Algorithms EM algorithm Canonical Example - Finite Mixture Model Remarks I No gradient and Hessian calculation at all I Simplex constraint on π and psd constraint on Ωj are satisfied by EM iterates (t) I The “membership variables” wij can be computed in parallel.
Recommended publications
  • Machine Learning Or Econometrics for Credit Scoring: Let’S Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi
    Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi To cite this version: Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi. Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds. 2020. hal-02507499v2 HAL Id: hal-02507499 https://hal.archives-ouvertes.fr/hal-02507499v2 Preprint submitted on 2 Nov 2020 (v2), last revised 15 Jan 2021 (v3) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Machine Learning or Econometrics for Credit Scoring: Let's Get the Best of Both Worlds∗ Elena, Dumitrescu,y Sullivan, Hu´e,z Christophe, Hurlin,x Sessi, Tokpavi{ October 18, 2020 Abstract In the context of credit scoring, ensemble methods based on decision trees, such as the random forest method, provide better classification performance than standard logistic regression models. However, logistic regression remains the benchmark in the credit risk industry mainly because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators. In this paper, we pro- pose to obtain the best of both worlds by introducing a high-performance and inter- pretable credit scoring method called penalised logistic tree regression (PLTR), which uses information from decision trees to improve the performance of logistic regression.
    [Show full text]
  • 18.650 (F16) Lecture 10: Generalized Linear Models (Glms)
    Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X (µ(X), σ2I), | ∼ N And ⊤ IE(Y X) = µ(X) = X β, | 2/52 Components of a linear model The two components (that we are going to relax) are 1. Random component: the response variable Y X is continuous | and normally distributed with mean µ = µ(X) = IE(Y X). | 2. Link: between the random and covariates X = (X(1),X(2), ,X(p))⊤: µ(X) = X⊤β. · · · 3/52 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y some exponential family distribution ∼ 2. Link: between the random and covariates: ⊤ g µ(X) = X β where g called link function� and� µ = IE(Y X). | 4/52 Example 1: Disease Occuring Rate In the early stages of a disease epidemic, the rate at which new cases occur can often increase exponentially through time. Hence, if µi is the expected number of new cases on day ti, a model of the form µi = γ exp(δti) seems appropriate. ◮ Such a model can be turned into GLM form, by using a log link so that log(µi) = log(γ) + δti = β0 + β1ti. ◮ Since this is a count, the Poisson distribution (with expected value µi) is probably a reasonable distribution to try. 5/52 Example 2: Prey Capture Rate(1) The rate of capture of preys, yi, by a hunting animal, tends to increase with increasing density of prey, xi, but to eventually level off, when the predator is catching as much as it can cope with.
    [Show full text]
  • Maximum Likelihood from Incomplete Data Via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Stati
    Maximum Likelihood from Incomplete Data via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. Stable URL: http://links.jstor.org/sici?sici=0035-9246%281977%2939%3A1%3C1%3AMLFIDV%3E2.0.CO%3B2-Z Journal of the Royal Statistical Society. Series B (Methodological) is currently published by Royal Statistical Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/rss.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Fri Apr 6 01:07:17 2007 Maximum Likelihood from Incomplete Data via the EM Algorithm By A. P. DEMPSTER,N. M. LAIRDand D. B. RDIN Harvard University and Educational Testing Service [Read before the ROYALSTATISTICAL SOCIETY at a meeting organized by the RESEARCH SECTIONon Wednesday, December 8th, 1976, Professor S.
    [Show full text]
  • Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models
    Editor’s Note: This article will be presented at the Technometrics invited session at the 2005 Joint Statistical Meetings in Minneapolis, Minnesota, August 2005. Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models Runze LI Agus SUDJIANTO Department of Statistics Risk Management Quality & Productivity Pennsylvania State University Bank of America University Park, PA 16802 Charlotte, NC 28255 ([email protected]) ([email protected]) Kriging is a popular analysis approach for computer experiments for the purpose of creating a cheap- to-compute “meta-model” as a surrogate to a computationally expensive engineering simulation model. The maximum likelihood approach is used to estimate the parameters in the kriging model. However, the likelihood function near the optimum may be flat in some situations, which leads to maximum likelihood estimates for the parameters in the covariance matrix that have very large variance. To overcome this difficulty, a penalized likelihood approach is proposed for the kriging model. Both theoretical analysis and empirical experience using real world data suggest that the proposed method is particularly important in the context of a computationally intensive simulation model where the number of simulation runs must be kept small because collection of a large sample set is prohibitive. The proposed approach is applied to the reduction of piston slap, an unwanted engine noise due to piston secondary motion. Issues related to practical implementation of the proposed approach are discussed. KEY WORDS: Computer experiment; Fisher scoring algorithm; Kriging; Meta-model; Penalized like- lihood; Smoothly clipped absolute deviation. 1. INTRODUCTION e.g., Ye, Li, and Sudjianto 2000) and by the meta-modeling ap- proach itself (Jin, Chen, and Simpson 2000).
    [Show full text]
  • Properties of the Zero-And-One Inflated Poisson Distribution and Likelihood
    Statistics and Its Interface Volume 9 (2016) 11–32 Properties of the zero-and-one inflated Poisson distribution and likelihood-based inference methods Chi Zhang, Guo-Liang Tian∗, and Kai-Wang Ng The over-dispersion may result from other phenomena such as a high proportion of both zeros and ones in the To model count data with excess zeros and excess ones, count data. For example, Eriksson and Aberg˚ [8] reported a in their unpublished manuscript, Melkersson and Olsson two-year panel data from Swedish Level of Living Surveys (1999) extended the zero-inflated Poisson distribution to in 1974 and 1991, where the visits to a dentist have higher a zero-and-one-inflated Poisson (ZOIP) distribution. How- proportions of zeros and ones and one-visit observations are ever, the distributional theory and corresponding properties even much more frequent than zero-visits. It is common to of the ZOIP have not yet been explored, and likelihood- visit a dentist for a routine control, e.g., school-children go based inference methods for parameters of interest were not to a dentist once a year almost as a rule. As the second ex- well developed. In this paper, we extensively study the ZOIP ample, Carrivick et al.[3] reported that much of the data distribution by first constructing five equivalent stochastic collected on occupational safety involves accident or injury representations for the ZOIP random variable and then de- counts with many zeros and ones. In general, the safety reg- riving other important distributional properties. Maximum ulations protect most workers from injury so that many ze- likelihood estimates of parameters are obtained by both the ros are observed.
    [Show full text]
  • A Brief Survey of Modern Optimization for Statisticians1
    International Statistical Review (2014), 82, 1, 46–70 doi:10.1111/insr.12022 A Brief Survey of Modern Optimization for Statisticians1 Kenneth Lange1,2,EricC.Chi2 and Hua Zhou3 1Departments of Biomathematics and Statistics, University of California, Los Angeles, CA 90095-1766, USA E-mail: [email protected] 2Department of Human Genetics, University of California, Los Angeles, CA 90095-1766, USA E-mail: [email protected] 3Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA E-mail: [email protected] Summary Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include non-differentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well-chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience. Key words: Block relaxation; Newton’s method; MM algorithm; penalization; augmented Lagrangian; acceleration. 1 Introduction Modern statistics represents a confluence of data, algorithms, practical inference, and subject area knowledge. As data mining expands, computational statistics is assuming greater promi- nence. Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo may be too slow to handle modern data sets.
    [Show full text]
  • On the Maximization of Likelihoods Belonging to the Exponential Family
    arXiv.org manuscript No. (will be inserted by the editor) On the maximization of likelihoods belonging to the exponential family using ideas related to the Levenberg-Marquardt approach Marco Giordan · Federico Vaggi · Ron Wehrens Received: date / Revised: date Abstract The Levenberg-Marquardt algorithm is a flexible iterative procedure used to solve non-linear least squares problems. In this work we study how a class of possible adaptations of this procedure can be used to solve maximum likelihood problems when the underlying distributions are in the exponential family. We formally demonstrate a local convergence property and we discuss a possible implementation of the penalization involved in this class of algorithms. Applications to real and simulated compositional data show the stability and efficiency of this approach. Keywords Aitchison distribution · compositional data · Dirichlet distribution · generalized linear models · natural link · optimization. arXiv:1410.0793v1 [stat.CO] 3 Oct 2014 M. Giordan IASMA Research and Innovation Centre, Biostatistics and Data Managment E-mail: [email protected] F. Vaggi IASMA Research and Innovation Centre, Integrative Genomic Research E-mail: [email protected] Ron Wehrens IASMA Research and Innovation Centre, Biostatistics and Data Managment E-mail: [email protected] 2 Marco Giordan et al. 1 Introduction The exponential family is a large class of probability distributions widely used in statistics. Distributions in this class possess many useful properties that make them useful for inferential and algorithmic purposes, especially with reference to maximum likelihood estimation. Despite this, when closed form solutions are not available, computational problems can arise in the fitting of a particular distribution in this family to real data.
    [Show full text]
  • Gaussian Process Learning Via Fisher Scoring of Vecchia's Approximation
    Gaussian Process Learning via Fisher Scoring of Vecchia's Approximation Joseph Guinness Cornell University, Department of Statistics and Data Science Abstract We derive a single pass algorithm for computing the gradient and Fisher information of Vecchia's Gaussian process loglikelihood approximation, which provides a computationally efficient means for applying the Fisher scoring algorithm for maximizing the loglikelihood. The advantages of the optimization techniques are demonstrated in numerical examples and in an application to Argo ocean temperature data. The new methods are more accurate and much faster than an optimization method that uses only function evaluations, especially when the covariance function has many parameters. This allows practitioners to fit nonstationary models to large spatial and spatial-temporal datasets. 1 Introduction The Gaussian process model is an indispensible tool for the analysis of spatial and spatial-temporal datasets and has become increasingly popular as a general-purpose model for functions. Because of its high computational burden, researchers have devoted substantial effort to developing nu- merical approximations for Gaussian process computations. Much of the work focuses on efficient approximation of the likelihood function. Fast likelihood evaluations are crucial for optimization procedures that require many evaluations of the likelihood, such as the default Nelder-Mead al- gorithm (Nelder and Mead, 1965) in the R optim function. The likelihood must be repeatedly evaluated in MCMC algorithms as well. Compared to the amount of literature on efficient likelihood approximations, there has been considerably less development of techniques for numerically maximizing the likelihood (see Geoga et al. (2018) for one recent example). This article aims to address the disparity by providing: 1.
    [Show full text]
  • Stochastic Gradient Descent As Approximate Bayesian Inference
    Journal of Machine Learning Research 18 (2017) 1-35 Submitted 4/17; Revised 10/17; Published 12/17 Stochastic Gradient Descent as Approximate Bayesian Inference Stephan Mandt [email protected] Data Science Institute Department of Computer Science Columbia University New York, NY 10025, USA Matthew D. Hoffman [email protected] Adobe Research Adobe Systems Incorporated 601 Townsend Street San Francisco, CA 94103, USA David M. Blei [email protected] Department of Statistics Department of Computer Science Columbia University New York, NY 10025, USA Editor: Manfred Opper Abstract Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results. (1) We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specif- ically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distri- butions. (2) We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models. (3) We also show how to tune SGD with momentum for approximate sampling. (4) We analyze stochastic-gradient MCMC algorithms. For Stochastic-Gradient Langevin Dynamics and Stochastic-Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally (5), we use the stochastic process per- spective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.
    [Show full text]
  • Stochastic Gradient MCMC: Algorithms and Applications
    UNIVERSITY OF CALIFORNIA, IRVINE Stochastic Gradient MCMC: Algorithms and Applications DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Sungjin Ahn Dissertation Committee: Professor Max Welling, Chair Professor Babak Shahbaba Professor Charless Fowlkes 2015 c 2015 Sungjin Ahn TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF TABLES vii LIST OF ALGORITHMS viii ACKNOWLEDGMENTS ix CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xii 1 Introduction 1 1.1 Bayesian Inference for Machine Learning . 3 1.2 Advantages of Bayesian Inference . 6 1.3 Large-Scale Machine Learning . 7 1.4 Research Challenges . 10 1.5 Outline of the Thesis . 11 2 Markov Chain Monte Carlo 13 2.1 Monte Carlo Method . 13 2.2 Markov Chains . 15 2.3 MCMC Algorithms . 17 2.3.1 The Metropolis-Hastings Algorithm . 17 2.3.2 Gibbs Sampling . 18 2.3.3 Langevin Monte Carlo . 20 2.4 Scalability of Traditional MCMC . 21 3 Stochastic Gradient Langevin Dynamics 24 3.1 Stochastic Gradient Langevin Dynamics . 25 3.1.1 SGLD with a constant step-size . 27 3.1.2 SGLD during burn-in . 27 4 Stochastic Gradient Fisher Scoring 29 4.1 Motivation . 29 4.2 Preliminaries and Notation . 30 ii 4.3 Stochastic Gradient Fisher Scoring . 32 4.3.1 Sampling from the Gaussian Approximate Posterior . 32 4.3.2 Stochastic Gradient Fisher Scoring . 33 4.4 Computational Efficiency . 38 4.5 Experiments . 39 4.5.1 Logistic Regression . 39 4.5.2 SGFS on Neural Networks . 42 4.5.3 Discriminative Restricted Boltzmann Machine (DRBM) .
    [Show full text]
  • Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-Order Convergence
    Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence Yuetian Luo1, Wen Huang2, Xudong Li3, and Anru R. Zhang1;4 March 16, 2021 Abstract In this paper, we propose a new Recursive Importance Sketching algorithm for Rank con- strained least squares Optimization (RISRO). As its name suggests, the algorithm is based on a new sketching framework, recursive importance sketching. Several existing algorithms in the literature can be reinterpreted under the new sketching framework and RISRO offers clear ad- vantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is only solving a dimension reduced least squares problem. Different from numerous existing algorithms with locally geometric convergence rate, we establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. In addition, we discover a deep connection of RISRO to Riemannian manifold optimization on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO. Keywords: Rank constrained least squares, Sketching, Quadratic convergence, Riemannian man- ifold optimization, Low-rank matrix recovery, Non-convex optimization 1 Introduction The focus of this paper is on the rank constrained least squares: 1 2 min fpXq :“ }y ´ ApXq}2 ; subject to rankpXq “ r: (1) XPRp1ˆp2 2 n p p n Here, y P R is the given data and A P R 1ˆ 2 Ñ R is a known linear map that can be explicitly arXiv:2011.08360v2 [math.OC] 14 Mar 2021 represented as J ApXq “ rxA1; Xy;:::; xAn; Xys ; xAi; Xy “ pAiqrj;ksXrj;ks (2) 1¤j¤p ;1¤k¤p ¸1 2 p ˆp with given measurement matrices Ai P R 1 2 , i “ 1; : : : ; n.
    [Show full text]
  • An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials Andrew M
    An Approximate Fisher Scoring Algorithm for Finite Mixtures of Multinomials Andrew M. Raim, Minglei Liu, Nagaraj K. Neerchal and Jorge G. Morel Abstract Finite mixture distributions arise naturally in many applications including clus- tering and classification. Since they usually do not yield closed forms for maximum likelihood estimates (MLEs), numerical methods using the well known Fisher Scoring or Expectation-Maximization algorithms are considered. In this work, an approximation to the Fisher Information Matrix of an arbitrary mixture of multinomial distributions is introduced. This leads to an Approximate Fisher Scoring algorithm (AFSA), which turns out to be closely related to Expectation-Maximization, and is more robust to the choice of initial value than Fisher Scoring iterations. A combination of AFSA and the classical Fisher Scoring iterations provides the best of both computational efficiency and stable convergence properties. Key Words: Multinomial; Finite mixture, Maximum likelihood, Fisher Information Matrix, Fisher Scoring. 1 Introduction A finite mixture model arises when observations are believed to originate from one of several populations, but it is unknown to which population each observation belongs. Model iden- tifiability of mixtures is an important issue with a considerable literature; see, for example, Robbins (1948) and Teicher (1960). The book by McLachlan and Peel (2000) is a good entry point to the literature on mixtures. A majority of the mixture literature deals with mix- tures of normal distributions; however, Blischke (1962; 1964), Krolikowska (1976), and Kabir (1968) are a few early works which address mixtures of discrete distributions. The present work focuses on mixtures of multinomials, which have wide applications in clustering and classification problems, as well as modeling overdispersion data (Morel and Nagaraj, 2012).
    [Show full text]