The EM Algorithm in Multivariate Gaussian Mixture Models Using Anderson Acceleration
Total Page:16
File Type:pdf, Size:1020Kb
The EM Algorithm in Multivariate Gaussian Mixture Models using Anderson Acceleration by Joshua H. Plasse A Project Report Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Master of Science in Applied Mathematics by May 2013 APPROVED: Dr. Homer F. Walker, Thesis Advisor Dr. Bogdan Vernescu, Department Head Abstract Over the years analysts have used the EM algorithm to obtain maximum likelihood estimates from incomplete data for various models. The general algorithm admits several appealing properties such as strong global convergence; however, the rate of convergence is linear which in some cases may be unacceptably slow. This work is primarily concerned with applying Anderson acceleration to the EM algorithm for Gaussian mixture models (GMM) in hopes of alleviating slow convergence. As preamble we provide a review of maximum likelihood estimation and derive the EM algorithm in detail. The iterates that correspond to the GMM are then formulated and examples are provided. These examples show how faster convergence is experienced when the data are well separated, whereas much slower convergence is seen whenever the sample is poorly separated. The Anderson acceleration method is then presented, and its connection to the EM algorithm is discussed. The work is then concluded by applying Anderson acceleration to the EM algorithm which results in reducing the number of iterations required to obtain convergence. i Acknowledgments I would first like to thank my advisor and project mentor Dr. Homer Walker. His guidance throughout the duration of this exposition proved invaluable. I also would like to express my gratitude to Professor Mayer Humi and Professor Marcus Sarkis for instilling a strong background in various areas of numerical analysis. I would also like to thank Dr. Joseph Fehribach for taking his time to conduct an independent study. Secondly, I would like to express my thanks to Kyle Dunn, Marisa Zemsky, William Sanguinet and Dan Brady, who helped vectorize MATLAB code, produce aesthetically appealing graphics and help alleviate problems in LATEX. I would also like to acknowledge Mike Malone and Sia Najafi who provided me with a powerful computer that was used extensively in the computation of large scale data sets. Lastly, I would like to thank my mother, father, Briannah and Alannah for all the love and support that they have given more over the span of the last twenty four years. To Caitlin, I express my utmost gratitude for the patience and encouragement you have shown me which has helped me get where I am today. It is to them I dedicate this project. This work was supported in part through U.S. Department of Energy Office of Science Grant DE- SC0004880 to Worcester Polytechnic Institute. ii Contents Abstract i Acknowledgments ii List of Figures iv List of Tables v Introduction 1 Maximum Likelihood Estimation.....................................2 Calculating an MLE Analytically..................................2 The EM Algorithm 5 General EM.................................................5 Convergence Properties..........................................6 EM Algorithm for Mixture Models....................................7 EM on Normal Mixtures.......................................9 Examples.................................................. 12 Well-Separated Case (WSC)..................................... 12 Poorly-Separated Case (PSC).................................... 17 Observations............................................. 23 Anderson Acceleration 24 Fixed-Point Iteration............................................ 24 Acceleration Algorithm.......................................... 24 Example................................................... 27 Conclusions 30 Future Work................................................ 30 Bibliography 31 Appendix A 32 Appendix B 44 iii List of Figures 1 Scatter Plot of our Random Sample (WSC)............................ 12 2 After 3 Iterations (WSC)...................................... 13 3 After 5 Iterations (WSC)...................................... 13 4 After 15 Iterations (WSC)...................................... 14 5 Mean Errors - log jjµi+1 − µijj1 (WSC).............................. 14 6 Proportion Errors - log jjαi+1 − αijj1 (WSC)........................... 14 7 Covariance Errors - log jjΣi+1 − Σijj1 (WSC)........................... 15 8 Applying K-Means to our Sample (WSC)............................. 15 9 Mean and Proportion Errors using K-Means (WSC)....................... 16 10 Covariance Errors using K-Means (WSC)............................. 16 11 Scatter Plot of our Random Sample (PSC)............................ 17 12 After 50 Iterations (PSC)...................................... 18 13 After 100 Iterations (PSC)...................................... 18 14 After 250 Iterations (PSC)...................................... 19 15 Mean Errors - log jjµi+1 − µijj1 (PSC)............................... 20 16 Proportion Errors - log jjαi+1 − αijj1 (PSC)............................ 20 17 Covariance Errors - log jjΣi+1 − Σijj1 (PSC)........................... 20 18 Applying K-Means to our Sample (PSC).............................. 21 19 After 25 Iterations using K-Means................................. 22 20 After 150 Iterations using K-Means................................. 22 21 Plots of Performance With and Without Acceleration...................... 28 iv List of Tables 1 True Parameter Values (WSC)................................... 12 2 Final Approximated Parameter Values (WSC).......................... 15 3 Initial Parameter Guesses Using K-Means (WSC)......................... 16 4 True Parameter Values (PSC).................................... 17 5 Approximated Parameter Values (PSC).............................. 19 6 Initial Parameter Guesses using K-Means (PSC)......................... 21 7 Without Acceleration........................................ 29 8 With Acceleration.......................................... 29 v Introduction Of interest is the estimation of parameters in a mixture model where all underlying components are multivariate Gaussian distributions of dimension at least two. To be precise, throughout this exposition the model used will be a Gaussian mixture model (GMM) that represents a population composed of m 2 Z+ subpopulations. To accompany our model we also assume that we have an unlabeled random sample X = (x1; x2; :::; xN ) which was obtained in an independent and identically distributed (iid) fashion. An unlabeled sample in this sense means that for any xk 2 X ; k = 1; :::; N, the true subpopulation to which xk belongs is not known. The goal is to make accurate statistical inferences on properties of the subpopulations using only the unlabeled sample X . As will be seen in the sequel, maximum likelihood estimation via the EM algorithm is a powerful tool in obtaining accurate parameter estimates for the GMM as well as various other models. Mathematically we define the GMM as m X p(xjΦ) = αipi(xjφi); (1) i=1 T d where x = (x1; :::; xd) 2R , φi = (µi; Σi), Φ = (α1; :::; αm; φ1; :::; φm) 2 Ω and each pi is a d-dimensional multivariate Gaussian distribution given by, 1 −1=2(x−µ )T Σ−1(x−µ ) p (xjφ ) = e i i i : (2) i i d=2 1=2 (2π) (det Σi) d th Similar to the univariate Gaussian distribution, µi 2 R represents the mean vector for the i subpop- th ulation; whereas, Σi is the d × d symmetric positive definite covariance matrix that corresponds to the i subpopulation. The collection of αi's are known as the model's mixture proportions, i.e. each αi represents th the probability that a randomly selected xk 2 X was generated from the i subpopulation. Since each αi is a probability, it follows that αi 2 [0; 1] for all i and the αi's are constrained to sum to one. Due to this constraint, it follows that p(xjΦ) is a probability density function (pdf) since m X 1 = αi i=1 m X Z = αi pi(xjφi)dx d i=1 R m Z X = αipi(xjφi)dx d R i=1 Z = p(xjΦ)dx: d R 1 where pdf stands for probability density function. Now that we have defined the model we seek numerical methods that will allow one to compute accurate estimates for Φ. Since ultimately our goal is to introduce, derive and accelerate the EM algorithm for (1), which is used for maximum likelihood estimation, we provide the reader with a review of the method. Maximum Likelihood Estimation The notations that will be developed reflects those that are used in Casella and Berger [2]. First we suppose that we have an n-tuple of random vectors X = (X1;X2; :::; Xn) that was generated in an iid fashion. We also assume that the distribution of X depends on a fixed unknown parameter θ = (θ1; :::; θk) that takes its value in the parameter space Θ. Thus for any Xi 2 X the individual probability density (mass) iid function (pdf (pmf)) for Xi will be denoted by Xi ∼ f(xijθ) for i = 1; :::; n. We may now take advantage of our sample being iid by defining our joint pdf (pmf) for X as n Y f(xjθ) = f(xijθ1; :::; θk): (3) i=1 Here x = (x1; :::; xn) denotes the observed values of X1:; ; ; :Xn. To understand how to compute a maximum likelihood estimate (MLE) for a given distribution, one must be familiar with the concept of likelihood functions, which we define below. Definition 1. Let f(xjθ) denote the joint pdf (pmf) of our random sample X = (X1; :::; Xn) and be defined as in (3). Then given the observed values X = x, the likelihood function is defined as n Y L(θjx) = L(θ1; ...θkjx1; :::; xn) = f(xjθ) = f(xijθ1; :::; θk): (4) i=1 The reader should be aware