A New Derivation for Gaussian Mixture Model Parameter Estimation

Home , Mixture model

arXiv:2001.02923v1 [eess.SP] 9 Jan 2020 h aeprmtrudt xrsin stoei h EM the in those as expressions produces The update approach, algorithm. understand. parameter MM to on same straightforward based the parameters derivation, and the new simple estimating proposed is of approach GMM based expectation of MM estima- classical The parameter in GMM hid- tion. expecta- based essential (or algorithm conditional however (EM) latent the maximization are of of which introduction computation tions the and variables require den) not approach (MM) does minorization-maximization which the using GMM itiuin Msaeas ieyue ofidunderlying find [1]. to samples used data widely in also clusters sin are with possible GMMs not real distribution. is which can functions (GMM). density distributions model complex component the mixture several Gaussian component of as Superposition the called is samples, then When model Gaussian are data mixture (density). model mixture the a model in for involved distributions a mixture model sets, as distribution data known parametric consi real-life is the distributions in basic as several structure of distribut complex combination single linear the a (distribut using describe structure samples To the data describe real-life observed to of possible the practice, always In estimate (MLE). not sam- estimation to data is likelihood given approach the maximum from is standard model, ples, parametric A the data. of parameters observe then of the and using record determined assumed, are finite is model to the distribution of is the parameters the for problem observ model estimat the density pivotal parametric of of approach a the parametric function the In of distribution samples. or data one density statistics, the estimate of area many EM classical the via obtained steps update derivat the new the as algorithm. via involves same obtained criterio parameters, and log-likelihood are the the of minorization-maximization of steps bound update b lower of is tighter derivation a approach finding new The the function. calcul density on and conditional variables latent the or of hidden straightf parameters. any invoke is its doesn’t employin (EM), and for approach expectation-maximization derivation classical of the new technique unlike a derivation, new show Mixt The Gaussian and of (GMM) parameters Model of (MLE) estimation likelihood aiiain(M,Mxmmlklho siain(MLE). estimation Maximum-likelihood (MM), maximization nti etr ervstteprmtretmto problem estimation parameter the revisit we letter, this In and classification pattern learning, machine of field the In Terms Index Abstract e eiainfrGusa itr Model Mixture Gaussian for Derivation New A aaee siain MBsdApproach Based MM Estimation: Parameter I hslte,w eii h rbe fmaximum of problem the revisit we letter, this —In Gusa itr oe GM,Minorization- (GMM), model mixture —Gaussian .I I. NTRODUCTION iehSh,Pah Babu Prabhu Sahu, Nitesh orward .The n. dered ation the g ased ion) ion, ion. ion, ure gle the for ize ed it d - aasmlsi asindsrbto hc sgvnby N given is which distribution Gaussian is samples data function n a ewitnas written be can and eter hr ahdniyfunction density each where oe sdsrbdb uepsto of mixture superposition general by A described GMM. is for model problem estimation parameter rprin ednt h aaeesascae ihmixtu by with associated density parameters the denote We proportion. density sprmtrsae nodrfor order In space. parameter is where eoepoedn,frcaiyo rsnainltu define us let presentation function of following clarity for proceeding, Before fe oemnplto,()cnb rte as written be can (4) manipulation, some After rbe st siaeteparameter The the (3). estimate in to given GMM is from problem identically and independently where where oe GM.Teeoe M a ewitnas written be mixt can Gaussian GMM called Therefore, is (GMM). it model then Gaussian multivariate is (1) opnn itr este slreeog hnGMcan on defined GMM distribution then any enough almost large approximate is densities mixture component where ahcmoetdniyfunction density component each and mean h otwdl sdmdlt ecietedsrbto of distribution the describe to model used widely most The nti eto,w omlt h aiu likelihood maximum the formulate we section, this In ie aaset data a Given ( x g θ ; ik k µ c φ x scle the called is , θ ( , p , k φ π Σ ( ∈ k k , x ) − , Σ θ log = ) ; utsatisfy must , g R d θ 2 and ik n { ≻ ) p d o (2 log π q I P II. and , { π × ( ( k π φ k x 1 p 0 (2 , θ k N D ; k µ ( π } sdt sample, data is θ ) x π ersnstecvrac arx When matrix. covariance the represents , π k k k K ROBLEM − 1 = ( = ) ; ) , =1 , x k π ) − d θ 1 2 hs rm()w have we (4) from Thus, . Σ n th k i { = ) | o ( log ; log Σ x , { k π 2 1 µ k X scle h iigcefiin or coefficient mixing the called is { θ } 1 itr opnn ftemixture the of component mixture K k =1 | ( µ k , . . . , hc ilb sdltroften. later used be will which , k f x exp k X | ≥ , π k } Σ K =1 i Σ π ( k , k K F x − k 0 k Σ =1 N k ORMULATION ; N | π p = ) x ∀ θ µ k + k f ( − ( , k N } k f ( k x x { k K x ) c, ( ) } 1 2 and i π ; ( =1 ecie yteparam- the by described , µ x e T ; ; θ x k of g ( µ ; µ Σ ik ) ; } x o θ ∈ k k θ θ k K N ( P K oqaiya density as qualify to k ftenme of number the If . k − − , , =1 φ k ) Σ Σ 1 ) k sn aaset data using R k K ape generated samples µ ) nmxuemodel mixture in o ( ai distribution basic k =1 k d . x ) ) )) × T i ∈ 1 π − , R Σ k Θ eoe the denotes d µ − 1 = × 1 k where 1 ) ( x [2]. . − ure the (1) (5) (6) (4) (2) (3) µ D Θ re ) 1 . 2

For the data set D we can write the likelihood function as {0, 1}. The random variable z is such that only a specific element would be equal to 1 (zk = 1) and other elements are N zeros. The random variable z can take only K possible values L (θ; D) , p (xi; θ) . (7) K th {ek} where ek denotes the k column of K × K identity i=1 k=1 In MLE, the likelihood functionY is maximized to estimate the matrix. Therefore, z follows a multinomial distribution over parameters of the model. Instead of maximizing L (θ; D), it K categories (possible values) and this distribution could be K is more convenient to maximize the logarithm of likelihood defined in terms of the mixing coefficients {πk}k=1 in (3) as function called the log-likelihood denoted as l (θ; D), and prior probabilities, that is, probability of z taking value ek is using (3), (6), and (7), l (θ; D) can be written as πk, p (z = ek)= πk. Thus, we can write the distribution of z as N K l (θ; D) , log L (θ; D)= log egik (φk) (8) K zk i=1 k=1 ! p (z)= π . (10) X K X k where θ is related to φk via θ = {φk}k . k=1 =1 Y K Since logarithm is monotonic function, therefore, the prob- Since we have already involved {πk}k=1 to define p (z), it is lem of estimating θ can be formulated as: safe to say that the conditional distribution of x for a given value of z = ek is N (x; µk, Σk), that is, maximize l (θ; D) Σ {πk ,µk, k} (9) T subject to π 1 =1, π 0, Σk ≻ 0 ∀k p (x | z = ek)= N (x; µk, Σk) (11) The problem in (9) is non-convex as the objective is a not which can be written as concave function in the parameters of interest θ. Moreover, no closed form solution is available for the problem (9). In K zk the next section, we will see how expectation maximization p (x | z)= N (x; µk, Σk) . (12) k=1 algorithm can be applied to arrive at a local maximizer of (9). Thus, the joint distributionY of x and z would be

III. EXPECTATION MAXIMIZATION (EM) ALGORITHM K p (x, z)= p (z) p (x | z)= πzk N (x; µ , Σ )zk . In machine learning and statistics, maximum likelihood k k k k=1 (ML) or maximum a posteriori (MAP) estimate of parameters Y (13) is easy when complete data is available. However, when some The conditional probability of z given x as p (z = ek | x) data is missing and/or model involves the latent or hidden which can also be written as p (zk =1 | x), can be given as: variables then estimation of parameters becomes hard [2]. The EM algorithm [3], [4] is an iterative method to find p (zk =1 | x)= p (z = ek | x) the maximum likelihood estimation of parameters of latent p (z = e ) p (x | z = e ) = k k variable models (statistical models which involve the latent K or hidden variable). EM algorithm alternates between two p (z = ej ) p (x | z = ej ) steps: expectation (E) step and maximization (M) step. In j=1 (14) P E-step, conditional expectation of log-likelihood function is πkN (x; µk, Σk) = computed given the current estimate of parameters and in M- K πj N x; µj , Σj step, parameters are obtained by maximizing the conditional j=1 expectation of log-likelihood function created in E-step [5]. Thus, we have successfullyP introduced latent variable z and also defined the joint distribution for z and x in (13) for the A. EM for GMM GMM model. In the steps above we have associated a latent variable z with variable x, similarly, we can associate latent In this subsection, we discuss the EM algorithm for GMM. variable z with every data sample x . Instead of maximizing We are given an observed data set D, and our goal is to find i i the log-likelihood of the incomplete data set D, one can look at the parameters θ of GMM described in (3) which model the maximizing the log-likelihood of the complete data set defined data best. To find θ, our objective is to maximize the MLE as D = {(x , z )}N . The likelihood of the complete data can problem given in (9). The difficulty in maximizing (9) is due to c i i i=1 be written as the presence of summation inside the logarithm of objective function. On the contrary, EM algorithm handles this issue N N K zi zi by introducing the latent variables and using the notion of k k Lc (θ; Dc)= p (xi, zi)= πk N (xi; µk, Σk) complete data log-likelihood. The following describes how EM i=1 i=1 k=1 algorithm introduces latent variables in the GMM, which we Y Y Y (15) i th feel is not that straightforward and can seem very abstract to where zk represents k element of zi. Taking the logarithm, a beginner trying to understand GMM. we get the complete data log-likelihood as Assume that the number of component density, K, in the N K GMM is known. Let us define a K−dimensional binary i T lc (θ; Dc)= zk log (πkN (xi; µk, Σk)) . (16) random variable z = z ... zK , that is, each zk ∈ 1 i=1 k X X=1 3

Now, involving the EM algorithm, which comes in two Let ut denote the estimate of u at t−th step of MM steps: expectation (E) step and maximization (M) step. In E- procedure. A surrogate function gf (u | ut) is said to majorize step, conditional expectation of complete data log-likelihood the objective function f (u) at ut if [6], [7]: is computed which is deﬁned as follows: f (u) ≤ gf (u | ut) ∀u ∈U (25) Q (θ | θt)= E [lc (θ; Dc) | D, θt] (17) and where Q is called the auxiliary function [2], t indexes the f (ut)= gf (ut | ut) . (26) iteration, and θt is the parameter values at current iteration t. In minimization step, gf (u | ut) is minimized instead of Therefore using (16) in (17) we have f (u) and minimizer of gf (u | ut) becomes the estimate of u at (t + 1) −th iteration of MM hence ut+1 can be written as N K Q (θ | θ )= E zi | D, θ log (π N (x ; µ , Σ )) t k t k i k k ut = arg minimize gf (u | ut) +1 u . (27) i=1 k=1 ∈U XN XK ut+1 evaluated using (27) forces the original objective to i decrease as shown below: = p z =1 | xi, θt log (πkN (xi; µ , Σk)) k k (25) (27) (26) i=1 k=1 f (ut+1) ≤ gf (ut+1 | ut) ≤ gf (ut | ut) = f (ut) . XN XK t (28) = γ log (πkN (xi; µ , Σk)) ik k Therefore, starting with an initial point u0 ∈ U, MM pro- i=1 k=1 X X (18) cedure generates a sequence {ut} which monotonically de- where creases the objective values. Various techniques and methods to construct the surrogate function are given in [6], [8]. πt N (x ; µ , Σt ) γt , k i k k . (19) ik K t t t V. PROPOSED DERIVATION USING MM APPROACH πj N xi; µj , Σj j=1 In this section, we approach the problem in (9) as a maxi- i i Since zk is binary randomP variable, that is, zk ∈ {0, 1}, mization problem and show a straightforward way to construct i i therefore, E zk | D, θt = p zk =1 | xi, θt . a minorizing surrogate function, and show how to arrive In M-step, parameter updates θ are obtained by maxi- t+1 at the minimizer of the surrogate function. The parameter mizing Q (θ | θt) with respect to θ: update of this MM-based derivation would be same as in the case of EM algorithm. However, the MM based derivation θt+1 = arg maximize Q (θ | θt) . (20) θ∈Θ would not involve the introduction of any hidden variable and In [3], it is proved that when Q (θ | θt) increases, the computation of conditional expectation. We feel that such a likelihood of the observed data, l (θ; D), also increases hence straightforward derivation for the parameter updates would set a stationary point for l (θ; D) is achieved. Without going into things clear to a beginner who is getting introduced to GMM. details of solving (20), which can be referred in detail from Before we move into the actual derivation we will discuss the [1], [2], the update equations for πk, µk and Σk are given as log-sum-exp function which would be useful in the proposed [1], [2]: derivation. The log-sum-exp function is deﬁned as [9]:

N n t y γik h (y) , log e k (29) t+1 i=1 πk = , (21) k=1 ! PN T X n×1 where y = y1 ... yn ∈ R . The log-sum-exp N n t x function h (y) is convex on R ×1. Since h (y) is convex γik i µt+1 = i=1 , (22) therefore a tight lower bound for h (y) at any yt can be k PN t obtained by writing the ﬁrst order Taylor approximation at γik i=1 yt as given below: N 1 P T Σt+1 = x − µt+1 x − µt+1 . (23) , T k N i k i k h (y) ≥ sh (y | yt) h (yt)+ ∇h (yt) (y − yt) (30) t i=1 where h y represents the gradient of h y computed at γik X ∇ ( t) ( ) i=1 yt and equality is achieved at y = yt, that is, h (yt) = P sh (yt | yt). The gradient of h (y) can be computed as IV. MM PROCEDURE y1 −1 e In this section, we brieﬂy describe the MM procedure n yk . for a minimization problem and extension of this idea for ∇h (y)= e  .  . (31) ! a maximization problem is trivial. Consider the following k=1 eyn X   minimization problem The objective function of problem (9) is: 

N K minimize f (u) (24) g u∈U l (θ; D)= log e ik (φk) . (32) where u is variable and U is constraint set. i=1 k ! X X=1 4

We observe that the function l (θ; D) is sum of log-sum-exp N functions in gik (φk). We ﬁrst compute the surrogate function t wik for l (θ; D) at θt which lowerbounds l (θ; D) . Using (30), t+1 i=1 K πk = (39) t K PN (31) the lower bound for egik(φk ) at φ can log k k=1 which is the similar to the update equation as obtained in k=1 (21). Next update µt+1, Σt+1 is obtained by solving the be written as follows: k k P following problem: K K t gik(φk) gik(φk) N K 1 log e ≥ log e + t 2 log |Σk| + k=1 k=1 minimize wik 1 T −1 , µ ,Σ t { k k≻0} i=1k=1 2 (xi − µk) Σk (xi − µk) P gi1 (φ1) Pgi1 φ1 (33) P P (40) t T . . (wi )  .  −  .  and given by g (φ ) φt iK K giK K N   T  wt wt ... wt   wt x where i = i1 iK and ik i g φt t+1 i=1 e ik ( k) µ = (41) t k PN wik = . (34) t K t w gij φ ik e ( j ) i=1 j=1 and P Using (33) the lower boundP for l (θ; D) at θ = θt, noting K N θ = {φ } , can be written as 1 T k k=1 Σt+1 = x − µt+1 x − µt+1 . (42) k N i k i k t i=1 l (θt; D)+ wik X t i=1 gi1 (φ ) gi1 φ N 1 1 Thus, we observe that MM based approach yields the similar l (θ; D) ≥ T P wt . . t+1 t+1 t+1 ( i )  .  −  .  update expressions for πk , µk and Σk as obtained in i=1 g (φ ) g φt (21), (22) and (23). P  iK K   iK K  N K     t VI. CONCLUSION = wikgik (φk)+ αt i=1 k=1 In this paper, we have revisited the GMM and proposed X X = sl (θ | θt)+ αt a new way to derive its parameters update expressions using (35) MM procedure. The expression obtained via MM procedure is N K , t t exactly same as those obtained using EM algorithm. The MM where αt l (θt; D) − wikgik φk is a constant and i=1k=1 based derivation is simple and solves the maximum likelihood N K , t P P estimation problem directly without introducing latent variable sl (θ | θt) wikgik (φk). The function sl (θ | θt)+αt i=1k=1 and computing the conditional expectation. is global lowerP boundP for l (θ; D) at θ = θt, that is, l (θ; D) ≥ sl (θ | θt)+ αt and equality is achieved at θ = θt. REFERENCES As per MM principle, we need to maximize the surrogate [1] C. M. Bishop, Pattern recognition and machine learning. springer, 2006. function sl (θ | θt)+ αt in lieu of l (θ; D) to obtain the next [2] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, update for θ, that is, θt+1. Hence, leaving the constant term 2012. α , θ can be written as [3] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from t t+1 incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977. θt+1 = arg maximize sl (θ | θt) Σ [4] R. A. Redner and H. F. Walker, “Mixture densities, maximum likelihood {πk,µk , k} . T and the em algorithm,” SIAM review, vol. 26, no. 2, pp. 195–239, 1984. subject to π 1 =1, π 0, Σk ≻ 0 ∀k [5] E. Alpaydin, Introduction to machine learning. MIT press, 2014. (36) [6] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization algorithms in signal processing, communications, and machine learning,” Using (5), sl (θ | θt) can be written as IEEE Transactions on Signal Processing, vol. 65, no. 3, pp. 794–816, 2017. N K 1 T −1 [7] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” The American t − 2 (xi − µk) Σk (xi − µk) sl (θ | θt)= wik 1 . Statistician, vol. 58, no. 1, pp. 30–37, 2004. − log |Σk| + log πk + c i=1 k=1 2 [8] K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer using X X (37) surrogate objective functions,” Journal of computational and graphical statistics, vol. 9, no. 1, pp. 1–20, 2000. We notice that sl (θ | θt) is separable in πk and {µ , Σk}, k [9] S. Boyd and L. Vandenberghe, Convex optimization. Cambridge university therefore, the problem (36) can be maximized separately as press, 2004. two optimization problems in πk and {µk, Σk}. The following t+1 problem is optimized to obtain the next update πk :

N K t maximize wik log πk {πk} i=1k=1 (38) subject to πPT 1P=1, π 0 t+1 and πk is given by