Gaussian Mixture Models (GMM) and ML Estimation Examples 3
Total Page:16
File Type:pdf, Size:1020Kb
Gaussian Mixture Models (GMM) and ML Estimation Examples 3 Let us look at the log likelihood function n l(µ) = log L(µ)= log P (X µ) i| Xi=1 2 1 2 1 =2log + log µ +3 log + log µ +3 log + log(1 µ) +2 log + log(1 µ) µ 3 ∂ µ 3 ∂ µ 3 ° ∂ µ 3 ° ∂ = C + 5 log µ + 5 log(1 µ) ° where C is a constant which does not depend on µ. It can be seen that the log likelihood function is easier to maximize compared to the likelihood function. Let the derivative of l(µ) with respect to µ be zero: dl(µ) 5 5 = =0 dµ µ ° 1 µ ° and the solution gives us the MLE, which is µˆ =0.5. We remember that the method of moment estimation is µˆ =5/12, which is diÆerent from MLE. Example 2: Suppose X1,X2, ,Xn are i.i.d. random variables with density function 1 x ··· f(x æ)= exp | | , please find the maximum likelihood estimate of æ. | 2æ ° æ Solution: The log-likelihood≥ ¥ function is n X l(æ)= log 2 log æ | i| "° ° ° æ # Xi=1 Let the derivative with respect to µ be zero: n n 1 Xi n i=1 Xi 0 l (æ)= + | 2 | = + 2| | =0 "°æ æ # °æ P æ Xi=1 and this gives us the MLE for æ as n X æˆ = i=1 | i| P n Mean and Variance of GaussianAgain this is diÆerent from the method of moment estimation which is n X2 æˆ = i=1 i 4 sP 2n Example 3: Use the method of moment to estimate the parameters µ and æ for the normal • Consider the Gaussian PDF:density 1 (x µ)2 f(x µ, æ2)= exp ° , | p2ºæ (° 2æ2 ) 4 Given the observations (sample)based on a random sample X , ,X . 1 ··· n based on a random sample X , ,X . Solution: In this example,1 ··· wen have two unknown parameters, µ and æ, therefore the pa- rameter µ =(Solution:µ, æIn) this is example, a vector. we have two We unknown first parameters, write outµ and æ the, therefore log the likelihood pa- function as Form the log-likelihood functionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n n n 1 1 2 n 1 2 n l(µ, æ)= log æ 1log 2º (Xi µ)1 = n log æ log 2º (Xi µ) n 1 ° ° 2 ° 2æ2 ° ° ° 2 2 ° 2æ2 ° 2 l(µ, æ)= i=1log∑ æ log 2º ∏ (X µ) = i=1n log æ log 2º (X µ) °X ° 2 ° 2æ2 i ° °X ° 2 ° 2æ2 i ° Take the derivatives wrt X�iSetting=1 �∑� the� partial � a derivativend s toe bet 0, i wet t haveo zero ∏ Xi=1 @l(µ, æ) 1 n = (X µ)=0 Setting the partial derivative@µ to beæ2 0,i we° have Xi=1 n @l(µ, æ) n 3 2 n = @l(+µ,æ° æ)(Xi µ)1=0 @æ °æ i=1 ° X = 2 (Xi µ)=0 Solving these equations will give us the MLE@µ for µ and æ: æ ° Xi=1 1 n µˆ = X andæ ˆ = (X X)2 n v i @l(µ, æ) un i=1 n ° 3 2 u X =t + æ° (Xi µ) =0 This time the MLE is the same as@æ the result of method°æ of moment. ° Xi=1 From these examples, we can see that the maximum likelihood result may or may not be the Solving thesesame as equations the result of method will of moment. give us the MLE for µ and æ: Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: 1 n µ µ 1 2 f(x xµˆ, µ)==µxXx° ° ,xandx , æ ˆµ >=1 (Xi X) | 0 0 ∏ 0 v un i=1 ° u X Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample.t Find the MLE of µ. ··· This timeSolution: the MLEThe log-likelihood is the same function is as the result of method of moment. n n From these examples,l(µ)= welog canf(X seeµ)= that(log µ + theµ log x maximum(µ + 1) log X ) likelihood result may or may not be the i| 0 ° i Xi=1 Xi=1 same as the result of method of moment.n = n log µ + nµ log x (µ + 1) log X 0 ° i Example 4: The Pareto distributionXi=1 has been used in economics as a model for a density function withLet the a derivative slowly with decaying respect to µ be zero: tail: dl(µ) n n = + n log x0 log Xi =0 dµ µ ° i=1 µ µ 1 f(x x , µ)=Xµx x° ° ,x x , µ > 1 | 0 0 ∏ 0 Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero: dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 4 based on a random sample X , ,X . 1 ··· n Solution: In this example, we have two unknown parameters, µ and æ, therefore the pa- Solutionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n 1 1 n 1 n l(µ, æ)= log æ log 2º (X µ)2 = n log æ log 2º (X µ)2 ° ° 2 ° 2æ2 i ° ° ° 2 ° 2æ2 i ° Xi=1 ∑ ∏ Xi=1 • Sample mean and variance:Setting the partial derivative to be 0, we have @l(µ, æ) 1 n = (X µ)=0 @µ æ2 i ° Xi=1 n @l(µ, æ) n 3 2 = + æ° (X µ) =0 @æ °æ i ° Xi=1 Solving these equations will give us the MLE for µ and æ: 1 n µˆ = X andæ ˆ = (X X)2 v i un i=1 ° u X t This time the MLE is the same as the result of method of moment. From these examples, we can see that the maximum likelihood result may or may not be the same as the result of method of moment. Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail: µ µ 1 f(x x , µ)=µx x° ° ,x x , µ > 1 | 0 0 ∏ 0 Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero: dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 Gaussian Mixture Model Gaussian Mixture Model Gaussian Mixture • Probabilistic story: Each clusterModel is associated with a • ProbabilisticGaussian distribution. story: Each To cluster generate is associated data, randomly with a choose Gaussiana cluster distribution.k with probability To generate⇡k and data,sample randomly from its choose distribution. • GMM a cluster k with probabilityK ⇡k and sample from its distribution.• Likelihood Pr(x)= ⇡k (x µk , ⌃k ) where K N | k =1 • LikelihoodK Pr(x)= X⇡ (x µ , ⌃ ) where k N | k k ⇡ = 1, 0 ⇡ k =11. k kX K k =1 X⇡ = 1, 0 ⇡ 1. k k kX=1 : : Sriram Sankararaman Clustering Sriram Sankararaman Clustering X is multidimensional. Ref: https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf Can we use the ML estimation method to estimate the unknown parameters, Gaussian Mixture Model �1, �1, �4 ? • It is not easy: • Loss function is the negative log likelihood n K log Pr(x ⇡, µ, ⌃)= log ⇡k (x µk , ⌃k ) − | − ( N | ) Xi=1 kX=1 • Why is this function difficult to optimize? • Notice that the sum over the components appears inside the log, thus coupling all the parameters. However, it is possible to obtain an iterative solution! Sriram Sankararaman Clustering • Need a procedure that would optimize the log likelihood by working with the (easier) complete log likelihood. • “Fill-in” the latent variables using current estimate of the parameters. • Adjust parameters based on the filled-in variables. We can estimate the parameters using iterative Expectation-Maximization (EM) algorithm • GivenGaussianthe observations Mixturexi, i=1,2,...,n Model • Each xi is associated with a latent variable zi =(zi1,...,ziK ). • Given the complete data (x, z)=(xi , zi ), i = 1,...,n • We can estimate the parameters by maximizing the complete data log likelihood. N K log Pr(x, z ⇡, µ, ⌃)= z log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 Xk =1 • Notice that the ⇡k and (µk , ⌃k ) decouple. Trivial closed-form solution exists. The latent variable parameter zik represents the contribution of k-th Gaussian to xi Take the derivative of the log-likelihood wrt �1,�1,�4 and set it to zero to get equations to be used in EM algorithm Sriram Sankararaman Clustering Iterate for k=1,2,… • Initialize with �5, �5�, �5 The Expectation-Maximization (EM) • Update algorithmequations at the k-th iteration: • E-step: Given parameters, compute ∆ ⇡k (xi µk , ⌃k ) rik = E(zik )= N | K ⇡ (x µ , ⌃ ) k =1 k N i | k k • M-step: Maximize the expectedP complete log likelihood n K E [log Pr(x, z ⇡, µ, ⌃)] = r log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 kX=1 By updatingTo update the parameters r r x r (x µ )(x µ )T ⇡ = i ik ,µ = i ik i , ⌃ = i ik i − k i − k k +1 n k+1 r k+1 r P P i ik P i ik • Iterate till likelihoodP converges.