<<

Gaussian Mixture Models (GMM) and ML Estimation Examples 3

Let us look at the log likelihood function

n l(µ) = log L(µ)= log P (X µ) i| Xi=1 2 1 2 1 =2log + log µ +3 log + log µ +3 log + log(1 µ) +2 log + log(1 µ) µ 3 ∂ µ 3 ∂ µ 3 ° ∂ µ 3 ° ∂ = C + 5 log µ + 5 log(1 µ) ° where C is a constant which does not depend on µ. It can be seen that the log likelihood function is easier to maximize compared to the likelihood function. Let the derivative of l(µ) with respect to µ be zero:

dl(µ) 5 5 = =0 dµ µ ° 1 µ ° and the solution gives us the MLE, which is µˆ =0.5. We remember that the method of moment estimation is µˆ =5/12, which is diÆerent from MLE.

Example 2: Suppose X1,X2, ,Xn are i.i.d. random variables with density function 1 x ··· f(x æ)= exp | | , please find the maximum likelihood estimate of æ. | 2æ ° æ Solution: The log-likelihood≥ ¥ function is

n X l(æ)= log 2 log æ | i| "° ° ° æ # Xi=1 Let the derivative with respect to µ be zero:

n n 1 Xi n i=1 Xi 0 l (æ)= + | 2 | = + 2| | =0 "°æ æ # °æ P æ Xi=1 and this gives us the MLE for æ as

n X æˆ = i=1 | i| P n and of GaussianAgain this is diÆerent from the method of moment estimation which is n X2 æˆ = i=1 i 4 sP 2n

Example 3: Use the method of moment to estimate the parameters µ and æ for the normal • Consider the Gaussian PDF:density 1 (x µ)2 f(x µ, æ2)= exp ° , | p2ºæ (° 2æ2 ) 4 Given the observations (sample)based on a random sample X , ,X . 1 ··· n based on a random sample X , ,X . Solution: In this example,1 ··· wen have two unknown parameters, µ and æ, therefore the pa- rameter µ =(Solution:µ, æIn) this is example, a vector. we have two We unknown first parameters, write outµ and æ the, therefore log the likelihood pa- function as Form the log-likelihood functionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n n n 1 1 2 n 1 2 n l(µ, æ)= log æ 1log 2º (Xi µ)1 = n log æ log 2º (Xi µ) n 1 ° ° 2 ° 2æ2 ° ° ° 2 2 ° 2æ2 ° 2 l(µ, æ)= i=1log∑ æ log 2º ∏ (X µ) = i=1n log æ log 2º (X µ) °X ° 2 ° 2æ2 i ° °X ° 2 ° 2æ2 i ° Take the derivatives wrt X�iSetting=1 �∑� the� partial � a derivativend s toe bet 0, i wet t haveo zero ∏ Xi=1 @l(µ, æ) 1 n = (X µ)=0 Setting the partial derivative@µ to beæ2 0,i we° have Xi=1 n @l(µ, æ) n 3 2 n = @l(+µ,æ° æ)(Xi µ)1=0 @æ °æ i=1 ° X = 2 (Xi µ)=0 Solving these equations will give us the MLE@µ for µ and æ: æ ° Xi=1 1 n µˆ = X andæ ˆ = (X X)2 n v i @l(µ, æ) un i=1 n ° 3 2 u X =t + æ° (Xi µ) =0 This time the MLE is the same as@æ the result of method°æ of moment. ° Xi=1 From these examples, we can see that the maximum likelihood result may or may not be the Solving thesesame as equations the result of method will of moment. give us the MLE for µ and æ: Example 4: The Pareto distribution has been used in as a model for a density function with a slowly decaying tail: 1 n µ µ 1 2 f(x xµˆ, µ)==µxXx° ° ,xandx , æ ˆµ >=1 (Xi X) | 0 0 ∏ 0 v un i=1 ° u X Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample.t Find the MLE of µ. ··· This timeSolution: the MLEThe log-likelihood is the same function is as the result of method of moment.

n n From these examples,l(µ)= welog canf(X seeµ)= that(log µ + theµ log x maximum(µ + 1) log X ) likelihood result may or may not be the i| 0 ° i Xi=1 Xi=1 same as the result of method of moment.n = n log µ + nµ log x (µ + 1) log X 0 ° i Example 4: The Pareto distributionXi=1 has been used in economics as a model for a density function withLet the a derivative slowly with decaying respect to µ be zero: tail: dl(µ) n n = + n log x0 log Xi =0 dµ µ ° i=1 µ µ 1 f(x x , µ)=Xµx x° ° ,x x , µ > 1 | 0 0 ∏ 0

Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is

n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero:

dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 4

based on a random sample X , ,X . 1 ··· n Solution: In this example, we have two unknown parameters, µ and æ, therefore the pa- Solutionrameter µ =(µ, æ) is a vector. We first write out the log likelihood function as n 1 1 n 1 n l(µ, æ)= log æ log 2º (X µ)2 = n log æ log 2º (X µ)2 ° ° 2 ° 2æ2 i ° ° ° 2 ° 2æ2 i ° Xi=1 ∑ ∏ Xi=1 • Sample mean and variance:Setting the partial derivative to be 0, we have

@l(µ, æ) 1 n = (X µ)=0 @µ æ2 i ° Xi=1 n @l(µ, æ) n 3 2 = + æ° (X µ) =0 @æ °æ i ° Xi=1 Solving these equations will give us the MLE for µ and æ:

1 n µˆ = X andæ ˆ = (X X)2 v i un i=1 ° u X t This time the MLE is the same as the result of method of moment. From these examples, we can see that the maximum likelihood result may or may not be the same as the result of method of moment. Example 4: The Pareto distribution has been used in economics as a model for a density function with a slowly decaying tail:

µ µ 1 f(x x , µ)=µx x° ° ,x x , µ > 1 | 0 0 ∏ 0

Assume that x0 > 0 is given and that X1,X2, ,Xn is an i.i.d. sample. Find the MLE of µ. ··· Solution: The log-likelihood function is

n n l(µ)= log f(Xi µ)= (log µ + µ log x0 (µ + 1) log Xi) i=1 | i=1 ° X X n = n log µ + nµ log x (µ + 1) log X 0 ° i Xi=1 Let the derivative with respect to µ be zero:

dl(µ) n n = + n log x log X =0 dµ µ 0 ° i Xi=1 Gaussian Gaussian Mixture Model Gaussian Mixture • Probabilistic story: Each clusterModel is associated with a • ProbabilisticGaussian distribution. story: Each To cluster generate is associated data, randomly with a choose Gaussiana cluster distribution.k with probability To generate⇡k and data,sample randomly from its choose distribution. • GMM a cluster k with probabilityK ⇡k and sample from its distribution.• Likelihood Pr(x)= ⇡k (x µk , ⌃k ) where K N | k =1 • LikelihoodK Pr(x)= X⇡ (x µ , ⌃ ) where k N | k k ⇡ = 1, 0 ⇡ k =11. k  kX K k =1 X⇡ = 1, 0 ⇡ 1. k  k  kX=1 : : Sriram Sankararaman Clustering

Sriram Sankararaman Clustering

X is multidimensional. Ref: https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf Can we use the ML estimation method to estimate the unknown parameters, Gaussian Mixture Model �, �, � ? • It is not easy:

• Loss function is the negative log likelihood

n K log Pr(x ⇡, µ, ⌃)= log ⇡k (x µk , ⌃k ) | ( N | ) Xi=1 kX=1 • Why is this function difficult to optimize? • Notice that the sum over the components appears inside the log, thus coupling all the parameters.

However, it is possible to obtain an iterative solution! Sriram Sankararaman Clustering • Need a procedure that would optimize the log likelihood by working with the (easier) complete log likelihood. • “Fill-in” the latent variables using current estimate of the parameters. • Adjust parameters based on the filled-in variables.

We can estimate the parameters using iterative Expectation-Maximization (EM) algorithm

• GivenGaussianthe observations Mixturexi, i=1,2,...,n Model

• Each xi is associated with a zi =(zi1,...,ziK ). • Given the complete data (x, z)=(xi , zi ), i = 1,...,n • We can estimate the parameters by maximizing the complete data log likelihood.

N K log Pr(x, z ⇡, µ, ⌃)= z log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 Xk =1

• Notice that the ⇡k and (µk , ⌃k ) decouple. Trivial closed-form solution exists.

The latent variable parameter zik represents the contribution of k-th Gaussian to xi Take the derivative of the log-likelihood wrt �,�,� and set it to zero to get equations to be used in EM algorithm

Sriram Sankararaman Clustering Iterate for k=1,2,…

• Initialize with �, ��, � The Expectation-Maximization (EM) • Update algorithmequations at the k-th iteration:

• E-step: Given parameters, compute

⇡k (xi µk , ⌃k ) rik = E(zik )= N | K ⇡ (x µ , ⌃ ) k =1 k N i | k k • M-step: Maximize the expectedP complete log likelihood n K E [log Pr(x, z ⇡, µ, ⌃)] = r log ⇡ + log (x µ , ⌃ ) | ik { k N i | k k } Xi=1 kX=1 By updatingTo update the parameters r r x r (x µ )(x µ )T ⇡ = i ik ,µ = i ik i , ⌃ = i ik i k i k k +1 n k+1 r k+1 r P P i ik P i ik • Iterate till likelihoodP converges. P • Converges to local optimum of the log likelihood.

Sriram Sankararaman Clustering

It may not converge to the global optimum! Example

• GMM example – Training set: � = 900 examples from a uniform pdf inside an annulus – Model: GMM with � = 30 Gaussian components – Training procedure • Gaussians centers initialized by choosing 30 arbitrary training examples • Covariance matrices initialized to be diagonal, with large variance compared to that of the training data • To avoid singularities, at every iteration the covariance matrices computed with EM were regularized with a small multiple of the identity matrix • Components whose mixing coefficients fell below a threshold are removed

– Illustrative results are provided in the next slide

Introduction to Speech Processing | Ricardo Gutierrez-Osuna | CSE@TAMU 28 Observations: blue dots

Iteration 0 Iteration 25 Iteration 50

3 1.5 1.5

2 1 1

1 0.5 0.5

0 0 0

-1 -0.5 -0.5

-2 -1 -1

-3 -1.5 -1.5

-4 -3 -2 -1 0 1 2 3 4 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 75 Iteration 275 Iteration 300

1.5 1.5 1.5

1 1 1

0.5 0.5 0.5

0 0 0

-0.5 -0.5 -0.5

-1 -1 -1

-1.5 -1.5 -1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Ellipses represent 2-D Gaussians Vector quantization = K- clustering • k-means clustering – The k-means algorithm is a simple procedure that attempts to group a collection of unlabeled examples � = �1 … �� into one of � clusters • k-means seeks to find compact clusters, measured as � 2 1 ���� = � − �� ; �� = � �=1 �∈�� �� �∈�� • It can be shown that k-means is a special case of the GMM-EM algorithm – Procedure

1. Define the number of clusters 2. Initialize clusters by a) an arbitrary assignment of examples to clusters or b) an arbitrary set of cluster centers (i.e., use some examples as centers) 3. Compute the sample mean of each cluster 4. Reassign each example to the cluster with the nearest mean 5. If the classification of all samples has not changed, stop, else go to step 3

You may initialize the GMM algorithm from KIntroduction to Speech Processing | Ricardo Gutierrez-Osuna- |means cluster centers ! CSE@TAMU 30 Example Speaker Identification

• Feature extractions from data using mel-cepstrum

• Extract feature vectors for each speaker (e.g., 90 sec long data) • Frame length 10ms Speaker model GMM

\Lambda represents a person (speaker) GMM Model

Mixture weights \sum_i p_i =1

Train the model from observations using the iterative algorithm for each speaker Classifying speakers

• Similarly, all of the speech produced by one speaker will cluster differently in MFCC space than speech from another speaker. • We can ∴ decide if a given observation comes from one How do we decide? speaker or another.

Time, 3 0 1 … T P( | ) > 1 … 2 … 3 … MFCC … … … … … P( | ) 42 … Observation matrix

CSC401/2511 – Spring 2016 10 She is the person! Speaker identification problem solution Speaker Identification result

16 speakers 3

Let us look at the log likelihood function

n l(µ) = log L(µ)= log P (X µ) i| Xi=1 2 1 2 1 =2log + log µ +3 log + log µ +3 log + log(1 µ) +2 log + log(1 µ) µ 3 ∂ µ 3 ∂ µ 3 ° ∂ µ 3 ° ∂ = C + 5 log µ + 5 log(1 µ) ° where C is a constant which does not depend on µ. It can be seen that the log likelihood function is easier to maximize compared to the likelihood function. Let the derivative of l(µ) with respect to µ be zero:

dl(µ) 5 5 = =0 dµ µ ° 1 µ Example° and the solution gives us the MLE, which is µˆ =0.5. We remember that the method of moment estimation is µˆ =5/12, which is diÆerent from MLE.

Example 2: Suppose X1,X2, ,Xn are i.i.d. random variables with density function 1 x ··· f(x æ)= exp | | , please find the maximum likelihood estimate of æ. | 2æ ° æ Solution: The log-likelihood≥ ¥ function is

n X l(æ)= log 2 log æ | i| "° ° ° æ # Xi=1 Let the derivative with respect to µ be zero:

n n 1 Xi n i=1 Xi 0 l (æ)= + | 2 | = + 2| | =0 "°æ æ # °æ P æ Xi=1 and this gives us the MLE for æ as

n X æˆ = i=1 | i| P n You get a different estimate for the Again this is diÆerent from the methodstandard deviation of moment estimation which is n X2 æˆ = i=1 i sP 2n

Example 3: Use the method of moment to estimate the parameters µ and æ for the normal density 1 (x µ)2 f(x µ, æ2)= exp ° , | p2ºæ (° 2æ2 ) References

• Robust text-independentspeaker identification using Gaussian mixturespeaker models, DA Reynolds, RC Rose - Speech and Audio Processing, IEEE …, 1995 - ieeexplore.ieee.orgGaussian mixture models (GMM) for robust text-independent speaker identification. The individual Gaussian components of a GMM are shown to represent some general speaker- dependent spectral shapes that are effective for modeling speaker identity. The focus of …Cited by 2855 Related articles All 11 versions Web of Science: 976 Cite Save

• Speaker verification using adapted Gaussian mixture models, DA Reynolds, TF Quatieri, RB Dunn - Digital signal processing, 2000 – Elsevier DA Reynolds PDF]

• CM Bishop - Machine Learning, 2006 - academia.edu, PDF is available! • https://people.eecs.berkeley.edu/~jordan/courses/294-fall09/lectures/clustering/slides.pdf