Expectation-Maximization and Mixture Density Estimation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Fall 2012 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 1 / 32 Missing Features I Suppose that we have a Bayesian classifier that uses the feature vector x but a subset xg of x are observed and the values for the remaining features xb are missing. I How can we make a decision? I Throw away the observations with missing values. I Or, substitute xb by their average ¯xb in the training data, and use x = (xg, ¯xb). I Or, marginalize the posterior over the missing features, and use the resulting posterior R P (wi|xg, xb) p(xg, xb) dxb P (wi|xg) = R . p(xg, xb) dxb CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 2 / 32 Expectation-Maximization I We can also extend maximum likelihood techniques to allow learning of parameters when some training patterns have missing features. I The Expectation-Maximization (EM) algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 3 / 32 Expectation-Maximization I There are two main applications of the EM algorithm: I Learning when the data is incomplete or has missing values. I Optimizing a likelihood function that is analytically intractable but can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. I The second problem is more common in pattern recognition applications. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 4 / 32 Expectation-Maximization I Assume that the observed data X is generated by some distribution. I Assume that a complete dataset Z = (X , Y) exists as a combination of the observed but incomplete data X and the missing data Y. I The observations in Z are assumed to be i.i.d. from the joint density p(z|Θ) = p(x, y|Θ) = p(y|x, Θ)p(x|Θ). CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 5 / 32 Expectation-Maximization I We can define a new likelihood function L(Θ|Z) = L(Θ|X , Y) = p(X , Y|Θ) called the complete-data likelihood where L(Θ|X ) is referred to as the incomplete-data likelihood. I The EM algorithm: I First, finds the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). I Then, maximizes this expectation (maximization step). CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 6 / 32 Expectation-Maximization I Define Q(Θ, Θ(i−1)) = E log p(X , Y|Θ) | X , Θ(i−1) as the expected value of the complete-data log-likelihood w.r.t. the unknown data Y given the observed data X and the current parameter estimates Θ(i−1). I The expected value can be computed as Z E log p(X , Y|Θ)|X , Θ(i−1) = log p(X , y|Θ) p(y|X , Θ(i−1)) dy. I This is called the E-step. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 7 / 32 Expectation-Maximization I Then, the expectation can be maximized by finding optimum values for the new parameters Θ as Θ(i) = arg max Q(Θ, Θ(i−1)). Θ I This is called the M-step. I These two steps are repeated iteratively where each iteration is guaranteed to increase the log-likelihood. I The EM algorithm is also guaranteed to converge to a local maximum of the likelihood function. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 8 / 32 Mixture Densities I A mixture model is a linear combination of m densities m X p(x|Θ) = αjpj(x|θj) j=1 where Θ = (α1, . , αm, θ1,..., θm) such that αj ≥ 0 and Pm j=1 αj = 1. I α1, . , αm are called the mixing parameters. I pj(x|θj), j = 1, . , m are called the component densities. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 9 / 32 Mixture Densities I Suppose that X = {x1,..., xn} is a set of observations i.i.d. with distribution p(x|Θ). I The log-likelihood function of Θ becomes n n m Y X X log L(Θ|X ) = log p(xi|Θ) = log αjpj(xi|θj) . i=1 i=1 j=1 I We cannot obtain an analytical solution for Θ by simply setting the derivatives of log L(Θ|X ) to zero because of the logarithm of the sum. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 10 / 32 Mixture Density Estimation via EM I Consider X as incomplete and define hidden variables n Y = {yi}i=1 where yi corresponds to which mixture component generated the data vector xi. I In other words, yi = j if the i’th data vector was generated by the j’th mixture component. I Then, the log-likelihood becomes log L(Θ|X , Y) = log p(X , Y|Θ) n X = log(p(xi|yi, θi)p(yi|θi)) i=1 n X = log(αyi pyi (xi|θyi )). i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 11 / 32 Mixture Density Estimation via EM I Assume we have the initial parameter estimates (g) (g) (g) (g) (g) Θ = (α1 , . , αm , θ1 ,..., θm ). I Compute (g) (g) (g) (g) αy py (xi|θ ) αy py (xi|θ ) p(y |x , Θ(g)) = i i yi = i i yi i i (g) Pm (g) (g) p(xi|Θ ) j=1 αj pj(xi|θj ) and n (g) Y (g) p(Y|X , Θ ) = p(yi|xi, Θ ). i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 12 / 32 Mixture Density Estimation via EM (g) I Then, Q(Θ, Θ ) takes the form X Q(Θ, Θ(g)) = log p(X , y|Θ)p(y|X , Θ(g)) y m n X X (g) = log(αjpj(xi|θj))p(j|xi, Θ ) j=1 i=1 m n X X (g) = log(αj)p(j|xi, Θ ) j=1 i=1 m n X X (g) + log(pj(xi|θj))p(j|xi, Θ ). j=1 i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 13 / 32 Mixture Density Estimation via EM I We can maximize the two sets of summations for αj and θj independently because they are not related. I The estimate for αj can be computed as n 1 X αˆ = p(j|x , Θ(g)) j n i i=1 where (g) (g) α pj(xi|θ ) p(j|x , Θ(g)) = j j . i Pm (g) (g) t=1 αt pt(xi|θt ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 14 / 32 Mixture of Gaussians I We can obtain analytical expressions for θj for the special case of a Gaussian mixture where θj = (µj, Σj) and pj(x|θj) = pj(x|µj, Σj) 1 1 = exp − (x − µ )T Σ−1(x − µ ) . d/2 1/2 j j j (2π) |Σj| 2 (g) I Equating the partial derivative of Q(Θ, Θ ) with respect to µj to zero gives Pn p(j|x , Θ(g))x µˆ = i=1 i i . j Pn (g) i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 15 / 32 Mixture of Gaussians I We consider five models for the covariance matrix Σj: 2 I Σj = σ I m n 1 X X σˆ2 = p(j|x , Θ(g))kx − µˆ k2 nd i i j j=1 i=1 2 I Σj = σj I Pn (g) 2 p(j|xi, Θ )kxi − µˆ k σˆ2 = i=1 j j Pn (g) d i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 16 / 32 Mixture of Gaussians I Covariance models continued: 2 d I Σj = diag({σjk}k=1) Pn (g) 2 p(j|xi, Θ )(xik − µˆ ) σˆ2 = i=1 jk jk Pn (g) i=1 p(j|xi, Θ ) I Σj = Σ m n 1 X X Σˆ = p(j|x , Θ(g))(x − µˆ )(x − µˆ )T n i i j i j j=1 i=1 I Σj = arbitrary Pn (g) T p(j|xi, Θ )(xi − µˆ )(xi − µˆ ) Σˆ = i=1 j j j Pn (g) i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 17 / 32 Mixture of Gaussians I Summary: I Estimates for αj, µj and Σj perform both expectation and maximization steps simultaneously. I EM iterations proceed by using the current estimates as the initial estimates for the next iteration. I The priors are computed from the proportion of examples belonging to each mixture component. I The means are the component centroids. I The covariance matrices are calculated as the sample covariance of the points associated with each component. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 18 / 32 Examples I Mixture of Gaussians examples I 1-D Bayesian classification examples I 2-D Bayesian classification examples CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 19 / 32 2 2 2 ¢¡¤£ 0 0 0 −2 −2 −2 −2 0 (a) 2 −2 0 (b) 2 −2 0 (c) 2 2 2 2 ¢¡¤£ ¢¡¤£ ¢¡¤£¦¥ 0 0 0 −2 −2 −2 −2 0 (d) 2 −2 0 (e) 2 −2 0 (f) 2 Figure 1: Illustration of the EM algorithm iterations for a mixture of two Gaussians.

Expectation-Maximization and Mixture Density Estimation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support