Expectation-Maximization and Mixture Density Estimation

Parametric Models Part II: Expectation-Maximization and Mixture Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University [email protected] CS 551, Fall 2012 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 1 / 32 Missing Features I Suppose that we have a Bayesian classifier that uses the feature vector x but a subset xg of x are observed and the values for the remaining features xb are missing. I How can we make a decision? I Throw away the observations with missing values. I Or, substitute xb by their average ¯xb in the training data, and use x = (xg, ¯xb). I Or, marginalize the posterior over the missing features, and use the resulting posterior R P (wi|xg, xb) p(xg, xb) dxb P (wi|xg) = R . p(xg, xb) dxb CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 2 / 32 Expectation-Maximization I We can also extend maximum likelihood techniques to allow learning of parameters when some training patterns have missing features. I The Expectation-Maximization (EM) algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 3 / 32 Expectation-Maximization I There are two main applications of the EM algorithm: I Learning when the data is incomplete or has missing values. I Optimizing a likelihood function that is analytically intractable but can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters. I The second problem is more common in pattern recognition applications. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 4 / 32 Expectation-Maximization I Assume that the observed data X is generated by some distribution. I Assume that a complete dataset Z = (X , Y) exists as a combination of the observed but incomplete data X and the missing data Y. I The observations in Z are assumed to be i.i.d. from the joint density p(z|Θ) = p(x, y|Θ) = p(y|x, Θ)p(x|Θ). CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 5 / 32 Expectation-Maximization I We can define a new likelihood function L(Θ|Z) = L(Θ|X , Y) = p(X , Y|Θ) called the complete-data likelihood where L(Θ|X ) is referred to as the incomplete-data likelihood. I The EM algorithm: I First, finds the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). I Then, maximizes this expectation (maximization step). CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 6 / 32 Expectation-Maximization I Define Q(Θ, Θ(i−1)) = E log p(X , Y|Θ) | X , Θ(i−1) as the expected value of the complete-data log-likelihood w.r.t. the unknown data Y given the observed data X and the current parameter estimates Θ(i−1). I The expected value can be computed as Z E log p(X , Y|Θ)|X , Θ(i−1) = log p(X , y|Θ) p(y|X , Θ(i−1)) dy. I This is called the E-step. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 7 / 32 Expectation-Maximization I Then, the expectation can be maximized by finding optimum values for the new parameters Θ as Θ(i) = arg max Q(Θ, Θ(i−1)). Θ I This is called the M-step. I These two steps are repeated iteratively where each iteration is guaranteed to increase the log-likelihood. I The EM algorithm is also guaranteed to converge to a local maximum of the likelihood function. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 8 / 32 Mixture Densities I A mixture model is a linear combination of m densities m X p(x|Θ) = αjpj(x|θj) j=1 where Θ = (α1, . , αm, θ1,..., θm) such that αj ≥ 0 and Pm j=1 αj = 1. I α1, . , αm are called the mixing parameters. I pj(x|θj), j = 1, . , m are called the component densities. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 9 / 32 Mixture Densities I Suppose that X = {x1,..., xn} is a set of observations i.i.d. with distribution p(x|Θ). I The log-likelihood function of Θ becomes n n m Y X X log L(Θ|X ) = log p(xi|Θ) = log αjpj(xi|θj) . i=1 i=1 j=1 I We cannot obtain an analytical solution for Θ by simply setting the derivatives of log L(Θ|X ) to zero because of the logarithm of the sum. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 10 / 32 Mixture Density Estimation via EM I Consider X as incomplete and define hidden variables n Y = {yi}i=1 where yi corresponds to which mixture component generated the data vector xi. I In other words, yi = j if the i’th data vector was generated by the j’th mixture component. I Then, the log-likelihood becomes log L(Θ|X , Y) = log p(X , Y|Θ) n X = log(p(xi|yi, θi)p(yi|θi)) i=1 n X = log(αyi pyi (xi|θyi )). i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 11 / 32 Mixture Density Estimation via EM I Assume we have the initial parameter estimates (g) (g) (g) (g) (g) Θ = (α1 , . , αm , θ1 ,..., θm ). I Compute (g) (g) (g) (g) αy py (xi|θ ) αy py (xi|θ ) p(y |x , Θ(g)) = i i yi = i i yi i i (g) Pm (g) (g) p(xi|Θ ) j=1 αj pj(xi|θj ) and n (g) Y (g) p(Y|X , Θ ) = p(yi|xi, Θ ). i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 12 / 32 Mixture Density Estimation via EM (g) I Then, Q(Θ, Θ ) takes the form X Q(Θ, Θ(g)) = log p(X , y|Θ)p(y|X , Θ(g)) y m n X X (g) = log(αjpj(xi|θj))p(j|xi, Θ ) j=1 i=1 m n X X (g) = log(αj)p(j|xi, Θ ) j=1 i=1 m n X X (g) + log(pj(xi|θj))p(j|xi, Θ ). j=1 i=1 CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 13 / 32 Mixture Density Estimation via EM I We can maximize the two sets of summations for αj and θj independently because they are not related. I The estimate for αj can be computed as n 1 X αˆ = p(j|x , Θ(g)) j n i i=1 where (g) (g) α pj(xi|θ ) p(j|x , Θ(g)) = j j . i Pm (g) (g) t=1 αt pt(xi|θt ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 14 / 32 Mixture of Gaussians I We can obtain analytical expressions for θj for the special case of a Gaussian mixture where θj = (µj, Σj) and pj(x|θj) = pj(x|µj, Σj) 1 1 = exp − (x − µ )T Σ−1(x − µ ) . d/2 1/2 j j j (2π) |Σj| 2 (g) I Equating the partial derivative of Q(Θ, Θ ) with respect to µj to zero gives Pn p(j|x , Θ(g))x µˆ = i=1 i i . j Pn (g) i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 15 / 32 Mixture of Gaussians I We consider five models for the covariance matrix Σj: 2 I Σj = σ I m n 1 X X σˆ2 = p(j|x , Θ(g))kx − µˆ k2 nd i i j j=1 i=1 2 I Σj = σj I Pn (g) 2 p(j|xi, Θ )kxi − µˆ k σˆ2 = i=1 j j Pn (g) d i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 16 / 32 Mixture of Gaussians I Covariance models continued: 2 d I Σj = diag({σjk}k=1) Pn (g) 2 p(j|xi, Θ )(xik − µˆ ) σˆ2 = i=1 jk jk Pn (g) i=1 p(j|xi, Θ ) I Σj = Σ m n 1 X X Σˆ = p(j|x , Θ(g))(x − µˆ )(x − µˆ )T n i i j i j j=1 i=1 I Σj = arbitrary Pn (g) T p(j|xi, Θ )(xi − µˆ )(xi − µˆ ) Σˆ = i=1 j j j Pn (g) i=1 p(j|xi, Θ ) CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 17 / 32 Mixture of Gaussians I Summary: I Estimates for αj, µj and Σj perform both expectation and maximization steps simultaneously. I EM iterations proceed by using the current estimates as the initial estimates for the next iteration. I The priors are computed from the proportion of examples belonging to each mixture component. I The means are the component centroids. I The covariance matrices are calculated as the sample covariance of the points associated with each component. CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 18 / 32 Examples I Mixture of Gaussians examples I 1-D Bayesian classification examples I 2-D Bayesian classification examples CS 551, Fall 2012 c 2012, Selim Aksoy (Bilkent University) 19 / 32 2 2 2 ¢¡¤£ 0 0 0 −2 −2 −2 −2 0 (a) 2 −2 0 (b) 2 −2 0 (c) 2 2 2 2 ¢¡¤£ ¢¡¤£ ¢¡¤£¦¥ 0 0 0 −2 −2 −2 −2 0 (d) 2 −2 0 (e) 2 −2 0 (f) 2 Figure 1: Illustration of the EM algorithm iterations for a mixture of two Gaussians.

Expectation-Maximization and Mixture Density Estimation

Propensity Scores up to Head(Propensitymodel) – Modify the Number of Threads! • Inspect the PS Distribution Plot • Inspect the PS Model

Generalized Scatter Plots

Propensity Score Matching with R: Conventional Methods and New Features

Scatter Plots with Error Bars

Estimation of Average Total Effects in Quasi-Experimental Designs: Nonlinear Constraints in Structural Equation Models

Algebra 1 Regular/Honors – Phase 2 Part 1 TEXTBOOK CHECKOUT

Exploring Spatial Data with Geoda: a Workbook

HW: Scatter Plots

Generalized Sensitivity Analysis and Application to Quasi-Experiments∗

On Assessing the Specification of Propensity Score Models

Chapter 14 Describing Relationships: Scatterplots and Correlation

Regression Modelling and Other Methods to Control