Gaussian Mixtures and the EM Algorithm

Gaussian Mixtures and the EM algorithm 1 2 sigma=1.0 sigma=1.0 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 sigma=0.2 sigma=0.2 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 Details of figure • Left panels: two Gaussian densities g0(x) and g1(x) (blue and orange) on the real line, and a single data point (green dot) at x = 0.5. The colored squares are plotted at x = −1.0 and x = 1.0, the means of each density. • Right panels: the relative densities g0(x)/(g0(x)+ g1(x)) and g1(x)/(g0(x)+ g1(x)), called the “responsibilities” of each cluster, for this data point. In the top panels, the Gaussian standard deviation σ = 1.0; in the bottom panels σ = 0.2. • The EM algorithm uses these responsibilities to make a “soft” assignment of each data point to each of the two clusters. When σ is fairly large, the responsibilities can be near 0.5 (they are 0.36 and 0.64 in the top right panel). • As σ → 0, the responsibilities → 1, for the cluster center closest to the target point, and 0 for all other clusters. This “hard” assignment is seen in the bottom right panel. 3 • Model for density estimation: gY (y) = (1 − π)φθ1 (y)+ πφθ2 (y) 2 θj = (µj, σj ) φ is normal density N(µ, σ2) π is mixing proportion • Goal: given data y1, y2, ...yn, estimate 2 ({µj, σj }, π) by maximum likelihood 2 N l({µj , σj }, π)= log gY (yi) log-likelihood (messy) Pi=1 4 EM-algorithm • Consider latent variables 1 if yi ∼ φθ2 ∆i = 0 otherwise • If we could observe ∆i ... ∆n, the log-likelihood would be: ℓ = [(1 − ∆ ) log φ 1 (y ) + ∆ log φ 2 (y )] 0 X i θ i i θ i • Let γi(θ)= E(∆i|θ, data) = Pr(∆i = 1|θ, data), the “responsibility” of model 2 for observation i. • EM algorithm maximizes the expected log-likelihood El = [(1 − γ (θ)) log φ 1 (y )+ γ (θ) log φ 2 (y )] 0 X i θ i i θ i where the expectation is conditional on the data and the current estimates θˆ. 5 Alternating algorithm: 2 • Given estimates for (µj ,σj , ∆, π) compute responsibilities γî = Pr(∆i =1|θ, data) πφˆ ˆ (yi) = θ2 ˆ ˆ (1 − πˆ)φθ1 (yi)+ˆπφθ2 (yi) i =1, 2, ...N (E-step) 2 • Given γî, estimate µj ,σj by weighted means + variances N N 2 i=1(1 − γî)yi 2 i=1(1 − γî)(yi − µˆ1) 1 P P µˆ = N , σˆ1 = N , Pi=1(1 − γî) Pi=1(1 − γî) N N 2 i=1 γîyi 2 i=1 γî(yi − µˆ1) 2 P P µˆ = N , σˆ2 = N , πˆ = X γî/N Pi=1 γî Pi=1 γî (M-Step) Iterate E+M step, until convergence 6 • Global mle has l =+∞ with µˆj = yi for any i, j 2 σˆj = 0. Not a useful solution. • Want non-pathological local minimum 2 • Usual strategy: start and keep σˆj away from zero. Relationship between K-means + Gaussian Mixtures 2 2 2 • If we restrict σˆ1 =σ ˆ2 =σ ˆ , and let σˆ2 → 0 then EM ≈ K-means • EM is a ”soft” version of K-means 7 Gaussian mixtures, mu1=1,mu2=5,sigma=1 Negative log−likelihood Negative training −6.0 −5.5 −5.0test −4.5 −4.0 −3.5 2 4 6 8 10 12 14 Number of clusters k 8 EM algorithm in general (Text pg. 276- ) • Observed data Z • log-likelihood l(θ; Z) ← want to maximize We have (or we introduce) latent (missing) data Zm • Complete data T =(Z, Zm) m ′ ′ Pr(Z , Z|θ ) Pr(Zm|Z,θ ) = Pr(Z|θ′) ′ ′ Pr(T|θ ) so Pr(Z|θ ) = Pr(Zm|Z,θ′) ′ ′ ′ m or l(θ ; Z) = l0(θ ; T) − l1(θ ; Z |Z) l(θ′; Z): observed data log-likelihood ′ l0(θ ; T): complete data log-likelihood ′ m l1(θ ; Z |Z): conditional log-likelihood • Take conditional expectation with respect to distribution T|Z, governed by parameter θ: ′ ′ ′ m l(θ ; Z) = E(l0(θ ; T)|Z,θ) − E(l1(θ ; Z |Z)|Z,θ) ′ ′ ≡ Q(θ ,θ) − R(θ ,θ) 8 EM algorithm in general 1. Start with initial guesses for parameters θˆ(0) 2. Expectation step: at the jth step, compute ′ (j) ′ (j) Q(θ , θˆ )= E(l0(θ ; T)|Z, θˆ ) as a function of dummy argument θ′ 3. Maximization step: set θˆ(j+1) as the maximizer of Q(θ′, θˆ(j)) with respect to θ′ 4. Iterate steps 2+3 until convergence 9 Claim: maximization of Q(θ′,θ) over θ′ gives maximizer of l(θ′, Z) • By Jensen’s inequality, R(θ∗,θ) is maximized as a function θ∗, when θ∗ = θ • So ′ ′ ′ l(θ ; Z) − l(θ; Z) = [Q(θ ,θ) − Q(θ,θ)] − [R(θ ,θ) − R(θ,θ)] ′ ≥ 0 if Q(θ ,θ) ≥ Q(θ,θ) • Hence as long as we go uphill at each M-step (don’t need to maximize), we increase the log-likelihood 10 Some properties of EM • each iteration increases the log-likeihood • In curved exponential families, can show that any limit point of EM is a stationarity point of the log-likelihood • convergence speed is linear, versus quadratic for Newton methods. 11 Model-based clustering Fraley + Raftery 1998 Banfield + Raftery 1994 • Gaussian mixture model for density of data G L (Θ ...Θ , π ...π |x) = Πn π f (x |Θ ) m 1 G 1 G i=1 X k k i k k=1 G components (groups); πk = prob of observation falling in group k • Θk =(µk, Σk) fk(xi|µk, Σk)= φ(µk, Σk) (multivariate normal) 12 • Novel idea: parametrize T Σk = λkDkAkDk 1 normalize so that determinant |Ak| = 1, giving λk = |Σk| p 13 T Σk = λkDkAkDk λk: determines volume of ellipsoid Dk orthogonal matrix determining orientation of ellipsoid Ak diagonal matrix determining shape of ellipsoid Σk Distance Volume Orientation Shape λI Spherical Equal NA Equal λkI Spherical Variable NA Equal λDADT Ellipsoidal Equal Equal Equal T λkDkAkDk Ellipsoidal Variable Variable Variable T λDkADk Ellipsoidal Equal Variable Equal T λkDkADk Ellipsoidal Variable Variable Equal 14 • Estimate parameters via EM algorithm • Estimates of components of - depend on assumed form Pk They use “Bayesian Information Criterion” (BIC) to choose model parametrization and # of clusters: 2 log p(x|M) + const ≈ 2lm(x, par) − mM log n mM = # independent parameters in model n = sample size Useful for low-dimensions (say ≤ 4), but difficult to estimate covariances in high dimensional feature spaces. 15.

Load more