<<

Gaussian Mixtures and algorithm

1 2

sigma=1.0 sigma=1.0 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 sigma=0.2 sigma=0.2 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 Details of figure

• Left panels: two Gaussian densities g0(x) and g1(x) (blue and orange) on the real line, and single data point (green dot) at x = 0.5. The colored squares are plotted at x = −1.0 and x = 1.0, the means of each density.

• Right panels: the relative densities g0(x)/(g0(x)+ g1(x)) and g1(x)/(g0(x)+ g1(x)), called the “responsibilities” of each cluster, for this data point. In the top panels, the Gaussian standard deviation σ = 1.0; in the bottom panels σ = 0.2.

• The EM algorithm uses these responsibilities to make a “soft” assignment of each data point to each of the two clusters. When σ is fairly large, the responsibilities can near 0.5 (they are 0.36 and 0.64 in the top right panel).

• As σ → 0, the responsibilities → 1, for the cluster center closest to the target point, and 0 for all other clusters. This “hard” assignment is seen in the bottom right panel.

3 • Model for density estimation:

gY (y) = (1 − π)φθ1 (y)+ πφθ2 (y) 2 θj = (µj, σj )

φ is normal density N(µ, σ2) π is mixing proportion

• Goal: given data y1, y2, ..., estimate

2 ({µj, σj }, π) by maximum likelihood

2 N l({µj , σj }, π)= log gY () log-likelihood (messy) Pi=1

4 EM-algorithm • Consider latent variables

 1 if yi ∼ φθ2 ∆ =  0 otherwise  • If we could observe ∆i ... ∆n, the log-likelihood would be:

ℓ = [(1 − ∆ ) log φ 1 (y ) + ∆ log φ 2 (y )] 0 X i θ i i θ i • Let γi(θ)= (∆i|θ, data) = Pr(∆i = 1|θ, data), the “responsibility” of model 2 for observation i. • EM algorithm maximizes the expected log-likelihood

El = [(1 − γ (θ)) log φ 1 (y )+ γ (θ) log φ 2 (y )] 0 X i θ i i θ i where the expectation is conditional on the data and the current estimates θˆ.

5 Alternating algorithm: 2 • Given estimates for (µj ,σj , ∆, π) compute responsibilities

γˆi = Pr(∆i =1|θ, data)

πφˆ ˆ (yi) = θ2 ˆ ˆ (1 − πˆ)φθ1 (yi)+ˆπφθ2 (yi) i =1, 2, ...N (E-step) 2 • Given γˆi, estimate µj ,σj by weighted means + variances

N N 2 i=1(1 − γˆi)yi 2 i=1(1 − γˆi)(yi − µˆ1) 1 P P µˆ = N , σˆ1 = N , Pi=1(1 − γˆi) Pi=1(1 − γˆi) N N 2 i=1 γˆiyi 2 i=1 γˆi(yi − µˆ1) 2 P P µˆ = N , σˆ2 = N , πˆ = X γˆi/N Pi=1 γˆi Pi=1 γˆi (M-Step)

Iterate E+M step, until convergence

6 • Global mle has l =+∞

with µˆj = yi for any i, j 2 σˆj = 0. Not a useful solution. • Want non-pathological local minimum

2 • Usual strategy: start and keep σˆj away from zero. Relationship between K-means + Gaussian Mixtures

2 2 2 • If we restrict σˆ1 =σ ˆ2 =σ ˆ , and let σˆ2 → 0 then EM ≈ K-means • EM is a ”soft” version of K-means

7 Gaussian mixtures, mu1=1,mu2=5,sigma=1 Negative log−likelihood Negative

training −6.0 −5.5 −5.0test −4.5 −4.0 −3.5

2 4 6 8 10 12 14

Number of clusters k

8 EM algorithm in general (Text pg. 276- ) • Observed data Z • log-likelihood l(θ; Z) ← want to maximize We have (or we introduce) latent (missing) data Zm • Complete data T =(Z, Zm)

m ′ ′ Pr(Z , Z|θ ) Pr(Zm|Z,θ ) = Pr(Z|θ′) ′ ′ Pr(T|θ ) so Pr(Z|θ ) = Pr(Zm|Z,θ′) ′ ′ ′ m or l(θ ; Z) = l0(θ ; T) − l1(θ ; Z |Z) l(θ′; Z): observed data log-likelihood ′ l0(θ ; T): complete data log-likelihood ′ m l1(θ ; Z |Z): conditional log-likelihood • Take conditional expectation with respect to distribution T|Z, governed by parameter θ:

′ ′ ′ m l(θ ; Z) = E(l0(θ ; T)|Z,θ) − E(l1(θ ; Z |Z)|Z,θ) ′ ′ ≡ Q(θ ,θ) − R(θ ,θ)

8 EM algorithm in general

1. Start with initial guesses for parameters θˆ(0)

2. Expectation step: at the jth step, compute

′ (j) ′ (j) Q(θ , θˆ )= E(l0(θ ; T)|Z, θˆ ) as a function of dummy argument θ′ 3. Maximization step:

set θˆ(j+1) as the maximizer of Q(θ′, θˆ(j)) with respect to θ′ 4. Iterate steps 2+3 until convergence

9 Claim: maximization of Q(θ′,θ) over θ′ gives maximizer of l(θ′, Z)

• By Jensen’s inequality, R(θ∗,θ) is maximized as a function θ∗, when θ∗ = θ • So

′ ′ ′ l(θ ; Z) − l(θ; Z) = [Q(θ ,θ) − Q(θ,θ)] − [R(θ ,θ) − R(θ,θ)] ′ ≥ 0 if Q(θ ,θ) ≥ Q(θ,θ)

• Hence as long as we go uphill at each M-step (don’t need to maximize), we increase the log-likelihood

10 Some properties of EM

• each iteration increases the log-likeihood • In curved exponential families, can show that any limit point of EM is a stationarity point of the log-likelihood

• convergence speed is linear, versus quadratic for Newton methods.

11 Model-based clustering

Fraley + Raftery 1998

Banfield + Raftery 1994

• Gaussian mixture model for density of data

G L (Θ ...Θ , π ...π |x) = Πn π f (x |Θ ) m 1 G 1 G i=1 X k k i k k=1 G components (groups);

πk = prob of observation falling in group k

• Θk =(µk, Σk)

fk(xi|µk, Σk)= φ(µk, Σk) (multivariate normal)

12 • Novel idea: parametrize

T Σk = λkDkAkDk

1 normalize so that determinant |Ak| = 1, giving λk = |Σk| p

13 T Σk = λkDkAkDk

λk: determines volume of ellipsoid

Dk orthogonal matrix determining orientation of ellipsoid

Ak diagonal matrix determining shape of ellipsoid

Σk Distance Volume Orientation Shape λI Spherical Equal NA Equal

λkI Spherical Variable NA Equal λDADT Ellipsoidal Equal Equal Equal T λkDkAkDk Ellipsoidal Variable Variable Variable T λDkADk Ellipsoidal Equal Variable Equal T λkDkADk Ellipsoidal Variable Variable Equal

14 • Estimate parameters via EM algorithm

• Estimates of components of - depend on assumed form Pk They use “Bayesian Information Criterion” (BIC) to choose model parametrization and # of clusters:

2 log p(x|M) + const ≈ 2lm(x, par) − mM log n

mM = # independent parameters in model n = sample size

Useful for low-dimensions (say ≤ 4), but difficult to estimate covariances in high dimensional feature spaces.

15