Mixture Models & EM Algorithm Lecture 21
Total Page:16
File Type:pdf, Size:1020Kb
Mixture Models & EM algorithm Lecture 21 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore The Evils of “Hard Assignments”? • Clusters may overlap • Some clusters may be “wider” than others • Distances can be deceiving! ProbabilisOc Clustering • Try a probabilisOc model! Y X X • allows overlaps, clusters of different 1 2 size, etc. ?? 0.1 2.1 • Can tell a generave story for ?? 0.5 -1.1 data ?? 0.0 3.0 – P(X|Y) P(Y) ?? -0.1 -2.0 • Challenge: we need to esOmate model parameters without ?? 0.2 1.5 labeled Ys … … … The General GMM assumpOon • P(Y): There are k components • P(X|Y): Each component generates data from a mulvariate Gaussian with mean μi and covariance matrix Σi Each data point is sampled from a genera&ve process: 1. Choose component i with probability P(y=i) 2. Generate datapoint ~ N(mi, Σi ) µ2 Gaussian mixture model µ (GMM) 1 µ3 (x µ )2 i− i0 2σ2 1 − i √ e = ln σi 2π (x µ )2 i− i1 1 2σ2 e− i σi√2π 2 2 (xi µi0) (xi µi1) = − 2 + − 2 − 2σi 2σi 2 2 µi0 + µi1 µi0 + µi1 = 2 xi + 2 σi 2σi 2 2 1 θ µi0 + µi1 w0 = ln − + 2 θ 2σi µi0 + µi1 wi = 2 σi w = w + η [y∗ p(y∗ x ,w)]f(x ) j − j | j j j w = w +[y∗ y(x;w)]f(x) − w = w + y∗f(x) h(x) = sign αihi(x) i n Count(Y = y)= δ(Y j = y) What Model Should We Use? =1 j n • Depends on X! j Count(Y = y)= D(j)δ(Y = y) Y X1 X2 • Here, maybe Gaussian Naïve Bayes? j=1 ?? 0.1 2.1 – MulOnomial over clusters Y ?? 0.5 -1.1 f(x)=– (Independent) Gaussian for each Xw0 + wihi(x) i given Y ?? 0.0 3.0 i ?? -0.1 -2.0 p(Yi = yk)=θk ?? 0.2 1.5 … … … 4 Could we make fewer assumpOons? • What if the Xi co-vary? • What if there are mulOple peaks? • Gaussian Mixture Models! – P(Y) sOll mulOnomial – P(X|Y) is a mulOvariate Gaussian distribuOon: 1 # 1 T & P(X x |Y i) exp x "1 x = j = = m/2 1/2 %" ( j " µi ) !i ( j " µi )( (2! ) || !i || $ 2 ' MulOvariate Gaussians 1 # 1 T & P(X x |Y i) exp x "1 x =P(X=j xj)== = m/2 1/2 %" ( j " µi ) !i ( j " µi )( (2! ) || !i || $ 2 ' Σ ∝ idenOty matrix MulOvariate Gaussians 1 # 1 T & P(X x |Y i) exp x "1 x =P(X=j xj)== = m/2 1/2 %" ( j " µi ) !i ( j " µi )( (2! ) || !i || $ 2 ' Σ = diagonal matrix Xi are independent ala Gaussian NB MulOvariate Gaussians 1 # 1 T & P(X x |Y i) exp x "1 x =P(X=j xj)== = m/2 1/2 %" ( j " µi ) !i ( j " µi )( (2! ) || !i || $ 2 ' Σ = arbitrary (semidefinite) matrix: - specifies rotaon (change of basis) - eigenvalues specify relave elongaon MulOvariate Gaussians Eigenvalue, λ, of Σ Covariance matrix, Σ, = degree to which xi vary together 1 # 1 T & P(X x |Y i) exp x "1 x =P(X=j xj)== = m/2 1/2 %" ( j " µi ) !i ( j " µi )( (2! ) || !i || $ 2 ' Mixtures of Gaussians (1) Old Faithful Data Set Time to Erupon Duraon of Last ErupOon Mixtures of Gaussians (1) Old Faithful Data Set Single Gaussian Mixture of two Gaussians Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3 Mixtures of Gaussians (3) Eliminang Hard Assignments to Clusters Model data as mixture of mulOvariate Gaussians Eliminang Hard Assignments to Clusters Model data as mixture of mulOvariate Gaussians Eliminang Hard Assignments to Clusters Model data as mixture of mulOvariate Gaussians Shown is the posterior probability that a point was generated from ith Gaussian: Pr(Y = i x) | ML esOmaon in supervised seng • Univariate Gaussian • Mixture of Mul&variate Gaussians ML esOmate for each of the MulOvariate Gaussians is given by: n n 1 T µ k = x k 1 k k ML ! n !ML = #(x j " µML )(x j " µML ) n j=1 n j=1 Just sums over x generated from the k’th Gaussian That was easy! But what if unobserved data? • MLE: – argmaxθ ∏j P(yj,xj) – θ: all model parameters • eg, class probs, means, and variances • But we don’t know yj’s!!! • Maximize marginal likelihood: K – argmaxθ ∏j P(xj) = argmax ∏j ∑k=1 P(Yj=k, xj) How do we opOmize? Closed Form? • Maximize marginal likelihood: K – argmaxθ ∏j P(xj) = argmax ∏j ∑k=1 P(Yj=k, xj) • Almost always a hard problem! – Usually no closed form soluOon – Even when lgP(X,Y) is convex, lgP(X) generally isn’t… – For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local opOmum… Learning general mixtures of Gaussian 1 ⎡ 1 T ⎤ P(y k | x ) exp x −1 x P(y k) = j ∝ m / 2 1/ 2 ⎢ − ( j − µk ) Σk ( j − µk )⎥ = (2π) || Σk || ⎣ 2 ⎦ • Marginal likelihood: € m m K ∏P(x j ) = ∏∑P(x j ,y = k) j =1 j =1 k=1 m K 1 ⎡ 1 T ⎤ exp x −1 x P(y k) = ∏∑ m / 2 1/ 2 ⎢ − ( j − µk ) Σk ( j − µk )⎥ = j =1 k=1 (2π) || Σk || ⎣ 2 ⎦ • Need to differenOate and solve for μk, Σk, and P(Y=k) for k=1..K € • There will be no closed form soluOon, gradient is complex, lots of local opOmum • Wouldn’t it be nice if there was a be<er way!?! Expectaon Maximizaon The EM Algorithm • A clever method for maximizing marginal likelihood: K – argmaxθ ∏j P(xj) = argmaxθ ∏j ∑k=1 P(Yj=k, xj) – A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) • Alternate between two steps: – Compute an expectaon – Compute a maximizaon • Not magic: sll op&mizing a non-convex func&on with lots of local op&ma – The computaons are just easier (oqen, significantly so!) EM: Two Easy Steps K K Objecve: argmaxθ lg∏j ∑k=1 P(Yj=k, xj |θ) = ∑j lg ∑k=1 P(Yj=k, xj|θ) Data: {x | j=1 .. n} Notaon a bit inconsistent j Parameters = θ=λ • E-step: Compute expectaons to “fill in” missing y values according to current parameters, θ – For all examples j and values k for Yj, compute: P(Yj=k | xj, θ) • M-step: re-esOmate the parameters with “weighted” MLE esOmates – Set θ = argmaxθ ∑j ∑k P(Yj=k | xj, θ) log P(Yj=k, xj |θ) Especially useful when the E and M steps have closed form soluOons!!! EM algorithm: Pictorial View Given a set of Parameters and training data Class assignment is probabilisOc or weighted (soq EM) Class assignment is Supervised learning hard (hard EM) problem EsOmate the class of each relearn the parameters training example using the based on the new training parameters yielding new data (weighted) training data Simple example: learn means only! Consider: • 1D data • Mixture of k=2 Gaussians • Variances fixed to σ=1 • DistribuOon over classes is uniform .01 .03 .05 .07 .09 • Just need to esOmate μ1 and μ2 m K m K =2 ⎡ 1 2 ⎤ ∏∑P(x,Y j = k) ∝ ∏∑exp⎢ − 2 x − µk ⎥ P(Y j = k) j =1 k=1 j =1 k=1 ⎣ 2σ ⎦ € EM for GMMs: only learning means Iterate: On the t’th iteraon let our esOmates be (t) (t) (t) λt = { μ1 , μ2 … μK } E-step Compute “expected” classes of all datapoints ⎛ 1 2⎞ P Y = k x ,µ ...µ ∝ exp⎜ − x − µ ⎟P Y = k ( j j 1 K ) ⎝ 2σ 2 j k ⎠ ( j ) M-step Compute most likely new μs given class expectaons € m P Y k x x ∑ ( j = j ) j j =1 µk = m P Y k x ∑ ( j = j ) j =1 € (t) E.M. for General GMMs pk is shorthand for esOmate of P(y=k) on Iterate: On the t’th iteraon let our esOmates be t’th iteraon (t) (t) (t) (t) (t) (t) (t) (t) (t) λt = { μ1 , μ2 … μK , Σ1 , Σ2 … ΣK , p1 , p2 … pK } E-step Compute “expected” classes of all datapoints for each class (t ) (t ) (t ) P Y k x , p p x , Just evaluate a ( j = j λt ) ∝ k ( j µk Σk ) Gaussian at xj M-step Compute weighted MLE for μ given expected classes above € T P Y = k x ,λ x P Y k x , x (t +1) x (t +1) ∑ ( j j t ) j ∑ ( j = j λt ) [ j − µk ][ j − µk ] (t +1) j (t +1) j µk = Σk = P Y = k x ,λ P Y k x , ∑ ( j j t ) ∑ ( j = j λt ) j j P Y k x , ∑ ( j = j λt ) p (t +1) = j k m € € m = #training examples € Gaussian Mixture Example: Start Aqer first iteraon Aqer 2nd iteraon Aqer 3rd iteraon Aqer 4th iteraon Aqer 5th iteraon Aqer 6th iteraon Aqer 20th iteraon What if we do hard assignments? Iterate: On the t’th iteraon let our esOmates be (t) (t) (t) λt = { μ1 , μ2 … μK } E-step Compute “expected” classes of all datapoints ⎛ 1 2⎞ P Y = k x ,µ ...µ ∝ exp⎜ − x − µ ⎟P Y = k ( j j 1 K ) ⎝ 2σ 2 j k ⎠ ( j ) M-step δ represents hard Compute most likely new μs given class expectaons assignment to “most likely” or nearest € m cluster P Y = k x x Y k,x x ∑ ( j j ) j δ( j = j ) j j =1 µ = µ = k m k m δ Y = k,x P Y = k x ∑ ( j j ) ∑ ( j j ) j =1 j =1 Equivalent to k-means clustering algorithm!!! € € Lets look at the math behind the magic! Next lecture, we will argue that EM: • OpOmizes a bound on the likelihood • Is a type of coordinate ascent • Is guaranteed to converge to a (oqen local) opOma .