<<

CSC 411: Lecture 13: Mixtures of Gaussians and

Richard Zemel, Raquel Urtasun and Sanja Fidler

University of Toronto

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 1 / 33 Today

Mixture of Gaussians EM algorithm Latent Variables

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 2 / 33 Today: statistical formulation of clustering → principled, justification for updates We need sensible measure of what it to cluster data well

I This makes it possible to judge different methods It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a

I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed

A Generative View of Clustering

Last time: hard and soft k-means algorithm

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 We need a sensible measure of what it means to cluster the data well

I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model

I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed

A Generative View of Clustering

Last time: hard and soft k-means algorithm Today: statistical formulation of clustering → principled, justification for updates

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 An obvious approach is to imagine that the data was produced by a generative model

I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed

A Generative View of Clustering

Last time: hard and soft k-means algorithm Today: statistical formulation of clustering → principled, justification for updates We need a sensible measure of what it means to cluster the data well

I This makes it possible to judge different methods I It may help us decide on the number of clusters

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed

A Generative View of Clustering

Last time: hard and soft k-means algorithm Today: statistical formulation of clustering → principled, justification for updates We need a sensible measure of what it means to cluster the data well

I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 A Generative View of Clustering

Last time: hard and soft k-means algorithm Today: statistical formulation of clustering → principled, justification for updates We need a sensible measure of what it means to cluster the data well

I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model

I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.

Gaussian (GMM)

A Gaussian mixture model represents a distribution as

K X p(x) = πk N (x|µk , Σk ) k=1

with πk the mixing coefficients, where:

K X πk = 1 and πk ≥ 0 ∀k k=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.

Gaussian Mixture Model (GMM)

A Gaussian mixture model represents a distribution as

K X p(x) = πk N (x|µk , Σk ) k=1

with πk the mixing coefficients, where:

K X πk = 1 and πk ≥ 0 ∀k k=1

GMM is a density estimator

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.

Gaussian Mixture Model (GMM)

A Gaussian mixture model represents a distribution as

K X p(x) = πk N (x|µk , Σk ) k=1

with πk the mixing coefficients, where:

K X πk = 1 and πk ≥ 0 ∀k k=1

GMM is a density estimator Where have we already used a density estimator?

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.

Gaussian Mixture Model (GMM)

A Gaussian mixture model represents a distribution as

K X p(x) = πk N (x|µk , Σk ) k=1

with πk the mixing coefficients, where:

K X πk = 1 and πk ≥ 0 ∀k k=1

GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Gaussian Mixture Model (GMM)

A Gaussian mixture model represents a distribution as

K X p(x) = πk N (x|µk , Σk ) k=1

with πk the mixing coefficients, where:

K X πk = 1 and πk ≥ 0 ∀k k=1

GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Now, we are trying to fit a GMM (with K = 2 in this example):

Visualizing a Mixture of Gaussians – 1D Gaussians

In the beginning of class, we tried to fit a Gaussian to data:

[Slide credit: K. Kutulakos] Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 5 / 33 Visualizing a Mixture of Gaussians – 1D Gaussians

In the beginning of class, we tried to fit a Gaussian to data:

Now, we are trying to fit a GMM (with K = 2 in this example):

[Slide credit: K. Kutulakos] Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 5 / 33 Visualizing a Mixture of Gaussians – 2D Gaussians

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 6 / 33 Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

.r.t Θ = {πk , µk , Σk }

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 How would you optimize this? Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this?

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Don’t forget to satisfy the constraints on πk

Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update?

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Fitting GMMs: Maximum Likelihood

Maximum likelihood maximizes

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

w.r.t Θ = {πk , µk , Σk } Problems:

I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update?

Don’t forget to satisfy the constraints on πk

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 K X = p(z = k) p(x|z = k) k=1 | {z } | {z } πk N (x|µk ,Σk )

We could introduce a hidden (latent) variable z which would represent which Gaussian generated our observation x, with some probability P Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Then: K X p(x) = p(x, z = k) k=1

Trick: Introduce a

Introduce a hidden variable such that its knowledge would simplify the maximization

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 8 / 33 K X = p(z = k) p(x|z = k) k=1 | {z } | {z } πk N (x|µk ,Σk )

P Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Then: K X p(x) = p(x, z = k) k=1

Trick: Introduce a Latent Variable

Introduce a hidden variable such that its knowledge would simplify the maximization We could introduce a hidden (latent) variable z which would represent which Gaussian generated our observation x, with some probability

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 8 / 33 K X = p(z = k) p(x|z = k) k=1 | {z } | {z } πk N (x|µk ,Σk )

Then: K X p(x) = p(x, z = k) k=1

Trick: Introduce a Latent Variable

Introduce a hidden variable such that its knowledge would simplify the maximization We could introduce a hidden (latent) variable z which would represent which Gaussian generated our observation x, with some probability P Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 8 / 33 Trick: Introduce a Latent Variable

Introduce a hidden variable such that its knowledge would simplify the maximization We could introduce a hidden (latent) variable z which would represent which Gaussian generated our observation x, with some probability P Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Then: K X p(x) = p(x, z = k) k=1

K X = p(z = k) p(x|z = k) k=1 | {z } | {z } πk N (x|µk ,Σk )

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 8 / 33 If occasionally unobserved they are missing, .g., undefined inputs, missing class labels, erroneous targets Variables which are always unobserved are called latent variables, or sometimes hidden variables

We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model Form of divide-and-conquer: use simple parts to build complex models

In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Latent Variable Models

Some model variables may unobserved, either at training or at test time, or both

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 Variables which are always unobserved are called latent variables, or sometimes hidden variables

We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model Form of divide-and-conquer: use simple parts to build complex models

In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Latent Variable Models

Some model variables may be unobserved, either at training or at test time, or both If occasionally unobserved they are missing, e.g., undefined inputs, missing class labels, erroneous targets

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model Form of divide-and-conquer: use simple parts to build complex models

In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Latent Variable Models

Some model variables may be unobserved, either at training or at test time, or both If occasionally unobserved they are missing, e.g., undefined inputs, missing class labels, erroneous targets Variables which are always unobserved are called latent variables, or sometimes hidden variables

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 Form of divide-and-conquer: use simple parts to build complex models

In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Latent Variable Models

Some model variables may be unobserved, either at training or at test time, or both If occasionally unobserved they are missing, e.g., undefined inputs, missing class labels, erroneous targets Variables which are always unobserved are called latent variables, or sometimes hidden variables

We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Latent Variable Models

Some model variables may be unobserved, either at training or at test time, or both If occasionally unobserved they are missing, e.g., undefined inputs, missing class labels, erroneous targets Variables which are always unobserved are called latent variables, or sometimes hidden variables

We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model Form of divide-and-conquer: use simple parts to build complex models

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 Latent Variable Models

Some model variables may be unobserved, either at training or at test time, or both If occasionally unobserved they are missing, e.g., undefined inputs, missing class labels, erroneous targets Variables which are always unobserved are called latent variables, or sometimes hidden variables

We may want to intentionally introduce latent variables to model complex dependencies between variables – this can actually simplify the model Form of divide-and-conquer: use simple parts to build complex models

In a mixture model, the identity of the component that generated a given datapoint is a latent variable

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 9 / 33 P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation General problem: sum inside the log How can we optimize this?

Back to GMM

A Gaussian : K X p(x) = πk N (x|µk , Σk ) k=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation General problem: sum inside the log How can we optimize this?

Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation General problem: sum inside the log How can we optimize this?

Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 Note: We have a hidden variable z(n) for every observation General problem: sum inside the log How can we optimize this?

Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 General problem: sum inside the log How can we optimize this?

Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 How can we optimize this?

Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation General problem: sum inside the log

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 Back to GMM

A Gaussian mixture distribution: K X p(x) = πk N (x|µk , Σk ) k=1

P We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1) Joint distribution: p(x, z) = p(z)p(x|z) Log-likelihood: N X `(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n)|π, µ, Σ) n=1 N K X X = ln p(x(n)| z(n); µ, Σ)p(z(n)| π) n=1 z(n)=1

Note: We have a hidden variable z(n) for every observation General problem: sum inside the log How can we optimize this? Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 10 / 33 We have been optimizing something similar for Gaussian bayes classifiers We would get this (check old slides):

PN (n) 1 (n) x µ = n=1 [z =k] k PN n=1 1[z(n)=k] PN (n) (n) T 1 (n) (x − µk )(x − µk ) Σ = n=1 [z =k] k PN n=1 1[z(n)=k] N 1 X π = 1 (n) k N [z =k] n=1

Maximum Likelihood

If we knew z(n) for every x (n), the maximum likelihood problem is easy:

N N X X `(π, µ, Σ) = ln p(x (n), z(n)|π, µ, Σ) = ln p(x(n)| z(n); µ, Σ)+ln p(z(n)| π) n=1 n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 11 / 33 We would get this (check old slides):

PN (n) 1 (n) x µ = n=1 [z =k] k PN n=1 1[z(n)=k] PN (n) (n) T 1 (n) (x − µk )(x − µk ) Σ = n=1 [z =k] k PN n=1 1[z(n)=k] N 1 X π = 1 (n) k N [z =k] n=1

Maximum Likelihood

If we knew z(n) for every x (n), the maximum likelihood problem is easy:

N N X X `(π, µ, Σ) = ln p(x (n), z(n)|π, µ, Σ) = ln p(x(n)| z(n); µ, Σ)+ln p(z(n)| π) n=1 n=1

We have been optimizing something similar for Gaussian bayes classifiers

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 11 / 33 Maximum Likelihood

If we knew z(n) for every x (n), the maximum likelihood problem is easy:

N N X X `(π, µ, Σ) = ln p(x (n), z(n)|π, µ, Σ) = ln p(x(n)| z(n); µ, Σ)+ln p(z(n)| π) n=1 n=1

We have been optimizing something similar for Gaussian bayes classifiers We would get this (check old slides):

PN (n) 1 (n) x µ = n=1 [z =k] k PN n=1 1[z(n)=k] PN (n) (n) T 1 (n) (x − µk )(x − µk ) Σ = n=1 [z =k] k PN n=1 1[z(n)=k] N 1 X π = 1 (n) k N [z =k] n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 11 / 33 1. E-step: Compute the that each Gaussian generates each datapoint (as this is unknown to us) 2. M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.

.95 .05 .05 .95

.5 .5 .5

.5

Intuitively, How Can We Fit a Mixture of Gaussians?

Optimization uses the Expectation Maximization algorithm, which alternates between two steps:

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 12 / 33 2. M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.

.95 .05 .05 .95

.5 .5 .5

.5

Intuitively, How Can We Fit a Mixture of Gaussians?

Optimization uses the Expectation Maximization algorithm, which alternates between two steps: 1. E-step: Compute the posterior probability that each Gaussian generates each datapoint (as this is unknown to us)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 12 / 33 .95 .05 .05 .95

.5 .5 .5

.5

Intuitively, How Can We Fit a Mixture of Gaussians?

Optimization uses the Expectation Maximization algorithm, which alternates between two steps: 1. E-step: Compute the posterior probability that each Gaussian generates each datapoint (as this is unknown to us) 2. M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 12 / 33 Intuitively, How Can We Fit a Mixture of Gaussians?

Optimization uses the Expectation Maximization algorithm, which alternates between two steps: 1. E-step: Compute the posterior probability that each Gaussian generates each datapoint (as this is unknown to us) 2. M-step: Assuming that the data really was generated this way, change the parameters of each Gaussian to maximize the probability that it would generate the data it is currently responsible for.

.95 .05 .05 .95

.5 .5 .5

.5

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 12 / 33 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ) 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint. I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

I We can derive closed form updates for all parameters

Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ) 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint. I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

I We can derive closed form updates for all parameters

Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint?

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint. I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

I We can derive closed form updates for all parameters

Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

I We can derive closed form updates for all parameters

Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ) 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 I We can derive closed form updates for all parameters

Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ) 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint. I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 Expectation Maximization

Elegant and powerful method for finding maximum likelihood solutions for models with latent variables 1. E-step:

I In order to adjust the parameters, we must first solve the inference problem: Which Gaussian generated each datapoint? I We cannot be sure, so it’s a distribution over all possibilities.

(n) (n) (n) γk = p(z = k|x ; π, µ, Σ) 2. M-step:

I Each Gaussian gets a certain amount of posterior probability for each datapoint. I At the optimum we shall satisfy ∂ ln p(X|π, µ, Σ) = 0 ∂Θ

I We can derive closed form updates for all parameters

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 13 / 33 Visualizing a Mixture of Gaussians

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 14 / 33 p(z = k)p(x|z = k) p(x) p(z = k)p(x|z = k) = PK j=1 p(z = j)p(x|z = j) π N (x|µ , Σ ) = k k k PK j=1 πj N (x|µj , Σj )

γk can be viewed as the responsibility

E-Step: Responsabilities

Conditional probability (using Bayes rule) of z given x

γk = p(z = k|x) =

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 15 / 33 p(z = k)p(x|z = k) PK j=1 p(z = j)p(x|z = j) π N (x|µ , Σ ) = k k k PK j=1 πj N (x|µj , Σj )

γk can be viewed as the responsibility

E-Step: Responsabilities

Conditional probability (using Bayes rule) of z given x

p(z = k)p(x|z = k) γ = p(z = k|x) = k p(x)

=

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 15 / 33 πk N (x|µk , Σk ) PK j=1 πj N (x|µj , Σj )

γk can be viewed as the responsibility

E-Step: Responsabilities

Conditional probability (using Bayes rule) of z given x

p(z = k)p(x|z = k) γ = p(z = k|x) = k p(x) p(z = k)p(x|z = k) = PK j=1 p(z = j)p(x|z = j)

=

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 15 / 33 γk can be viewed as the responsibility

E-Step: Responsabilities

Conditional probability (using Bayes rule) of z given x

p(z = k)p(x|z = k) γ = p(z = k|x) = k p(x) p(z = k)p(x|z = k) = PK j=1 p(z = j)p(x|z = j) π N (x|µ , Σ ) = k k k PK j=1 πj N (x|µj , Σj )

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 15 / 33 E-Step: Responsabilities

Conditional probability (using Bayes rule) of z given x

p(z = k)p(x|z = k) γ = p(z = k|x) = k p(x) p(z = k)p(x|z = k) = PK j=1 p(z = j)p(x|z = j) π N (x|µ , Σ ) = k k k PK j=1 πj N (x|µj , Σj )

γk can be viewed as the responsibility

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 15 / 33 Set derivatives to 0:

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj )

We used: 1  1  N (x|µ, Σ) = exp − (x − µ)T Σ−1(x − µ) p(2π)d |Σ| 2 and:

∂(xT Ax) = xT (A + AT ) ∂x

M-Step: Estimate Parameters

Log-likelihood:

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 16 / 33 We used: 1  1  N (x|µ, Σ) = exp − (x − µ)T Σ−1(x − µ) p(2π)d |Σ| 2 and:

∂(xT Ax) = xT (A + AT ) ∂x

M-Step: Estimate Parameters

Log-likelihood:

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

Set derivatives to 0:

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj )

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 16 / 33 M-Step: Estimate Parameters

Log-likelihood:

N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

Set derivatives to 0:

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj )

We used: 1  1  N (x|µ, Σ) = exp − (x − µ)T Σ−1(x − µ) p(2π)d |Σ| 2 and:

∂(xT Ax) = xT (A + AT ) ∂x

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 16 / 33 This gives N 1 X (n) (n) µk = γk x Nk n=1

with Nk the effective number of points in cluster k

N X (n) Nk = γk n=1

We just take the center-of gravity of the data that the Gaussian is responsible for Just like in K-means, except the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the convex hull of the data (Could be big initial jump)

M-Step: Estimate Parameters

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj ) | {z } (n) γk

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 17 / 33 We just take the center-of gravity of the data that the Gaussian is responsible for Just like in K-means, except the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the convex hull of the data (Could be big initial jump)

M-Step: Estimate Parameters

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj ) | {z } (n) γk This gives N 1 X (n) (n) µk = γk x Nk n=1

with Nk the effective number of points in cluster k

N X (n) Nk = γk n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 17 / 33 Just like in K-means, except the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the convex hull of the data (Could be big initial jump)

M-Step: Estimate Parameters

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj ) | {z } (n) γk This gives N 1 X (n) (n) µk = γk x Nk n=1

with Nk the effective number of points in cluster k

N X (n) Nk = γk n=1

We just take the center-of gravity of the data that the Gaussian is responsible for

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 17 / 33 Guaranteed to lie in the convex hull of the data (Could be big initial jump)

M-Step: Estimate Parameters

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj ) | {z } (n) γk This gives N 1 X (n) (n) µk = γk x Nk n=1

with Nk the effective number of points in cluster k

N X (n) Nk = γk n=1

We just take the center-of gravity of the data that the Gaussian is responsible for Just like in K-means, except the data is weighted by the posterior probability of the Gaussian.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 17 / 33 M-Step: Estimate Parameters

N (n) ∂ ln p(X|π, µ, Σ) X πk N (x |µk , Σk ) (n) = 0 = K Σk (x − µk ) ∂µk P n=1 j=1 πj N (x|µj , Σj ) | {z } (n) γk This gives N 1 X (n) (n) µk = γk x Nk n=1

with Nk the effective number of points in cluster k

N X (n) Nk = γk n=1

We just take the center-of gravity of the data that the Gaussian is responsible for Just like in K-means, except the data is weighted by the posterior probability of the Gaussian. Guaranteed to lie in the convex hull of the data (Could be big initial jump)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 17 / 33 We can also minimize w.r.t the mixing coefficients

N Nk X (n) π = , with N = γ k N k k n=1

The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for. Note that this is not a closed form solution of the parameters, as they (n) depend on the responsibilities γk , which are complex functions of the parameters But we have a simple iterative scheme to optimize

M-Step (, mixing coefficients)

We can get similarly expression for the variance

N 1 X (n) Σ = γ (x(n) − µ )(x(n) − µ )T k N k k k k n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 18 / 33 The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for. Note that this is not a closed form solution of the parameters, as they (n) depend on the responsibilities γk , which are complex functions of the parameters But we have a simple iterative scheme to optimize

M-Step (variance, mixing coefficients)

We can get similarly expression for the variance

N 1 X (n) Σ = γ (x(n) − µ )(x(n) − µ )T k N k k k k n=1

We can also minimize w.r.t the mixing coefficients

N Nk X (n) π = , with N = γ k N k k n=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 18 / 33 Note that this is not a closed form solution of the parameters, as they (n) depend on the responsibilities γk , which are complex functions of the parameters But we have a simple iterative scheme to optimize

M-Step (variance, mixing coefficients)

We can get similarly expression for the variance

N 1 X (n) Σ = γ (x(n) − µ )(x(n) − µ )T k N k k k k n=1

We can also minimize w.r.t the mixing coefficients

N Nk X (n) π = , with N = γ k N k k n=1

The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 18 / 33 But we have a simple iterative scheme to optimize

M-Step (variance, mixing coefficients)

We can get similarly expression for the variance

N 1 X (n) Σ = γ (x(n) − µ )(x(n) − µ )T k N k k k k n=1

We can also minimize w.r.t the mixing coefficients

N Nk X (n) π = , with N = γ k N k k n=1

The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for. Note that this is not a closed form solution of the parameters, as they (n) depend on the responsibilities γk , which are complex functions of the parameters

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 18 / 33 M-Step (variance, mixing coefficients)

We can get similarly expression for the variance

N 1 X (n) Σ = γ (x(n) − µ )(x(n) − µ )T k N k k k k n=1

We can also minimize w.r.t the mixing coefficients

N Nk X (n) π = , with N = γ k N k k n=1

The optimal mixing proportion to use (given these posterior probabilities) is just the fraction of the data that the Gaussian gets responsibility for. Note that this is not a closed form solution of the parameters, as they (n) depend on the responsibilities γk , which are complex functions of the parameters But we have a simple iterative scheme to optimize

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 18 / 33 EM Algorithm for GMM

Initialize the means µk , covariances Σk and mixing coefficients πk Iterate until convergence:

I E-step: Evaluate the responsibilities given current parameters π N (x(n)|µ , Σ ) γ(n) = p(z (n)|x) = k k k k PK (n) j=1 πj N (x |µj , Σj )

I M-step: Re-estimate the parameters given current responsibilities N 1 X (n) (n) µk = γk x Nk n=1 N 1 X (n) (n) (n) T Σk = γk (x − µk )(x − µk ) Nk n=1 N Nk X (n) π = with N = γ k N k k n=1

I Evaluate log likelihood and check for convergence N K ! X X (n) ln p(X|π, µ, Σ) = ln πk N (x |µk , Σk ) n=1 k=1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 19 / 33 Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 20 / 33 Instead of hard assignments in the E-step, we do soft assignments based on the softmax of the squared Mahalanobis distance from each point to each cluster. Each center moved by weighted means of the data, with weights given by soft assignments In K-means, weights are 0 or 1

Mixture of Gaussians vs. K-means

EM for mixtures of Gaussians is just like a soft version of K-means, with fixed priors and covariance

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 21 / 33 Each center moved by weighted means of the data, with weights given by soft assignments In K-means, weights are 0 or 1

Mixture of Gaussians vs. K-means

EM for mixtures of Gaussians is just like a soft version of K-means, with fixed priors and covariance Instead of hard assignments in the E-step, we do soft assignments based on the softmax of the squared Mahalanobis distance from each point to each cluster.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 21 / 33 In K-means, weights are 0 or 1

Mixture of Gaussians vs. K-means

EM for mixtures of Gaussians is just like a soft version of K-means, with fixed priors and covariance Instead of hard assignments in the E-step, we do soft assignments based on the softmax of the squared Mahalanobis distance from each point to each cluster. Each center moved by weighted means of the data, with weights given by soft assignments

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 21 / 33 Mixture of Gaussians vs. K-means

EM for mixtures of Gaussians is just like a soft version of K-means, with fixed priors and covariance Instead of hard assignments in the E-step, we do soft assignments based on the softmax of the squared Mahalanobis distance from each point to each cluster. Each center moved by weighted means of the data, with weights given by soft assignments In K-means, weights are 0 or 1

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 21 / 33 General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold )

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold )

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 An Alternative View of EM

Hard to maximize (log-)likelihood of data directly General problem: sum inside the log X ln p(x|Θ) = ln p(x, z|Θ) z

Complete data {x, z}, and x is the incomplete data If we knew z, then easy to maximize (replace sum over k with just the k where z = k) Unfortunately we are not given the complete data, but only the incomplete. Our knowledge about the latent variables is p(Z|X, Θold ) In the E-step we compute p(Z|X, Θold ) In the M-step we maximize w.r.t Θ X Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ) z

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 22 / 33 General EM Algorithm

1. InitializeΘ old 2. E-step: Evaluate p(Z|X, Θold ) 3. M-step: Θnew = arg max Q(Θ, Θold ) Θ where old X old Q(Θ, Θ ) = p(Z|X, Θ ) ln p(X, Z|Θ) z 4. Evaluate log likelihood and check for convergence (or the parameters). If not converged, Θold = Θ, Go to step2

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 23 / 33 Beyond this slide, read if you are interested in more details

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 24 / 33 I But we know that the posterior will change after updating the parameters. A good way to show that this is OK is to show that there is a single function that is improved by both the E-step and the M-step.

I The function we need is called Free Energy.

How do we know that the updates improve things?

Updating each Gaussian definitely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 25 / 33 A good way to show that this is OK is to show that there is a single function that is improved by both the E-step and the M-step.

I The function we need is called Free Energy.

How do we know that the updates improve things?

Updating each Gaussian definitely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates.

I But we know that the posterior will change after updating the parameters.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 25 / 33 I The function we need is called Free Energy.

How do we know that the updates improve things?

Updating each Gaussian definitely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates.

I But we know that the posterior will change after updating the parameters. A good way to show that this is OK is to show that there is a single function that is improved by both the E-step and the M-step.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 25 / 33 How do we know that the updates improve things?

Updating each Gaussian definitely improves the probability of generating the data if we generate it from the same Gaussians after the parameter updates.

I But we know that the posterior will change after updating the parameters. A good way to show that this is OK is to show that there is a single function that is improved by both the E-step and the M-step.

I The function we need is called Free Energy.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 25 / 33 The expected energy term measures how difficult it is to generate each datapoint from the Gaussians it is assigned to. It would be happiest assigning each datapoint to the Gaussian that generates it most easily (as in K-means). The entropy term encourages ”soft” assignments. It would be happiest spreading the assignment probabilities for each datapoint equally between all the Gaussians.

Why EM converges

Free energy F is a cost function that is reduced by both the E-step and the M-step. F = expected energy − entropy

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 26 / 33 The entropy term encourages ”soft” assignments. It would be happiest spreading the assignment probabilities for each datapoint equally between all the Gaussians.

Why EM converges

Free energy F is a cost function that is reduced by both the E-step and the M-step. F = expected energy − entropy

The expected energy term measures how difficult it is to generate each datapoint from the Gaussians it is assigned to. It would be happiest assigning each datapoint to the Gaussian that generates it most easily (as in K-means).

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 26 / 33 Why EM converges

Free energy F is a cost function that is reduced by both the E-step and the M-step. F = expected energy − entropy

The expected energy term measures how difficult it is to generate each datapoint from the Gaussians it is assigned to. It would be happiest assigning each datapoint to the Gaussian that generates it most easily (as in K-means). The entropy term encourages ”soft” assignments. It would be happiest spreading the assignment probabilities for each datapoint equally between all the Gaussians.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 26 / 33 Free Energy

Our goal is to maximize X p(X|Θ) = p(X, z|Θ) z

Typically optimizing p(X|Θ) is difficult, but p(X, Z|Θ) is easy Let q(Z) be a distribution over the latent variables. For any distribution q(Z) we have

ln p(X|Θ) = L(q, Θ) + KL(q||p(Z|X, Θ))

where X p(X, Z|Θ) L(q, Θ) = q(Z) ln q(Z) Z X p(Z|X, Θ) KL(q||p) = − q(Z) ln q(Z) Z

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 27 / 33 More on Free Energy

Since the KL-divergence is always positive and have value 0 only if q(Z) = p(Z|X, Θ) Thus L(q, Θ) is a lower bound on the likelihood

L(q, Θ) ≤ ln p(X|Θ)

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 28 / 33 E-step and M-step

ln p(X|Θ) = L(q, Θ) + KL(q||p(Z|X, Θ))

In the E-step we maximize w.r.t q(Z) the lower bound L(q, Θ) Since ln p(X|θ) does not depend on q(Z), the maximum L is obtained when the KL is 0 This is achieved when q(Z) = p(Z|X, Θ) The lower bound L is then X old X old old L(q, Θ) = p(Z|X, Θ ) ln p(X, Z|Θ) − p(Z|X, Θ ) ln p(Z|X, Θ ) Z Z = Q(Θ, Θold ) + const with the content the entropy of the q distribution, which is independent of Θ In the M-step the quantity to be maximized is the expectation of the complete data log-likelihood Note that Θ is only inside the logarithm and optimizing the complete data likelihood is easier

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 29 / 33 Visualization of E-step

The q distribution equal to the posterior distribution for the current parameter values Θold , causing the lower bound to move up to the same value as the log likelihood function, with the KL divergence vanishing.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 30 / 33 Visualization of M-step

The distribution q(Z) is held fixed and the lower bound L(q, Θ) is maximized with respect to the parameter vector Θ to give a revised value Θnew . Because the KL divergence is nonnegative, this causes the log likelihood ln p(X|Θ) to increase by at least as much as the lower bound does.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 31 / 33 Visualization of the EM Algorithm

The EM algorithm involves alternately computing a lower bound on the log likelihood for the current parameter values and then maximizing this bound to obtain the new parameter values. See the text for a full discussion.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 32 / 33 Summary: EM is coordinate descent in Free Energy

X X L(q, Θ) = p(Z|X, Θold ) ln p(X, Z|Θ) − p(Z|X, Θold ) ln p(Z|X, Θold ) Z Z = Q(Θ, Θold ) + const = expected energy − entropy

The E-step minimizes F by finding the best distribution over hidden configurations for each data point. The M-step holds the distribution fixed and minimizes F by changing the parameters that determine the energy of a configuration.

Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 33 / 33