CSC 411: Lecture 13: Mixtures of Gaussians and EM

CSC 411: Lecture 13: Mixtures of Gaussians and EM Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 1 / 33 Today Mixture of Gaussians EM algorithm Latent Variables Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 2 / 33 Today: statistical formulation of clustering ! principled, justification for updates We need a sensible measure of what it means to cluster the data well I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed A Generative View of Clustering Last time: hard and soft k-means algorithm Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 We need a sensible measure of what it means to cluster the data well I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed A Generative View of Clustering Last time: hard and soft k-means algorithm Today: statistical formulation of clustering ! principled, justification for updates Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 An obvious approach is to imagine that the data was produced by a generative model I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed A Generative View of Clustering Last time: hard and soft k-means algorithm Today: statistical formulation of clustering ! principled, justification for updates We need a sensible measure of what it means to cluster the data well I This makes it possible to judge different methods I It may help us decide on the number of clusters Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed A Generative View of Clustering Last time: hard and soft k-means algorithm Today: statistical formulation of clustering ! principled, justification for updates We need a sensible measure of what it means to cluster the data well I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 A Generative View of Clustering Last time: hard and soft k-means algorithm Today: statistical formulation of clustering ! principled, justification for updates We need a sensible measure of what it means to cluster the data well I This makes it possible to judge different methods I It may help us decide on the number of clusters An obvious approach is to imagine that the data was produced by a generative model I Then we adjust the model parameters to maximize the probability that it would produce exactly the data we observed Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 3 / 33 GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators. Gaussian Mixture Model (GMM) A Gaussian mixture model represents a distribution as K X p(x) = πk N (xjµk ; Σk ) k=1 with πk the mixing coefficients, where: K X πk = 1 and πk ≥ 0 8k k=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators. Gaussian Mixture Model (GMM) A Gaussian mixture model represents a distribution as K X p(x) = πk N (xjµk ; Σk ) k=1 with πk the mixing coefficients, where: K X πk = 1 and πk ≥ 0 8k k=1 GMM is a density estimator Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators. Gaussian Mixture Model (GMM) A Gaussian mixture model represents a distribution as K X p(x) = πk N (xjµk ; Σk ) k=1 with πk the mixing coefficients, where: K X πk = 1 and πk ≥ 0 8k k=1 GMM is a density estimator Where have we already used a density estimator? Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators. Gaussian Mixture Model (GMM) A Gaussian mixture model represents a distribution as K X p(x) = πk N (xjµk ; Σk ) k=1 with πk the mixing coefficients, where: K X πk = 1 and πk ≥ 0 8k k=1 GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Gaussian Mixture Model (GMM) A Gaussian mixture model represents a distribution as K X p(x) = πk N (xjµk ; Σk ) k=1 with πk the mixing coefficients, where: K X πk = 1 and πk ≥ 0 8k k=1 GMM is a density estimator Where have we already used a density estimator? We know that neural nets are universal approximators of functions GMMs are universal approximators of densities (if you have enough Gaussians). Even diagonal GMMs are universal approximators. Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 4 / 33 Now, we are trying to fit a GMM (with K = 2 in this example): Visualizing a Mixture of Gaussians { 1D Gaussians In the beginning of class, we tried to fit a Gaussian to data: [Slide credit: K. Kutulakos] Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 5 / 33 Visualizing a Mixture of Gaussians { 1D Gaussians In the beginning of class, we tried to fit a Gaussian to data: Now, we are trying to fit a GMM (with K = 2 in this example): [Slide credit: K. Kutulakos] Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 5 / 33 Visualizing a Mixture of Gaussians { 2D Gaussians Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 6 / 33 Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update? Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update? Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update? Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 How would you optimize this? Can we have a closed form update? Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Can we have a closed form update? Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Don't forget to satisfy the constraints on πk Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update? Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 Fitting GMMs: Maximum Likelihood Maximum likelihood maximizes N K ! X X (n) ln p(Xjπ; µ, Σ) = ln πk N (x jµk ; Σk ) n=1 k=1 w.r.t Θ = fπk ; µk ; Σk g Problems: I Singularities: Arbitrarily large likelihood when a Gaussian explains a single point I Identifiability: Solution is up to permutations How would you optimize this? Can we have a closed form update? Don't forget to satisfy the constraints on πk Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 7 / 33 K X = p(z = k) p(xjz = k) k=1 | {z } | {z } πk N (xjµk ;Σk ) We could introduce a

Load more