<<

Generative Models for Dimensionality Reduction: Probabilistic PCA and

Piyush Rai

Machine Learning (CS771A)

Oct 5, 2016

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 1 K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID ))

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generative Model for Dimensionality Reduction

D Assume the following generative story for each x n ∈ R K Generate latent variables z n ∈ R (K  D) as

z n ∼ N (0, IK )

Generate data x n conditioned on z n as 2 x n ∼ N (Wz n, σ ID ) where W is the D × K “factor loading matrix” or “dictionary”

z n is K-dim latent features or latent factors or “coding” of x n w.r.t. W

Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0, σ ID )) This is “Probabilistic PCA” (PPCA) with Gaussian observation model

2 N Want to learn model parameters W, σ and latent factors {z n}n=1

When n ∼ N (0, Ψ), Ψ is diagonal, it is called “Factor Analysis” (FA)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Wdk denotes the weight of relationship between feature d and latent factor k This view also helps in thinking about “deep” generative models that have many layers of latent variables or “hidden units”

Generative Model for Dimensionality Reduction

D K Zooming in at the relationship between each x n ∈ R and each z n ∈ R

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 3 This view also helps in thinking about “deep” generative models that have many layers of latent variables or “hidden units”

Generative Model for Dimensionality Reduction

D K Zooming in at the relationship between each x n ∈ R and each z n ∈ R

Wdk denotes the weight of relationship between feature d and latent factor k

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 3 Generative Model for Dimensionality Reduction

D K Zooming in at the relationship between each x n ∈ R and each z n ∈ R

Wdk denotes the weight of relationship between feature d and latent factor k This view also helps in thinking about “deep” generative models that have many layers of latent variables or “hidden units”

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 3 (x = Wz + b + , where  ∼ N (0, Σx )) A few nice properties of such systems (follow from properties of Gaussians): The of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

(Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 A few nice properties of such systems (follow from properties of Gaussians): The marginal distribution of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

(Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx ) (x = Wz + b + , where  ∼ N (0, Σx ))

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 The marginal distribution of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

(Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx ) (x = Wz + b + , where  ∼ N (0, Σx )) A few nice properties of such systems (follow from properties of Gaussians):

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

(Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx ) (x = Wz + b + , where  ∼ N (0, Σx )) A few nice properties of such systems (follow from properties of Gaussians): The marginal distribution of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 (Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx ) (x = Wz + b + , where  ∼ N (0, Σx )) A few nice properties of such systems (follow from properties of Gaussians): The marginal distribution of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 Linear Gaussian Systems

Note that PPCA and FA are special cases of linear Gaussian Systems which have the following general form

p(z) = N (µz , Σz )

p(x|z) = N (Wz + b, Σx ) (x = Wz + b + , where  ∼ N (0, Σx )) A few nice properties of such systems (follow from properties of Gaussians): The marginal distribution of x, i.e., p(x), is Gaussian Z Z > p(x) = p(x, z)dz = p(x|z)p(z)dz = N (Wµz + b, Σx + WΣz W )

The posterior distribution of z, i.e., p(z|x) ∝ p(z)p(x|z) is Gaussian

p(z|x) = N (µ, Σ) −1 −1 > −1 Σ = Σz + W Σx W > −1 −1 µ = Σ[W Σx (x − b) + Σz µz ]

(Chapter 4 of Murphy and Chapter 2 of Bishop have various useful results on properties of multivar. Gaussians)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 4 Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK ) For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N) p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero ) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N) p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero mean) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N) p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero mean) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK ) For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N) p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero mean) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK ) For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero mean) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK ) For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 PPCA or FA = Low-Rank Gaussian

Suppose we’re modeling D-dim data using a (say zero mean) Gaussian p(x) = N (0, Σ) where Σ is a D × D p.s.d. cov. matrix, O(D2) parameters needed Consider modeling the same data using the one-layer PPCA model

2 p(x|z) = N (Wz, σ ID ) where p(z) = N (0, IK ) For this Gaussian PPCA, the marginal distribution p(x) = R p(x, z)dz is

> 2 p(x) = N (0, WW + σ ID ) (using result from previous slide)

Cov. matrix is close to low-rank. Also, only (DK + 1) free params to learn Thus modeling data using a Gaussian PPCA instead of Gaussian with full cov. may be easier when we have very little but high-dim data (i.e., D  N) p(x) is still a Gaussian but between two extremes (diagonal cov and full cov)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 5 Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID )

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp)

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 N Won’t be possible to learn the latent variables Z = {z n}n=1

Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA)

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 Parameter Estimation for PPCA

N N 2 Data: X = {x n}n=1, latent vars: Z = {z n}n=1, parameters: W, σ Note: If we just want to estimate W and σ2, we could do MLE directly† on incomplete data > 2 likelihood p(x) = N (0, WW + σ ID ) Closed-form solution† can be obtained for W and σ2 by maximizing N log p(X) = − (D log 2π + log |C| + trace(C−1S) 2

where S is the data cov. matrix and C−1 = σ−1I − σ−1WM−1W>and M = W>W + σ2I But this method isn’t usually preferred because It is expensive (have to work with cov. matrices and their eig-decomp) A closed-form solution may not even be possible for more general models (e.g. Factor Analysis where σ2I is replace by diagonal matrix, or mixture of PPCA) N Won’t be possible to learn the latent variables Z = {z n}n=1

†Probabilistic Principal Component Analysis (Tipping and Bishop, 1999) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 6 N N Y 2 Y 2 = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik.

log p(X, Z|W, σ2)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 N Y 2 = log p(x n|zn, W, σ )p(zn) n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 > > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 EM based Parameter Estimation for PPCA

We will instead go the EM route and work with the complete data log-lik. N N 2 Y 2 Y 2 log p(X, Z|W, σ ) = log p(x n, zn|W, σ ) = log p(x n|zn, W, σ )p(zn) n=1 n=1 N X 2 = {log p(x n|zn, W, σ )+ log p(zn)} n=1 As we’ll see, it leads to much simpler expressions and efficient solutions

> > 2 1 (xn−Wzn) (xn−Wzn) zn zn Recall that p(x n|zn, W, σ ) = exp(− ) and p(zn) ∝ exp(− ) (2πσ2)D/2 2σ2 2 Plugging in, simplifying, using the trace trick, and ignoring constants, we get the following expression for complete data log-likelihood log p(X, Z|W, σ2)

N   X D 2 1 2 1 > > 1 > > 1 > − log σ + ||x n|| − z W x n + tr(znz W W)+ tr(znz ) 2 2σ2 σ2 n 2σ2 n 2 n n=1 We will need the expected value of this quantity in M step of EM

This requires computing the posterior distribution of z n in E step (which is Gaussian; recall the result from earlier slide on linear Gaussian systems)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 7 2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 > 2 −1 = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > E[znzn ] = E[zn]E[zn] + cov(zn)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 EM based Parameter Estimation for PPCA

2 The expected complete data log-likelihood E[log p(X, Z|W, σ )] N   X D 2 1 2 1 > > 1 > > 1 > = − log σ + ||x n|| − [zn] W x n + tr( [znz ]W W) + tr( [znz ]) 2 2σ2 σ2 E 2σ2 E n 2 E n n=1

2 Taking the derivative of E[log p(X, Z|W, σ )] w.r.t. W and setting to zero

" N #" N #−1 X > X > W = x nE[zn] E[znzn ] n=1 n=1 > To compute W, we also need two expectations E[z n] and E[z nz n ]

These can be obtained in E step by computing posterior over z n, which, using the results of Gaussian posterior for linear Gaussian models, is

−1 > 2 −1 > 2 p(zn|x n, W) = N (M W x n, σ M ) where M = W W + σ IK The required expectations can be easily obtained from the Gaussian posterior −1 > E[zn] = M W x n > > > 2 −1 E[znzn ] = E[zn]E[zn] + cov(zn) = E[zn]E[zn] + σ M Note: The noise variance σ2 can also be estimated (take deriv., set to zero..)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 8 E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 > 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 " N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 The Full EM Algorithm for PPCA

Specify K, initialize W and σ2 randomly. Also center the data E step: Compute the expectations required in M step. For each data point

> 2 −1 > −1 > E[zn] =( W W + σ IK ) W x n = M W x n > > > 2 −1 E[znzn ] = cov(zn) + E[zn]E[zn] = E[zn]E[zn] + σ M

M step: Re-estimate W and σ2

" N #" N #−1 " N #" N #−1 X > X > X > X > 2 −1 Wnew = x nE[zn] E[znzn ] = x nE[zn] E[zn]E[zn] + σ M n=1 n=1 n=1 n=1 N 2 1 X n 2 > >  > > o σ = ||x n|| − 2 [zn] W x n + tr [znz ]W Wnew new ND E new E n new n=1

2 2 Set W = Wnew and σ = σnew If not converged, go back to E step (can monitor the incomplete/complete log-likelihood to assess convergence)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 9 EM for Factor Analysis is very similar to that for PPCA The required expectations in the E step :

−1 > −1 E[zn] = G W Ψ x n > > E[znzn ] = E[zn]E[zn] + G

> −1 −1 2 where G = (W Ψ W + IK ) . Note that if Ψ = σ ID , we get the same equations as in PPCA

In the M step, updates for Wnew are the same as PPCA In the M step, updates for Ψ are

( N ) 1 X > Ψnew = diag S − Wnew [zn]x (S is the cov. matrix of data) N E n n=1

EM for Factor Analysis

Similar to PPCA except that the Gaussian conditional distribution p(x n|z n) has diagonal instead of spherical covariance, i.e., x n ∼ N (Wz n, Ψ), where Ψ is a diagonal matrix

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 10 The required expectations in the E step :

−1 > −1 E[zn] = G W Ψ x n > > E[znzn ] = E[zn]E[zn] + G

> −1 −1 2 where G = (W Ψ W + IK ) . Note that if Ψ = σ ID , we get the same equations as in PPCA

In the M step, updates for Wnew are the same as PPCA In the M step, updates for Ψ are

( N ) 1 X > Ψnew = diag S − Wnew [zn]x (S is the cov. matrix of data) N E n n=1

EM for Factor Analysis

Similar to PPCA except that the Gaussian conditional distribution p(x n|z n) has diagonal instead of spherical covariance, i.e., x n ∼ N (Wz n, Ψ), where Ψ is a diagonal matrix EM for Factor Analysis is very similar to that for PPCA

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 10 In the M step, updates for Wnew are the same as PPCA In the M step, updates for Ψ are

( N ) 1 X > Ψnew = diag S − Wnew [zn]x (S is the cov. matrix of data) N E n n=1

EM for Factor Analysis

Similar to PPCA except that the Gaussian conditional distribution p(x n|z n) has diagonal instead of spherical covariance, i.e., x n ∼ N (Wz n, Ψ), where Ψ is a diagonal matrix EM for Factor Analysis is very similar to that for PPCA The required expectations in the E step :

−1 > −1 E[zn] = G W Ψ x n > > E[znzn ] = E[zn]E[zn] + G

> −1 −1 2 where G = (W Ψ W + IK ) . Note that if Ψ = σ ID , we get the same equations as in PPCA

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 10 In the M step, updates for Ψ are

( N ) 1 X > Ψnew = diag S − Wnew [zn]x (S is the cov. matrix of data) N E n n=1

EM for Factor Analysis

Similar to PPCA except that the Gaussian conditional distribution p(x n|z n) has diagonal instead of spherical covariance, i.e., x n ∼ N (Wz n, Ψ), where Ψ is a diagonal matrix EM for Factor Analysis is very similar to that for PPCA The required expectations in the E step :

−1 > −1 E[zn] = G W Ψ x n > > E[znzn ] = E[zn]E[zn] + G

> −1 −1 2 where G = (W Ψ W + IK ) . Note that if Ψ = σ ID , we get the same equations as in PPCA

In the M step, updates for Wnew are the same as PPCA

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 10 EM for Factor Analysis

Similar to PPCA except that the Gaussian conditional distribution p(x n|z n) has diagonal instead of spherical covariance, i.e., x n ∼ N (Wz n, Ψ), where Ψ is a diagonal matrix EM for Factor Analysis is very similar to that for PPCA The required expectations in the E step :

−1 > −1 E[zn] = G W Ψ x n > > E[znzn ] = E[zn]E[zn] + G

> −1 −1 2 where G = (W Ψ W + IK ) . Note that if Ψ = σ ID , we get the same equations as in PPCA

In the M step, updates for Wnew are the same as PPCA In the M step, updates for Ψ are

( N ) 1 X > Ψnew = diag S − Wnew [zn]x (S is the cov. matrix of data) N E n n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 10 miss obs Usually the posterior p(x n |x n ) over can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars.

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM”

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Some Aspects about PPCA/FA

Can also handle missing data as additional latent variables in E step. Just write each data point as obs miss miss x n = [x n x n ] and treat x n as latent vars. miss obs Usually the posterior p(x n |x n ) over missing data can be computed Note: Ability to handle missing data is the property of EM in general and can be used in other models as well (e.g., GMM)

Can learn other model params such as noise variance σ2 using MLE/MAP

Also more efficient than the na¨ıvePCA. Doesn’t require computing the D × D cov. matrix of data and doing expensive eigen-decomposition Can learn the model very efficiently using “online EM” Possible to give it a fully Bayesian treatment (which has many other benefits such as inferring K using nonparametric Bayesian modeling)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 11 Mixture of PPCA/FA models (joint clust. + dim. red., or nonlin. dim. red.)

Deep models for feature learning and dimensionality reduction

Supervised extensions, e.g., by jointly modeling labels yn as conditioned on latent factors, i.e., K p(yn = 1|z n, θ) using a logistic model with weights θ ∈ R

Some Aspects about PPCA/FA

Provides a framework that could be extended to build more complex models

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 12 Deep models for feature learning and dimensionality reduction

Supervised extensions, e.g., by jointly modeling labels yn as conditioned on latent factors, i.e., K p(yn = 1|z n, θ) using a logistic model with weights θ ∈ R

Some Aspects about PPCA/FA

Provides a framework that could be extended to build more complex models Mixture of PPCA/FA models (joint clust. + dim. red., or nonlin. dim. red.)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 12 Supervised extensions, e.g., by jointly modeling labels yn as conditioned on latent factors, i.e., K p(yn = 1|z n, θ) using a logistic model with weights θ ∈ R

Some Aspects about PPCA/FA

Provides a framework that could be extended to build more complex models Mixture of PPCA/FA models (joint clust. + dim. red., or nonlin. dim. red.)

Deep models for feature learning and dimensionality reduction

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 12 Some Aspects about PPCA/FA

Provides a framework that could be extended to build more complex models Mixture of PPCA/FA models (joint clust. + dim. red., or nonlin. dim. red.)

Deep models for feature learning and dimensionality reduction

Supervised extensions, e.g., by jointly modeling labels yn as conditioned on latent factors, i.e., K p(yn = 1|z n, θ) using a logistic model with weights θ ∈ R

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 12 Ability to fill-in missing data allows “image inpainting” (left: image with 80% missing data, middle: reconstructed, right: original)

Some Applications of PPCA

Learning the noise variance allows “image denoising”

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 13 Some Applications of PPCA

Learning the noise variance allows “image denoising”

Ability to fill-in missing data allows “image inpainting” (left: image with 80% missing data, middle: reconstructed, right: original)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 13 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 > > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 = X>Ω(Ω>Ω)−1

where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > Wnew = x nE[zn] E[zn]E[zn] n=1 n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n]

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 Using EM for (efficiently) solving standard PCA

Let’s see what happens if the noise variance σ2 goes to 0 Let’s first look at the E step

> 2 −1 > > −1 > E[zn] = (W W + σ IK ) W x n =( W W) W x n

> > (no need to compute E[z nz n ] since it will simply be equal to E[z n]E[z n] ) Let’s now look at the M step " N #" N #−1 X > X > > > −1 Wnew = x nE[zn] E[zn]E[zn] = X Ω(Ω Ω) n=1 n=1 where Ω = E[Z] is an N × K matrix with row n equal to E[z n] Note that M step is equivalent to finding W that minimizes the recon. error 2 2 Wnew = arg min ||X − E[Z]W|| = arg min ||X − ΩW|| W W

Thus EM can also be used to efficiently solve the standard non-probabilistic PCA without doing eigendecomposition

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 14 If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID ) Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution) Thus the PPCA model is not uniquely identifiable

Usually this is not a problem, unless we want to very strictly interpret W To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution) Thus the PPCA model is not uniquely identifiable

Usually this is not a problem, unless we want to very strictly interpret W To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID ) If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID )

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 Thus the PPCA model is not uniquely identifiable

Usually this is not a problem, unless we want to very strictly interpret W To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID ) If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID ) Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 Usually this is not a problem, unless we want to very strictly interpret W To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID ) If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID ) Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution) Thus the PPCA model is not uniquely identifiable

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID ) If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID ) Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution) Thus the PPCA model is not uniquely identifiable

Usually this is not a problem, unless we want to very strictly interpret W

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 Identifiability

> 2 Note that p(x n) = N (0, WW + σ ID ) If we replace W by W˜ = WR for some orthogonal rotation matrix R then

> 2 p(x n) = N (0, W˜ W˜ + σ ID ) > > 2 = N (0, WRR W + σ ID ) > 2 = N (0, WW + σ ID ) Thus PPCA doesn’t give a unique solution (for every W, there is another W˜ = WR that gives the same solution) Thus the PPCA model is not uniquely identifiable

Usually this is not a problem, unless we want to very strictly interpret W To ensure identifiability, we can impose some more structure on W, e.g., constrain it to be a lower-triangular or sparse matrix

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 15 Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n)

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16 Some Concluding Thoughts

Discussed the basic idea of generative models for doing unsupervised learning Looked at a way (EM) to perform parameter estimation in such models EM is a general framework for parameter estimation in latent variable models

Looked at two types of unsupervised learning problems Mixture models: Clustering Latent factor models: Dimensionality reduction Both these models can also be used for estimating the prob. density p(x)

More sophisticated models are usually built on these basic principles E.g., Hidden Markov Models and Kalman Filters can be seen as generalization of mixture models and Gaussian latent factor models, respectively, for sequential data (z n correspond to the “state” of x n) We will look at these and other related models (e.g., LSTM) when talking about learning from seqential data

Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 16