Quick viewing(Text Mode)

Mixtures of Bernoulli Distributions

Mixtures of Bernoulli Distributions

Machine Learning Srihari

Mixtures of Bernoulli Distributions

Sargur Srihari [email protected]

1 Machine Learning Srihari Mixtures of Bernoulli Distributions • GMMs are defined over continuous variables • Now consider mixtures of discrete binary variables: Bernoulli distributions (BMMs) • Sets foundation of HMM over discrete variables • We begin by defining: 1. Bernoulli 2. Multivariate Bernoulli 3. Mixture of Bernoulli 4. Mixture of multivariate Bernoulli

2 Machine Learning Srihari Bernoulli Distribution 1. A coin has a Bernoulli distribution

x 1−x p(x | µ) = µ (1 − µ) where x=0,1

2. Each pixel of a binary image has a Bernoulli distribution.

– All D pixels together define a multivariate Bernoulli distribution 3 Machine Learning Srihari Mixture of Two Bernoullis

is single value x, result of tossing one of K=2 coins, with parameters µ=[µ1,µ2] chosen with π=[π1,π2] • is 1−x 1−x p(x | µ,π) = π µx 1 − µ + π µx 1 − µ 1 1 ( 1 ) 2 2 ( 2 ) – Two coins

– Two pixels from a binary image

4 Machine Learning Srihari Multivariate Bernoulli

• Set of D independent binary variables xi, i=1,..,D – E.g., a set of D coins, set of D pixels

• Each governed by parameter µi Bivariate Bernoulli distribution • Multivariate distribution D = 2, K = 2 ⎛ ⎡ x ⎤ ⎡ µ ⎤⎞ p x | µ = p ⎜ ⎢ 1 ⎥ | ⎢ 1 ⎥⎟ ⎧ ⎫ ( ) ⎜ ⎢ x ⎥ ⎢ µ ⎥⎟ ln p(X |θ) = ln ⎨∑ p(X,Z|θ)⎬ ⎝ ⎣ 2 ⎦ ⎣ 2 ⎦⎠ ⎩ Z ⎭ 0,1 where T and x=(x1,..,xD) x2 T 1,1 µ =(µ1,..,µD)

• Mean and covariance are 0,0

E[x]=µ, cov[x]=diag{µi(1-µi)} x1

µ1=0.8, µ2=0.7: p(0,0)= 0.06, p(1,0)=0.24, p(1,1)=0.56, p(0,1)=0.14 1,0 5 Machine Learning Srihari Mixture of multivariate Bernoulli Two bivariate Bernoulli distributions

D = 2, K = 2 ⎛ ⎡ x ⎤ ⎡ µ ⎤⎞ p x | p⎜ ⎢ 1 ⎥ | ⎢ 11 ⎥⎟ • Finite mixture of ( µ1) = K ⎢ x ⎥ ⎢ µ ⎥ ⎝⎜ 2 12 ⎠⎟ Bernoulli distributions ⎣ ⎦ ⎣ ⎦ 0,1 x2 – E.g., K bags of D coins 1,1 each where bag k is 0,0 x1 chosen with πk D = 2, K = 2 K ⎛ ⎡ x ⎤ ⎡ µ ⎤⎞ p x | p⎜ ⎢ 1 ⎥ | ⎢ 11 ⎥⎟ pp(x |µ,π)= π kk (x|µ ) ( µ2 ) = ∑ ⎜ ⎢ x ⎥ ⎢ µ ⎥⎟ k=1 ⎝ ⎣ 2 ⎦ ⎣ 12 ⎦⎠ 0,1 x – where µ={µ1,..,µK}, 2 1,1 1 π={π ,..,π }, ∑ π k = and 1 K k D 0,0 x p(x|µ)xxii (1 )(1− ) 1 kkiki=∏ µµ− i=1 1,0 Machine Learning Srihari Mean and Covariance of BMM

K E ⎡x⎤ = π µ ⎣⎢ ⎦⎥ ∑ k k k=1 K T cov ⎡x⎤ = π ∑ +µ µT −E ⎡x⎤ E ⎡x⎤ ⎣⎢ ⎦⎥ ∑ k { k k k } ⎣⎢ ⎦⎥ ⎣⎢ ⎦⎥ k=1

where Σk=diag{µk1(1-µk1)}

• Because the covariance matrix cov[x] is no longer diagonal, • the mixture distribution can capture correlation between the variables, • unlike a single Bernoulli distribution

7 Machine Learning Srihari Log likelihood of Bernoulli mixture

• Given data set X={x1,..,xN} log-likelihood of model is

N ⎧ K ⎫ ln p(X | µ,π ) = ∑ln ⎨∑π k p(xn | µk )⎬ n=1 ⎩ k=1 ⎭

Summation due to logarithm Summation due to mixture

• Due to summation inside logarithm there is no closed form m.l.e. solution

8 Machine Learning Srihari To define EM, introduce latent variables

T • One of K representation z = (z1,..zK) • Conditional distribution of x given the latent variable is K pp(x|z,µ) (x|µ )zk = ∏ k k =1 • Prior distribution for the latent variables is same as for GMM

K z p z | ) = k ( π ) ∏πk k=1 • If we form product of p(x|z,µ) and p(z|µ) and marginalize over z, we recover

K pp(x |µ,π) (x|µ ) = ∑π kk k=1

9 Machine Learning Srihari Complete Data Log-likelihood

• To derive the EM algorithm we first write down the complete data log-likelihood function N K ⎪⎧ D ⎪⎫ ln p X,Z | µ,π = z ⎨⎪ln π + x ln µ + 1−x ln 1− µ ⎬⎪ ( ) ∑∑ nk ⎪ k ∑{ ni ki ( ni ) ( ki )}⎪ n=1 k=1 ⎩⎪ i=1 ⎭⎪

– where X ={xn} and Z ={zn}

• Taking its expectation w.r.t. posterior distribution of the latent variables

N K ⎪⎧ D ⎪⎫ E ⎡ln p X,Z | µ,π ⎤ = γ z ⎨⎪ln π + ⎡x ln µ + 1−x ln 1− µ ⎤⎬⎪ Z ⎣⎢ ( )⎦⎥ ∑∑ ( nk )⎪ k ∑ ⎣⎢ ni ki ( ni ) ( ki )⎦⎥⎪ n=1 k=1 ⎩⎪ i=1 ⎭⎪

– where γ(znk)=E[znk] is the posterior probability or responsibility of component k given data point xn. • In the E-step these responsibilities are evaluated 10 Machine Learning Srihari E-step: Evaluating Responsibilities

• In the E step the responsibilities γ(znk)=E[znk] are evaluated using Bayes theorem znk z ⎡π p x | ⎤ ∑ nk ⎢ k ( n µk )⎥ z ⎣ ⎦ ⎡ ⎤ nk γ(znk ) = E znk = ⎣⎢ ⎦⎥ znj ⎡π p x | µ ⎤ ∑ ⎣⎢ j ( n j )⎦⎥ znj

πk p(xn | µk ) = K p x | ∑ πj ( n µj ) j=1 • If we consider the sum over n in N K ⎪⎧ D ⎪⎫ E ⎡ln p X,Z | µ,π ⎤ = γ z ⎨⎪ln π + ⎡x ln µ + 1−x ln 1− µ ⎤⎬⎪ Z ⎣⎢ ( )⎦⎥ ∑∑ ( nk )⎪ k ∑ ⎣⎢ ni ki ( ni ) ( ki )⎦⎥⎪ n=1 k=1 ⎩⎪ i=1 ⎭⎪ – Responsibilities enter only through two terms N Nk = ∑γ (znk ) Effective no. of data points associated with component k n=1 1 N xk = ∑γ (znk )xn Nk n=1 11 Machine Learning Srihari M-step

• In the M step we maximize the expected complete data log- likelihood wrt parameters µk and π • If we set the derivative of N K ⎪⎧ D ⎪⎫ E ⎡ln p X,Z | µ,π ⎤ = γ z ⎨⎪ln π + ⎡x ln µ + 1−x ln 1− µ ⎤⎬⎪ Z ⎣⎢ ( )⎦⎥ ∑∑ ( nk )⎪ k ∑ ⎣⎢ ni ki ( ni ) ( ki )⎦⎥⎪ n=1 k=1 ⎩⎪ i=1 ⎭⎪

– wrt µk equal to zero and rearrange the terms, we obtain

µk = xk

– Thus mean of component k is equal to weighted mean of data with weighting given by responsibilities of component k for the data points

• For maximization wrt π , we use a Lagrangian to ensure ∑π k = 1 k k – Following steps similar to GMM we get the reasonable result

Nk π k = N 12 Machine Learning Srihari No singularities in BMMs

• In contrast to GMMs no singularities with BMMs • Can be seen as follows – Likelihood function is bounded above because 0 ≤

p(xn|µk) ≤ 1 – There exist singularities at which the likelihood function goes to zero • But these will not be found by EM provided it is not initialized to a pathological starting point – Because EM always increases value of likelihood function

13 Machine Learning Srihari Illustrate BMM with Handwritten Digits • Given a set of unlabeled digits 2,3 and 4

• Images turned to binary: set values >0.5 to 1, rest to 0 • Fit a data set of N = 600 digits, K=3 • 10 iterations of EM

• Mixing coefficients initialized with πk=1/K

Parameters µki were set to random values chosen uniformly in range (0.25,0.75) and normalized so that Σjµkj=1 14 Machine Learning Srihari Result of BMM

EM finds the three clusters corresponding to different digits

Parameters µki for each of three components of (gray scale since mean for each pixel varies)

Using single multivariate Bernoulli and maximum likelihood amounts to averaging counts in each pixel

15 Machine Learning Srihari Bayesian EM for Discrete Case

of the parameters of Bernoulli is given by the • Beta prior is equivalent to introducing additional effective observations of x • Also introduce priors into the Bernoulli mixture model • Use EM to maximize posterior probability of distribution • Can be extended to multinomial discrete variables – Introduce Dirichlet priors over model parameters if desired

16