<<

The David M. Blei Columbia University

November 3, 2015

Definition

‘ A probability density in the exponential family has this form

p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where

 is the natural parameter  t.x/ are sufficient statistics  h.x/ is the “underlying measure”, ensures x is in the right space  a./ is the log normalizer  Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models.

Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.)

‘ The statistic t.x/ is called sufficient because the probability of x under  only depends on x through t.x/.

‘ The exponential family has fundamental connections to the world of graphical models. For our purposes, we’ll use exponential families as components in directed graphical models, e.g., in the mixtures of Gaussians.

‘ The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ dx (2) D f g This is the negative logarithm of the .

‘ Example: Bernoulli. As an example, let’s put the Bernoulli (in its usual form) into its exponential family form. The Bernoulli you are used to seeing is

x 1 x p.x /  .1 / x 0; 1 (3) j D 2 f g 1 In exponential family form

˚ x 1 x« p.x / exp log  .1 / (4) j D exp x log  .1 x/ log.1 / (5) D f C g exp x log  x log.1 / log.1 / (6) D f C g exp x log.=.1 // log.1 / (7) D f C g This reveals the exponential family where

 log.=.1 // (8) D t.x/ x (9) D a./ log.1 / log.1 eÁ/ (10) D D C h.x/ 1 (11) D

Note the relationship between  and  is invertible

Á  1=.1 e / (12) D C This is the logistic function.

‘ Example: Gaussian. The familiar form of the univariate Gaussian is

1  .x /2  p.x ;  2/ exp (13) j D p2 2 2 2

We put it in exponential family form by expanding the square

1   1 1  p.x ;  2/ exp x x2 2 log  (14) j D p2  2 2 2 2 2

We see that

 Œ= 2; 1=2 2 (15) D t.x/ Œx; x2 (16) D a./ 2=2 2 log  (17) D 2 C  =42 .1=2/ log. 22/ (18) D 1 h.x/ 1=p2 (19) D

2 Moments of an exponential family

‘ Let’s go back to the general family. We are going to take derivatives of the log normalizer. This gives us moments of the sufficient statistics, R Áa./ Á log exp >t.x/ h.x/dx (20) r D r fR f g g Á exp  t.x/ h.x/dx r f > g (21) D R exp  t.x/ h.x/dx f > g Z exp  t.x/ h.x/ t.x/ f > g dx (22) D R exp  t.x/ h.x/dx f > g EÁŒt.X/ (23) D

‘ Check: Higher order derivatives give higher order moments. For example,

@2a./ Cov.ti .X/; tj .X// (24) D @i @j

‘ There is a 1-1 relationship between the mean of the sufficient statistics E Œt.X/ and natural parameter . In other words, the mean is an alternative parameterization of the distribution. (Many of the forms of exponential families that you know are the mean parameterization.)

‘ Consider again the Bernoulli. We saw this with the logistic function, where note that  E ŒX (because X is an indicator). D ‘ Now consider the Gaussian.

The derivative with respect to 1 is da./  1 (25) d1 D 22  (26) D E ŒX (27) D

The derivative with respect to 2 is

2 da./ 1 1 2 (28) d2 D 42 22  2 2 (29) D C E X 2 (30) D

3 This means that the variance is

Var.X/ E X 2 E ŒX2 (31) D 1 (32) D 22  2 (33) D

‘ To see the 1 1 relationship between E Œt.X/ and , note that

2 Var.t.X// aÁ is positive.  a./ is convex.D r ! 1-1 relationship between its argument and first derivative ! Here is some notation for later (when we discuss generalized linear models). Denote the mean parameter as  E Œt.X/; denote the inverse map as ./, which gives the  such that E Œt.X/ . D D

Maximum likelihood estimation of an exponential family.

‘ The data are x1 n. We seek the value of  that maximizes the likelihood. W ‘ The log likelihood is

N X L log p.xn / (34) D j n 1 D N X .log h.xn/ >t.xn/ a.// (35) D C n 1 D PN PN n 1 log h.xn/ > n 1 t.xn/ N a./ (36) D D C D  PN As a function of , the log likelihood only depends on n 1 t.xn/. It has fixed dimension; there is no need to store the data. (And note that it is sufficientD for estimating .)

‘ Take the gradient of the likelihood and set it to zero,

N X ÁL t.xn/ N Áa./: (37) r D r n 1 D It’s now easy to solve for the mean parameter:

PN n 1 t.xn/  D : (38) ML D N

4 This is, as you might guess, the empirical mean of the sufficient statistics. The inverse map gives us the natural parameter,  .ML/: (39) ML D

‘ Consider the Bernoulli. The MLE of the mean parameter ML is the sample mean; the MLE of the natural parameter is the corresponding log odds.

‘ Consider the Gaussian. The MLE of the mean parameters are the sample mean and the sample variance.

Conjugacy

‘ Consider the following set up:

 F. / (40)   j xi G. / for i 1; : : : ; n : (41)   j 2 f g This is a classical Bayesian data analysis setting. And, this is used as a component in more complicated models, e.g., in hierarchical models.

‘ The posterior distribution of  given the data x1 n is W n Y p. x1 n; / F . / G.xi /: (42) j W / j j i 1 D Suppose this distribution is in the same family as F , i.e., its parameters are in the space indexed by . Then F and G are a conjugate pair.

‘ For example,

Gaussian likelihood with fixed variance; Gaussian prior on the mean  Multinomial likelihood; Dirichlet prior on the probabilities  Bernoulli likelihood; beta prior on the bias  Poisson likelihood; gamma prior on the rate  In all these settings, the conditional distribution of the parameter given the data is in the same family as the prior.

‘ Suppose the data come from an exponential family. Every exponential family has a conjugate prior,

p.xi / h`.x/ exp >t.xi / a`./ (43) j D f g p. / hc./ exp > 2. a`.// ac./ : (44) j D f 1 C g

5 The natural parameter  Œ1; 2 has dimension dim./ 1. The sufficient statistics are Œ; a./. D C

The other terms hc. / and ac. / depend on the form of the exponential family. For example, when  are multinomial parameters then the other terms help define a Dirichlet.

‘ Let’s compute the posterior in the general case,

n Y p. x1 n; / p. / p.xi / (45) j W / j j i 1 D h./ exp > 2. a.// ac./ (46) D f 1 C g Qn  Pn i 1 h.xi / exp > i 1 t.xi / na`./  D Pfn D g h./ exp .1 i 1 t.xi //> .2 n/. a.// : (47) / f C D C C g This is the same exponential family as the prior, with parameters

n X 1 1 t.xi / (48) O D C i 1 D 2 2 n: (49) O D C

Posterior predictive distribution

‘ Let’s compute the posterior predictive distribution. This comes up frequently in applica- tions of models (i.e., when doing prediction) and in inference (i.e., when doing collapsed ).

‘ Consider a new data point xnew. Its posterior predictive is Z p.xnew x1 n/ p.xnew /p. x1 n/d (50) j W D j j W Z exp >x a./ exp > 2. a.// a./ d (51) / f new g fO 1 C O O g R exp .1 x /  .2 1/. a.//d f O C new > C O C (52) D exp a.1; 2/ f O O g exp a.1 x ; 2 1/ a.1; 2/ (53) D f O C new O C O O g

‘ In other words, the posterior predictive density is a ratio of normalizing constants. In the numerator is the posterior with the new data point added; in the denominator is the posterior without the new data point.

‘ This is how we can derive collapsed Gibbs samplers.

6 Canonical prior

‘ There is a variation on the form of the simple conjugate prior that is useful for under- standing its properties. We set 1 x0n0 and 2 n0. (And, for convenience, we drop the sufficient statistic so that t.x/ xD.) D D ‘ In this form the conjugate prior is

p. x0; n0/ exp n0x> n0a./ : (54) j / f 0 g

Here we can interpret x0 as our prior idea of the expectation of x, and n0 as the number of “prior data points.”

Consider the predictive expectation of X, E ŒE ŒX . The first expectation is with respect to the prior; the second is respect to the data distribution.j As an exercise, show that

E ŒE ŒX  x0: (55) j D

‘ Assume data x1 n. The mean is W n X x .1=n/ xi (56) N D i 1 D Given the data and a prior, the posterior parameters are

n0x0 nx x0 C N (57) O D n n0 C n n0 n (58) O D C

This is easy to see from our previous derivation of O 1; O 2. Further, we see that the posterior expectation of X is a convex combination of x0 (our prior idea) and x (the data mean). N

Example: Data from a unit variance Gaussian

‘ Suppose the data xi come from a unit variance Gaussian 1 p.x / exp .x /2=2 : (59) j D p2 f g

This is a simpler exponential family than the previous Gaussian

exp x2=2 p.x / f g exp x 2=2 : (60) j D p2 f g

7 In this case

  (61) D t.x/ x (62) D exp x2=2 h.x/ f g (63) D p2 a./ 2=2 2=2: (64) D D

‘ What is the conjugate prior? It is

2 p. / h./ exp 1 2.  =2/ ac./ (65) j D f C g

This has sufficient statistics Œ; 2=2, which means it’s a Gaussian distribution. ‘ Put it in canonical form,

2 p. n0; x0/ h./ exp n0x0 n0. =2/ ac.n0; x0/ : (66) j D f g

‘ The posterior parameters are

n0x0 nx x0 C N (67) O D n0 n C n0 n n0: (68) O D C This implies that the posterior mean and variance are

 x0 (69) O2 DO  1=.n0 n/: (70) O D C These are results that we asserted earlier.

‘ Intuitively, when we haven’t seen any data then our estimate of the mean is the prior mean. As we see more data, our estimate of the mean moves towards the sample mean.

Before seeing data, our “confidence” about the estimate is the prior variance. As we see more data, the confidence decreases.

Example: Data from a Bernoulli

‘ The Bernoulli likelihood is

x 1 x p.x /  .1 / : (71) j D

8 Above we have seen its form as a minimal exponential family. Let’s rewrite that, but keeping the mean parameter in the picture,

p.x / exp x log.=.1 // log.1 / (72) j D f g

‘ The canonical conjugate prior therefore looks like this,

p. x0; n0/ exp n0x0 log.=.1 // n0 log.1 / a.n0; x0/ (73) j D f C g This simplifies to

p. x0; n0/ exp n0x0 log./ n0.1 x0/ log.1 // a.n0; x0/ (74) j D f C g Putting this in non-exponential family form,

n0x0 n0.1 x0/ p. x0; n0/  .1 / (75) j / which nearly looks like the familiar .

‘ To get the beta, we set ˛ , n0x0 1 and ˇ , n0.1 x0/ 1. Bringing in the resulting normalizer, we have C C

€.˛ ˇ/ ˛ 1 ˇ 1 p. ˛; ˇ/ C  .1 / : (76) j D €.˛/€.ˇ/

‘ The posterior distribution follows the usual recipe, from which we can derive that the corresponding updates are

n X ˛ ˛ xi (77) O D C i 1 Dn X ˇ ˇ .1 xi / (78) O D C i 1 D

To see this, update x0 and n0 as above and compute ˛ and ˇ from the definitions. O O O O

The big picture

‘ Exponential families and conjugate priors can be used in many graphical models. Gibbs sampling is straightforward when each complete conditional involves a conjugate “prior” and likelihood pair.

‘ For example, we can now define an exponential family . The mixture components are drawn from a conjugate prior; the data are drawn from the corresponding exponential family.

9 ‘ Imagine a generic model p.x1 n; z1 n ˛/. Suppose each complete conditional is in the exponential family, W W j

p.zi z i ; x; ˛/ h.zi / exp i .z i ; x/>zi a.i .z i;x// : (79) j D f g Notice that the natural parameter is a function of the variables we condition on. This is called a conditionally conjugate model. Provided we can compute the natural parameter, Gibbs sampling is immediate.

‘ Many models from the machine learning research literature are conditionally conjugate. Examples include HMMs, mixture models, hierarchical linear regression, probit regression, factorial models, and others.

‘ Consider the HMM, for example, but with its parameters unknown,

[ graphical model ]

Notice that this is like a mixture model, but where the mixture components are embedded in a Markov chain.

Each row of the transition matrix is a point on the .K 1/-simplex; each set of emission probabilities is a point on the .V 1/-simplex. We can place Dirichlet priors on these com- ponents. (We will talk about the Dirichlet later. It is a multivariate generalization of the beta distribution, and is the conjugate prior to the categorical/.)

We can easily obtain the complete conditional of zi . The complete conditionals of the transitions & emissions follow from our discussion of conjugacy.

Collapsed Gibbs sampling is possible, but is quite expensive. One needs to run forward- backward at each sampling of a zi .

10