Exponential Families and Conjugate Priors Can Be Used in Many Graphical Models
Total Page:16
File Type:pdf, Size:1020Kb
The Exponential Family David M. Blei Columbia University November 3, 2015 Definition ‘ A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter t.x/ are sufficient statistics h.x/ is the “underlying measure”, ensures x is in the right space a./ is the log normalizer Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) ‘ The statistic t.x/ is called sufficient because the probability of x under only depends on x through t.x/. ‘ The exponential family has fundamental connections to the world of graphical models. For our purposes, we’ll use exponential families as components in directed graphical models, e.g., in the mixtures of Gaussians. ‘ The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ dx (2) D f g This is the negative logarithm of the normalizing constant. ‘ Example: Bernoulli. As an example, let’s put the Bernoulli (in its usual form) into its exponential family form. The Bernoulli you are used to seeing is x 1 x p.x / .1 / x 0; 1 (3) j D 2 f g 1 In exponential family form ˚ x 1 x« p.x / exp log .1 / (4) j D exp x log .1 x/ log.1 / (5) D f C g exp x log x log.1 / log.1 / (6) D f C g exp x log.=.1 // log.1 / (7) D f C g This reveals the exponential family where log.=.1 // (8) D t.x/ x (9) D a./ log.1 / log.1 eÁ/ (10) D D C h.x/ 1 (11) D Note the relationship between and is invertible Á 1=.1 e / (12) D C This is the logistic function. ‘ Example: Gaussian. The familiar form of the univariate Gaussian is 1 .x /2 p.x ; 2/ exp (13) j D p2 2 2 2 We put it in exponential family form by expanding the square 1 1 1 p.x ; 2/ exp x x2 2 log (14) j D p2 2 2 2 2 2 We see that Œ= 2; 1=2 2 (15) D t.x/ Œx; x2 (16) D a./ 2=2 2 log (17) D 2 C =42 .1=2/ log. 22/ (18) D 1 h.x/ 1=p2 (19) D 2 Moments of an exponential family ‘ Let’s go back to the general family. We are going to take derivatives of the log normalizer. This gives us moments of the sufficient statistics, R Áa./ Á log exp >t.x/ h.x/dx (20) r D r fR f g g Á exp t.x/ h.x/dx r f > g (21) D R exp t.x/ h.x/dx f > g Z exp t.x/ h.x/ t.x/ f > g dx (22) D R exp t.x/ h.x/dx f > g EÁŒt.X/ (23) D ‘ Check: Higher order derivatives give higher order moments. For example, @2a./ Cov.ti .X/; tj .X// (24) D @i @j ‘ There is a 1-1 relationship between the mean of the sufficient statistics E Œt.X/ and natural parameter . In other words, the mean is an alternative parameterization of the distribution. (Many of the forms of exponential families that you know are the mean parameterization.) ‘ Consider again the Bernoulli. We saw this with the logistic function, where note that E ŒX (because X is an indicator). D ‘ Now consider the Gaussian. The derivative with respect to 1 is da./ 1 (25) d1 D 22 (26) D E ŒX (27) D The derivative with respect to 2 is 2 da./ 1 1 2 (28) d2 D 42 22 2 2 (29) D C E X 2 (30) D 3 This means that the variance is Var.X/ E X 2 E ŒX2 (31) D 1 (32) D 22 2 (33) D ‘ To see the 1 1 relationship between E Œt.X/ and , note that 2 Var.t.X// aÁ is positive. a./ is convex.D r ! 1-1 relationship between its argument and first derivative ! Here is some notation for later (when we discuss generalized linear models). Denote the mean parameter as E Œt.X/; denote the inverse map as ./, which gives the such that E Œt.X/ . D D Maximum likelihood estimation of an exponential family. ‘ The data are x1 n. We seek the value of that maximizes the likelihood. W ‘ The log likelihood is N X L log p.xn / (34) D j n 1 D N X .log h.xn/ >t.xn/ a.// (35) D C n 1 D PN PN n 1 log h.xn/ > n 1 t.xn/ N a./ (36) D D C D PN As a function of , the log likelihood only depends on n 1 t.xn/. It has fixed dimension; there is no need to store the data. (And note that it is sufficientD for estimating .) ‘ Take the gradient of the likelihood and set it to zero, N X ÁL t.xn/ N Áa./: (37) r D r n 1 D It’s now easy to solve for the mean parameter: PN n 1 t.xn/ D : (38) ML D N 4 This is, as you might guess, the empirical mean of the sufficient statistics. The inverse map gives us the natural parameter, .ML/: (39) ML D ‘ Consider the Bernoulli. The MLE of the mean parameter ML is the sample mean; the MLE of the natural parameter is the corresponding log odds. ‘ Consider the Gaussian. The MLE of the mean parameters are the sample mean and the sample variance. Conjugacy ‘ Consider the following set up: F. / (40) j xi G. / for i 1; : : : ; n : (41) j 2 f g This is a classical Bayesian data analysis setting. And, this is used as a component in more complicated models, e.g., in hierarchical models. ‘ The posterior distribution of given the data x1 n is W n Y p. x1 n; / F . / G.xi /: (42) j W / j j i 1 D Suppose this distribution is in the same family as F , i.e., its parameters are in the space indexed by . Then F and G are a conjugate pair. ‘ For example, Gaussian likelihood with fixed variance; Gaussian prior on the mean Multinomial likelihood; Dirichlet prior on the probabilities Bernoulli likelihood; beta prior on the bias Poisson likelihood; gamma prior on the rate In all these settings, the conditional distribution of the parameter given the data is in the same family as the prior. ‘ Suppose the data come from an exponential family. Every exponential family has a conjugate prior, p.xi / h`.x/ exp >t.xi / a`./ (43) j D f g p. / hc./ exp > 2. a`.// ac./ : (44) j D f 1 C g 5 The natural parameter Œ1; 2 has dimension dim./ 1. The sufficient statistics are Œ; a./. D C The other terms hc. / and ac. / depend on the form of the exponential family. For example, when are multinomial parameters then the other terms help define a Dirichlet. ‘ Let’s compute the posterior in the general case, n Y p. x1 n; / p. / p.xi / (45) j W / j j i 1 D h./ exp > 2. a.// ac./ (46) D f 1 C g Qn Pn i 1 h.xi / exp > i 1 t.xi / na`./ D Pfn D g h./ exp .1 i 1 t.xi //> .2 n/. a.// : (47) / f C D C C g This is the same exponential family as the prior, with parameters n X 1 1 t.xi / (48) O D C i 1 D 2 2 n: (49) O D C Posterior predictive distribution ‘ Let’s compute the posterior predictive distribution. This comes up frequently in applica- tions of models (i.e., when doing prediction) and in inference (i.e., when doing collapsed Gibbs sampling). ‘ Consider a new data point xnew. Its posterior predictive is Z p.xnew x1 n/ p.xnew /p. x1 n/d (50) j W D j j W Z exp >x a./ exp > 2. a.// a./ d (51) / f new g fO 1 C O O g R exp .1 x / .2 1/. a.//d f O C new > C O C (52) D exp a.1; 2/ f O O g exp a.1 x ; 2 1/ a.1; 2/ (53) D f O C new O C O O g ‘ In other words, the posterior predictive density is a ratio of normalizing constants. In the numerator is the posterior with the new data point added; in the denominator is the posterior without the new data point. ‘ This is how we can derive collapsed Gibbs samplers. 6 Canonical prior ‘ There is a variation on the form of the simple conjugate prior that is useful for under- standing its properties. We set 1 x0n0 and 2 n0. (And, for convenience, we drop the sufficient statistic so that t.x/ xD.) D D ‘ In this form the conjugate prior is p. x0; n0/ exp n0x> n0a./ : (54) j / f 0 g Here we can interpret x0 as our prior idea of the expectation of x, and n0 as the number of “prior data points.” Consider the predictive expectation of X, E ŒE ŒX . The first expectation is with respect to the prior; the second is respect to the data distribution.j As an exercise, show that E ŒE ŒX x0: (55) j D ‘ Assume data x1 n.