The Exponential Family 1 Definition

The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter. How do we take a familiar distribution with parameter and try to turn it into an exponential family? 1. Exponentiate the log density to find exp log p.x / . f j g 2. Inside the exponential, find terms that can be written as i ./ti .x/ for sets of functions i ./ and ti .x/. These are natural parameters and sufficient statistics. 3. Find terms that are a function only of the parameters, a./. Try to write them in terms of the i ./ identified in the previous step. This is the log normalizer. 4. Now look at the rest—hopefully it’s a function of x and constants, i.e., not of any of the free parameters . This is h.x/. Bernoulli. As an example, let’s put the Bernoulli (in its usual form) into its exponential family form. The Bernoulli you are used to seeing is x 1 x p.x / .1 / x 0; 1 (3) j D 2 f g In exponential family form ˚ x 1 x« p.x / exp log .1 / (4) j D exp x log .1 x/ log.1 / (5) D f C g exp x log x log.1 / log.1 / (6) D f C g exp x log.=.1 // log.1 / (7) D f C g 2 1 0.5 0 −6 −4 −2 0 2 4 6 Figure 1: The logistic function (from Wikipedia). This reveals the exponential family where log.=.1 // (8) D t.x/ x (9) D a./ log.1 / log.1 eÁ/ (10) D D C h.x/ 1 (11) D Note the relationship between and is invertible Á 1=.1 e / (12) D C This is the logistic function. See Figure 1. Poisson. The Poisson is a distribution on integers, often used for count data. The familiar form of the Poisson is 1 p.x / x exp : (13) j D xŠ f g Put this in exponential family form 1 p.x / exp x log : (14) j D xŠ f g 3 We see that log (15) D t.x/ x (16) D a./ exp (17) D D f g h.x/ 1=xŠ (18) D Gaussian. See Appendix A for the natural parameterization of the Gaussian. 2 Moments of an exponential family Let’s go back to the general family. We are going to take derivatives of the log normalizer. This gives us moments of the sufficient statistics, R Áa./ Á log exp >t.x/ h.x/dx (19) r D r fR f g g Á exp t.x/ h.x/dx r f > g (20) D R exp t.x/ h.x/dx f > g Z exp t.x/ h.x/ t.x/ f > g dx (21) D R exp t.x/ h.x/dx f > g EÁŒt.X/ (22) D Check: Higher order derivatives give higher order moments. For example, 2 @ a./ E ti .X/tj .X/ E Œti .X/ E tj .X/ Cov.ti .X/; tj .X//: (23) @i @j D D There is a 1-1 relationship between the mean of the sufficient statistics E Œt.X/ and natural parameter . In other words, the mean is an alternative parameterization of the distribution. (Many of the forms of exponential families that you know are the mean parameterization.) We can easily see the 1 1 relationship between E Œt.X/ and in one dimension. Let X be a scalar. Note that 2 Var.t.X// aÁ is positive. D r 4 a./ is convex. ! 1-1 relationship between its argument and first derivative ! Here is some notation for later (when we discuss generalized linear models). Denote the mean parameter as E Œt.X/; denote the inverse map as ./, which gives D the such that E Œt.X/ . D Bernoulli. We saw the 1-1 relationship through the logistic function. Note that E ŒX because X is an indicator. D Gaussian. The derivative with respect to 1 is da./ 1 (24) d1 D 22 (25) D E ŒX (26) D The derivative with respect to 2 is 2 da./ 1 1 2 (27) d2 D 42 22 2 2 (28) D C E X 2 (29) D This means that the variance is Var.X/ E X 2 E ŒX2 (30) D 1 (31) D 22 2 (32) D Poisson. The expectation is the rate, da./ E ŒX exp : (33) D d D f g D All higher moments are the rate. This tying of mean and variance leads applied statisticians to alternatives, such as the negative Binomial, where we separately control the mean and variance. 5 3 Maximum likelihood estimation of an exponential family. The data are x1 n. We seek the value of that maximizes the likelihood. W The log likelihood is N X L log p.xn / (34) D j n 1 D N X .log h.xn/ >t.xn/ a.// (35) D C n 1 D PN PN n 1 log h.xn/ > n 1 t.xn/ N a./ (36) D D C D PN As a function of , the log likelihood only depends on n 1 t.xn/. It has fixed D dimension; there is no need to store the data. (And note that it is sufficient for estimating .) Take the gradient of the likelihood and set it to zero, N X ÁL t.xn/ N Áa./: (37) r D r n 1 D It’s now easy to solve for the mean parameter: PN n 1 t.xn/ D : (38) ML D N This is, as you might guess, the empirical mean of the sufficient statistics. The inverse map gives us the natural parameter, .ML/: (39) ML D Bernoulli. The MLE of the mean parameter ML is the sample mean; the MLE of the natural parameter is the corresponding log odds. Gaussian. The MLE of the mean parameters are the sample mean and the sample variance. Poisson. The MLE of the mean parameters is the sample mean; the MLE of the natural parameter is its log. 6 4 Conjugacy Consider the following set up: F. / (40) j xi G. / for i 1; : : : ; n : (41) j 2 f g This is a classical Bayesian data analysis setting. And, this is used as a component in more complicated models, e.g., in hierarchical models. The posterior distribution of given the data x1 n is W n Y p. x1 n; / F . / G.xi /: (42) j W / j j i 1 D Suppose this distribution is in the same family as F , i.e., its parameters are in the space indexed by . Then F and G are a conjugate pair. For example, Gaussian likelihood with fixed variance; Gaussian prior on the mean Multinomial likelihood; Dirichlet prior on the probabilities Bernoulli likelihood; beta prior on the bias Poisson likelihood; gamma prior on the rate In all these settings, the conditional distribution of the parameter given the data is in the same family as the prior. 4.1 A generic conjugate prior Suppose the data come from an exponential family. Every exponential family has a conjugate prior (Diaconis and Ylvisaker, 1979; Bernardo and Smith, 1994), p.xi / h`.x/ exp >t.xi / a`./ (43) j D f g p. / hc./ exp > 2. a`.// ac./ : (44) j D f 1 C g The natural parameter Œ1; 2 has dimension dim./ 1. The sufficient D C statistics are Œ; a./. 7 The other terms hc. / and ac. / depend on the form of the exponential family. For example, when are multinomial parameters, these terms define a Dirichlet. 4.2 The posterior We compute the posterior in the general case, n Y p. x1 n; / p. / p.xi / (45) j W / j j i 1 D h./ exp > 2. a.// ac./ (46) D f 1 C g Qn Pn i 1 h.xi / exp > i 1 t.xi / na`./ D f D g Pn h./ exp .1 i 1 t.xi //> .2 n/. a.// : (47) / f C D C C g This is the same exponential family as the prior, with parameters n X 1 1 t.xi / (48) O D C i 1 D 2 2 n: (49) O D C 4.3 The posterior predictive We compute the posterior predictive distribution.

The Exponential Family 1 Definition

Machine Learning Conjugate Priors and Monte Carlo Methods

A Skew Extension of the T-Distribution, with Applications

On a Problem Connected with Beta and Gamma Distributions by R

1 Introduction to Bayesian Inference 2 Introduction to Gibbs Sampling

Polynomial Singular Value Decompositions of a Family of Source-Channel Models

Fast Bayesian Non-Negative Matrix Factorisation and Tri-Factorisation

A Family of Skew-Normal Distributions for Modeling Proportions and Rates with Zeros/Ones Excess

Bayesian Inference in the Normal Linear Regression Model

Lecture 2 — September 24 2.1 Recap 2.2 Exponential Families

A Compendium of Conjugate Priors

An Introduction to Markov Chain Monte Carlo Methods and Their Actuarial Applications

Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN