Lecture 2. Exponential Families
Nan Ye
School of Mathematics and Physics University of Queensland
1 / 19 Recall: definition of GLM ∙ A GLM has the following structure
⊤ (systematic) E(Y | x) = h(훽 x). (random) Y | x follows an exponential family distribution.
∙ This is usually separated into three components ∙ The linear predictor 훽⊤x. ∙ The response function h. People often specify the link function g = h−1 instead. ∙ The exponential family for the conditional distribution of Y given x.
2 / 19 Recall: remarks on exponential families ∙ It is common! normal, Bernoulli, Poisson, and Gamma distributions are exponential families. ∙ It gives a well-defined model. its parameters are determined by the mean 휇 = E(Y | x). ∙ It leads to a unified treatment of many different models. linear regression, logistic regression, ...
3 / 19 This Lecture
∙ One-parameter exponential family: definition and examples. ∙ Maximum likelihood estimation: score equation, MLE as moment matching.
4 / 19 Exponential Families
One parameter exponential family An exponential family distribution has a PDF (or PMF for a discrete random variable) of the form
(︂휂(휃)T (y) − A(휃) )︂ f (y | 휃, 휑) = exp + c(y, 휑) b(휑)
∙ T (y) is called the natural statistic. ∙ 휂(휃) is called the natural parameter. ∙ 휑 is a nuisance parameter treated as known.
Range of y must be independent of 휃.
5 / 19 Special forms ∙ A natural exponential family is one parametrized by 휂, that is, 휃 = 휂, and the PDF/PMF can be written as
(︂휂T (y) − A(휂) )︂ f (y | 휂, 휑) = exp + c(y, 휑) b(휑)
∙ The nuisance parameter can be absent, in which case b and c have the form b(휑) = 1 and c(y, 휑) = c(y).
6 / 19 General form ∙ In general, 휂(휃) and T (y) can be vector-valued functions, 휃 can be a vector, and y can be an arbitrary object (like strings, sequences). ∙ Of course, 휂(휃)T (y) need to be replaced by the inner product 휂(휃)⊤T (y) then. ∙ In practice, this general form has been heavily used in various domains such as computer vision, natural language processing. ∙ For this course, we focus on the single-parameter case, but the properties studied here can be easily generalized for the general form.
7 / 19 Examples
Example 1. Gaussian distribution (known 휎2) ∙ The PDF is 1 (︂ (y − 휇)2 )︂ f (y | 휇, 휎) = √ exp − 2휋휎2 2휎2 (︂ y 2 휇y 휇2 √ )︂ = exp − + − − ln 2휋휎2 , 2휎2 휎2 2휎2
∙ This is a natural exponential family with a nuisance parameter 휎 ∙ 휂(휇) = 휇. ∙ T (y) = y. ∙ A(휇) = 휇2/2. ∙ b(휎) = 휎2. 2 √ ∙ y 2 c(y, 휎) = − 2휎2 − ln 2휋휎 .
8 / 19 Example 2. Bernoulli distribution ∙ The PMF is
f (y | p) = py (1 − p)1−y = exp (y ln p + (1 − y) ln(1 − p)) (︂ p )︂ = exp y ln + ln(1 − p) 1 − p
∙ This is an exponential family without a nuisance parameter ∙ p 휂(p) = ln 1−p . ∙ T (y) = y. ∙ A(p) = − ln(1 − p). ∙ c(y) = 0.
9 / 19 ∙ The natural form is given by ey휂 f (y | 휂) = . 1 + e휂
10 / 19 Example 3. Poisson distribution ∙ The PMF is 휆y f (y | 휆) = exp(−휆) y! = exp(−휆 + y ln 휆 − ln y!)
∙ This is an exponential family without a nuisance parameter ∙ 휂(휆) = ln 휆. ∙ T (y) = y. ∙ A(휆) = 휆. ∙ c(y) = − ln y!.
11 / 19 Example 4. Gamma distribution (fixed shape) ∙ The PDF, parametrized using the shape k and scale 휃, is
y k−1 f (y | k, 휃) = e−y/휃. Γ(k)휃k
∙ An alternative parametrization, sometimes easier to work with, is to use two parameters 휇 and 휈 such that 휇 is the mean, and the variance is 휇2/휈.
12 / 19 ∙ In the (k, 휃) parametrization, the mean and variance are k휃 and k휃2 respectively, thus k = 휈 and 휃 = 휇/휈, and
y 휈−1휈휈 f (y | 휇, 휈) = e−y휈/휇 Γ(휈)휇휈 (︂ (︂ 1 )︂ )︂ = exp −휈 y + ln 휇 + (휈 − 1) ln y + 휈 ln 휈 − ln Γ(휈) 휇
∙ This is an exponential family with a nuisance parameter 휈 ∙ 휂(휇) = 1/휇. ∙ T (y) = y. ∙ A(휇) = ln 휇 ∙ b(휈) = 1/휈. ∙ c(y, 휈) = (휈 − 1) ln y + 휈 ln 휈 − ln Γ(휈).
13 / 19 Example 5. Negative binomial (fixed r) ∙ The PMF of N(r, p) is
(︂y + r − 1)︂ f (y | p, r) = (1 − p)r py y (︂ (︂y + r − 1)︂ )︂ = exp ln + r ln(1 − p) + y ln p y
∙ This is an exponential family when r is fixed ∙ 휂(p) = ln p. ∙ T (y) = y. ∙ A(p) = −r ln(1 − p). ∙ (︀y+r−1)︀ c(y) = ln y .
14 / 19 MLE
Score equation The MLE 휃 is the solution to A′(휃) = T˜, 휂′(휃)
˜ 1 ∑︀ where T = n i T (yi ).
15 / 19 Derivation
∙ The log-likelihood of 휃 given y1,..., yn is
∑︁ 휂(휃)T (yi ) − A(휃) ℓ(휃) = + c(y , 휑). b(휑) i i ∙ Set the derivative of the log-likelihood to 0, and rearrange to obtain A′(휃)/휂′(휃) = T˜.
16 / 19 MLE as moment matching ′ ′ ∙ We will see that E(T | 휃) = A (휃)/휂 (휃) in next lecture. ∙ Thus the score equation is E(T | 휃) = T˜. ∙ MLE is thus matching the first moment of the exponential family to that of the empirical distribution (i.e., a method of moment estimator).
17 / 19 Example. MLE for negative binomial distribution ∙ Recall: for fixed r, A(p) = −r ln(1 − p), 휂(p) = ln p. rp 1 ∑︀ ∙ The score equation is 1−p =y ¯ = n i yi . ∙ p =y ¯/(¯y + r).
18 / 19 What You Need to Know
∙ Recognise whether a distribution is a one-parameter exponential family. ∙ Score equation for one-parameter exponential family ∙ Sufficiency of T˜ for determining MLE. ∙ MLE as moment matching.
19 / 19