<<

1 One exponential families

The world of exponential families bridges the gap between the Gaussian family and general dis- tributions. Many properties of Gaussians carry through to exponential families in a fairly precise sense.

• In the Gaussian world, there exact small distributional results (i.e. t, F , χ2). • In the world, there are approximate distributional results (i.e. tests). • In the general setting, we can only appeal to asymptotics. A one-parameter exponential family, F is a one-parameter family of distributions of the form

Pη(dx) = exp (η · t(x) − Λ(η)) P0(dx) for some P0. The parameter η is called the natural or canonical parameter and the Λ is called the , and is simply the normalization needed to make

dPη fη(x) = (x) = exp (η · t(x) − Λ(η)) dP0 a proper probability density. The random t(X) is the sufficient of the exponential family. Note that P0 does not have to be a distribution on , but these are of course the simplest examples.

1.0.1 A first example: Gaussian with linear sufficient statistic Consider the standard Z e−z2/2 P0(A) = √ dz A 2π and let t(x) = x. Then, the exponential family is eη·x−x2/2 Pη(dx) ∝ √ 2π and we see that Λ(η) = η2/2.

eta= np.linspace(-2,2,101) CGF= eta**2/2. plt.(eta, CGF) A= plt.gca() A.set_xlabel(r’$\eta$’, size=20) A.set_ylabel(r’$\(\eta)$’, size=20) f= plt.gcf()

1 Thus, the exponential family in this setting is the collection

F = {N(η, 1) : η ∈ R} .

d 1.0.2 Normal with quadratic sufficient statistic on R d As a second example, take P0 = N(0,Id×d), i.e. the standard normal distribution on R . As sufficient 2 statistic, we take t(x) = kxk2/2. Then, the exponential family is η·kxk2/2−kxk2/2 Pη(dx) ∝ e 2 2 and we see that the family is only defined for η < 1. For η < 1, d Λ(η) = − log(1 − η). 2 We see that not all exponential families have all of R as their parameter . We might as well define Λ over all of R: ( − d log(1 − η) η < 1 Λ(η) = 2 ∞ η ≥ 1. The exponential family here is  −1 F = N(0d×1, (1 − η) · Id×d), η < 1 .

eta= np.linspace(-3,0.99,101) d=3 CGF=-d* np.log(1-eta)/2. plt.plot(eta, CGF) A= plt.gca() A.set_xlabel(r’$\eta$’, size=20) A.set_ylabel(r’$\Lambda(\eta)$’, size=20)

2

1.0.3 Tilts of The previous two examples, we could express Λ explicitly by simple integration. This is not always possible, though we can use the computer to do some calculations for us. P0 to be the triangular distribution on (−1, 1) with sufficient statistic t(x) = x so that

Pη(dx) = exp(η · x − Λ(η))P0(dx) with Z 1  Λ(η) = log eηx dx . −1

X= np.linspace(-1,1,501) dX=X[1]-X[0]

def tilted_density(eta): D= np.exp(eta*X)* np.minimum((1 +X), (1 -X)) CGF= np.log((np.exp(eta*X)* np.minimum((1 +X), (1 -X)) * dX).sum()) returnD/ np.exp(CGF)

[plt.plot(X, tilted_density(eta), label=r’$\eta=%d$’% eta) for eta in [0,1,2,3]] plt.gca().set_title(’Tilts of the uniform distribution.’) plt.legend(loc=’upper left’)

3 1.0.4 Carrier measure

More generally, P0 could be replaced by some measure m0 that is not a probability density. 2 For example, if m0 is on R and t(x) = x /2. Then, for all η < 0

ηx2/2 dPη e (x) = p dm0 −2π/η corresponds to a N(0, −η−1) density. To find Λ(η), note that

Z (p Λ(η) ηx2/2 2π/ − η η < 0 e = e dm0(x) = R ∞ otherwise.

Therefore, for η < 0 1 1 Λ(η) = − log(−η) + log(2π). 2 2 The exponential family is therefore

N(0, −η−1), η < 0 .

1.1 Reparametrizing the family

Note that the exponential family is determined by the pair (t(X), m0). The choice of m0 is somewhat arbitrary. We could fix some η0 and consider a new family with carrier measure Pη0 ∈ F: n   o F = = exp η · t(x) − Λ(η) (dx) e Peηe e e e Pη0

4 But, a simple manipulation shows that

Pη(dx) = exp (η · t(x) − Λ(η)) m0(dx)

= exp ((η − η0) · t(x) − (Λ(η) − Λ(η0)) Pη0 (dx). This shows that there is a 1:1 correspondence between F and Fe. Namely F 3 P 7→ ∈ F e eηe Pηe+η0 Λ(e ηe) = Λ(ηe + η0) − Λ(η0). 1.2 Domain of an exponential family In the examples above, we saw that not all values of η lead to a due to the sufficient statistic not being integrable with respect to m0. The domain D(F) can be thought of as the set of all natural which lead to a probablity distribution. Formally, we define the domain as

D(F) = D((t(X), m0)) = {η : Λ(η) < ∞} .

The domain is also defined relative to the carrier measure m0. As in the previous section on reparametrization, we see

D(Fe) = D((t(X), Pη0 ) = {ηe : ηe + η0 ∈ D(F)} = {ηe : Λ(ηe + η0) − Λ(η0) < ∞} = D(F) ⊕ (−η0). Hence, the domain of two exponential families with different parametrizations determined by dif- ferent canonical parameters are related by a simple translation.

1.2.1 Exercise: convexity of D(F)

1. Show that Λ is a (possibly infinite) on R. 2. Use this to show that D(F) is convex, i.e. a (possibly infinite) .

1.2.2 Exercise: half-Gaussian density Consider the half-Gaussian distribution with density 2e−x2/2 f(x) = √ , x ≥ 0. 2π 1. Use f as carrier measure to create an exponential family with sufficient statistic t(x) = −x. 2. Plot the density for η ∈ [0, 2, 4, 6]. 3. What is D(F)? 4. What happens as η → ∞? What about η → −∞? 5. Can you renormalize the random variables with distributions in F to get a “nice” at either

±∞? That is, suppose Zn ∼ Pηn with ηn → ±∞ Can you define Wn = cn(Zn − µn) so that Wn converges in distribution?

5 1.3 Example: the Poisson family An important example of a one-parameter family that we will revisit often is the Poisson family on the non-negative integers Z≥0 = {0, 1, 2,... }. The carrier measure is 1 m (dx) = m(dx) 0 x! with m the counting measure on Z≥0. Poisson random variables are usually parametrized by their expectation λ. This is different than the canonical parametrization. Let’s write this parameterization as

e−λλx (dx) = m (dx). Qλ x! 0 We see, then Pη(dx) = exp(η · x − Λ(η))m0(dx). The two parametrizations are related by

η = log(λ) Λ(η) = eη = λ

1.3.1 Exercise: reparametrizing the Poisson family

Let F = (x, m0) denote the Poisson family. 1. What is D(F)?

2. Rewrite the Poisson family Pη so that the carrier measure is a with 2. Call the exponential family with this carrier measure F2. What is D(F2)?

3. Write the Poisson distribution with mean 6 as a point in D(F) and as a point in D(F2). That is, in each case, find the canonical parameter such the corresponding distribution is a Poisson with mean 6.

1.4 Expectation and The function Λ is the cumulant generating function of the family and differentiating it yields the of the t(X). Specifically, if the carrier measure is a , it is the of the generating function of t(X) under P0. More generally, if the carrier measure is not a probability measure but just a measure on some Ω, then for any η ∈ D(F) Z θ·t(X) (θ+η)t(x)−Λ(η) Λ(θ+η)−Λ(η) Eη(e ) = e m0(dx) = e . Ω Note that Z Λ(η) η·t(x) e = e m0(dx). Ω

6 Differentiating yields with respect to η Z Λ(η) η·t(x) Λ(˙ η)e = t(x)e m0(dx) Ω Z Λ(η) = e · t(x)Pη(dx) Ω Λ(η) = e · Eη(t(X)). Differentiating a second yields Z  2 Λ(η) 2 η·t(x) Λ(¨ η) + Λ(˙ η) e = t(x) e m0(dx) Ω Λ(η) 2 = e Eη(t(X) ). Summarizing, ˙ Λ(η) = Eη(t(X)) ¨ 2 Λ(η) = Eη[(t(X) − Eη(t(X)) ] = Varη(t(X)) The above also motivates definition of another space related to F, the set of realizable expected values n o M(F) = Λ(˙ η): η ∈ D(F) = Λ(˙ D(F)).

1.4.1 Parametrization by the mean The above calculation yields a parameterization ˙ µ(η) = Eη(t(X)) = Λ(η). As dµ = Λ(¨ η) = Var (t(X)) ≥ 0 dη η we see that the mapping is 1:1 and non-decreasing and is invertible as long as the random variable t(X) is not constant under Pη. Further, as the moment generating function of t(X) under Pη is defined for all η ∈ D(F). This map is infinitely differentiable on D(F).

1.5 and

We saw above that the moment generating function of t(X) under Pη can be expressed as

θ·t(X) Λ(θ+η)−Λ(η) Eη(e ) = e .

Taking the logs and expanding yields the cumulants of t(X) under Pη. That is, θ2 θ3 Λ(θ + η) − Λ(η) = k θ + k + k + ... 1 2 2 3 6 where (i) ki = Λ (η)

7 are the cumulants of t(X) under Pη. The skewness of a random variable defined as [(Y − (Y ))3] k Skew(Y ) = Skew(Y, L ) = E E =∆ γ = 3 Y Var(Y )3/2 3/2 k2 and kurtosis is defined as

4 E[(Y − E(Y )) ] ∆ k4 Kurtosis(Y ) = Kurtosis(Y, LY ) = 2 − 3 = δ = 2 . Var(Y ) k2 For a one-parameter exponential family, we see that

γ(η) = Skew(t(X), Pη) Λ(3)(η) = Λ(¨ η)3/2 δ(η) = Kurtosis(t(X), Pη) Λ(4)(η) = Λ(¨ η)2

1.5.1 Exercise: skewness and kurtosis of the Poisson family 1. Plot the skewness and kurtosis of the Poisson family as a function of η.

2. What happens as η → ∞? Is this expected?

1.6 Cumulants in the mean parametrization Above, we see that the most natural parameterization of cumulants (and skewness, kurtosis) is in terms of the canonical parameter η. However, we now have this alternative parametrization in terms of the mean µ. Given a quantity like γ(η) we can reparametrize this as

Λ(3)(η(µ)) γ˜(µ) = Skew(t(X), ) = γ(η(µ)) = . Pη(µ) 3/2 Varη(µ)(t(X))

This new parametrization yields some interesting relations.

1.6.1 Exercise: reparametrization of skewness Show that d γ˜(µ) = 2 Var 1/2 . dµ η(µ)

1.6.2 Exercise: estimation of µ and η

Suppose X ∼ Pη, then, from the calculations we have seen above ˙ Eη(t(X)) = Λ(η) = µ(η).

8 Can we find an unbiased estimate of η? For this exercise, assume Ω = R, t(x) = x and the carrier measure has a density with respect to Lebesgue measure and write D(F) = [m, M] (with one or both of m, M possibly infinite). That is,

( η·x−Λ(η) e g0(x) dx m ≤ x ≤ M Pη(dx) = 0 otherwise. with Λ(0) = 0. Define `0(x) = log(g0(x)). Show that  d  − ` (x) = η − [g (M) − g (m)] . Eη dx 0 η η Try this out numerically for the half-Gaussian family.

1.7 Repeated One of the very nice properties of exponential families is the behaviour under IID sampling. Specif- ically, let IID X1,...,Xn ∼ Pη with d Pη (x) = exp(η · t(x) − Λ(η)). dm0 The joint density has a very simple expression:

n n Y  h i Y [exp(η · t(xi) − Λ(η)) m0(dxi)] = exp n · η · t(X) − Λ(η) m0(dxi) i=1 i=1 with n 1 X t(X) = t(X ). n i i=1 This is a one-parameter exponential family with parameter η, sufficient statistic

n X n · t(X) = t(Xi) i=1

Qn n and carrier measure i=1 m0(dxi) defined on Ω where Ω is the sample space for each of the Xis.

1.7.1 Exercise: cumulants of sum

1. What is the cumulant generating function of the above exponential family? Call it Λn.

2. Relate the first 4 cumulants, as a function of η, of Λn to the mean, , skewness and Pn kurtosis of i=1 t(Xi). 3. Repeat 2. for the random variable t(X).

9 1.7.2 Exercise: sufficiency

For this question, assume m0 = P0 is a probability distribution. 1. Relate the one-parameter exponential family of distributions on Ωn above, to a one-parameter exponential family of distributions on R. 2. What is the carrier measure of this exponential family? What is its Λ, D(F)?

3. How does this relate to sufficiency?

1.8 Examples 1.8.1 Binomial: Bin(n, p) Suppose that X ∼ Binomial(n, π). Then,

n n   π   (X = j) = πj(1 − π)n−j = exp log · j + n · log(1 − π) P j j 1 − π This is a one-parameter family whose carrier measure we can take as having a density

(n j 0 ≤ j ≤ n m0(j) = 0 otherwise. with respect to counting measure on Z.  π  The natural parameter is η = log 1−π with D = (−∞, ∞) and cumulant generating function

Λ(η) = n log(1 + eη).

We see that neη Λ(˙ η) = = nπ 1 + eη and eη(1 + eη) − e2η Λ(¨ η) = n · = n · π(1 − π). (1 + η)2

1.8.2 Exercise: Gamma with fixed shape Suppose we consider the Gamma family with fixed k and unknown scale. The Gamma density is 1 f (x) = xk−1e−x/λ, x ≥ 0. λ,k Γ(k)λk 1. Write this as a one-parameter exponential family. What is the canonical parameter?

2. Compute the mean, variance, skewness and kurtosis of this family.

1.8.3 Exercise: Gamma with a fixed scale Suppose that, instead of fixing the shape of the Gamma family, we fix the scale at some value λ. Can you write this as a one-parameter exponential family?

10 1.8.4 Exercise: Negative binomial The negative binomial arises when waiting for a fixed , k of failures in IID Bernoulli(π) trials. Specifically, n + k − 1 (X = n) = (1 − π)kπn. P k 1. Write this as a one-parameter exponential family.

2. Compute the mean, variance and skewness as a function of π.

1.8.5 Inverse Gaussian This distribution arises from the of standard Brownian with drift to cross the boundary 1. If the drift is 1/µ, then the density has the form

(x−µ)2 1 − 2 gµ(x) = √ e 2µ x . 2πx3

2 Basic results on exponential families

2.1 MLE Recall the joint density under repeated IID sampling

n n n Y Y  h i Y Pη(dxi) = [exp(η · t(xi) − Λ(η))m0(dxi)] = exp n · η · t(X) − Λ(η) m0(dxi). i=1 i=1 i=1 From this, we see that the log-likelihood has a very compact expression

n ! Y h i `(η) = log Pη(dxi) = n · η · t(X) − Λ(η) . i=1 The function is defined as d `˙(η) = `(η). dη

2.2 The MLE map The maximum likelihood estimator of the canonical parameter is

ηb = argmaxη∈D `(η) and they satisfy (assuming the maximum is achieved in the interior of D(F))

˙ h ˙ i 0 = `(ηb) = n · t(X) − Λ(ηb) .

In words, the MLE of η is chosen so that, under the of the sufficient statistic Pηb is the observed value t(X).

11 Solving the MLE equations therefore determine a map from M(F), the set of all possible mean values for F to D(F) the canonical . This map is effectively the inverse of Λ.˙ That is, " n # ! 1 X η(t(X)) = Λ˙ −1(t(X)) = argmax η · t(X ) − Λ(η) . b η n i i=1

2.2.1 Fenchel-Legendre transform Consider the maximized log-likelihood, as a function of t(X) sup (η · t − Λ(η)) . η∈D In convex analysis, this function is called the Fenchel-Legendre transform of Λ and is often denoted by Λ∗. That is, Λ∗(t) = sup (η · t − Λ(η)) . η∈D Another general fact from convex analysis says that d Λ∗(t) = argmax (η · t − Λ(η)) . dt η∈D Sometimes, the Fenchel-Legendre transform may fail to be differentiable. In this case, the argmax above is a set, called the subdifferential of Λ∗ at t. This would correspond to there being more than one MLE, which will not happen for one-parameter exponential families. In any case, Λ∗ provides the MLE map. That is, η(µ) = Λ˙ ∗(µ). Another property of this map is Λ∗(µ) = η(µ) · µ − Λ(η(µ)). In , this implies ˙ ˙ ∗ ˙ Λ(Λ (µ)) = Λ(η(µ)) = Eη(µ)[t(X)] = µ. This implies that Λ˙ ◦ Λ˙ ∗ is the identity on M. And therefore, Λ˙ ∗ ◦ Λ˙ is the identity on D. In other words, Λ˙ −1 = Λ˙ ∗.

2.2.2 Likelihood as a function of µ Alternatively, we might try computing the MLE in the mean parametrization. In this case, we write the likelihood as `˜(µ) = `(Λ˙ ∗(µ)) Differentiating d   d dη η · t(X) − Λ(η) `˜(µ) = n dµ dµ dη t(X) − µ = n · Varη(µ)(t(X)) h i = n · Λ¨ ∗(µ) · t(X) − µ

12 which shows µb = t(X).

2.2.3 Exercise: density in M parametrization 1. Show that

dPη(µ)   (x) = exp Λ∗(µ) − (µ − t(x))Λ˙ ∗(µ) dP0 2. Rederive the score for µ h i nΛ¨ ∗(ˆµ) · t(X) − µˆ = 0

directly with this formula.

2.2.4 Exercise: computing Λ∗ 1. Compute Λ∗ for the Poisson family.

˙ ∗ 2. Knowing the relationship between µ and λ implied by µ(η) = Λ(η), show ηb computed with Λ agrees with the the usual MLE rule by plugging in µb into this relationship.

2.2.5 Score for arbitrary parameters In general, the score function for some (invertible) function of η, i.e. ξ = h(η) is

d dη `(η) d −1 η=h−1(ξ) `(h (ξ)) = . dξ dh dη η=h−1(ξ)

One key property of the score function is " # d h ˙ i Eη `(ζ) = n Eη[t(X)] − Λ(η) dζ ζ=η = n [µ(η) − µ(η)] = 0

This is also true for the score of any ξ = h(η).

2.3 Fisher  d −1  The in (X1,...,Xn) for ξ = h(η) at η ∈ D is given by Varη(ξ) dξ `(h (ξ)) . That is, ! (n) d −1 2 Iη (h(η)) = Eη `(h (ξ)) dξ ξ=h(η) I(n) = η . h˙ (η)2

13 Above, (n) (n)  ˙ 2 Iη = Iη (η) = nEη (t(X) − Λ(η)) = n · Varη(t(X)) where Varη(t(X)) is the Fisher information for η in one observation. On reinspection of the loglikelihood `(η) we see that

−`¨(η) = nVarη(t(X)).

So, the second of the likelihood is in fact not random and we can write

(n) ¨ Iη = −`(η).

2.3.1 Exercise: Fisher information for the mean

Compute the Fisher information for the mean parameter in an IID sample (X1,...,Xn).

2.3.2 Exercise: Fisher information as a pullback Consider the map

 1/2   Ψ dPη 1 2 R 3 η 7→ = exp (η · t(x) − Λ(η)) ∈ L (Ω, m0). dm0 2 1. Show that the derivative of this map is

∂Ψ 1   1  = t(x) − Λ(˙ η) · exp (η · t(x) − Λ(η)) ∈ L2(Ω, m ). ∂η 2 2 0

2. Show that 2 ∂Ψ Iη = 4 . ∂η 2 L (Ω,m0) Geometrically, this says that the Fisher information is the pull-back of the inner deter- mined by Hellinger . In other words, the length of a curve

Z b p Iη dη a in a one-parameter exponential family is, up to the constant factor 2, equal to arclength in the Hilbert structure induced by .

2.3.3 Cramer-Rao lower bound The Cramer-Rao lower bound for an unbiased estimator ξˆ of ξ = h(η) based on an IID sample (X1,...,Xn) from Pη is 1 h˙ (η)2 Var (ξˆ) ≥ = . η (n) Iη (h(η)) n · Varη(t(X))

14 Applying this to µ = Λ(˙ η) yields

2 Λ(¨ η) Varη(t(X)) Varη(ˆµ) ≥ = n · Varη(t(X)) n and µˆ = t(X) achieves the Cramer-Rao lower bound. This happens for linear functions of µ but generally not for η. The Cramer-Rao bound applies for unbiased . In general the MLE ξˆ = h(ˆη) is not unbiased and the bias should be included. Nevertheless, the delta rule approximation

h˙ (η)2 Var(ξˆ) ≈ nVarη(t(X)) is usually reasonable. In practice, of course we don’t know η so we must use

h˙ (η)2 Var(\ξˆ) ≈ b nVar (t(X)) ηb

2.3.4 Deviance The deviance (also known as , Kullback Leibler (KL) ) between prob- ability measures is defined as   ( dP1 2 · 1 log 2  1 E dP2 P P D(P1; P2) = ∞ otherwise.

This notation is slightly different then the usual notation for KL divergence:   dQ DKL(P||Q) = EP . dP I have used the same indexing as Brad Efron’s notes: the first parameter is the one the is computed with respect to.

2.3.5 Exercise: deviance in exponential families

1. Show that, in a one-parameter exponential family F, for any η1, η2 ∈ D(F)

∆ D(η1; η2) = D(Pη1 , Pη2 ) h i = 2 · Λ(η2) − Λ(η1) + (η1 − η2) · Λ(˙ η1) .

2. Give a direct argument based on convexity that shows that D(η1; η2) ≥ 0.

15 2.3.6 Convexity picture The form of the deviance in one-parameter exponential families shows that it is in fact a remainder in a Λ(η2) = Λ(η1) + (η2 − η1) · Λ(˙ η1) + R(η1; η2) with 1 1 R(η ; η ) = D(η ; η ) = Λ(¨ θ)(η − η )2 1 2 2 1 2 2 1 2 for some θ = θ(η1, η2) ∈ [η1, η2]. Let’s make a simple exponential family, which we suppose has

Λ(η) = 5η2 − log(η) Λ(˙ η) = 10η − η−1

# Two points in the domain

eta1 = 1.5 d_eta1=1 eta2= eta1+ d_eta1

# Our CGF

def CGF(eta): return 5*eta**2 - np.log(eta)

# Derivative of the CGF

def dotCGF(eta): return 10*eta - 1/eta

Here is our deviance function on the η scale

# Deviance on the natural parameter scale

def deviance(eta2, eta1=eta1): return2*(CGF(eta2)- CGF(eta1) + (eta1-eta2)* dotCGF(eta1))

deviance(eta2, eta1)

10.311682085801348

We can plot the deviance as the difference, at η2 between the tangent approximation to the graph of Λ at η1 and the true Λ(η2).

# Plotting points

eta= np.linspace(eta1- d_eta1,eta1+1.5 * d_eta1,101) plt.figure(figsize=(8,8))

# Plot the CGF

16 plt.plot(eta, CGF(eta), label=r’$\Lambda(\eta)$’, linewidth=4)

# First order Taylor approximaton at eta1

plt.plot(eta, CGF(eta1) + (eta-eta1)* dotCGF(eta1), label=r’$\Lambda(\eta_1)+\dot{\ Lambda}(\eta_1)(\eta-\eta_1)$’, linewidth=4)

# Difference between CGF and approximation: half the deviance

plt.plot([eta2,eta2],[CGF(eta2), CGF(eta1) + d_eta1* dotCGF(eta1)], label=r’$D(\eta_1;\ eta_2)/2$’, linewidth=4)

# Markers for where the two points are

plt.plot([eta1,eta1],[0,CGF(eta1)], linestyle=’--’, color=’gray’, linewidth=4) plt.plot([eta2,eta2],[0,CGF(eta2)-deviance(eta2,eta1)/2], linestyle=’--’, color=’gray’, linewidth=4)

# Labelling

a= plt.gca() a.set_xticks([eta1,eta2]) a.set_xticklabels([r’$\eta_1$’,r’$\eta_2$’], size=15) a.set_xlim(sorted([eta1-.8*d_eta1, eta1+1.2*d_eta1])) a.set_ylim([0,40]) a.set_xlabel(r’$\eta\in{\calD}$’, size=20) a.set_ylabel(r’$\Lambda(\eta)$’, size=20)

# Adda legend and title

plt.legend(loc=’upper left’) a.set_title(r’Convexity picture on${\calD}$ at$(\eta_1,\eta_2)=(%0.1f,%0.1f)$’%( eta1,eta2), size=20) f= plt.gcf() plt.close()

Finally, here is our rendered figure.

f

2.3.7 Exercise: convexity picture for Poisson family

1. Plot the convexity picture for the Poisson family with η1 corresponding to a mean of 10 and η2 to a mean of 20.

17 2.3.8 Hoeffding’s formula ˙ ∗ Let ηb(t(x)) = Λ (t(x)) denote the MLE of η having observed t(x) as sufficient statistic. Then, using the identity ˙ t(x) = Λ(ηb(t(x))), we arrive at dPη −D(η(t(x));η)/2 (x) = e b . d Pηb This leads to a scaled version of the likelihood having maximum value 1.

∗ L(η) = exp (−D(ηb(t(X)); η)/2) = exp (η · t(X) − Λ(η) − Λ (t(X))) . In the M parametrization, this reads as

˜  ˜  L(µ) = L(η(µ)) = exp (−D(η(µb); η(µ))/2) = exp −D(t(X); µ)/2 .

The other point we see here is that maximum likelihood estimation is the same as minimum deviance or minimum KL estimation. That is,

argmaxη [η · t(X) − Λ(η)] = argmin D(η(t(X)); η). η

2.3.9 Exercise: the Normal deviance 1. When the family is the Normal family with unknown mean and known variance σ2, show that µ η(µ) = Λ˙ ∗(µ) = . 2σ2

2. Conclude that (t(X) − µ)2 D˜(t(X); µ) = σ2

2.3.10 Exercise: familiar deviances In both the M and D parameterization, compute the deviances for the following families:

1. Poisson with mean parameter µ ∈ M.

2. Binomial with n trials having probability of success n · π ∈ M (n is fixed).

3. Gamma with fixed shape parameter k and mean k · µ ∈ M.

2.3.11 Deviance under IID sampling Suppose we observe n IID samples from a one-parameter exponential family F. Then, the total deviance is (n) D (η1; η2) = 2nD(η1; η2).

18 2.3.12 Deviance and Fisher information So, as with Fisher information, the deviance grows with n. The first and second order versions of Taylor’s imply 1 Λ(η ) = Λ(η ) + Λ(˙ η ) · (η − η ) − D(η ; η ) 2 1 1 1 2 2 1 2 1 (η − η )3 = Λ(η ) + Λ(˙ η ) · (η − η ) + Λ(¨ η − η )2 + Λ(3)(θ) 1 2 1 1 1 2 2 1 2 6 for some θ ∈ [η1, η2]. We see, then, that

3 2 (η1 − η2) D(η1; η2) − Iη (η1 − η2) ≤ (η1; |η1 − η2|) · 1 6 where |Λ(¨ θ) − Λ(¨ η)| |I − I | C(η; r) = sup = sup θ η θ:|θ−η|≤r |θ − η| θ:|θ−η|≤r |θ − η| is a local Lipschitz constant or modulus of continuity for I at η.

2.3.13 Exercise: dual version of convexity picture In this exercise, we parameterize the deviance by µ instead of η.

1. Show that Λ∗(µ) = Λ˙ ∗(µ) · µ − Λ(Λ˙ ∗(µ)).

2. Verify Λ∗ in the code below.

3. Use this to verify the convexity picture below for

∆ D˜(µ1; µ2) = D(η(µ1); η(µ2))

as above in the µ parametrization.

4. From the picture below, give an explicit formula for D˜(µ1; µ2). Based on our previous exponential family, here are the conjugate and its derivative. Note that we only need to compute the MLE map to compute Λ∗.

# Based on the previous family, the MLE map

def dotCGFstar(mu): return(mu+ np.sqrt(mu**2+40)) / 20.

#A general formula for CGF^*

def CGFstar(mu): return dotCGFstar(mu)*mu-CGF(dotCGFstar(mu))

19 # Find the two corresponding points from natural parameter scale mu1= dotCGF(eta1) mu2= dotCGF(eta2) d_mu2= mu1- mu2 mu= np.linspace(mu2- d_mu2,mu2+1.5 * d_mu2,101) plt.figure(figsize=(8,8))

# Plot CGF^* plt.plot(mu, CGFstar(mu), label=r’$\Lambda^*(\mu)$’, linewidth=4)

# The first order Taylor approximation plt.plot(mu, CGFstar(mu2) + (mu-mu2)* dotCGFstar(mu2), label=r’$\Lambda^*(\mu_2)+\dot {\Lambda}^*(\mu_2)(\mu-\mu_2)$’, linewidth=4)

# The difference between CGF^* and the Taylor approximation at mu1 plt.plot([mu1,mu1],[CGFstar(mu1), CGFstar(mu2) + d_mu2* dotCGFstar(mu2)], label=r’$\ {D}(\mu_1;\mu_2)/2$’, linewidth=4)

# Mark where the points are plt.plot([mu1,mu1],[0,CGFstar(mu1)-deviance(eta1,eta2)/2], linestyle=’--’, color=’gray’, linewidth=4) plt.plot([mu2,mu2],[0,CGFstar(mu2)], linestyle=’--’, color=’gray’, linewidth=4)

# Labelling a= plt.gca() a.set_xticks([mu1,mu2]) a.set_xticklabels([r’$\mu_1$’,r’$\mu_2$’], size=15) a.set_xlim(sorted([mu2-1.2*d_mu2, mu2+1.2*d_mu2])) a.set_xlabel(r’$\mu\in{\calM}$’, size=20) a.set_ylabel(r’$\Lambda^*(\mu)$’, size=20) plt.legend(loc=’upper left’)

# Adda legend and title a.set_title(r"""Convexity picture on${\calM}$ at $(\mu_1,\mu_2)=(\dot{\Lambda}^*(%0.1f),\dot{\Lambda}^*(%0.1f)\approx(%0.2f,%0.2f))$""" %(eta1,eta2,mu1,mu2), size=20) f= plt.gcf() plt.close()

Here is our rendered figure.

20 f

2.4 Deviance residuals For the normal family, you showed in your homework that, in the mean parameter, the deviance has the form (ˆµ − µ)2 D˜(ˆµ; µ) = . σ2 Hence, it is like a normalized residual squared. To recover the original residual, one would compute q r(ˆµ; µ) = D˜(ˆµ; µ) · sign(ˆµ − µ).

This is the general form of a deviance residual. In the repeated sampling setting, we observe a sample of deviance residuals. Namely,

ri = r(t(X); t(Xi)).

There is also a deviance residual for t(X) with respect to µ q RD = sign(t(X) − µ) · D˜n(t(X); µ) q = sign(t(X) − µ) · n · D˜(t(X); µ)

We might compare this with the Pearson residual

t(X) − µ RP = p . Varη(t(X))/n

Below are comparisons of the two residuals for a sample of size 10 from Poisson(1). Generally speaking, the deviance residuals are closer to normal than Pearson residuals.

%%R mu = 10 nsim = 10000 Y= rpois(nsim, mu)

dev.resid= sign(Y-mu)*sqrt(2*(Y*log(Y/mu)-(Y-mu))) qqnorm(dev.resid) abline(0,1)

21 %%R pearson.resid=(Y-mu)/ sqrt(mu) qqnorm(pearson.resid) abline(0,1)

22 Of course, for Poisson the residuals can’t be exactly normal: there is some lattice effect. We do see that the deviance residuals are closer to the diagonal than the Pearson residuals in the tails, though in the center, the approximation looks reasonable.

2.4.1 Bias and variance correction of deviance residuals (From Appendix C, McCullagh and Nelder) The deviance residual of t(X) for µ is asymptotically

N(Bn,Vn) with ρ B = − 3 n 6  7 ρ 2 V = 1 + ρ2 − 4 n 36 3 8 where ρ3 = Skew(t(X), Pη) ρ4 = Kurtosis(t(X), Pη) Hence, families whose sufficient statistics are highly skewed will have poorer approximations by the normal distribution.

23 The corresponding result for the deviance itself is

 5ρ2 − 3ρ  D (t(X); µ) ≈ 1 + 3 4 · χ2. n 12 1

These results extend to multiparameter versions as well. Here is a brief sketch of how you might begin to prove some of these asymptotic bias formulae for Dn. The convexity picture tells us that the deviance is a residual in a Taylor series expansion. This residual has an exact form in terms of an infinite Taylor series

∞ ∗ (j) 1 1 ∗ X (Λ ) D (t(X); µ) = Λ¨ (µ)(t(X) − µ)2 + n (t(X) − µ)j 2 n 2 n j! j=3

∗ where Λn is the Fenchel-Legendre transform of the CGF of t(X). The first term is recognized as a random variable with mean 0 and variance 1. The remaining terms can be analyzed to approximate the bias and variance for a fixed n.

2.4.2 Exercise In this exercise, we use the Gamma exponential family with fixed shape parameter set to 1. That is, t(x) = x and η·x−Λ(η) Pη(dx) = e dx x ≥ 0.

1. Compute the skewness and kurtosis of t(X) = X¯n for an IID sample of size n.

2. Set η = −1 and simulate the deviance residual for X¯n with respect to η. Compare this distribution to N(Bn,Vn) for various values of n.

3. Try to choose the sample size n, so that Bn is approximately 1/100.

2.5 References • Efron, B. Defining the Curvature of a Statistical Problem (with Applications to Second Order Efficiency)

• Efron, B. The geometry of exponential families

• McCullagh, P. and Nelder, J., Appendix C Generalized Linear Models

24