1 Multiparameter Exponential Families

1 Multiparameter exponential families

1.1 General deﬁnitions Not surprisingly, a multi-parameter exponential family, F is a multi-parameter family of distributions of the form T p Pη(dx) = exp η t(x) − Λ(η) m0(dx), η ∈ R . for some reference measure m0 on Ω. Nothing really changes except η · t(x) has changed to ηT t(x). This indicates that there is a vector η = (η1, . . . , ηp) of natural parameters as well as a vector t(x) = (t1(x), . . . , tp(x)) of suﬃcient statistics. p Again, Λ : R → R is the normalizing constant needed to make Pη a proper probability distribution Z Λ(η) ηT t(x) e = e m0(dx). Ω The domain is D(F) = {η : Λ(η) < ∞} .

1.1.1 Exercise: normal family 1. Show that the family of distributions

IID 2 Pη = L (X1,...,XN ) ∼ N(µ, σ )

can be written as a 2-dimensional exponential family with suﬃcient statistic

N N ! 1 X 1 X t(x , . . . , x ) = x , x2 . 1 m n i n i i=1 i=1

2. What is the carrier measure?

In this family all of the information about (µ, σ2) is contained in a 2-dimensional statistic. This, of course, is the basis of suﬃciency. IID In general, if we observe (X1,...,Xn) ∼ Pη there is a corresponding p-dimensional exponential p family of distributions on R which we might denote by ˜ ∆ Pη(A) = Pη(t(X) ∈ A)

p which is the just the push-forward of Pη under t :Ω → R .

1.1.2 Exercise: the pushed forward family ˜ n˜ o 1. What is the suﬃcient statistic of the family F = Pη ?

2. What is the reference measure?

3. What is the domain of F˜?

1 1.2 Expectation and variances As in the case of a one-parameter family, the expected values of the suﬃcient statistic is determined by the CGF, Λ. That is, Eη(t(X)) = ∇Λ(η). This motivates the deﬁnition of the mean parameter space M(F) = conv ({t(x): x ∈ Ω}) = conv(t(Ω)) where $ conv(S) $ is the convex hull of S. (Brad Efron’s notes and papers, typically refer to Λ(D) as B and D as A, notation we may come back to at some point.) Unlike in the one-parameter case, it is possible that

∇Λ(D) ( M(F). See Geometry of Exponential Families for an example, albeit slightly unnatural. I say unnatural, because the construction uses a carrier measure with a support that is singular (in a certain sense). As in the one-parameter case variances are also obtained from the Λ via 2 Varη(t(X)) = ∇ Λ(η).

1.2.1 Exercise: convexity 1. Prove that D is convex. 2. Show that ∇Λ(D) ⊂ M.

1.3 The univariate normal family Consider the family 2 η1·x−η2·x /2−Λ(η1,η2) Pη(dx) = e dx. The suﬃcient statistics here are t(x) = (x, −x2/2). We assume that we have observed N IID samples so the family is ( N ! N ) T N Y N·[η t(x)−Λ(η1,η2)] Y F = Pη dxi = e dxi i=1 i=1 with N N ! 1 X 1 X t(x) = t(x , . . . , x ) = x , − x2 1 N N i 2N i i=1 i=1 We see that this is a parameterization of the univariate normal family with

Z 2 2 Λ(η1,η2) η1x−η2x /2 p η /2η2 e = e dx = 2π/η2 · e 1 . R so that 2 1 η1 Λ(η1, η2) = log(2π) − log(η2) + 2 η2 The domain is the upper half plane

D = {(η1, η2): η2 > 0}

2 Means and variances for the normal family Straightforward calculations yield

2 η1 1 1 η1 ∇Λ(η) = , − + 2 η2 2 η2 η2 1 η1 ! − 2 2 η2 η2 ∇ Λ(η) = 2 η1 1 η1 − 2 2 + 3 η2 2η2 η2 Here is an example of the family based on a sample of size 10.

C= np.log(2*np.pi)

def CGFnormal(eta,N=10): """ The univariate normal CGF fora sample of sizeN

.. math::

\frac{1}{2}\left[\log(2\pi)-\log(\eta_2)+\frac{\eta_1^2 }{\eta_2}\right] """ eta= np.asarray(eta) returnN*(C- np.log(eta[1]) + eta[0]**2 / eta[1]) / 2.

def dotCGFnormal(eta,N=10): """ The gradient of the univariate normal CGF fora sample of sizeN """ eta= np.asarray(eta) returnN* np.array([eta[0] / eta[1], (-eta[0]**2/eta[1]**2 - 1. / eta[1]) / 2.])

def ddotCGFnormal(eta,N=10): """ The Hessian of the univariate normal CGF fora sample of sizeN """ eta= np.asarray(eta) H= np.zeros((2,2)) H[0,0] = 1. / eta[1] H[0,1] =H[1,0] = - eta[0] / eta[1]**2 H[1,1] = 0.5 / eta[1]**2 + eta[0]**2 / eta[1]**3 returnN*H

We can wrap this in a class, to make it more formal.

class normal(object): """ The univariate normal family fora sample of sizeN """ def __init__(self,N): self.N=N

3 def value(self, eta): return CGFnormal(eta, self.N)

def grad(self, eta,v=None): """ The gradient at eta in the directionv, ifv is not None """ g= dotCGFnormal(eta, self.N) ifv is not None: return(v*g).sum() returng

def hess(self, eta, vw=None): """ The hessian at eta in the directionsv,w= vw """ h= ddotCGFnormal(eta, self.N) if vw is not None: v,w= vw return(v* np.dot(h,w)).sum() returnh

While Λ is convex and ∇Λ maps natural parameters to mean values, the image of D under ∇Λ is not necessarily convex. In fact, even the image of one-parameter subfamilies may be curved. For example, let’s restrict our family to an aﬃne line (which I’ll call a ray since we only plot some of it) 2 L = η = η(s) = η0 + s · v ∈ R , s ∈ R bearing in mind that we might hit the boundary of D with such a ray. We can look at the paths s 7→ η(s) ∈ D, and s 7→ ∇Λ(η(s)) ∈ M. Since it is a ray in D, it is not surprising that it looks like a ray. . .

eta_0= np.array([0, 0.01]) v= np.array([12.5, .1])

ray= np.linspace(0,1,101) eta= np.array([eta_0+s*v fors in ray]) plt.plot(eta[:,0], eta[:,1]) a= plt.gca() a.set_xlabel(r’$\eta_1$’, size=20) a.set_ylabel(r’$\eta_2$’, size=20) a.set_title(r’${\calD}$ coordinates’, size=20)

4 However, in M, it does not look straight, it is curved.

family= normal(1) mu= np.array([family.grad(eta_0+t*v) fort in ray]) plt.plot(mu[:,0], mu[:,1]) a= plt.gca() a.set_xlabel(r’$\mu_1$’, size=20) a.set_ylabel(r’$\mu_2$’, size=20) a.set_title(r’${\calM}$ coordinates’, size=20)

5 1.3.1 Exercise

1. Repeat the above plots, but start at point η0 = (2, 2) and move along the ray dη = (1, 1). Is this expected?

1.4 Estimation Generally speaking, the images of ﬂat lines are curves under ∇Λ. This makes it seem like our nice picture from one parameter families is going to fail when it comes to maximum likelihood. We’ll see that all is not really lost. First of all, the map s 7→ Λ(η(s)) − η(s)T t is convex, so ﬁnding the MLE of Λ restricted to R is a convex problem. In fact, it can essentially be solved by Newton-Raphson using

dΛ(η(s)) = vT ∇Λ(η(s)) ds d2Λ(η(s)) = vT ∇2Λ(η(s))v. ds2 However, we must be careful to not have Newton-Raphson force us to take too big a step. In particular, we must stay in D.

6 Notation for multivariate derivatives It is sometimes convenient to write T v ∇Λ(η) = ∇Λη(v)

= ∇Λ(v) η T 2 2 v ∇ Λ(η(s))w = ∇ηΛ(v, w) 2 = ∇ Λ(v, w) η This makes higher order derivatives easy to express in two (similar) fashions

k k ∇ηΛ(v1, . . . , vk) = ∇ Λ(v1, . . . , vk) η. Here is what happens with the naive Newton-Raphson iteration

−1 (k+1) (k) 2 s = s − ∇ Λ(v, v) (k) ∇Λ(v) (k) −t η0+s v η0+s v

s = 10 eta_0= np.array([10,2]) v= np.array([1.1,.2]) tx= np.array([.2,-2]) grad, hess= np.inf, np.inf# initial values fori in range(5): print’s, grad, hess=%s’ % (‘(s, grad, hess)‘) print’eta=%s’ % (‘(eta_0+s*v)‘,) grad=(v*(family.grad(eta_0+s*v)-tx)).sum() hess= family.hess(eta_0+s*v,(v,v)) s=s- grad/ hess s, grad, hess = (10, inf, inf) eta = array([ 21., 4.]) s, grad, hess = (-1682.6666666666467, 3.1737500000000005, 0.0018750000000000225) eta = array([-1840.93333333, -334.53333333]) s, grad, hess = (-18045209.467520244, 3.2052980303192289, 1.7764254547883605e-07) eta = array([-19849720.41427227, -3609039.89350405]) s, grad, hess = (-2087284487739199.8, 3.2050000277081927, 1.5354878907224692e-15) eta = array([ -2.29601294e+15, -4.17456898e+14]) s, grad, hess = (-8.1256403474629461e+31, 3.2050000000000014, 3.9443045261050625e-32) eta = array([ -8.93820438e+31, -1.62512807e+31]) -c:11: RuntimeWarning: divide by zero encountered in double_scalars

We see that after three steps, we jumped outside of D. Here is a slightly modiﬁed algorithm that just takes a descent step, but makes sure not to step outside of D.

s = 10# initial point s_iterates=[s]# sequence of values ofs

value= np.inf# initial value value_iterates=[value]# sequence of objective step_size=1# how biga step do we try to take

7 tol = 1.e-7# tolerance for stopping-- based on objective

max_iters = 100

fori in range(max_iters): grad=(v*(family.grad(eta_0+s*v)-tx)).sum() hess= family.hess(eta_0+s*v,(v,v)) count=0 step_size = 1. while True: proposed_s=s- step_size* grad/ hess proposed_eta= eta_0+ proposed_s*v proposed_value= family.value(proposed_eta)-(tx*proposed_eta).sum()

# if the step is leads outside the domain, or is nota descent, take #a smaller step

if proposed_eta[1] < 0 or proposed_value> value: step_size *= 0.9 else: break

# when do we stop? ifi>1 and np.fabs((value- proposed_value)/ value) < tol: break

value= proposed_value s= proposed_s s_iterates.append(s) value_iterates.append(family.value(eta_0+s*v)-(tx*(eta_0+s*v)).sum())

-c:12: RuntimeWarning: invalid value encountered in log

Let’s look at how many steps it took and how quickly our objective dropped.

plt.plot(np.log(value_iterates)) a= plt.gca() a.set_xlabel(’Iteration $k$’) a.set_ylabel(r’$\log(\Lambda(\eta_0+s^{(k)}v)-(\eta_0+s^{(k)}v)^Tt(x))$’) print grad,s

-8.2910256441e-05 -9.03537880144

8 How about the actual point on the ray?

plt.plot(s_iterates) a= plt.gca() a.set_xlabel(’Iteration $k$’) a.set_ylabel(r’$s^{(k)}$’)

9 1.5 MLE map The curved picture above hides a little structure from us and makes things seem worse than they really are. What is really going on is that we have not properly taken into account the fact that we were going to restrict our attention to a fixed affine line. Let’s first consider the unrestricted MLE problem. That is, having observed t(X) = t, let’s estimate η. The problem we must solve is

T ηˆ(t) = argmaxη∈D η t − Λ(η).

The Fenchel-Legendre transform (convex conjugate) extends to arbitrary dimensions

Λ∗(t) = sup ηT t − Λ(η). η∈D

Again, by general convex analysis, we see that

ηˆ(t) = ∇Λ∗(t).

Subdiﬀerential If the MLE is not unique, then we can describe the set of MLE estimates as

ηˆ(t) ∈ ∂Λ∗(t) = η : ηT t − Λ(η) = Λ∗(t) .

In words, these are the η’s that achieved the maximum value of the likelihood. The notation ∂f(x) is referred to as the subdiﬀerential of f at x. For an arbitrary function with domain D(f) = {x : f(x) < ∞} it is deﬁned as

∂f(x) = v : f(y) − f(x) ≥ vT (y − x) ∀x ∈ D(f) .

1.5.1 Exercise Show, from the deﬁnition of ∂Λ∗ that, if Λ∗(t) < ∞

∂Λ∗(t) = η : ηT t − Λ(η) = Λ∗(t) .

1.6 MLE in a restricted model Let’s go back to our ray in the two-dimensional family above. The ray can be described by a 1-dimensional constraint. If a is normal to dη, then the ray is the set

T T η : a η = a η0 = b for some b ∈ R. To perform MLE in this model, we should consider the problem

T ηˆ(t, b) = argmaxη∈D η t − Λ(η) subject to the additional constraint ηT a = b.

10 1.6.1 Lagrange multipliers Standard multivariate calculus says we should use a Lagrange multiplier. Then, we need to ﬁnd the appropriate value of the Lagrange multiplier and solve the equations

t − ∇Λ(ˆη(u)) + u · a = 0, u ∈ R which are the critical point equations for solving

ηˆ(t, u) = argmin ηT t − Λ(η) + u · (ηT a − b) η

Then, we should ﬁnd the correct Lagrange multiplier u = u(t, b) so that the solution to the above problem actually satisfyη ˆ(t, u)T a = b. This trick is used over and over in optimization and is referred to as forming a Lagrangian

L(η; u) = ηT t − Λ(η) + u · (ηT a − b).

1.6.2 Dual function Maximizing this over η yields a function

g(u) = sup ηT t − Λ(η) + u · (ηT a − b) = Λ∗(t + u · a) − ub. η∈D

So, our old friend Λ∗ appears. Since it is convex, we see that the problem of minimizing g(u) is a convex problem. Having found

uˆ(t, b) = argmin Λ∗(t + u · a) − ub u we can then solve maximize ηT t − Λ(η) +u ˆ(t, b) · (ηT a − b). η This last problem can be solved explicitly as

ηˆ(t, b) = ∇Λ∗(t +u ˆ(t, b) · a).

By the famous KKT conditions, this is actually a solution to our restricted problem. Speciﬁcally, the KKT conditions are that at this pair (ˆη(t, b), uˆ(t, b)) we have

∇ηL(η, u) = t + u · a − ∇Λ(η) = 0 T ∇uL(η, u) = η a − b = 0

In other words, the pair (ˆη(t, b), uˆ(t, b)) are a critical point of the function L(η, u).

11 1.6.3 Exercise: estimation via Λ∗ 1. In the univariate normal family, show that 1 Λ∗(t , t ) = − log(−(2t + t2)) + log(2π) + 1 . 1 2 2 2 1 ∗ ∗ What is the domain D(Λ ) = {(t1, t2):Λ (t1, t2) < ∞}? 2. What is the MLE map ∇Λ∗? Describe it in terms of the usual MLE (ˆµ, σˆ) from IID N(µ, σ2). Does the MLE exist when N = 1?

3. Suppose we observed tx = [.2, −2]. Find the MLE of the restriction to

R = {η = η(s) = (10 + 11 · s, 2 + .2 · s): η ∈ D, s ∈ R} . That is solve, 2 minimize Λ(η(s)) − x · η1(s) + x /2 · η2(s). s 4. Find a suitable (a, b) such that

R = η : aT η = b, η ∈ D .

5. Plot the function g(u) = Λ∗(t + u · a) − u · b and ﬁnd uˆ(t, b) = argmin Λ∗(t + u · a) − u · b. u Compareη ˆ(t, b) = ∇Λ∗(t +u ˆ(t, b) · a) with your solution to 3.

Here’s my solution (without the code, of course)

%run scripts/normal_family_exercise.py original solution: [ 0.30800078 1.82378183] dual solution: [ 0.3080008 1.82378183]

1.7 Examples 1.7.1 Beta family

In this family, we set m0(dx) = 1[0,1](x) dx and

t(x) = (log(x), log(1 − x)).

Or, η1 log(x)+η2 log(1−x)−Λ(η) Pη(dx) = e m0(dx). We see that Z 1 Λ(η) η1 η2 e = B(η1 + 1, η2 + 1) = x (1 − x) dx. 0

12 1.7.2 Exercise: Beta family 1. What is D(F) for the Beta family?

2. Compute ∇Λ(η), ∇2Λ(η).

1.7.3 Multinomial For n, k positive integers, set

n m0(dx) = 1An,k (x) m(dx) x1, . . . , xk

k where m(dx) is counting measure on Z and

( k ) k X An,k = (x1, . . . , xk) ∈ Z : xi ≥ 0, xi = n . i=1 The multinomial family can be written as

T Pη(dx) = exp(x η − Λ(η))m0(dx) where k X n Y eΛ(η) = exiηi . x1, . . . , xk x∈An,k i=1 We see that Λ is linear along 11. That is, for any α ∈ R

Λ(η + α · 11) = Λ(η) + α · n.

This is essentially an issue of identiﬁability as it says that the map

η 7→ Pη is many to 1. In turn, this will manifest itself in the Fisher information

2 1 1 ∇ Λ(11,11)η η= 0. This shows that Λ is not strongly convex so ∇Λ∗ is not smooth. As ∇Λ∗ is our MLE map, this says that the MLE will not be easy to ﬁnd – it is not even unique.

1.7.4 Exercise: reparametrizing the multinomial An alternative to avoid the identiﬁability issues in the multinomial is to reparametrize by taking

D˜ = η : ηT 1 = 0 .

13 1. Deﬁne Λ(˜ η) = Λ(η) + P(η) where ( 0 ηT 1 = 0 P(η) = ∞ otherwise. Argue that D(Λ)˜ = D˜ as above.

2. Solve the problem minimize Λ(˜ η) − xT η. η∈D˜ Would you call the minimizers MLEs or is it better to think of them as “restricted”?

1.7.5 Dirichlet family

k There are at least two ways to deﬁne the Dirichlet distribution with parameters α ∈ R . The ﬁrst k−1 is a distribution on R . Set k−1 m (dx) = 1 ˜ (x) dx, x ∈ 0 Sk R k−1 where dx is Lebesgue measure on R and

( k−1 ) ˜ k−1 X Sk = (x1, . . . , xk−1) ∈ R : xi ≥ 0, xi ≤ 1 i=1 are the ﬁrst (k − 1) coordinates of the simplex

( k ) k X Sk = (x1, . . . , xk) ∈ R : xi ≥ 0, xi = 1 i=1 Take k−1 !! X k t(x) = log(x1), log(x2),..., log(xk−1), log 1 − xi ∈ R i=1 and set d T Pη (x) = eη t(x)−Λ(η) dm0 with η Z k−1 ! k k−1 Λ(η) X Y ηi e = 1 − xi xi dx ˜ Sk i=1 i=1

1.7.6 Exercise: Dirichlet on the simplex

Write the Dirichlet distribution as a k-parameter exponential family on Sk. What is its reference measure?

14 1.7.7 The multivariate normal family: linear suﬃcient statistic

p Probably the most common of all multiparameter exponential families is the normal model on R with ﬁxed covariance. Of course, the most common choice from a modelling standpoint is probably to choose a p-vector of IID N(0, 1) random variables, i.e. covariance to be Ip×p. We will start with reference measure P0 = N(0, Σ).

If Σ > 0, i.e. it is positive deﬁnite, then P0 has density

−1/2 −xT Σ−1x/2 P0(dx) = det(2πΣ) e dx.

If we take suﬃcient statistic t(x) = x we see that

d T Pη = ex η−Λ(η) dP0 and T eΛ(η) = eη Ση/2. Further, ∇Λ(η) = Ση so that p F = {N(Ση, Σ), η ∈ R } .

1.7.8 Exercise: computing ∇Λ∗ for the normal means model

∗ 1. Compute Λ for the normal means model with reference P0 = N(0, Σ) and suﬃcient statistic t(x) = x. That is, ﬁnd 1 Λ∗(t) = sup ηT t − ηT Ση. η 2

2. What happens when Σ is not positive deﬁnite, but only nonnegative deﬁnite? What is D(Λ), Λ∗?

3. How might you modify Λ as in the multinomial family to yield an identiﬁable model?

1.7.9 The multivariate normal family: quadratic suﬃcient statistic In this section, we again consider the normal family, but we focus on an exponential family for the covariance of the normal family. As in the 1-dimensional case, we see that the natural parameters are related to the inverse of variance, or the so-called precision or concentration matrix

Θ = Σ−1.

−p/2 p We will take the reference measure to be m0(dx) = (2π) dx, Lebesgue measure on R . With this reference N(0, Θ−1) family has density

det(Θ)1/2 exp(−xT Θx/2).

15 We can write xT Θx = Tr(xxT Θ) and hence our suﬃcient statistic is

T p×p t(x) = −xx /2 ∈ R . After repeated sampling we see that n 1 X t(x) = − x xT 2n i i i=1 which is the minus half the MLE of Σ under N(0, Σ). (Well, it certainly seems to be the MLE. . . more in a moment). In any case, from the explicit form our density, we see that 1 Λ(Θ) = − log det(Θ). 2 The MLE map is therefore 1 Λ∗(T ) = sup Tr(T Θ) + log det(Θ). Θ 2

1.7.10 Exercise: domain of Λ(Θ) Suppose Θ is not symmetric. Can you still make sense of Λ(Θ)? What is D(Λ)?

1.7.11 Exercise: computing Λ∗(S) 1. Suppose f(θ) = −1/2 log(θ). Show that f ∗(s) = −1/2 + log(−2s)/2.

2. Argue that 1 Λ∗(T ) = C − log det(−2T ). 2 3. What is D(Λ∗)? Finally, we see that, the MLE map, is Θ = ∇ log det(S)

1 Pn T where S = −2t(x) = n i=1 xixi is the usual estimate of the covariance matrix. It remains to compute ∇ log det(A) for some symmetric A > 0. A very useful formula tells us that ∇ log det(A) = A−1. That is, −1 ∇ log det(T ) A= Tr(A T ). The Hessian also has a nice form, analgous to the Hessian of log on R 2 −1 −1 ∇ log det(V,W ) A= −Tr(A VA W ).

16 1.7.12 Exercise: MLE for covariance matrices If n < p is the MLE Θˆ well deﬁned?