Exponential Family of Distributions and Generalized Linear Model (GLM) (Draft: Version 0.9.2)

Home , Categorical distribution, Exponential distribution, Exponential family, Probability distribution

Lectures on Machine Learning (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Topics to be covered:

• Exponential family of distributions

• Mean and (canonical) link functions

• Convexity of log partition function

• Generalized linear model (GLM)

• Various GLM models

1 Exponential family of distributions

In this section, we study a family of probability distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class.

1 1.1 Deﬁnition Deﬁnition 1. A probability distribution (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. 1 P (y) = h(y)eθ·T (y), (1) θ Z(θ)

m k where y = (y1, ··· , ym) is a point in R ; and θ = (θ1, ··· , θk) ∈ R is a parameter called the canonical (natural) parameter; T : Rm → Rk is R θ·T (y) a map T (y) = (T1(y), ··· ,Tk(y)); and Z(θ) = h(y)e dy is called the partition function, while its logarithm, A(θ) = log Z(θ), is called the log partition (cumulant) function. Remark. In this lecture and throughout this course, the “dot” notation as in θ · T (y) always means the inner (dot) product of two vectors.

Equivalently, Pθ(y) can be written in the form

Pθ(y) = exp[θ · T (y) − A(θ) + C(y)], (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter φ, called the dispersion parameter, to control the shape of Pθ(y) by θ · T (y) − A(θ) P (y) = exp + C(y, φ) . θ φ

1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following.

• In case Pθ(y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability measure Pθ(y);

• In case Pθ(y) is a probability mass function (PMF), there exists a range of discrete value of Pθ(y) that is the same for all θ and for all y.

If Pθ(y) satisﬁes either condition, we say it is regular, which is always assumed throughout this course.

2 Remark. Sometimes people use the more general form of Pθ(y) to write 1 P (y) = h(y)eη(θ)·T (y). θ Z(θ)

But most of the time the same result can be obtained without the use of the general form η(θ). So it is not much of a loss of generality to stick to our convention of using just θ.

1.3 Examples Let us now look at a few illustrative examples.

(1) Bernoulli(µ) The Bernoulli distribution is perhaps the simplest in the exponential family. Let Y be a random variable taking its binary value in {0, 1}. Let µ = P [Y = 1]. Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µy(1 − µ)1−y. Then,

P (y) = µy(1 − µ)1−y = exp[y log µ + (1 − y) log(1 − µ)] µ = exp y log + log(1 − µ) . 1 − µ

µ Letting T (y) = y and θ = log and recalling the deﬁnitions of 1 − µ logit and σ functions given in Lecture 3, we have

µ logit(µ) = log . 1 − µ

Thus θ = logit(µ). (3) Its inverse function is the sigmoid function:

µ = σ(θ), (4)

where 1 σ(θ) = . 1 + e−θ

3 Therefore, we have

A(θ) = − log(1 − µ) = log(1 + eθ). (5)

Thus Pθ(y), written in canonical form as in (2), becomes

θ Pθ(y) = exp[θ · y − log(1 + e )].

(2) Exponential distribution The exponential distribution is a distribution that models the indepen- dent arrival time. Its distribution (the probability density function, PDF) is given as −θy Pθ(y) = θe I(x ≥ 0). To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = −y and h(y) = I(y ≥ 0). Since 1 Z Z(θ) = = e−θy (y ≥ 0)dy, θ I

it is already in the canonical form given as in (1).

(3) Normal distribution The normal (Gaussian) distribution given by

1 (y − µ)2 P (y) = √ exp − 2πσ2 2σ2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a number, not the sigmoid function.)

• 1st view (σ2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the variance σ2 is regarded as known or as a parameter that can be ﬁddled with as if known. Writing P (y) in the form   1 2 1 2 − y + yµ − µ 1 P (y) = exp  2 2 − log(2πσ2) , θ  σ2 2 

4 One can see right away it is in the form of the exponential family if we set

θ = (θ1, θ2) = (µ, 1) 1 T (y) = (y, − y2) 2 φ = σ2 1 1 A(θ) = µ2 = θ2 2 2 1 1 C(y, φ) = − log(2πσ2). 2

• 2nd view (φ = 1) When both µ and σ are parameters to be treated as unknown, we take this point of view. In here, we set the dispersion parameter φ = 1. Writing out Pθ(y) we have

1 µ 1 1 P (y) = exp − y2 + y − µ2 − log σ − log 2π . θ 2σ2 σ2 2σ2 2

Thus it is easy to see the following: 1 T (y) = (y, y2) 2 µ 1 θ = (θ , θ ) = ( , − ) 1 2 σ2 σ2 2 1 2 1 θ1 1 A(θ) = 2 µ + log σ = − − log(−θ2) 2σ 2 θ2 2 1 C(y) = − log(2π). 2 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since

Z Z θ · T (y) − A(θ) P (y)dy = exp + C(y, φ) dy = 1, θ φ

5 taking ∇θ, we have Z θ · T (y) − A(θ) T (y) − ∇ A(θ) exp + C(y, φ) θ dy = 0. φ φ

Thus

Z θ · T (y) − A(θ) Z θ · T (y) − A(θ) exp + C(y, φ) T (y)dy = exp + C(y, φ) ∇ A(θ)dy φ φ θ Z θ · T (y) − A(θ) = ∇ A(θ) exp + C(y, φ) dy. θ φ

Since Z θ · T (y) − A(θ) exp + C(y, φ) dy = 1, φ we have the following:

Proposition 1. Z ∇θA(θ) = Pθ(y)T (y)dy = E[T (Y )],

where Y is the random variable with distribution Pθ(y). Writing the component, we have

∂A Z θ · T (y) − A(θ) = exp + C(y, φ) Ti(y)dy. ∂θi φ Taking the second partial derivative, we have

∂2A 1 Z θ · T (y) − A(θ) ∂A(θ) = exp + C(y, φ) Tj(y) − Ti(y)dy ∂θi∂θj φ φ ∂θj 1 Z θ · T (y) − A(θ) = exp + C(y, φ) T (y)T (y)dy φ φ i j ∂A(θ) Z θ · T (y) − A(θ) − exp + C(y, φ) Ti(y)dy . ∂θj φ

6 Using Proposition 1 once more, we have

∂2A 1 n o = E[Ti(Y )Tj(Y )] − E[Ti(Y )]E[Tj(Y )] ∂θi∂θj φ 1 h i = E T (Y ) − E[T (Y )] T (Y ) − E[T (Y )] φ i i j j 1 = Cov(T (Y ),T (Y )). φ i j Therefore we have the following result on the Hessian matrix of A.

2 2 ∂ A 1 1 Proposition 2. Dθ A = = Cov(T (Y ),T (Y )) = Cov(T (Y )), ∂θi∂θj φ φ where Y is the random variable with distribution Pθ(y). Here, Cov(T (Y ),T (Y )) denotes the covariance matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the covariance matrix is always positive semi-deﬁnite, we have

Corollary 1. A(θ) is a convex function of θ.

Remark. In most exponential family models, Cov(T (Y )) is positive deﬁnite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive deﬁnite and therefore that A(θ) is strictly convex.

1.5 Maximum likelihood estimation

(i) N (i) m Let D = {y }i=1 be a given data of IID samples, where y ∈ R . Then its likelihood function L(θ) and the log likelihood function l(θ) are given by

N Y θ · T (y(i)) − A(θ) L(θ) = exp + C(y(i), φ) φ i=1 N N 1 X X l(θ) = θ · T (y(i)) − NA(θ) + C(y(i), φ). φ i=1 i=1

7 Since A(θ) is strictly convex, l(θ) has a unique maximum at θb that is the unique solution of ∇θl(θ) = 0. Note that

N 1 X ∇ l(θ) = T (y(i)) − N∇ A(θ) . θ φ θ i=1

So we have the following:

Proposition 3. There is a unique θb that maximizes l(θ) at which

N 1 X ∇ A(θ)| = T (y(i)). θ θ=θb N i=1

1.5.1 Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to. First, diﬀerentiating A(θ) as in (5), we get 1 A0(θ) = = σ(θ). 1 + e−θ

Thus by Proposition 3 we have a unique θb satisfying

N 1 X σ(θb) = T (y(i)). N i=1

Since T (y) = y, and y takes on 0 or 1 as its value, we have T (y) = I(y = 1). Deﬁning µb = σ(θb) by (4), we then have

µb = σ(θb) N 1 X = (y(i) = 1) N I i=1 Number of 1’s in the sample {y(i)} = . N

This conﬁrms the usual estimate of µ.b

8 Remark. T is a suﬃcient statistic. Pn (i) It is well known that the statistic S(yi, ··· , yn) = i=1 T (y ) is suﬃcient, which means that θ is a constant on the level set

{(y1, ··· , yn) | S(y1, y2, ··· , yn) = const}.

This provides many advantages in estimating θ, as one needs not to look inside the level sets.

1.6 Maximum Entropy Distribution Exponential family is useful in that it is the most random distribution under some constraints. Let Y be a random variable with an unknown distribution P (y). Suppose the expected values of certain features f1(y), ··· , fk(y) are nonetheless known. Namely, Z fi(y)P (y)dy = Ci (6) is known for some given constant Ci for i = 1, ··· , k. The question is how to ﬁx P (y) so that it is as random as possible while satisfying the above constraints. One approach is to construct a maximum entropy distribution satisfying the constraints (6). The entropy is the measure of randomness and in general, the higher it is, the more random it is deemed. The deﬁnition of entropy is X H = P (y) log P (y), y for the discrete case; for the continuous case, the sum becomes an integral so that Z H = P (y) log P (y)dy.

We will use the integral notation here without the loss of generality. To solve the problem, we use the Lagrange multiplier method to set up the following variational problem of minimizing J(P, λ):

k Z Z X n Z o J(P, λ) = − P (y) log P (y)dy+λ0(1− P (y)dy)+ λi Ci− fi(y)P (y)dy . i=1

9 The solution is found as the critical point. Namely, setting the variational (functional or Fr´echet) derivative equal to 0, k δJ X = −1 − log P (y) − λ − λ f (y) = 0. (7) δP 0 i i i=1 To see where this comes from, let η be any smooth function with compact support and deﬁne Z Z Z X J(P +η, λ) = − (P +η) log(P +η)dy+λ0 1− (P +η)dy) + λi Ci− fi(P +η)dy .

Taking the derivative with respect to at = 0 and setting it equal to 0, we get

d Z Z X Z J(P + η, λ) = − (η + η log P )dy − λ ηdy + λ {− f ηdy} d 0 i i =0 Z k X = − 1 − log P − λ0 − λifi ηdy i=1 = 0. Since the above formula is true for any η, we have k X −1 − log P − λ0 − λifi = 0, i=1 which is (7). Solving (7) for P (y), we have k 1 X P (y) = exp(− λ f (y)), Z i i i=1 where Z = e1+λ0 . Now k Z 1 Z X 1 = P (y)dy = exp(− λ f (y))dy. Z i i i=1 Thus k Z X Z = exp(− λifi(y))dy. i=1 Therefore P (y) belongs to the exponential family of distribution. This kind of distribution is in general called the Gibbs distribution.

10 2 Generalized liner model (GLM)

2.1 Mean parameter and canonical link function Before we embark on the generalized linear model, we need the following elementary fact.

k 1 Lemma 1. Let f : R → R be a strictly convex C function. Then ∇xf : Rk → Rk is an invertible function.

Proof. Let p ∈ Rk be any given vector. By the strict convexity and the absence of corners (the C1 condition), there is a unique hyperplane H in Rk+1 that is tangent to the graph of y = f(x), having the normal vector of the form (p, −1). Let (x, y) be the point of contact. Since the graph is the zero set of F (x, y) = f(x) − y, its normal vector at (x, y) is (∇f(x), −1). Therefore we must have p = ∇f(x).

Figure 1: Graph of a strictly convex function with a tangent plane

Let us now look at the exponential family Pθ(y) = exp θ · T (y) − A(θ) + C(y) , (8) where the dispersion parameter φ is set to be 1 for the sake of simplicity of the presentation. (The argument does not change much even with the presence of the dispersion parameter.) Recall that Proposition 1 says that ∇θA(θ) = E[T (Y )]. We always assume that the log partition function A : Rk → R 1 k k is strictly convex and C . Thus by the above Lemma, ∇θA : R → R is invertible. We now set up a few terminologies.

11 Deﬁnition 2. The mean parameter µ is deﬁned as

µ = E[T (Y )] = ∇θA(θ), k k and the function ∇θA : R → R is called the mean function.

Since ∇θA is invertible, we deﬁne −1 Deﬁnition 3. The inverse function (∇θA) of ∇θA is called the canonical link function and is denoted by ψ. Thus

−1 θ = ψ(µ) = (∇θA) (µ). Therefore, combining the above deﬁnitions, the mean function is also written as −1 µ = ψ (θ) = ∇θA(θ). Example: Bernoulli(µ) Recall, for Bernoulli(µ), P (y) is given by µ P (y) = µy(1 − µ)1−y = exp y log + log(1 − µ) 1 − µ = exp θy − log(1 + eθ) = exp [ θy − A(θ)] , µ where θ = log and A(θ) = log(1 + eθ). Thus 1 − µ µ = sigmoid(θ) = A0(θ) θ = logit(µ) = (A0)−1(θ). Therefore for the Bernoulli distribution the sigmoid function is the mean function and the logit function the canonical link function.

2.2 Conditional probability in GLM From now on in this lecture, to simplify notation, we use ω to stand for d d+1 both w and b and extend x ∈ R to x ∈ R by adding x0 = 1. (In the previous lectures, we used θ as a generic term to represent both w and b. But in the exponential family notation, θ is reserved to denote the canonical parameter. So we are forced to use ω. One more note: although the typogra- phy is diﬃcult to discern, this ω is the Greek lower case ’omega,’ not English w.)

12 2.2.1 GLM recipe

With this notation, the expression ω = (b, w1, ··· , wd) as (1) of Lecture 3 is now written as ω · x, and thus the conditional probability in Lecture 3 is written as eω·x P (y = 1 | x) = 1 + eω·x 1 P (y = 0 | x) = , 1 + eω·x which can be simpliﬁed as

e(ω·x)y P (y | x) = , 1 + eω·x for y ∈ {0, 1}. Using (5), this is easily seen to be equivalent to the following conditional probability model

P (y | x) = exp [(ω · x)y − A(ω · x)] = exp [θy − A(θ)] .

Summarizing this process, we have the following: • GLM Recipe: P (y | x) is gotten from P (x) by replacing the canonical parameter θ in (8) with a linear expression of x. In the binary case, the linear expression is ω · x. The multiclass case mimics it in exactly the same way. To describe it correctly, ﬁrst deﬁne the parameter

 T    (ω1) ω10 ··· ω1j ··· ω1d . . . .  .   . . .   .   . . .   T    W =  (ωi)  = ωi0 ··· ωij ··· ωid  . (9)  .   . . .   .   . . .  T (ωk) ωk0 ··· ωkj ··· ωkd

Then replace θ in (8) with W x to get the conditional probability as

P (y | x) = exp[(W x) · T (y) − A(W x) + C(y)]. (10)

13 Written this way, we can see that

k d X X (W x) · T (y) = ω`jxjT`(y). `=1 j=0

(i) (i) N Now, let D = {(x , y )}i=1 be a given data, where the x-part of the i-th data is represented by the column vector

(i) (i) (i) T x = [x0 , ··· , xd ] . Using the conditional probability as in (10), one can write the likelihood function and the log likelihood function by

N Y L(W ) = exp[(W x(i)) · T (y(i)) − A(W x(i)) + C(y(i))] i=1 N X l(W ) = log L(W ) = [(W x(i)) · T (y(i)) − A(W x(i)) + C(y(i))] i=1 N " k d # X X X (i) (i) T (i) T (i) (i) = ω`jxj T`(y ) − A(ω1 x , ··· , ωk x ) + C(y ) . i=1 `=1 j=0

Thus taking the derivative, one gets

N " k d k d # ∂l(W ) X X X (i) (i) X ∂A X (i) = xj δ`rδjsT`(y ) − xj δ`rδjs ∂ωrs ∂ω` i=1 `=1 j=0 `=1 j=0 N X ∂A = x(i)T (y(i)) − x(i) s r ∂ω s i=1 r N X ∂A = T (y(i)) − (W x(i)) x(i). (11) r ∂ω s i=1 r This is a generalized form of what we got for the logistic regression (see (10) of Lecture 3).

2.3 Multiclass classiﬁcation (softmax regression) We now look at the multiclass classiﬁcation problem. The formula (11) is the derivative formula with which the gradient descent algorithm can be used.

14 But for multiclass regression, the categorical distribution has the redundancy we talked about in Lecture 3. As usual, let yi = I(y = i) and let Prob[Y = i] = µi. Then we must have y1 + ··· + yk = 1 and µ1 + ··· + µk = 1. Thus the probability (PMF) can be written as y1 y2 yk P (y) = µ1 µ2 ··· µk . Rewrite P (y) by

1−Pk−1 y y1 y2 yk−1 ( i=1 i) P (y) = µ1 µ2 ··· µk−1 µk " k−1 ! # X = exp y1 log µ1 + ··· + yk−1 log µk−1 + 1 − yi log µk i=1 "k−1 # X µi = exp y log + log µ . i µ k i=1 k Note that when k = 2, this is exactly the Bernoulli distribution. Deﬁne θi = log(µi/µk) and Ti(y) = yi for i = 1, ··· , k − 1. Using the facts that θi Pk−1 µi = µke and 1 − µk = i=1 µi, and solving for µk and then for µj, we get 1 µk = k−1 P θi 1 + i=1 e eθj µj = k−1 , (12) P θi 1 + i=1 e for j = 1, ··· , k−1. The expression in the right hand side of (12) is called the generalized sigmoid (softmax) function. Therefore P (y) can be written in the exponential family form as

Pθ(y) = exp[θ · T (y) − A(θ)], where θ = (θ1, ··· , θk−1),

T (y) = (y1, ··· , yk−1), and k−1 ! X θi A(θ) = − log µk = log 1 + e . i=1

15 Note that the mean parameter µi is given as

µi = E[Ti(Y )] = E[Yi] = Prob[Yi = 1] = Prob[Y = i], which again can be calculated by the fact µ = ∇θA. Indeed one can verify that ∂A eθj = k−1 = µj. ∂θ P θi j 1 + i=1 e

We have shown above that θj = log(µj/µk). Thus for j = 1, ··· , k − 1, we have ! µ θ = log j . j Pk−1 1 − i=1 µi The expression on the right side of the above equation is called the generalized logit function.

2.4 Probit regression Recall that the essence of the GLM for the exponential family is the way it links the (outside) features x1, ··· , xd to the probability model. To be speciﬁc, given the probability model (8), the linking was done by setting θ = ψ(µ) = ω · x so that the conditional probability becomes

P (y | x) = exp[(ω · x)T (y) − A(ω · x) + C(y)].

But there is no a priori reason why µ has to be related to ω·x via the canonical link function only. One may use any function as long as it invertible. So set g(µ) = ω · x, where g : Rk → Rk is an invertible function. So we call g a link function as opposed to the canonical link function and its inverse g−1 the mean function. So we have

g(µ) = ω · x : link function µ = g−1(ω · x) : mean function.

Let Φ(t) be the cumulative distribution function (CDF) of the standard normal distribution N (0, 1), i.e.,

t 1 Z 2 Φ(t) = √ e−s /2 ds. 2π −∞

16 The probit regression deals with the following situation: the output y is binary taking on values in {0, 1}, and its probability model is Bernoulli. So P (y) = µy(1 − µ)1−y, where µ = P (y = 1). Its link to the outside variables (features) is via the link function g = Φ−1. Thus

g(µ) = ω · x = ω0 + ω1x1 + ··· + ωdxd µ = g−1(ω · x) = Φ(ω · x). (13)

Therefore, P (y | x) = Φ(ω · x)y(1 − Φ(ω · x))1−y. Since this is still a Bernoulli hence an exponential family distribution, the general machinery of GLM can be employed. But, in the probit regression we change tack and take a diﬀerent approach. First, we let y ∈ {−1, 1}. Then since

µ = E[Y ] = 1 · P (y = 1) + (−1) · P (y = −1) = P (y = 1) − [1 − P (y = 1)], 1 we have µ = 2P (y = 1) − 1. Thus P (y = 1) = (1 + µ) and P (y = −1) = 2 1 (1 − µ). Therefore, 2 1 P (y) = (1 + yµ). 2 Thus using (13), we have 1 P (y | x) = [1 + yΦ(ω · x)]. 2 n Now let the data D = {(xi, yi)}i=1 be given. Then the log likelihood function is n X l(ω) = log P (yi | xi) i=1 n X 1 = log [1 + y Φ(ω · x )] , 2 i i i=1 Thus n 0 ∂l(ω) X yiΦ (ω · xi) = x , ∂ω 1 + y Φ(ω · x ) ik k i=1 i i

17 which is in a neat form to apply the numerical methods introduced in Lecture 1 2 3, where Φ0(ω · x) = √ e−(ω·x) /2. 2π

Comments (1). When one uses a link function which is not canonical, the probability model does not even have to belong to the exponential family. As long as one can get a hold of the conditional distribution P (y | x), one is in business. Namely with it one can form the likelihood function and the rest of the maximum likelihood machinery.

(2). One may not even need the probability distribution as long as one can somehow estimate ω so that the decisions can be made.

(3). Historically the GLM was ﬁrst developed for the exponential family but was later extended to the non-exponential family and even to the case where the distribution is not completely known.

2.5 GLM for other distributions 2.5.1 GLM for Poisson(µ) Recall that the Poisson distribution is the probability distribution (PMF) on the non-negative integers given by e−µµn P (n) = , n! for n = 0, 1, 2, ··· . It is an exponential family distribution as it can be written as e−µµy P (y) = y! = exp{y log µ − µ − log(y!)}, where y = 0, 1, 2, ··· . It can be put in the exponential family form by setting

θ = log µ : canonical link function µ = eθ : mean function.

18 One can independently check that ∞ X E[Y ] = nP (n) = µ. n=0

2.5.2 GLM for Γ(α, λ) Recall that the gamma distribution is written in the following form: λe−λy(λy)α−1 P (y) = , for y ≥ 0 Γ(α) = exp{log λ − λy + (α − 1) log(λy) − log Γ(α)} = exp{α log λ − λy + (α − 1) log y − log Γ(α)} n−λφy − (− log λ) 1 o = exp + − 1 log y − log Γ(1/φ) . φ φ 1 Set φ = as the dispersion parameter and let θ = −λφ. Then P (y) is an α exponential family distribution written as: nθy − (− log(−θ)) o P (y) = exp + C(y, φ) , θ φ where 1 1 C(y, φ) = log(1/φ) + − 1 log y − log Γ(1/φ). φ φ Therefore the log partition function is A(θ) = − log(−θ). Diﬀerentiating this we get the mean parameter 1 µ = A0(θ) = − . θ α Since E[Y ] = for Y ∼ Γ(α, λ), this fact is independently veriﬁed. To λ recap, the mean function and the canonical link function are given by 1 µ = ψ−1(θ) = − : mean function θ 1 θ = ψ(µ) = − : canonical link function. µ

19 In practice, this canonical link function is rarely used. Instead one uses one of the following three alternatives. 1 (i) the inverse link g(µ) = , µ (ii) the log link g(µ) = log µ,

(iii) the identity link g(µ) = µ.

2.5.3 GLM Summary The GLM link and inverse link (mean) functions for various probability models are summarized in the following Table 1. In here, the range of j in the link and the inverse link of the categorical distribution is j = 1, ··· , k−1, although µ = (µ1, ··· , µk).

Inverse Link Range Link Dispersion (mean) N(µ, σ2)(−∞, ∞) θ = µ µ = θ σ2 Bernoulli(µ) {0, 1} θ = logit(µ) µ = σ(θ) 1

! θj µj e Categorical(µ) {1, ··· , k} θj = log µj = k−1 1 Pk−1 P θi 1 − i=1 µi 1 + i=1 e {0, 1} or Probit Φ−1 Φ 1 {−1, 1} Poisson(µ) Z+ θ = log(µ) µ = eθ 1 1 1 1 Gamma(α, λ) R+ θ = µ = µ θ α Table 1: GLM-Summary