<<

Lectures on (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 4: of distributions and generalized (GLM) (Draft: version 0.9.2)

Topics to be covered:

• Exponential family of distributions

and (canonical) link functions

• Convexity of log partition

(GLM)

• Various GLM models

1 Exponential family of distributions

In this section, we study a family of distribution called the exponential family (of distributions). It is of a special form, but most, if not all, of the well known probability distributions belong to this class.

1 1.1 Definition Definition 1. A (PDF or PMF) is said to belong to the exponential family of distributions (in natural or canonical form) if it is of the form. 1 P (y) = h(y)eθ·T (y), (1) θ Z(θ)

m k where y = (y1, ··· , ym) is a point in R ; and θ = (θ1, ··· , θk) ∈ R is a called the canonical (natural) parameter; T : Rm → Rk is R θ·T (y) a map T (y) = (T1(y), ··· ,Tk(y)); and Z(θ) = h(y)e dy is called the partition function, while its , A(θ) = log Z(θ), is called the log partition () function. Remark. In this lecture and throughout this course, the “dot” notation as in θ · T (y) always the inner (dot) product of two vectors.

Equivalently, Pθ(y) can be written in the form

Pθ(y) = exp[θ · T (y) − A(θ) + C(y)], (2) where C(y) = log h(y). More generally, one sometimes introduces an extra parameter φ, called the dispersion parameter, to control the shape of Pθ(y) by θ · T (y) − A(θ)  P (y) = exp + C(y, φ) . θ φ

1.2 Standing assumption In our discussion on the exponential family of distribution, we always assume the following.

• In case Pθ(y) is a probability density function (PDF), it is assumed to be continuous as a function of y. It means that there is no singularity in the probability Pθ(y);

• In case Pθ(y) is a probability mass function (PMF), there exists a of discrete value of Pθ(y) that is the same for all θ and for all y.

If Pθ(y) satisfies either condition, we say it is regular, which is always as- sumed throughout this course.

2 Remark. Sometimes people use the more general form of Pθ(y) to write 1 P (y) = h(y)eη(θ)·T (y). θ Z(θ)

But most of the time the same result can be obtained without the use of the general form η(θ). So it is not much of a loss of generality to stick to our convention of using just θ.

1.3 Examples Let us now look at a few illustrative examples.

(1) Bernoulli(µ) The is perhaps the simplest in the exponential family. Let Y be a taking its binary value in {0, 1}. Let µ = P [Y = 1]. Its distribution (in fact, the probability mass function, PMF) is then succinctly written as P (y) = µy(1 − µ)1−y. Then,

P (y) = µy(1 − µ)1−y = exp[y log µ + (1 − y) log(1 − µ)]   µ   = exp y log + log(1 − µ) . 1 − µ

 µ  Letting T (y) = y and θ = log and recalling the definitions of 1 − µ and σ functions given in Lecture 3, we have

 µ  logit(µ) = log . 1 − µ

Thus θ = logit(µ). (3) Its inverse function is the :

µ = σ(θ), (4)

where 1 σ(θ) = . 1 + e−θ

3 Therefore, we have

A(θ) = − log(1 − µ) = log(1 + eθ). (5)

Thus Pθ(y), written in canonical form as in (2), becomes

θ Pθ(y) = exp[θ · y − log(1 + e )].

(2) The exponential distribution is a distribution that models the indepen- dent arrival time. Its distribution (the probability density function, PDF) is given as −θy Pθ(y) = θe I(x ≥ 0). To put it in the exponential family form, we use the same θ as the canonical parameter and we let T (y) = −y and h(y) = I(y ≥ 0). Since 1 Z Z(θ) = = e−θy (y ≥ 0)dy, θ I

it is already in the canonical form given as in (1).

(3) The normal (Gaussian) distribution given by

1  (y − µ)2  P (y) = √ exp − 2πσ2 2σ2 is the single most well known distribution. As far as its relation with the exponential family is concerned there are two views. (Here, this σ is a , not the sigmoid function.)

• 1st view (σ2 as a dispersion parameter) This is the case when the main focus is the mean. In this case, the σ2 is regarded as known or as a parameter that can be fiddled with as if known. Writing P (y) in the form   1 2 1 2 − y + yµ − µ 1 P (y) = exp  2 2 − log(2πσ2) , θ  σ2 2 

4 One can see right away it is in the form of the exponential family if we

θ = (θ1, θ2) = (µ, 1) 1 T (y) = (y, − y2) 2 φ = σ2 1 1 A(θ) = µ2 = θ2 2 2 1 1 C(y, φ) = − log(2πσ2). 2

• 2nd view (φ = 1) When both µ and σ are to be treated as unknown, we take this point of view. In here, we set the dispersion parameter φ = 1. Writing out Pθ(y) we have

 1 µ 1 1  P (y) = exp − y2 + y − µ2 − log σ − log 2π . θ 2σ2 σ2 2σ2 2

Thus it is easy to see the following: 1 T (y) = (y, y2) 2 µ 1 θ = (θ , θ ) = ( , − ) 1 2 σ2 σ2 2 1 2 1 θ1 1 A(θ) = 2 µ + log σ = − − log(−θ2) 2σ 2 θ2 2 1 C(y) = − log(2π). 2 1.4 Properties of exponential family The log partition function A(θ) plays a key role, so let us now look at it more carefully. First, since

Z Z θ · T (y) − A(θ)  P (y)dy = exp + C(y, φ) dy = 1, θ φ

5 taking ∇θ, we have Z θ · T (y) − A(θ)  T (y) − ∇ A(θ) exp + C(y, φ) θ dy = 0. φ φ

Thus

Z θ · T (y) − A(θ)  Z θ · T (y) − A(θ)  exp + C(y, φ) T (y)dy = exp + C(y, φ) ∇ A(θ)dy φ φ θ Z θ · T (y) − A(θ)  = ∇ A(θ) exp + C(y, φ) dy. θ φ

Since Z θ · T (y) − A(θ)  exp + C(y, φ) dy = 1, φ we have the following:

Proposition 1. Z ∇θA(θ) = Pθ(y)T (y)dy = E[T (Y )],

where Y is the random variable with distribution Pθ(y). Writing the component, we have

∂A Z θ · T (y) − A(θ)  = exp + C(y, φ) Ti(y)dy. ∂θi φ Taking the second partial , we have

∂2A 1 Z θ · T (y) − A(θ)   ∂A(θ) = exp + C(y, φ) Tj(y) − Ti(y)dy ∂θi∂θj φ φ ∂θj 1 Z θ · T (y) − A(θ)  = exp + C(y, φ) T (y)T (y)dy φ φ i j ∂A(θ) Z θ · T (y) − A(θ)   − exp + C(y, φ) Ti(y)dy . ∂θj φ

6 Using Proposition 1 once more, we have

∂2A 1 n o = E[Ti(Y )Tj(Y )] − E[Ti(Y )]E[Tj(Y )] ∂θi∂θj φ 1 h i = E T (Y ) − E[T (Y )] T (Y ) − E[T (Y )] φ i i j j 1 = Cov(T (Y ),T (Y )). φ i j Therefore we have the following result on the Hessian of A.

2 2  ∂ A  1 1 Proposition 2. Dθ A = = Cov(T (Y ),T (Y )) = Cov(T (Y )), ∂θi∂θj φ φ where Y is the random variable with distribution Pθ(y). Here, Cov(T (Y ),T (Y )) denotes the matrix of T (Y ) with itself, which is sometimes called the variance matrix. Since the is always positive semi-definite, we have

Corollary 1. A(θ) is a convex function of θ.

Remark. In most exponential family models, Cov(T (Y )) is positive definite, in which case A(θ) is strictly convex. In this lecture and throughout the course, we always assume that Cov(T (Y )) is positive definite and therefore that A(θ) is strictly convex.

1.5 Maximum likelihood estimation

(i) N (i) m Let D = {y }i=1 be a given of IID samples, where y ∈ R . Then its L(θ) and the log likelihood function l(θ) are given by

N Y θ · T (y(i)) − A(θ)  L(θ) = exp + C(y(i), φ) φ i=1 N N 1 X X l(θ) = θ · T (y(i)) − NA(θ) + C(y(i), φ). φ i=1 i=1

7 Since A(θ) is strictly convex, l(θ) has a unique maximum at θb that is the unique solution of ∇θl(θ) = 0. Note that

N 1 X ∇ l(θ) =  T (y(i)) − N∇ A(θ) . θ φ θ i=1

So we have the following:

Proposition 3. There is a unique θb that maximizes l(θ) at which

N 1 X ∇ A(θ)| = T (y(i)). θ θ=θb N i=1

1.5.1 Example: Bernoulli Distribution Let us apply the above argument to the Bernoulli distribution Bernoulli(µ) and see what it leads to. First, differentiating A(θ) as in (5), we get 1 A0(θ) = = σ(θ). 1 + e−θ

Thus by Proposition 3 we have a unique θb satisfying

N 1 X σ(θb) = T (y(i)). N i=1

Since T (y) = y, and y takes on 0 or 1 as its value, we have T (y) = I(y = 1). Defining µb = σ(θb) by (4), we then have

µb = σ(θb) N 1 X = (y(i) = 1) N I i=1 Number of 1’s in the {y(i)} = . N

This confirms the usual estimate of µ.b

8 Remark. T is a sufficient . Pn (i) It is well known that the statistic S(yi, ··· , yn) = i=1 T (y ) is sufficient, which means that θ is a constant on the level set

{(y1, ··· , yn) | S(y1, y2, ··· , yn) = const}.

This provides many advantages in estimating θ, as one needs not to look inside the level sets.

1.6 Maximum Distribution Exponential family is useful in that it is the most random distribution under some constraints. Let Y be a random variable with an unknown distri- bution P (y). Suppose the expected values of certain features f1(y), ··· , fk(y) are nonetheless known. Namely, Z fi(y)P (y)dy = Ci (6) is known for some given constant Ci for i = 1, ··· , k. The question is how to fix P (y) so that it is as random as possible while satisfying the above constraints. One approach is to construct a maximum entropy distribution satisfying the constraints (6). The entropy is the measure of and in general, the higher it is, the more random it is deemed. The definition of entropy is X H = P (y) log P (y), y for the discrete case; for the continuous case, the sum becomes an so that Z H = P (y) log P (y)dy.

We will use the integral notation here without the loss of generality. To solve the problem, we use the method to set up the following variational problem of minimizing J(P, λ):

k Z Z X n Z o J(P, λ) = − P (y) log P (y)dy+λ0(1− P (y)dy)+ λi Ci− fi(y)P (y)dy . i=1

9 The solution is found as the critical point. Namely, setting the variational (functional or Fr´echet) derivative equal to 0, k δJ X = −1 − log P (y) − λ − λ f (y) = 0. (7) δP 0 i i i=1 To see where this comes from, let η be any smooth function with compact and define Z Z Z  X  J(P +η, λ) = − (P +η) log(P +η)dy+λ0 1− (P +η)dy) + λi Ci− fi(P +η)dy .

Taking the derivative with respect to  at  = 0 and setting it equal to 0, we get

d Z Z X Z J(P + η, λ) = − (η + η log P )dy − λ ηdy + λ {− f ηdy} d 0 i i =0 Z k X  = − 1 − log P − λ0 − λifi ηdy i=1 = 0. Since the above formula is true for any η, we have k X −1 − log P − λ0 − λifi = 0, i=1 which is (7). Solving (7) for P (y), we have k 1 X P (y) = exp(− λ f (y)), Z i i i=1 where Z = e1+λ0 . Now k Z 1 Z X 1 = P (y)dy = exp(− λ f (y))dy. Z i i i=1 Thus k Z X Z = exp(− λifi(y))dy. i=1 Therefore P (y) belongs to the exponential family of distribution. This kind of distribution is in general called the Gibbs distribution.

10 2 Generalized liner model (GLM)

2.1 Mean parameter and canonical link function Before we embark on the generalized linear model, we need the following elementary fact.

k 1 Lemma 1. Let f : R → R be a strictly convex C function. Then ∇xf : Rk → Rk is an invertible function.

Proof. Let p ∈ Rk be any given vector. By the strict convexity and the absence of corners (the C1 condition), there is a unique hyperplane H in Rk+1 that is tangent to the graph of y = f(x), having the normal vector of the form (p, −1). Let (x, y) be the point of contact. Since the graph is the zero set of F (x, y) = f(x) − y, its normal vector at (x, y) is (∇f(x), −1). Therefore we must have p = ∇f(x).

Figure 1: Graph of a strictly convex function with a tangent plane

Let us now look at the exponential family   Pθ(y) = exp θ · T (y) − A(θ) + C(y) , (8) where the dispersion parameter φ is set to be 1 for the sake of simplicity of the presentation. (The argument does not change much even with the presence of the dispersion parameter.) Recall that Proposition 1 says that ∇θA(θ) = E[T (Y )]. We always assume that the log partition function A : Rk → R 1 k k is strictly convex and C . Thus by the above Lemma, ∇θA : R → R is invertible. We now set up a few terminologies.

11 Definition 2. The mean parameter µ is defined as

µ = E[T (Y )] = ∇θA(θ), k k and the function ∇θA : R → R is called the mean function.

Since ∇θA is invertible, we define −1 Definition 3. The inverse function (∇θA) of ∇θA is called the canonical link function and is denoted by ψ. Thus

−1 θ = ψ(µ) = (∇θA) (µ). Therefore, combining the above definitions, the mean function is also written as −1 µ = ψ (θ) = ∇θA(θ). Example: Bernoulli(µ) Recall, for Bernoulli(µ), P (y) is given by  µ  P (y) = µy(1 − µ)1−y = exp y log + log(1 − µ) 1 − µ = exp  θy − log(1 + eθ)  = exp [ θy − A(θ)] ,  µ  where θ = log and A(θ) = log(1 + eθ). Thus 1 − µ µ = sigmoid(θ) = A0(θ) θ = logit(µ) = (A0)−1(θ). Therefore for the Bernoulli distribution the sigmoid function is the mean function and the logit function the canonical link function.

2.2 in GLM From now on in this lecture, to simplify notation, we use ω to stand for d d+1 both w and b and extend x ∈ R to x ∈ R by adding x0 = 1. (In the previous lectures, we used θ as a generic term to represent both w and b. But in the exponential family notation, θ is reserved to denote the canonical parameter. So we are forced to use ω. One more note: although the typogra- phy is difficult to discern, this ω is the Greek lower case ’omega,’ not English w.)

12 2.2.1 GLM recipe

With this notation, the expression ω = (b, w1, ··· , wd) as (1) of Lecture 3 is now written as ω · x, and thus the conditional probability in Lecture 3 is written as eω·x P (y = 1 | x) = 1 + eω·x 1 P (y = 0 | x) = , 1 + eω·x which can be simplified as

e(ω·x)y P (y | x) = , 1 + eω·x for y ∈ {0, 1}. Using (5), this is easily seen to be equivalent to the following conditional probability model

P (y | x) = exp [(ω · x)y − A(ω · x)] = exp [θy − A(θ)] .

Summarizing this process, we have the following: • GLM Recipe: P (y | x) is gotten from P (x) by replacing the canonical parameter θ in (8) with a linear expression of x. In the binary case, the linear expression is ω · x. The multiclass case mimics it in exactly the same way. To describe it correctly, first define the parameter

 T    (ω1) ω10 ··· ω1j ··· ω1d . . . .  .   . . .   .   . . .   T    W =  (ωi)  = ωi0 ··· ωij ··· ωid  . (9)  .   . . .   .   . . .  T (ωk) ωk0 ··· ωkj ··· ωkd

Then replace θ in (8) with W x to get the conditional probability as

P (y | x) = exp[(W x) · T (y) − A(W x) + C(y)]. (10)

13 Written this way, we can see that

k d X X (W x) · T (y) = ω`jxjT`(y). `=1 j=0

(i) (i) N Now, let D = {(x , y )}i=1 be a given data, where the x-part of the i-th data is represented by the column vector

(i) (i) (i) T x = [x0 , ··· , xd ] . Using the conditional probability as in (10), one can write the likelihood function and the log likelihood function by

N Y L(W ) = exp[(W x(i)) · T (y(i)) − A(W x(i)) + C(y(i))] i=1 N X l(W ) = log L(W ) = [(W x(i)) · T (y(i)) − A(W x(i)) + C(y(i))] i=1 N " k d # X X X (i) (i) T (i) T (i) (i) = ω`jxj T`(y ) − A(ω1 x , ··· , ωk x ) + C(y ) . i=1 `=1 j=0

Thus taking the derivative, one gets

N " k d k d # ∂l(W ) X X X (i) (i) X ∂A X (i) = xj δ`rδjsT`(y ) − xj δ`rδjs ∂ωrs ∂ω` i=1 `=1 j=0 `=1 j=0 N X  ∂A  = x(i)T (y(i)) − x(i) s r ∂ω s i=1 r N X  ∂A  = T (y(i)) − (W x(i)) x(i). (11) r ∂ω s i=1 r This is a generalized form of what we got for the (see (10) of Lecture 3).

2.3 Multiclass classification (softmax regression) We now look at the multiclass classification problem. The formula (11) is the derivative formula with which the gradient descent can be used.

14 But for multiclass regression, the has the redundancy we talked about in Lecture 3. As usual, let yi = I(y = i) and let Prob[Y = i] = µi. Then we must have y1 + ··· + yk = 1 and µ1 + ··· + µk = 1. Thus the probability (PMF) can be written as y1 y2 yk P (y) = µ1 µ2 ··· µk . Rewrite P (y) by

1−Pk−1 y y1 y2 yk−1 ( i=1 i) P (y) = µ1 µ2 ··· µk−1 µk " k−1 ! # X = exp y1 log µ1 + ··· + yk−1 log µk−1 + 1 − yi log µk i=1 "k−1 # X µi = exp y log + log µ . i µ k i=1 k Note that when k = 2, this is exactly the Bernoulli distribution. Define θi = log(µi/µk) and Ti(y) = yi for i = 1, ··· , k − 1. Using the facts that θi Pk−1 µi = µke and 1 − µk = i=1 µi, and solving for µk and then for µj, we get 1 µk = k−1 P θi 1 + i=1 e eθj µj = k−1 , (12) P θi 1 + i=1 e for j = 1, ··· , k−1. The expression in the right hand side of (12) is called the generalized sigmoid (softmax) function. Therefore P (y) can be written in the exponential family form as

Pθ(y) = exp[θ · T (y) − A(θ)], where θ = (θ1, ··· , θk−1),

T (y) = (y1, ··· , yk−1), and k−1 ! X θi A(θ) = − log µk = log 1 + e . i=1

15 Note that the mean parameter µi is given as

µi = E[Ti(Y )] = E[Yi] = Prob[Yi = 1] = Prob[Y = i], which again can be calculated by the fact µ = ∇θA. Indeed one can verify that ∂A eθj = k−1 = µj. ∂θ P θi j 1 + i=1 e

We have shown above that θj = log(µj/µk). Thus for j = 1, ··· , k − 1, we have ! µ θ = log j . j Pk−1 1 − i=1 µi The expression on the right side of the above equation is called the gener- alized logit function.

2.4 Probit regression Recall that the essence of the GLM for the exponential family is the way it links the (outside) features x1, ··· , xd to the probability model. To be specific, given the probability model (8), the linking was done by setting θ = ψ(µ) = ω · x so that the conditional probability becomes

P (y | x) = exp[(ω · x)T (y) − A(ω · x) + C(y)].

But there is no a priori reason why µ has to be related to ω·x via the canonical link function only. One may use any function as long as it invertible. So set g(µ) = ω · x, where g : Rk → Rk is an invertible function. So we call g a link function as opposed to the canonical link function and its inverse g−1 the mean function. So we have

g(µ) = ω · x : link function µ = g−1(ω · x) : mean function.

Let Φ(t) be the cumulative distribution function (CDF) of the standard nor- mal distribution N (0, 1), i.e.,

t 1 Z 2 Φ(t) = √ e−s /2 ds. 2π −∞

16 The probit regression deals with the following situation: the output y is binary taking on values in {0, 1}, and its probability model is Bernoulli. So P (y) = µy(1 − µ)1−y, where µ = P (y = 1). Its link to the outside variables (features) is via the link function g = Φ−1. Thus

g(µ) = ω · x = ω0 + ω1x1 + ··· + ωdxd µ = g−1(ω · x) = Φ(ω · x). (13)

Therefore, P (y | x) = Φ(ω · x)y(1 − Φ(ω · x))1−y. Since this is still a Bernoulli hence an exponential family distribution, the general machinery of GLM can be employed. But, in the probit regression we change tack and take a different approach. First, we let y ∈ {−1, 1}. Then since

µ = E[Y ] = 1 · P (y = 1) + (−1) · P (y = −1) = P (y = 1) − [1 − P (y = 1)], 1 we have µ = 2P (y = 1) − 1. Thus P (y = 1) = (1 + µ) and P (y = −1) = 2 1 (1 − µ). Therefore, 2 1 P (y) = (1 + yµ). 2 Thus using (13), we have 1 P (y | x) = [1 + yΦ(ω · x)]. 2 n Now let the data D = {(xi, yi)}i=1 be given. Then the log likelihood function is n X l(ω) = log P (yi | xi) i=1 n X 1  = log [1 + y Φ(ω · x )] , 2 i i i=1 Thus n 0 ∂l(ω) X yiΦ (ω · xi) = x , ∂ω 1 + y Φ(ω · x ) ik k i=1 i i

17 which is in a neat form to apply the numerical methods introduced in Lecture 1 2 3, where Φ0(ω · x) = √ e−(ω·x) /2. 2π

Comments (1). When one uses a link function which is not canonical, the probability model does not even have to belong to the exponential family. As long as one can get a hold of the conditional distribution P (y | x), one is in business. Namely with it one can form the likelihood function and the rest of the maximum likelihood machinery.

(2). One may not even need the probability distribution as long as one can somehow estimate ω so that the decisions can be made.

(3). Historically the GLM was first developed for the exponential family but was later extended to the non-exponential family and even to the case where the distribution is not completely known.

2.5 GLM for other distributions 2.5.1 GLM for Poisson(µ) Recall that the is the probability distribution (PMF) on the non-negative integers given by e−µµn P (n) = , n! for n = 0, 1, 2, ··· . It is an exponential family distribution as it can be written as e−µµy P (y) = y! = exp{y log µ − µ − log(y!)}, where y = 0, 1, 2, ··· . It can be put in the exponential family form by setting

θ = log µ : canonical link function µ = eθ : mean function.

18 One can independently check that ∞ X E[Y ] = nP (n) = µ. n=0

2.5.2 GLM for Γ(α, λ) Recall that the is written in the following form: λe−λy(λy)α−1 P (y) = , for y ≥ 0 Γ(α) = exp{log λ − λy + (α − 1) log(λy) − log Γ(α)} = exp{α log λ − λy + (α − 1) log y − log Γ(α)} n−λφy − (− log λ)  1  o = exp + − 1 log y − log Γ(1/φ) . φ φ 1 Set φ = as the dispersion parameter and let θ = −λφ. Then P (y) is an α exponential family distribution written as: nθy − (− log(−θ)) o P (y) = exp + C(y, φ) , θ φ where 1  1  C(y, φ) = log(1/φ) + − 1 log y − log Γ(1/φ). φ φ Therefore the log partition function is A(θ) = − log(−θ). Differentiating this we get the mean parameter 1 µ = A0(θ) = − . θ α Since E[Y ] = for Y ∼ Γ(α, λ), this fact is independently verified. To λ recap, the mean function and the canonical link function are given by 1 µ = ψ−1(θ) = − : mean function θ 1 θ = ψ(µ) = − : canonical link function. µ

19 In practice, this canonical link function is rarely used. Instead one uses one of the following three alternatives. 1 (i) the inverse link g(µ) = , µ (ii) the log link g(µ) = log µ,

(iii) the identity link g(µ) = µ.

2.5.3 GLM Summary The GLM link and inverse link (mean) functions for various probability models are summarized in the following Table 1. In here, the range of j in the link and the inverse link of the categorical distribution is j = 1, ··· , k−1, although µ = (µ1, ··· , µk).

Inverse Link Range Link Dispersion (mean) N(µ, σ2)(−∞, ∞) θ = µ µ = θ σ2 Bernoulli(µ) {0, 1} θ = logit(µ) µ = σ(θ) 1

! θj µj e Categorical(µ) {1, ··· , k} θj = log µj = k−1 1 Pk−1 P θi 1 − i=1 µi 1 + i=1 e {0, 1} or Probit Φ−1 Φ 1 {−1, 1} Poisson(µ) Z+ θ = log(µ) µ = eθ 1 1 1 1 Gamma(α, λ) R+ θ = µ = µ θ α Table 1: GLM-Summary

20