<<

8 | Maximum likelihood estimation

Maximum likelihood (ML) estimation is a general principle to derive point in probabilistic models. ML estimation was popularized by Fisher at the beginning of the 20th century, but already found application in the works of Laplace (1749-1827) and Gauss (1777-1855) (Aldrich, 1997). ML estimation is based on the following intuition: the most likely value of a probabilistic model that generated an observed data set should be that parameter value for which the of the data under the model is maximal. In this Section, we first make this intuition more precise and introduce the notions of (log) likelihood functions and ML estimators (Section 8.1). We then exemplify the ML approach by discussing ML parameter estimation for the univariate Gaussian samples (Section 8.2). Finally, we consider ML parameter estimation for the GLM, relate it to ordinary estimation, and introduce the restricted ML for the parameter of the GLM (Section 8.3).

8.1 Likelihood functions and maximum likelihood estimators

Likelihood functions The fundamental idea of ML estimation is to select that parameter value as a point estimate of the true, but unknown, parameter value that gave rise to the data which maximizes the probability of the data under the model of interest. To implement this intuition, the notion of the and its maximization is invoked. To introduce the likelihood function, consider a parametric probabilistic model pθ(y) which specifies the of a random entity y. Here, y models data and θ denotes the model’s parameter with parameter space Θ. Given a parametric probabilistic model pθ(y), the function

Ly :Θ → R≥0, θ 7→ Ly(θ) := pθ(y) (8.1) is called the likelihood function of the parameter θ for the data y. Note that the specific nature of θ and y is left unspecified, i.e., θ and y may be scalars, vectors, or matrices. Notably, the likelihood function is a function of the parameter θ, while it also depends on y. Because y is a random entity, different data samples from probabilistic model pθ(y) result in different likelihood functions. In this sense, there is a distribution of likelihood functions for each probabilistic model, but once a data realization has been obtained, the likelihood function is a (deterministic) function of the parameter value only. This is in stark contrast with PDFs and PMFs which are functions of the random variable’s outcome values (Section 5 | Probability and random variables). Stated differently, the input argument of a PDF or PMF is the value of a random variable and the output argument of a PDF or PMF is the probability density or mass of this value for a fixed value of the model’s parameter. In contrast, the input argument of a likelihood function is a parameter value and the output of the likelihood function is the probability density or mass of a fixed value of the random variable modelling data for this parameter value under the probability model of interest. If the random variable value and parameter value submitted to a PDF or PMF of a model and their corresponding likelihood functions are identical, so are the outputs of both functions. It is the functional dependencies that distinguish likelihood functions from PDFs and PMFs, but not their functional form.

Maximum likelihood estimators

The ML estimator of a given probabilistic model pθ(y) is that parameter value which maximizes the likelihood function. Formally, this can be expressed as ˆ θML := arg max Ly(θ). (8.2) θ∈Θ

ˆ Eq. (8.2) should be read as follows: θML is defined as that argument of the likelihood function Ly for which Ly(θ) assumes its maximal value over all possible parameter values θ in the parameter space Θ. Note that from a mathematical viewpoint, the above definition is not overly general, because it is tacitly assumed that Ly in fact has an maximizing argument and that this argument is unique. Also note that Likelihood functions and maximum likelihood estimators 2

ˆ ˆ instead of values for θML, one is often interested in functional forms that express θML as a function of the ˆ ˆ data y. Concrete numerical values of θML are referred to as ML estimates, while functional forms of θML are referred to as ML estimators. There are essentially two approaches to ML estimation. The first approach aims to obtain functional forms of ML estimators (sometimes referred to as closed-form solutions) by analytically maximizing the likelihood function with respect to θ. The second approach, often encountered in applied computing, builds on the former and systematically varies θ given an observation of y while monitoring the numeric value of the likelihood function. Once this value appears to be maximal, varying θ stops, and the resulting value is used as an ML estimate. In the following, we consider the first approach, which is of immediate relevance for basic parameter estimation in the GLM, in more detail. ˆ From Section 3 | Calculus, we know that candidate values for the ML estimator θML fulfil the requirement d Ly(θ)| ˆ = 0. (8.3) dθ θ=θML ˆ Eq. (8.3) is known as the likelihood equation and should be read as follows: at the location of θML, the d p derivative of the likelihood function dθ Ly with respect to θ is equal to zero. If θ ∈ R , p > 1, the statement ˆ implies that at the location of θML, the gradient ∇L with respect to θ is equal to the zero vector 0p. Clearly, eq. (8.3) corresponds to the necessary condition for extrema of functions. By evaluating the necessary derivatives of the likelihood function and setting them to zero, one may thus obtain a set of equations which can hopefully be solved for an ML estimator.

The log likelihood function To simplify the analytical approach for finding ML estimators as sketched above, one usually considers the of the likelihood function, the so-called log likelihood function. The log likelihood function is defined as (cf. eq. (8.1))

`y :Θ → R, θ 7→ `y(θ) := ln Ly(θ) = ln pθ(y). (8.4) Because the logarithm is a monotonically increasing function, the location in parameter space at which the likelihood function assumes its maximal value corresponds to the location in parameter space at which the log likelihood assumes its maximal value. Using either the likelihood function or log likelihood function to find a maximum likelihood estimator is thus equivalent, as both will identify the same maximizing value (if it exists). The use of log likelihood functions instead of likelihood functions in ML estimation is primarily of pragmatic nature: first, probabilistic models often involve PDFs with exponential terms that are dissolved by the log transform. Second, independence assumptions often give rise to factorized probability distributions which are simplified to sums by the log transform. Finally, from a numerical perspective, one often deals with PDF or PMF values that are rather close to zero and that are stretched to a broader range by the log transform. In analogy to (8.3), the log likelihood equation for the maximum likelihood estimator is given by d `y(θ)| ˆ = 0. (8.5) dθ θ=θML Like eq. (8.3), the log likelihood equation can be extended to multivariate θ in terms of the gradient of `, ˆ and like eq. (8.3), it can be solved for θML. We next aim to exemplify the idea of ML estimation in a first example (Section 8.2). To do so, we first discuss two additional assumptions that simplify the application of the ML approach considerably: the assumption of a concave log likelihood function and the assumption of independent data random variables with associated PDFs. Finally, we summarize the ML method in a recipe-like manner.

Concave log likelihood functions If the log likelihood function is concave, then the necessary condition for a maximum of the log likelihood function is also sufficient. Recall that a multivariate real-valued function f : Rn → R is called concave, if for all input arguments a, b ∈ Rn the straight line connecting f(a) and f(b) lies below the function’s graph. Formally,

n f (ta + (1 − t)b) ≥ tf(a) + (1 − t)f(b) for a, b ∈ R and t ∈ [0, 1]. (8.6)

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Likelihood functions and maximum likelihood estimators 3

Here, ta+(1−t)b for t ∈ [0, 1] describes a straight line in the domain of the function, while tf(a)+(1−t)f(b) for t ∈ [0, 1] describes a straight line in the range of the function. Leaving mathematical subtleties aside, it is roughly correct that concave functions have a single maximum, or in other words, that a critical point at which the gradient vanishes is guaranteed to be a maximum of the function. Thus, if the log likelihood function is concave, finding a parameter value for which the log likelihood equation holds, is sufficient to identify a maximum at this location. In principle, whenever applying the ML method based on the log likelihood equation, it is thus necessary to show that that the log likelihood function is concave and that the necessary condition for a maximum is hence also sufficient. However, such an approach is beyond the level of rigour herein and we content with stating without proof that the log likelihood functions of interest in the following are concave.

Independent data random variables with probability density functions A second assumption that simplifies the application of the ML method is the assumption of independent data variables with associated PDFs. To this end, we first note that in the case of more than one data point, the data random entity y corresponds to a random vector comprising data random variables T y1, y2, ..., yn, i.e. y := (y1, ..., yn) . If in addition one assumes that these data variables are independent and each variable is governed by a PDF that is parameterized by the same parameter vector, then the joint PDF of y can be written as the product of the individual PDFs pθ(yi), i = 1, ..., n. Formally, we write n Y pθ(y) = pθ(y1, ..., yn) = pθ(yi). (8.7) i=1

Eq. (8.7) may be conceived from two angles: on the one hand, one may think of the random variables yi to be governed by one and the same underlying probability distribution from which samples are obtained with replacement. Alternatively, one may think of each yi to be governed by its individual probability distribution defined in terms of its PDF, all of which are however identical. For our purposes, these two angles are equivalent, while the latter conception seems somewhat closer to the formal developments below.

Crucially, in the case of independent data random variables y1, ..., yn, the log likelihood function is given by n Y `y(θ) = ln pθ(y) = ln pθ(yi). (8.8) i=1 Repeated application of the product property of the logarithm then allows for expressing the log likelihood as (8.8) n X `y(θ) = ln pθ(y) = ln pθ(yi). (8.9) i=1

The evaluation of the logarithm of a product of PDFs pθ(yi) is thus simplified to the summation of of individual PDFs pθ(yi).

Analytical derivation of maximum likelihood estimators In summary, the developments above suggest the following three step procedure for the analytical derivation of ML estimators in probabilistic models: (1) Formulation of the log likelihood function. This step corresponds to writing down the log probability density of a set of data random variables under the model of interest. Special attention has to be paid to the number of observable variables considered and their independence properties. (2) Evaluation of the log likelihood function’s gradient. Often, probabilistic models of interest have more than one parameter and ML estimators for each parameter are required, i.e., the partial derivatives of the log likelihood function with respect to the have to be evaluated. This step is usually eased by the use of PDFs that involve exponential terms and the assumption of independent data random variables. (3) Solution of the log likelihood equations. Under the assumption of concave log likelihood functions, solving the log likelihood equation yields the location of the maximum of the log likelihood function in parameter space. The parameter values thus obtained then correspond to ML estimators. We next consider an exemplary application of the maximum likelihood method.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 4

8.2 Maximum likelihood estimation for univariate Gaussian distributions

As a first example of the ML method, we consider the case of n independent and identically distributed 2 random variables y1, ..., yn with univariate Gaussian distribution and parameter vector (µ, σ ) ∈ R × R>0. 2 2 Our aim is to derive ML estimators µˆML and σˆML for µ and σ , respectively. In terms of the general 2 principle discussed above, we thus have θ := (µ, σ ) and Θ := R × R>0.

Formulation of the log likelihood function The first step in the application of the ML approach is the formulation of the log likelihood function. For the current example, the distribution of the ith random variable yi, i = 1, ..., n is governed by a 2 univariate Gaussian distribution with PDF N(yi; µ, σ ) and the random variables y1, ..., yn are assumed T n to be independent. The PDF of the joint outcome value y = (y1, ..., yn) ∈ R thus corresponds to the product of the PDFs of each individual yi, i = 1, ..., n. This can be written as

n n Y Y 2 pµ,σ2 (y) = pµ,σ2 (yi) = N(yi; µ, σ ). (8.10) i=1 i=1

Because the PDFs of yi, i = 1, ..., n are of the form   2 1 1 2 N(yi; µ, σ ) = √ exp − (yi − µ) , (8.11) 2πσ2 2σ2 we may rewrite (8.10) as n ! n 2− 2 1 X 2 p 2 = 2πσ exp − (y − µ) (8.12) µ,σ 2σ2 i i=1 as shown below.

Proof. With the laws of exponentiation and the exponentiation property of the exponential function, we have n n   Y 2 Y 1 1 2 N(yi; µ, σ ) = √ exp − (yi − µ) 2 2σ2 i=1 i=1 2πσ n n 1   Y 2 − Y 1 2 = 2πσ  2 exp − (y − µ) 2σ2 i i=1 i=1 n ! (8.13) n 2 − X 1 2 = 2πσ  2 exp − (y − µ) 2σ2 i i=1 n ! n 2 − 1 X 2 = 2πσ  2 exp − (y − µ) . 2σ2 i i=1

Based on eq. (8.12), we can write down the likelihood function as

n ! 2 2 2 − n 1 X 2 L : × → , (µ, σ ) 7→ L (µ, σ ) := (2πσ ) 2 exp − (y − µ) (8.14) y R R>0 R>0 y 2σ2 i i=1 and, as shown below, the corresponding log likelihood function evaluates to

n n n 1 X ` : × → , (µ, σ2) 7→ ` (µ, σ2) := − ln 2π − ln σ2 − (y − µ)2. (8.15) y R R>0 R y 2 2 2σ2 i i=1

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 5

Proof. With the properties of the logarithm, we have

n !! 2 2 − n 1 X 2 ln L (µ, σ ) = ln (2πσ ) 2 exp − (y − µ) y 2σ2 i i=1 n !!  2 − n  1 X 2 = ln (2πσ ) 2 + ln exp − (y − µ) 2σ2 i i=1 (8.16) n n 2 1 X 2 = − ln 2πσ  − (y − µ) 2 2σ2 i i=1 n n n 2 1 X 2 = − ln 2π − ln σ − (y − µ) . 2 2 2σ2 i i=1

Evaluation of the log likelihood function’s gradient The second step in the analytical derivation of ML estimators is the evaluation of the gradient   ∂ 2 2 ∂µ `y(µ, σ ) ∇`y(µ, σ ) =   . (8.17) ∂ 2 ∂σ2 `y(µ, σ ) As shown below, for the partial derivative with respect to µ, we have

n ∂ 1 X ` (µ, σ2) = (y − µ), (8.18) ∂µ y σ2 i i=1 and for the partial derivative with respect to σ2, we have

n ∂ n n X ` (µ, σ2) = − + (y − µ)2. (8.19) ∂σ2 y 2σ2 2σ4 i i=1

Proof. With the summation and chain rules of differential calculus, we have

n ! ∂ 2 ∂ n n 2 1 X 2 ` (µ, σ ) = − ln 2π − ln σ − (y − µ) ∂µ y ∂µ 2 2 2σ2 i i=1 n ! ∂ 1 X 2 = − (y − µ) ∂µ 2σ2 i i=1 n 1 X ∂ 2 = − (y − µ) (8.20) 2σ2 ∂µ i i=1 n 1 X ∂ = − 2(y − µ) (−µ) 2σ2 i ∂µ i=1 n 1 X = (y − µ), σ2 i i=1 and with the form of the derivative of the logarithm, we have

n ! ∂ 2 ∂ n n 2 1 X 2 ` (µ, σ ) = − ln 2π − ln σ − (y − µ) ∂σ2 y ∂σ2 2 2 2σ2 i i=1 n ! n ∂ 2 ∂ 1 X 2 = − ln σ − (y − µ) 2 ∂σ2 ∂σ2 2σ2 i i=1 (8.21) n n 1 1 ∂ 2 −1 X 2 = − − σ  (y − µ) 2 σ2 2 ∂σ2 i i=1 n n 1 X 2 = − + (y − µ) . 2σ2 2σ4 i i=1

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 6

Solution of the log likelihood equations

2 With the above, the log likelihood equations corresponding to ∇`y(µ, σ ) = 0 are given by n 1 X (y − µ) = 0 σ2 i i=1 n (8.22) n 1 X − + (y − µ)2 = 0. 2σ2 2σ4 i i=1 Notably, these log likelihood equations exhibit a dependence between the ML estimator for µ and the ML estimator for σ2, because both parameters appear in both equations. To solve the log likelihood equations 2 for µˆML and σˆML, a standard approach is to first solve the log likelihood equation for µˆML and then use 2 the solution to solve the log likelihood equation forσ ˆML. As shown below, this yields n 1 X µˆ = y (8.23) ML n i i=1 and n 1 X σˆ2 = (y − µˆ )2. (8.24) ML n i ML i=1

−2 Pn Proof. The first log likelihood equation implies that σ or i=1(yi − µˆML) is equal to zero. Because by definition 2 −2 Pn σ > 0 and thus σ > 0, the equation can only hold, if i=1(yi − µˆML) equals zero. We thus have n X (yi − µˆML) = 0 i=1 n n X X ⇔ yi − µˆML = 0 i=1 i=1 n (8.25) X ⇔ yi − nµˆML = 0 i=1 n 1 X ⇔ µˆ = y . ML n i i=1 To find the maximum likelihood estimator for σ2, we substitute this result in the second log likelihood equation 2 and solve forσ ˆML: n n 1 X 2 − + (y − µ) = 0 2ˆσ2 2ˆσ4 i ML ML i=1 n 1 X 2 n ⇔ 4 (yi − µ) = 2 2ˆσML 2ˆσML i=1 (8.26) n 2 1 X 2 2nσˆML ⇔ (y − µ) = σˆ2 i 2ˆσ2 ML i=1 ML n 2 1 X 2 ⇔ σˆ = (y − µ) . ML n i i=1

Notably, the ML estimatorµ ˆML corresponds to the sample n 1 X y¯ := y . (8.27) n i i=1

2 On the other hand, the ML estimatorσ ˆML does not correspond to the sample variance n 1 X s2 := (y − y¯)2. (8.28) n − 1 i i=1 While the sample variance is a -free estimator of σ2, the ML estimator is not.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 7

8.3 ML estimation of GLM parameters

We have previously seen that the GLM can be expressed in PDF form as

2  pβ,σ2 (y) = N y; Xβ, σ In . (8.29)

We now turn to the problem of finding ML estimates βˆ andσ ˆ2 for β and σ2, respectively.

Beta parameter estimation We first note that, assuming a known value σ2 > 0, the likelihood function for the beta parameter is

− n  1  L : p → , β 7→ L (β) := 2πσ2 2 exp − (y − Xβ)T (y − Xβ) . (8.30) y R R>0 y 2σ2

Logarithmic transformation yields the corresponding log likelihood function n n 1 ` : p → , β 7→ ` (β) := − ln 2π − ln σ2 − (y − Xβ)T (y − Xβ). (8.31) y R R y 2 2 2σ2

As shown below, the necessary condition for a maximum of `y is equivalent to

XT Xβ = XT y. (8.32)

Eq. (8.32) is a set of j linear equations known as normal equations. If it is assumed that the matrix XT X is invertible, then the normal equations can readily be solved for the ML beta parameter estimate

−1 βˆ = XT X XT y. (8.33)

It is a basic exercise in linear algebra to prove that the invertibility of XT X is given, if the design matrix X ∈ Rn×p is of full column-rank p. Experimental designs yielding full column-rank design matrices hence allow for the unique identification of the GLM beta parameter maximum likelihood estimate. We will refer to eq. (8.33) as beta parameter estimator.

Proof. The equivalence of the necessary condition for a maximum of `y and eq. (8.32) derives from the following considerations: the necessary condition for a maximum of `y is

∇`y(β) = 0p, (8.34) which implies that ∂ `y(β) = 0 for all j = 1, ..., p. (8.35) ∂βj

Given the functional form of `y, eq. (8.35) is equivalent to

1 ∂ T − 2 (y − Xβ) (y − Xβ) = 0 for all j = 1, ..., p. (8.36) 2σ ∂βj

Because σ2 > 0, eq. (8.36) in turn is equivalent to

1 ∂ (y − Xβ)T (y − Xβ) = 0 for all j = 1, ..., p. (8.37) 2 ∂βj

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 8

We next consider the partial derivative in eq. (8.37) for a selected j ∈ {1, ..., p} in more detail. We have

n 1 ∂ T 1 ∂ X 2 (y − Xβ) (y − Xβ) = (yi − (Xβ)i) 2 ∂βj 2 ∂βj i=1 n X ∂ = (yi − (Xβ)i) (yi − (Xβ)i) ∂βj i=1 n X ∂ = (yi − (Xβ)i) (Xβ)i ∂βj i=1 n X ∂ = (yi − (Xβ)i) (xi1β1 − · · · − xij βj − · · · − xipβp) (8.38) ∂βj i=1 n X = (yi − xi1β1 − · · · − xij βj − · · · − xipβp) xij i=1 n X = (xij yi − xij xi1β1 − · · · − xij xij βj − · · · − xij xipβp) i=1 n n n n X X X X = xij yi − xij xi1β1 − · · · − xij xij βj − · · · − xij xipβp. i=1 i=1 i=1 i=1 From eq. (8.37), we thus have

n n n n X X X X xij xi1β1 + ··· + xij xij βj ··· + xij xipβp = xij yi for all 1 < j < p, (8.39) i=1 i=1 i=1 i=1 and similarly for j = 1 and j = p. Summarizing these p equations in vector format then results in Pn Pn Pn  Pn  i=1 xi1xi1β1 + i=1 xi1xi2β2 + ··· + i=1 xi1xipβp i=1 xi1yi Pn Pn Pn Pn  i=1 xi2xi1β1 + i=1 xi2xi2β2 + ··· + i=1 xi2xipβp   i=1 xi2yi   =   . (8.40)  .   .   .   .  Pn Pn Pn Pn i=1 xipxi1β1 + i=1 xipxi2β2 + ··· + i=1 xipxipβp i=1 xipyi Furthermore, we may rewrite the left-hand side of eq. (8.40) as

Pn Pn Pn  Pn Pn Pn    i=1 xi1xi1β1 + i=1 xi1xi2β2 + ··· + i=1 xi1xipβp i=1 xi1xi1 i=1 xi1xi2 ... i=1 xi1xip β1 Pn Pn Pn Pn Pn Pn  i=1 xi2xi1β1 + i=1 xi2xi2β2 + ··· + i=1 xi2xipβp   i=1 xi2xi1 i=1 xi2xi2 ... i=1 xi2xip β2   =      .   . . .. .   .   .   . . . .   .  Pn Pn Pn Pn Pn Pn i=1 xipxi1β1 + i=1 xipxi2β2 + ··· + i=1 xipxipβp i=1 xipxi1 i=1 xipxi2 ... i=1 xipxip βp

x11 x21 . . . xn1 x11 x12 . . . x1p  β1 x12 x22 . . . xn2 x21 x22 . . . x2p  β2 =        . . .. .   . . .. .   .   . . . .   . . . .   .  x1p x2p . . . xnp xn1 xn2 . . . xnp βp

= XT Xβ. (8.41) Similarly, we may rewrite the right-hand side of eq. (8.40) as

Pn      i=1 xi1yi x11 x21 . . . xn1 y1 Pn  i=1 xi2yi x12 x22 . . . xn2 y2    =     = XT y. (8.42)  .   . . .. .   .   .   . . . .   .  Pn i=1 xipyi x1p x2p . . . xnp yn The normal equations then follow directly from eqs. (8.40), (8.41), and (8.42).

Ordinary least squares beta parameter estimation A popular alternative approach for estimating the parameters of GLMs is the (OLS) method. The idea of OLS estimation is to minimize the squared distance between observed data points

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 9 and the data predicted by the GLM. In contrast to the ML approach, OLS estimation is not dependent on any specific parametric, nor even any probabilistic assumptions about the data. Below, we discuss the equivalence of ML and OLS estimators for GLM beta parameters. While both approaches result in the same beta parameter estimator, OLS estimation does not lend itself to probabilistic inference, because its ensuing estimates are endowed neither with a Frequentist nor a Bayesian distributional theory. To show the equivalence of ML and OLS beta parameter estimation, we first consider OLS estimation. In OLS estimation, the aim is to minimize the sum of error squares (SES), defined as

n X 2 SES := (yi − (Xβ)i) , (8.43) i=1 where (Xβ)i denotes the ith row of Xβ. (Xβ)i is the GLM prediction of yi and depends on the value of β. The sum of all squared prediction errors yi − (Xβ)i forms the SES. Clearly, due to the quadratic terms, it holds that SES ≥ 0. We next reconsider the likelihood function of the GLM beta parameter for known σ2 > 0 (cf. eq. (8.30)),

n ! − n 1 X 2 L (β) = 2πσ2 2 exp − (y − (Xβ) ) . (8.44) y 2σ2 i i i=1

As readily apparent, the likelihood function Ly comprises the SES in its exponential term. Because the SES itself is non-negative and in the functional form of Ly is endowed with a minus sign, the exponential term of Ly becomes maximal, if the squared deviations between model prediction and data values become minimal. In other words, the GLM likelihood function for the beta parameter is maximized, if the SES is minimized. In effect, irrespective of whether the OLS or ML methods are employed to derive the GLM beta parameter estimator, the beta parameter estimators are identical.

Maximum likelihood variance parameter estimation Finally, to derive the ML estimator for the GLM variance parameter σ2, we proceed as follows: we substitute βˆ in the GLM log likelihood function and then maximize the resulting log likelihood function with respect to σ2. Substitution of βˆ in eq. (8.31) renders the GLM log likelihood function a function of σ2 > 0 only,

n n 1  T   ` : → , σ2 7→ ` (σ2) := − ln 2π − ln σ2 − y − Xβˆ y − Xβˆ , (8.45) y,βˆ R>0 R y,βˆ 2 2 2σ2

2 and, as shown below, the derivative of `y,βˆ with respect to σ evaluates to

T d 2 1 n 1 1     ` ˆ(σ ) = − + y − Xβˆ y − Xβˆ . (8.46) dσ2 y,β 2 σ2 2 (σ2)2

Proof. We have

d d  n n 1  T   ` (σ2) = − ln 2π − ln σ2 − y − Xβˆ y − Xβˆ dσ2 y,βˆ dσ2 2 2 2σ2 T n d 2 1  ˆ  ˆ d 2−1 = − 2 ln σ − y − Xβ y − Xβ 2 σ 2 dσ 2 dσ (8.47) n 1 1  T   −2 = − − y − Xβˆ y − Xβˆ (−1) σ2 2 σ2 2 1 n 1 1  T   = − + y − Xβˆ y − Xβˆ . 2 σ2 2 (σ2)2

d 2 Finally, as shown below, setting dσ2 `y,βˆ to zero and solving forσ ˆ yields

1  T   σˆ2 = y − Xβˆ y − Xβˆ . (8.48) n

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Example (Independent and identically distributed Gaussian samples) 10

Proof. We have d ` (ˆσ2) = 0 dσ2 y,βˆ 1 n 1 1  T   ⇔ − + y − Xβˆ y − Xβˆ = 0 2 σˆ2 2 (ˆσ2)2 1 1  T   1 n ⇔ y − Xβˆ y − Xβˆ = 2 (ˆσ2)2 2 σˆ2 (8.49)  T   σˆ2σˆ2n ⇔ y − Xβˆ y − Xβˆ = σˆ2  T   ⇔ nσˆ2 = y − Xβˆ y − Xβˆ 1  T   ⇔ σˆ2 = y − Xβˆ y − Xβˆ . n

A couple of aspects of eq. (8.48) are noteworthy. First, the term (y − Xβˆ) corresponds to the difference between data y and the GLM data prediction Xβˆ. The variance parameter estimate σˆ2 thus corresponds to a scaled version of the residual sum of error squares (RSS), defined as

n X ˆ 2 RSS = (yi − (Xβ)i) . (8.50) i=1

2 1 ˆ T ˆ Second, the ML estimator of σ is biased: it can be shown that the of n (y − Xβ) (y − Xβ) is smaller than σ2. In other words, the ML variance parameter estimator underestimates the true, but unknown, GLM variance parameter. This can, however, readily be rectified by dividing the RSS not by n, but by n − p. This yields the so-called restricted maximum likelihood (ReML) estimator of the GLM variance parameter, defined as (y − Xβˆ)T (y − Xβˆ) σˆ2 := . (8.51) n − p In the following, we will hence consider the ReML estimator for estimating σ2. For simplicity, we shall refer to (8.51) as variance parameter estimator. An introduction to the origin of the estimation bias of the ML variance parameter estimator and the concept of ReML is beyond the scope of a basic introduction to the GLM.

8.4 Example (Independent and identically distributed Gaussian samples)

As a first application of the beta and variance parameter estimators introduced above, we consider the GLM scenario of n independent and identically distributed Gaussian samples,

2 yi ∼ N(µ, σ ) for i = 1, ..., n, (8.52) which, as discussed in Section 7 | Probability distributions, corresponds to the GLM

2 n×1 2 y ∼ N(Xβ, σ In), where X := 1n ∈ R , β := µ, and σ > 0. (8.53) As shown below, for the model specified in eq. (8.53), the beta and variance parameter estimators evaluate to n n 1 X 1 X 2 βˆ = y =y ¯ andσ ˆ2 = (y − y¯) = s2 (8.54) n i n − 1 i i=1 i=1 respectively. In the GLM scenario of independent and identically distributed Gaussian samples, the beta and variance parameter estimators are thus identical to the sample mean and sample variance of the 2 random sample y1, ..., yn ∼ N(µ, σ ).

Proof. The beta parameter estimator evaluates to   −1   1 y1 n n T −1 T −1 X 1 X βˆ = (X X) X y = 1 ··· 1 . 1 ··· 1  .  = n y = y =y. ¯ (8.55)  .  .  i n i i=1 i=1 1 yn

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 11

The variance parameter estimator is given by

1  T   σˆ2 = y − Xβˆ y − Xβˆ n − 1     T      y1 1 n ! y1 1 n ! 1 1 X 1 X =  .  − . y   .  − . y  n − 1  .  . n i   .  . n i  i=1 i=1 yn 1 yn 1  1 Pn T  1 Pn  y1 − n i=1 yi y1 − n i=1 yi 1 =  .   .  n − 1  .   .  (8.56) 1 Pn 1 Pn yn − n i=1 yi yn − n i=1 yi n n !!2 1 X 1 X = y − y n − 1 i n i i=1 i=1 n 1 X 2 = (y − y¯) n − 1 i i=1 = s2.

8.5 Bibliographic remarks

Treatments of ML and GLM estimation theory can be found in virtually all statistical textbooks. Seber (2015) and Christensen(2011) provide comprehensive introductions for the GLM.

8.6 Study questions

1. Write down the general form of a likelihood function and name its components. 2. Write down the general form of a log likelihood function and name its components 3. Write down the general form of an ML estimator and explain it. 4. Discuss commonalities and differences between OLS and ML beta parameter estimators. 5. Write down the formula of the GLM ML beta parameter estimator and name its components. 6. Write down the formula of the GLM ML variance parameter estimator and name its components. 7. Write down the formula of the GLM ReML variance parameter estimator and name its components. 8. Define the sum of error squares (SES) and the residual sum of squares (RSS) and discuss their commonalities and differences. 9. Write down the GLM of incarnation independent and identical sampling from a univariate Gaussian distribution as well as the ensuing expectation and variance parameter estimators.

8.7 References

Aldrich, J. (1997). R.A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12(3):162–176. Christensen, R. (2011). Plane Answers to Complex Questions. Springer Texts in . Springer New York, New York, NY. Seber, G. (2015). The Linear Model and Hypothesis. Springer Series in Statistics. Springer International Publishing, Cham.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0