The General 20/21

Univ.-Prof. Dr. Dirk Ostwald Contents

1 Introduction 3 1.1 Probabilistic modelling...... 3 1.2 Experimental design...... 6 1.3 A verbose introduction to the general linear model...... 7 1.4 Bibliographic remarks...... 11 1.5 Study questions...... 12

2 Sets, sums, and functions 13 2.1 Sets...... 13 2.2 Sums, products, and exponentiation...... 17 2.3 Functions...... 19 2.4 Bibliographic remarks...... 25 2.5 Study questions...... 25

3 Calculus 26 3.1 Derivatives of univariate real-valued functions...... 26 3.2 Analytical optimization of univariate real-valued functions...... 29 3.3 Derivatives of multivariate real-valued functions...... 32 3.4 Derivatives of multivariate vector-valued functions...... 36 3.5 Basic integrals...... 37 3.6 Bibliographic remarks...... 43 3.7 Study questions...... 43

4 Matrices 44 4.1 definition...... 44 4.2 Matrix operations...... 44 4.3 Determinants...... 52 4.4 Symmetry and positive-definiteness...... 52 4.5 Bibliographic remarks...... 53 4.6 Study Questions...... 53

5 Probability spaces and random variables 55 5.1 Probability spaces...... 55 5.2 Elementary probabilities...... 56 5.3 Random variables and distributions...... 58 5.4 Random vectors and multivariate probability distributions...... 62 5.5 Bibliographic remarks...... 68 5.6 Study questions...... 68

6 Expectation, , and transformations 69 6.1 Expectation...... 69 6.2 ...... 71 6.3 Sample , sample variance, and sample ...... 74 6.4 Covariance and correlation of random variables...... 75 6.5 Sample covariance and sample correlation...... 78 6.6 Probability density transformations...... 79 6.7 Combining random variables...... 81 6.8 Bibliographic remarks...... 85 6.9 Study questions...... 85

7 Probability distributions 86 7.1 The multivariate Gaussian distribution...... 87 Contents 2

7.2 The General Linear Model...... 91 7.3 The Gamma distribution...... 92 7.4 The χ2 distribution...... 92 7.5 The t distribution...... 93 7.6 The f distribution...... 95 7.7 Bibliographic remarks...... 96 7.8 Study questions...... 96

8 Maximum likelihood estimation 97 8.1 Likelihood functions and maximum likelihood estimators...... 97 8.2 Maximum likelihood estimation for univariate Gaussian distributions...... 100 8.3 ML estimation of GLM parameters...... 103 8.4 Example (Independent and identically distributed Gaussian samples)...... 106 8.5 Bibliographic remarks...... 107 8.6 Study questions...... 107

9 Frequentist distribution theory 108 9.1 Introduction...... 108 9.2 Beta parameter estimates...... 108 9.3 Variance parameter estimates...... 110 9.4 The T -...... 111 9.5 The F -statistic...... 112 9.6 Bibliographic remarks...... 115 9.7 Study questions...... 115

10 Statistical testing 116 10.1 Statistical tests...... 116 10.2 A single-observation z-test...... 118 10.3 Bibliographic remarks...... 119 10.4 Study questions...... 119

11 T-tests and simple 121 11.1 Introduction...... 121 11.2 One-sample t-test...... 123 11.3 Independent two-sample t-test...... 124 11.4 ...... 127 11.5 Bibliographic remarks...... 131 11.6 Study questions...... 131

12 Multiple linear regression 132 12.1 An exemplary multiple linear regression design...... 132 12.2 Linearly independent, orthogonal, and uncorrelated regressors...... 134 12.3 Statistical efficiency of multiple linear regression designs...... 135 12.4 Multiple linear regression in functional neuroimaging...... 136 12.5 Bibliographic remarks...... 141 12.6 Study questions...... 141

13 One-way ANOVA 143 13.1 The GLM perspective...... 143 13.2 The F -test perspective...... 149 13.3 Bibliographic remarks...... 154 13.4 Study questions...... 155

14 Two-way 156 14.1 An additive two-way ANOVA design...... 157 14.2 A two-way ANOVA design with ...... 159 14.3 Bibliographic remarks...... 161 14.4 Study questions...... 161

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 1 | Introduction

The general linear model (GLM) is a unifying perspective on many analytical techniques in , machine learning, and artificial intelligence. For example, many statistical methods, such as T-tests, F-tests, simple linear regression, multiple linear regression, the analysis of variance, and the are special cases of the GLM. Furthermore, the mathematical machinery of the GLM forms the basis for many more advanced data analytical techniques ranging from mixed linear models to neural networks to Bayesian hierarchical models. In cognitive neuroimaging, the GLM is popular as a standard technique in the analysis of fMRI data. The aim of this introductory Section is to preview the scope contemporary data analytical approaches, which is most sensibly summarized by the term probabilistic modelling (Section 1.1). After touching upon some basic aspects of experimental design (Section 1.2), we then provide a verbose introduction to the GLM and its mathematical form (Section 1.3). The mathematical language that is needed to discuss the GLM (e.g., matrix calculus and multivariate Gaussian distributions) will be expanded upon in subsequent Sections. It is introduced here primarily to motivate the engagement with these mathematically more basic concepts in subsequent Sections.

1.1 Probabilistic modelling

Science is the dyad of formulating quantitative theories about natural phenomena and validating these theories in light of quantitative data. Because quantitative data is finite, theories will ever only be quantified up to a certain level of uncertainty. Probabilistic modelling provides the glue between formalized scientific theories and empirical data and offers a mechanistic framework for quantifying the remaining uncertainty about a theory’s validation. Probabilistic modelling has many synonyms, such as statistics, , data assimilation, advanced machine learning, or simply data analysis. Cognitive neuroscience aims for a scientific approach to understanding brain function. When designing any in cognitive neuroscience, it is thus essential to have at least a vague idea about the data analytical procedures that are going to be used on the collected data, irrespective of whether the data is behavioural or derives from neuroimaging techniques such as functional magnetic resonance imaging (fMRI) or magneto- or electroencephalography (M/EEG). In the current Section, we provide a brief overview about common data analytical strategies employed in cognitive neuroimaging or, more generally, in probabilistic quantitative data analysis. To this end, it is first helpful to appreciate that any form of data analysis embodies data reduction and that any sensible form of data reduction is based on a model of the data generating process.

Data analysis is data reduction. Any cognitive neuroscience experiment generates a wealth of quanti- tative data (numbers). For example, when conducting a typical behavioural experiment, one presents stimuli of different experimental conditions multiple times to participants and records, for example, the correctness of the response and the associated reaction time on each experimental trial. For reaction times only and with a hundred trials per one of four experimental condition, this amounts to four hundred numbers per participant. Usually, one does not only acquire data from a single participant and thus deals with four hundred times the number of participants data points. If one concomitantly acquires neurophysiological data, for example fMRI data across many voxels or EEG data from multiple electrodes, the number of data points grows into the hundred thousands or even millions very quickly. Nevertheless, one would like to understand and visualize in which way the experimental manipulation has affected the recorded data. Any data analysis method must hence project large sets of numbers onto smaller sets of numbers that allow for the experimental effects to be more readily appreciated by humans. These smaller sets of numbers are commonly referred to as statistics. While many data analysis techniques appear to be very different at least on the surface, a reduction of the data dimensionality is a common characteristic of all forms of data analysis (Figure 1.1).

Data analysis is model-based. A second characteristic of any data analysis method is that it embodies assumptions about how the data were generated and which data aspects are important. The key step of any data analysis method is to evaluate how well a given set of quantitative assumptions, i.e., a model, can explain a set of observed data. When studying any data analysis approach, it is helpful to identify the following three components of the scientific method: model formulation, model estimation, and model Probabilistic modelling 4

Raw Data Reduced Data

Figure 1.1. Data analysis is data reduction. Raw data usually takes the form of large data matrices, here represented by an 100 × 100 array of different colours encoding real number values. Usually, the raw data are not reported in scientific reports, but rather a smaller set of numbers, such as T- or p-values in frequentist statistics. This smaller set of numbers is represented by the 2 × 2 array of different colours on the right. The process of transforming a large data set into a smaller data set that can be more readily appreciated by humans is called data analysis (glm 1.m). evaluation (Figure 1.2). Model formulation refers to the mathematical formalization of informal ideas about the generation of empirical data. Typically, models aim to mechanistically and quantitatively capture data generating processes and comprise both deterministic and probabilistic aspects. Some components of a model may take predefined values and are referred to as fixed parameters, while other components of a model can be informed by the data and are referred to as free parameters. Model estimation is the adaptation of model parameters in light of observed data. Often the adaptation of free model parameters in light of observed data is a non-trivial task and requires sophisticated mathematical and statistical techniques. Finally, model evaluation refers to the evaluation of adapted parameter values in some meaningful sense and drawing conclusions about experimental hypotheses. Note that upon model evaluation, the scientific method proceeds by going back to the model formulation step. At least two aims may be addressed during model reformulation: either to conceive a model formulation that may capture observed data in a more meaningful way or to relax the assumptions of the model to derive a more general theory.

Model classes It is sometimes helpful to classify a particular model. While ultimately every model and its associated estimation and evaluation scheme is unique, some rough categorization can help to obtain an overview about the plethora of data analysis approaches in functional neuroimaging. Below we discuss a non-exhaustive list of dichotomies.

Static vs. dynamic models. In most simple terms, static models describe the current state of a phenomenon, while dynamic models describe how the phenomenon of interest currently changes. Usually, static models have no inherent representation of time, while dynamic models typically treat time as an explicit model variable. Static models often have a relatively simple algebraic form, whereas dynamic models are usually formulated with the help of differential equations. While not originally conceived as models of time-series data, many static models are also applied to data, often inducing the need for sophisticated model modifications. Dynamic models can further be classified into deterministic dynamic and stochastic dynamic models. Deterministic dynamic models describe the change of the state of a phenomenon without additive stochastic error, commonly using systems of ordinary or partial differential equations. Stochastic dynamic models additionally assume probabilistic influences on the change of the phenomenon of interest and are formulated using stochastic differential equations.

Univariate vs. multivariate models. Another way to classify models is according to the dimensionality of the measurement data they describe. If this dimension is one, i.e., for each measurement a single number is observed and modelled, the model is referred to as univariate. On the other hand, if each measurement constitutes two or more numbers which are modelled, the model is referred to as multivariate.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Probabilistic modelling 5

Science Model formulation

Reality Data Model estimation

Model evaluation

Figure 1.2. Data analysis is model-based. The figure depicts the relationship between the scientific method (big box) and reality. Data forms part of the scientific method, because it is registered in data recording instruments that aim to capture specific aspects of reality. The scientific method is based on the formulation of models (also known as theories or hypotheses), the estimation of these models based on data (also known as parameter estimation or model fitting), and the evaluation of the models in light of the data upon their estimation. Typically, multiple models are compared with respect to each other. Upon evaluation of a model, the model may be refined or a new model may be formulated. Note that this is a highly idealistic description of the scientific process, which omits all sociological factors involved in actual academic practice (glm 1.m).

Encoding vs. decoding models. Another popular model classification scheme uses the notions of encoding vs. decoding models. According to this scheme, encoding models rest on an explicit formulation of experimental circumstances that generate measurements, while decoding approaches decode the experimental circumstances from the observed measurement. However, the distinction between encoding and decoding models is meaningless because every “decoding model” is also based on a of the measurement data - most typically a very simple one with little explanatory appeal. As will become evident in subsequent Sections, the GLM is a static, univariate model that can be used both in an encoding and a decoding manner. Due to its relative simplicity, the GLM forms an ideal starting point for studying modern data analysis.

Model estimation and evaluation techniques Probabilistic models comprise both deterministic aspects and stochastic aspects. The stochastic aspects commonly model that part of the data variability that is not explained by the deterministic aspects. The frameworks of Frequentist and Bayesian statistics differ in the way that stochastic aspects are interpreted.

Frequentist statistics. In Frequentist statistics, probabilities are interpreted as large sample limits of random phenomena. Most of classical Frequentist statistics as encountered in undergraduate statistics combines variants of the GLM with null hypothesis significance testing (NHST). NHST is based on the following logic: one assumes that if there is no experimental effect, a statistic of interest has a certain . This is referred to as the null distribution. Upon observing data, one can compute the probability of obtaining the observed or more extreme data under the null distribution. If this probability (known as the p-value) is small, one concludes that the data does not support the null hypothesis and declares the experimental effect to be “statistically significant”.

Bayesian statistics. In Bayesian statistics, probabilities are interpreted as measures of subjective uncertainty. Here, in the absence of any experimental data, one quantifies one’s uncertainty about model parameters using so-called distributions. Using Bayes’ theorem, one then computes the posterior distribution of model parameters given the data, resulting in an updated belief. At the same time, one often aims to quantify the probability of the data under the model assumptions employed. It should be noted that the dichotomy between Frequentist statistics and Bayesian statistics is not a strict one, and that mixed forms, such as parametric empirical Bayes or the study of Frequentist quality criteria

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Experimental design 6 of Bayesian point estimators, exist. In this introduction to the GLM, we will focus on the Frequentist statistics, which remains the dominant statistical paradigm in the empirical sciences.

1.2 Experimental design

In this Section, we briefly review a few terms from the theory of experimental design that will be needed for introducing the GLM.

Experiment and experimental design. An experiment is the controlled test of a scientific hypothesis or theory. manipulate some aspect of the world and then measure the outcome of that manipulation. In functional neuroimaging experiments, researchers often manipulate some aspects of a stimulus (for example presenting a picture of a face or a house, or manipulating whether a word is easy or difficult to remember) and measure the participant’s behaviour and brain activity using fMRI or EEG. Here, experimental design refers to the organization of an experiment to allow for the effective investigation of the research hypothesis. All well-designed experiments share several characteristics: they test specific hypotheses, rule out alternative explanations for the data, and minimize costs involved in the experiment.

Independent and dependent experimental variables. An experimental variable can be defined as a manipulated or measured quantity that varies within an experiment. Two classes of experimental variables are central: independent and dependent variables. Independent experimental variables are aspects of the experimental design that are intentionally manipulated by the experimenter and that are hypothesized to cause changes in the dependent variables. Independent variables in functional neuroimaging experiments include, for example, different forms of sensory stimulation, different cognitive contexts, or different motor tasks. The different values of an independent variable are often referred to as conditions or levels. Usually, independent variables are explicitly controlled. From a modelling perspective, they are thus usually represented by constants rather than by random variables. Dependent experimental variables are quantities that are measured by the experimenter in order to evaluate the effect of the independent variables. Examples for dependent variables in functional neuroimaging experiments are the response accuracy and reaction time in behavioural tasks, the BOLD signal at a given voxel in an fMRI experiment, or the composition of a recording channel in an EEG experiment. From a data analytical perspective, dependent experimental variables are usually modelled by random variables.

Categorical and continuous experimental variables. In principle, both independent and dependent variables can either be categorical or continuous. A categorical experimental variable is an experimental variable that can take on one of several discrete values, for example, encoding sensory stimulation (1) vs. no sensory stimulation (0). Categorical experimental variables are commonly referred to as factors taking on different levels. Mathematically, categorical variables are usually represented by elements of the natural numbers or signed integers. A continuous experimental variable is an experimental variable that can take on any value within a specified . Examples for continuous variables are different contrast levels of a visual stimulus as well as most observed signals in functional neuroimaging, such as the BOLD signal or electrical potentials in EEG. Mathematically, continuous experimental variables are usually represented by real numbers.

Between- and within-participant designs. Experimental designs can be classified according to whether the levels of an independent variable are applied to the same group of participants or to different groups of participants. In a between-participant design, different participant groups are associated with different values of an independent experimental variable. A more common design type in basic functional neuroimaging research is the within-participant designs, in which each participant is exposed to all levels of the independent experimental variables. These designs are also commonly referred to as repeated-measures designs.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 7

1.3 A verbose introduction to the general linear model

The GLM can be neatly summarized in the expression y = Xβ + ε, (1.1) which we will refer to as the GLM equation. In the GLM equation, y represents data, X denotes a , β denotes a parameter vector, and ε denotes an error vector. The aim of the current Section is to gain an initial understanding of eq. (1.1). To this end, we will first consider the structural aspects of eq. (1.1) comprising the data y, the design matrix X, and the parameter vector β from the perspective of independent and dependent experimental variables. In a second step, we then consider the stochastic aspect represented by the error vector ε. We exemplify both the structural and stochastic aspects by of the simple linear regression model. Throughout, it is important to note that the data y is modelled by the GLM as a stochastic entity of which a single realization is available in a practical context.

Structural aspects To obtain an initial understanding of the GLM equation (1.1), we consider an independent experimental variable, denoted by x for the , and a dependent experimental variable, denoted by y for the moment. As reviewed above, the independent experimental variable x is under the control of the experimenter, while the dependent experimental variable y models measurements of a phenomenon of interest. y is not under the direct control of the researcher, but it is assumed that it is in some way related to x. For reasons of simplicity, flexibility, and because every functional relationship is locally linear, researchers chose to model a lot of these relationships by means of affine-linear functions, or linear models for short. In verbose terms, a noise-free linear model states that “an observed value of the dependent variable y is equal to a weighted sum of values associated with one or more independent variables x”.

To render the last statement more precise, we introduce some additional notation: let yi denote one observation of the dependent variable y, where i = 1, ..., n, such that there are n observations in total. Likewise, let xij, i = 1, ..., n, j = 1, ..., p denote the values of a number of independent experimental variables that are supposed to be associated with the observation yi. Here, p is the number of independent experimental variables. The statement that the value yi equals the weighted sum of the values of the independent variables xij associated with this observation can then be written as

yi = xi1β1 + xi2β2 + xi3β3 + ... + xipβp. (1.2)

In eq. (12.1), the βj parameters are multiplicative coefficients that quantify the contribution of the independent experimental variable xij to the value of the dependent experimental variable yi. Each βj parameter may thus be conceived as the size of the effect that the independent experimental variable xij has on the value of the dependent experimental variable yi. All variables in eq. (12.1) should be thought of as scalar numbers. As a concrete example of eq. (12.1), we consider the 7th dependent variable y7 of a set of dependent variables associated with p = 4 independent experimental variables and, correspondingly, four βj parameters:

y7 = x71β1 + x72β2 + x73β3 + x74β4. (1.3) A numerical example of eq. (1.3) is 10 = 16 · 0.25 + 1 · 2 + 3 · 0.5 + 2.5 · 1. (1.4)

Here, the value of the dependent experimental variable is y7 = 10, the values of the independent experimental variables are x71 = 16, x72 = 1, x73 = 3, x74 = 2.5, and the βj values are β1 = 0.25, β2 = 2, β3 = 0.5, β4 = 1. For understanding the GLM, it is important to be clear about which variables are known at which point of a research project: the values of the independent experimental variables xij, i = 1, ..., n, j = 1, ..., p are specified by the researcher and are hence known as soon as the researcher has decided on the design of a given experiment. The dependent experimental variable values yi, i = 1, ..., n are known as soon as the researcher has collected data in response to the independent experimental variable values xi1, ...xip. However, how much each of the independent experimental variables xi1, ..., xip contributes to the sum on the right-hand side of (12.1) and thus to the observed data on the left-hand side of (12.1), is unknown at this point. In other words, the weighting coefficients β1, ..., , βp are not known to the researcher in advance and have to be estimated. As discussed inSection 1.1, the process of identifying these parameter values is referred to as model estimation and will be discussed in detail in subsequent Sections.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 8

Structural aspects of simple linear regression Expressions such as eq. (12.1) are often referred to as multiple linear regression models and are usually introduced as generalizations of simple linear regression models to scenarios of more than one independent experimental variable. In the current Section, we consider the structural aspects of simple linear regression models to introduce the GLM matrix notation. In undergraduate statistics, simple linear regression models are often written as y = a + bx, (1.5) where y is referred to as the dependent experimental variable, a is referred to as the offset, b is referred to as the slope, and x is referred as the independent experimental variable. Crucially, eq. (1.5) encodes the idea that if we know the values of x, b, and a, we can compute the value of y. Let us hence assume that we would like to compute the value of y for five different values of x, namely,

x12 = 0.2, x22 = 1.4, x32 = 2.3, x42 = 0.7, and x52 = 0.5. (1.6)

Notably, the values of x and y are allowed to vary, whereas the values of a and b are fixed. Let us hence assume that a = 0.8 and that b = 1.3. We may thus write the five values of y corresponding to the five values of x as

y1 = a + bx12 = 1 · 0.8 + 1.3 · 0.2

y2 = a + bx22 = 1 · 0.8 + 1.3 · 1.4

y3 = a + bx32 = 1 · 0.8 + 1.3 · 2.3 (1.7)

y4 = a + bx42 = 1 · 0.8 + 1.3 · 0.7

y5 = a + bx52 = 1 · 0.8 + 1.3 · 0.5.

Using matrix notation as formally introduced in Section 4 | Matrix algebra, we can equivalently express (1.7) as         y1 x11 x12 1 0.2 1 · 0.8 + 1.3 · 0.2 y2 x21 x22   1 1.4   1 · 0.8 + 1.3 · 1.4     a   0.8   y3 = x31 x32 = 1 2.3 = 1 · 0.8 + 1.3 · 2.3 . (1.8)     b   1.3   y4 x41 x42 1 0.7 1 · 0.8 + 1.3 · 0.7 y5 x51 x52 1 0.5 1 · 0.8 + 1.3 · 0.5

Note that in (1.8) we have introduced another variable xi1, i = 1, ..., n which takes on the value 1 for all values of yi for i = 1, ..., n and serves the purpose of including the offset 0.8 on the right-hand side. Independent variables that take on only the values 0,1, or −1 are sometimes referred to as dummy variables. What is the benefit of rewriting eq. (1.7) in the form of eq. (1.8)? Conceptually nothing has changed, but notationally, we can now express the relatively large expression (1.7) much more compactly. To do so, we define     y1 x11 x12 y2 x21 x22       a y := y3 ,X := x31 x32 , and β := . (1.9)     b y4 x41 x42 y5 x51 x52 Moreover, the definition of β in (1.9) can be simplified and aligned to the notation used in the previous Section by setting β1 := a and β2 := b, i.e., by defining β  β := 1 . (1.10) β2 Take note of the dimensions of y, X, and β: y is a 5 × 1 vector, X is a 5 × 2 matrix, and β is a 2 × 1 vector. In matrix form, we can thus write (1.7) very compactly as

5 5×2 2 y = Xβ, where y ∈ ,X ∈ R , and β ∈ R . (1.11) Matrix notation thus allows for neatly summarizing sets of linear equations as made explicit in (1.7). Moreover, as will become clear in subsequent Sections, matrix algebra also allows for writing other aspects of the GLM, such as parameter estimation and the evaluation of statistics, in very compact forms that can readily be implemented in computer code.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 9

To conclude this Section, consider again the GLM equation (1.1). In comparison to eq. (1.11), it is apparent that so far we did not consider the error term ε thus far. In fact, the right-and side of eq. (1.11) merely describes the structural or deterministic aspect of the GLM. In the following Section, we shall thus consider the error term, which reflects the probabilistic aspect of the GLM and provides an essential contribution to the data y and its conception as a .

Probabilistic aspects Before delving into the meaning of the error term ε in the GLM equation (1.1), we shall summarize in more general form, what we have learned so far. To this end, we first note that a fundamental aspect of the GLM equation (1.1) is that it generalizes many experimental design cases, such as the simple linear regression design discussed above. In all generality and using matrix notation as introduced in Section 4 | Matrix algebra, the structural elements of the GLM equation (1.1) take the forms y ∈ Rn, X ∈ Rn×p, and β ∈ Rp. Note that the design matrix always has as many rows as there are data values (n) and as many columns as their are parameter values (p). Explicitly, we may thus write the GLM equation (1.1) as       y1 x11 x12 ··· x1p β1 y2  x21 x22 ··· x2p  β2   =     + ε. (1.12)  .   . . .. .   .   .   . . . .   .  yn xn1 xn2 ··· xnp βp

We now consider ε in (1.12) in more detail. We first note that because X ∈ Rn×p and y ∈ Rn, ε must also be an n-dimensional real vector, i.e., ε ∈ Rn, and we thus have         y1 x11 x12 ··· x1p β1 ε1 y2  x21 x22 ··· x2p  β2 ε2    =     +   . (1.13)  .   . . .. .   .   .   .   . . . .   .   .  yn xn1 xn2 ··· xnp βp εn We next consider the ith row of (1.13), which reads

yi = xi1β1 + xi2β2 + ... + xipβp + εi. (1.14)

The right-hand side of (1.14) now corresponds to the full GLM assumption about the ith data value yi and comprises two categorically different entities. The first part xi1β1 + xi2β2 + ... + xipβp is the structural, deterministic part already discussed above. The value εi, on the other hand, is conceived as the realization of a random variable. This means that the values εi for i = 1, ..., n are governed by random variables and their associated probability distributions. We might know some parameters of these probability distributions, but the exact values of the εi’s do not follow deterministically from this knowledge. Eq. (1.14) thus implies that the value yi is given by the sum of a deterministic and a probabilistic term. Next, consider obtaining sample values εi and adding them to the deterministic value

µi := xi1β1 + xi2β2 + ... + xipβp, (1.15) such that yi = µi + εi. (1.16)

In eq. (1.16), µi is a deterministic value and εi is a random variable realization. We now make the central assumption that the values εi are drawn from independent univariate Gaussian distributions with specified expectation parameter 0 and variance parameter σ2 > 0, which will formally be introduced in Section 5 | 2 Probability theory and Section 6 | Probability distributions. For small values of σ , the sampled values εi will be close to zero, but on occasion they may be a little bit positive or a little bit negative. Consider drawing the sample values ε1 = 0.200, ε2 = −0.001, ε3 = 0.050 for µ1 = µ2 = µ3 = 1. If we evaluate (1.16) for these values, we obtain

y1 = µ1 + ε1 = 1 + 0.200 = 1.200

y2 = µ2 + ε2 = 1 − 0.001 = 0.099 (1.17)

y3 = µ3 + ε3 = 1 + 0.050 = 1.050.

The most important thing to realize about (1.17) is that despite the fact that each yi has the same deterministic aspect µi = 1 for i = 1, 2, 3, the values yi, i = 1, 2, 3, still vary, because realizations of random

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 10

variables are added to the µi for i = 1, 2, 3. Crucially, this renders the yi themselves realizations of random variables. We can also infer how the random variables they result from are distributed: because the random variables governing the εi’s have an expectation of zero, the expectation of the random variables governing the yi’s will correspond to the deterministic aspects µi. The variance of the random variables governing the yi’s, on the other hand, corresponds to the variance of the random variables governing the εi’s. There are two ways to express this more formally. We can either state that

yi = µi + εi, (1.18) where εi is a realization of a random variable distributed according to a univariate Gaussian distribution with expectation parameter 0 and variance parameter σ2, such that in distribution form (Section 7 | Probability distributions) we may write 2 εi ∼ N(0, σ ). (1.19)

Equivalently, we may state that yi is a realization of a random variable distributed according to a univariate 2 Gaussian distribution with expectation parameter µi and variance parameter σ , such that in distribution form we may write 2 yi ∼ N(µi, σ ). (1.20) Formally, (1.20) follows directly from application of the linear-affine transformation theorem for Gaussian distributions to εi under addition of µi (as introduced in Section 7 | Probability distributions). Next, recall that µi = xi1β1 + xi2β2 + ... + xipβp, (1.21) which may be re-expressed using matrix multiplication as

µi = xiβ, (1.22)

1×p where we defined xi ∈ R as the row vector

 1×p xi := xi1 xi2 ... xip ∈ R , (1.23) and β ∈ Rp as the column vector   β1 β2 β :=   . (1.24)  .   .  βp 1×p n×p Note that xi ∈ R in (1.22) corresponds to the ith row of the design matrix X ∈ R of the GLM equation (1.1) and that β ∈ Rp in (1.22) corresponds to the beta parameter of the GLM equation (1.1). We may thus rewrite (1.20) as 2 yi ∼ N xiβ, σ . (1.25) 2 Notably, we consider this probability distribution as parameterized by β and σ , but not by xi, as the values of the independent experimental variable usually do not correspond to a model parameter. Eq. (1.25) summarizes an essential aspect of the GLM: the ith data point yi is assumed to be a realization of a univariate Gaussian probability distribution with expectation parameter governed by the ith row of the design matrix and the beta parameter β, as well as a variance parameter that is determined by the variance parameter of the error term.

Finally, we can consider a more macroscopic perspective on the error terms εi, i = 1, ..., n. In Section 7 | Probability distributions, we will see that the probability density function of n independent univariate Gaussian random variables with expectation parameters µi, i = 1, ..., n and common variance parameter σ2 is identical to the probability density function of the random vector comprising these random variables T 2 with expectation parameter vector µ := (µ1, ..., µn) and spherical σ In. Applied to the current scenario, we may write this identity as

n Y 2 2 N(εi; 0, σ ) = N(ε; 0n, σ In), (1.26) i=1

T n n where ε := (ε1, ..., εn) ∈ R denotes the vector of error terms as introduced in (1.13) and 0n ∈ R denotes a vector of zeros. By noting that the product of the design matrix X ∈ Rn×p and the parameter

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 11

15 T 7 = (71; :::; 7n) T

y = (y1; :::; yn)

e

l

b a

i 10

r

a

v

t

n

e d

n 5

e

p

e D

0 0 2 4 6 8 10 Independent variable (xi2)

Figure 1.3. A sample of simple linear regression model, obtained by an 11-dimensional Gaussian distribution. Here, the design matrix comprises a column of 11 ones and a column of the values 0, 1, 2, ..., 10 corresponding to the matrix entries x12, x22, ..., x11,2. The black dots depict the result of the multiplication of this design matrix with the beta parameter T 11 vector β := (1, 1) , resulting in the Gaussian expectation vector µ := Xβ ∈ R . The blue dots depict the result of sampling 11 2 the 11-dimensional Gaussian distribution with expectation parameter µ := Xβ ∈ R and spherical covariance matrix σ I11, where σ2 := 3 (glm 1.m). vector β ∈ Rp results in a vector µ := Xβ ∈ Rn, the left-hand side of (1.13) hence corresponds to a special case of the linear-affine transformation theorem for Gaussian distributions, for which the transformation matrix conforms to the identity matrix In. To make this explicit, let ε denote a random vector distributed according to an n-variate Gaussian distribution with expectation parameter 0n and covariance matrix 2 n parameter σ In, let In denote a transformation matrix, and let Xβ ∈ R . Then the random vector 2 y = Inε + Xβ = Xβ + ε with ε ∼ N(0n, σ In) (1.27) is distributed according to the n-variate Gaussian distribution

2 y ∼ N(Xβ, σ In). (1.28) Eq. (1.28) is central for the theory of the GLM as discussed in the following. With respect to the GLM equation (1.1) it puts a stronger emphasis on the GLM as a probabilistic model GLM and will form the basis for GLM parameter estimation and GLM model evaluation in all subsequent Sections. Note that throughout, we have made the explicit assumption that the error term ε is distributed according to a Gaussian distribution. While accounts of the GLM exist that only postulate that ε is random (but not necessarily Gaussian distributed), these accounts do not align well with the classical Frequentist theory of GLM inference and testing, as will become evident in Section 8 | Maximum likelihood estimation and Section 9 | Frequentist distribution theory. Moreover, standard Bayesian conjugate inference using Gaussian prior distributions for the beta parameter also necessitate Gaussian error terms.

Probabilistic aspects of simple linear regression To conclude the current Section, we briefly discuss, how eq. (1.28) can be used for probabilistically sampling a simple linear regression model. To this end, we first note that most programming environments provide the functionality to sample n-dimensional Gaussian distributions with specified expectation and covariance matrix parameters (e.g., the routine scipy.stats.multivariate normal of Python’s SciPy package or the mvnrnd.m function in Matlab). To sample a simple linear regression model with n data points according to eq. (1.28), we thus define (1) the expectation parameter by the matrix product of the simple linear regression design matrix X ∈ Rn×2 and a chosen parameter vector β ∈ R2 and (2) the covariance matrix parameter by multiplying a chosen variance parameter σ2 > 0 with the n × n-dimensional identity matrix In. Figure 1.3 visualizes this process.

1.4 Bibliographic remarks

Efron and Hastie(2016) provide an excellent overview about the historical development of statistical methodology from the beginning of the twentieth century until today. Wasserman(2004) provides a

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 12

concise overview of Frequentist and Bayesian methodology. Bishop(2006), Barber(2012), and Murphy (2012) cover a wide range of statistical techniques from the machine learning perspective. Friston (2007) provides a comprehensive, but technically advanced, overview of most of the techniques discussed in the following and a wide variety of additional methods implemented in the SPM Matlab toolbox (https://www.fil.ion.ucl.ac.uk/spm/). The GLM Section is modelled on verbose discussion of the GLM in the later chapters of applied statistical textbooks, such as Hays(1994).

1.5 Study questions

1. Give definitions of the terms model formulation, model estimation, and model evaluation. 2. What is the difference between static and dynamic models? 3. Provide a brief overview of differences and commonalities between Frequentist and Bayesian statistics. 4. Define the terms independent experimental variable, dependent experimental variable, , and continuous variable. 5. Explain the difference between within- and between-participant experimental designs. 6. Consider the GLM equation y = Xβ + ε. (1.29) Which of the symbols y, X, β, ε represents independent experimental variables, which of the symbols represents dependent experimental variables? 7. Consider the GLM equation y = Xβ + ε. (1.30) In an experimental context, which of the components of this equation are known before performing and experiment, which of the components are known to after the experiment before estimating the model?

8. The design matrix X is of dimensionality n × p. What do n ∈ N and p ∈ N represent, respectively? 9. Express the GLM (matrix) equation y = Xβ + ε. (1.31) as a set of (simultaneous non-matrix) equations for n := 4 and p := 3. 10. Name and explain the components and their properties of the GLM equation

y = Xβ + ε. (1.32)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 2 | Sets, sums, and functions

In this Section, we review very fundamental and very important mathematical concepts. Essentially, one of the main aims of modern mathematics is to express mathematical content as sets and functions between sets. For example, a data-analytical model can be understood as a function that maps experimental conditions onto observable data under additive stochastic noise. Stochastic noise can be modelled as random variable, which is formally defined as a function from a probability space to an outcome space. Model estimation can be understood as a function that maps data onto statistics, and model evaluation can be understood as a function that maps models and data onto model evaluation criteria. Being able to mathematically formulate sets and functions is an important precondition for implementing data analysis methods in computer code: standard imperative programming rests on defining functions on predefined sets of data structures, while object-oriented programming allows for representing close associations between sets of data and functions defined on that data. To provide the necessary mathematical language for these considerations, we first review the notion of a set and important sets of numbers that can be used to represent quantitative data (Section 2.1). We next consider important mathematical notation for representing basic mathematical entities, such as sums, products, and exponentiation operations (Section 2.2). Finally, we review the abstract notion of a function (or mapping) and some essential functions that will encountered repeatedly in the context of the GLM (Section 2.3).

2.1 Sets

A set may be defined according to Cantor(1895) as follows: “A set is a gathering together into a whole of definite, distinct objects - which are called elements of the set.” We primarily use sets as a means to identify the mathematical objects we are dealing with. Sets are usually denoted using curly brackets. For example, the set A comprising the first five lower-case letters of the Roman alphabet is denoted as

A := {a, b, c, d, e} . (2.1)

Note that the symbol “ :=” is used to denote a definition and the symbols “=” and “6=” are used to denote an equalities and inequalities. Definitions are justified by their utility in a given context and cannot be false, whereas equalities and inequalities follow from definitions by logical inference. There are three ways to define sets: ˆ A set may be defined by listing the elements of the set as in expression (2.1). ˆ A set may be defined by specifying the properties of the elements of the set, for example as in

B := {x|x is one of the first five lowercase letters of the alphabet}. (2.2)

Here, the variable x in front of the vertical bar denotes the elements of the set in a generic way and the statement after the vertical bar expresses the defining properties of the set. ˆ Finally, a set may be defined by defining it to be equal to another set, for example,

C := N, (2.3)

where N denotes the set of natural numbers to be introduced below.

Elements, cardinality, subsets, and supersets To indicate that b is an element of a set A, we write

b ∈ A, (2.4) which should be read as “b is in A” or “b is an element of A”. To indicate that, for example, 2 is not an element of a set A, we write 2 ∈/ A, (2.5) Sets 14 which may be read as “2 is not in A” or “2 is not an element of A”. The number of elements comprising a set is called the cardinality of the set and is denoted by vertical bars. For example, for the sets defined (2.1) and (2.2), it holds that |A| = |B| = 5, (2.6) because both sets contain five elements. A set that does not contain any elements has a cardinality of zero and is called empty set. Empty sets are denoted by ∅. If a set B contains all elements of another set A and A contains some additional elements, i.e., the two sets are not equal, then B is said to be a subset of A, denoted as B ⊂ A, and A is said to be a superset of B, denoted as A ⊃ B. For example, if A := {1, 2, a, b} and B := {1, a}, then B ⊂ A and A ⊃ B because all elements of B are also in A. If a set B may either be a subset or may equal another set A, the notations B ⊆ A and A ⊇ B are used.

Unions, intersections, differences, and complements Let M and N be two arbitrary sets. Then

M ∪ N := {x|x ∈ M or x ∈ N} (2.7) defines the union of the two sets M and N. The union of two sets is the set which comprises all elements that are either in M (only), in N (only), or in both M and N. The “or” in the definition of M ∪ N is thus understood in an inclusive logical “and/or” manner, rather than in the exclusive logical “or” way. As an example, for M := {1, 2, 3} and N := {2, 3, 5, 7}, we have

M ∪ N = {1, 2, 3, 5, 7}. (2.8)

The intersection of two sets M and N is defined as

M ∩ N := {x|x ∈ M and x ∈ N}. (2.9)

The intersection M ∩N is thus a set that only comprises elements that are both in M and N. For example, for M := {1, 2, 3} and N := {2, 3, 5, 7}, we have

M ∩ N = {2, 3}, (2.10) because 2 and 3 are the only numbers that are both in M and N. If the intersection of two sets is the empty set, the two sets are said to be disjoint. The difference of two sets is defined as

M \ N := {x|x ∈ M and x∈ / N}. (2.11)

The set M \N thus comprises all elements which are in M, but are not in N. For example, for M := {1, 2, 3} and N := {2, 3, 5, 7}, we have M \ N = {1}, (2.12) because the elements 2, 3 ∈ M are also in N. The remaining elements of N do not play a role in the evaluation of the difference set. Notably, the difference of two sets is not symmetric. For example, for M := {1, 2, 3} and N := {2, 3, 5, 7}, we have

N \ M = {5, 7}= 6 {1} = M \ N. (2.13)

Finally, assume that N ⊂ M. Then the complement of N with respect to M is the set of all elements of M which are not in N. We denote and define this complement as

N c := {x ∈ M|x∈ / N}. (2.14)

Alternatively, we may state that N c = M \ N. (2.15) The complement as defined here is sometimes also referred to as absolute complement.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sets 15

푀 ∪ 푁 = 푀 ∩ 푁푐 ∪ 푀 ∩ 푁 ∪ 푀푐 ∩ 푁

푀 푁 푀 푁

O O O O

Figure 2.1

Example. As an example for the notions of unions, intersections, and complements, we highlight that the union of two sets M and N, which are both subsets of a set O can be expressed as

M ∪ N = (M ∩ N c) ∪ (M ∩ N) ∪ (M c ∩ N). (2.16)

Note that it is implied that the formation of complements is with respect to the superset O. In lieu of a formal proof, we provide a Venn diagram visualization of the statement in Figure 2.1. Finally, some notational remarks. Both unions and intersection may be applied to multiple, indexed sets. For example, if A1, A2, and A3 are sets, then the union of these three sets is

A = A1 ∪ A2 ∪ A3. (2.17)

Note that this set comprises all elements which are in A1 and/or in A2 and/or A3. To simplify the notation of the union of many indexed sets, the index and its maximal value can be sub- and superscripted at the union symbol, such that the above can be written as

3 A = ∪i=1Ai = A1 ∪ A2 ∪ A3. (2.18)

Analogously, the union of an infinite number of sets A1,A2, ... is denoted as

∞ A = ∪i=1Ai, (2.19) and the intersection of an infinite number of sets B1,B2, ... is denoted as

∞ B = ∩i=1Bi. (2.20)

Power sets Let M denote a set. Then the power set P(M) of M is the set of all subsets of M. The power set always includes the empty set and the original set M. Without proof, we note that the cardinality of the power set of a set with cardinality n is 2n, i.e., if |M| = n, then |P(M)| = 2n.

Example. For example, let M := {1, 2, 3}. (2.21) Then the power set of M is

P(M) = {∅, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}. (2.22) Note that |M| = 3 and |P(M)| = 23 = 8.

Selected sets of numbers We next introduce a selection of sets of numbers that are essential for modelling quantitative data.

The set of natural numbers, also referred to as positive integers, is denoted by N and is defined as

N = {1, 2, 3,...}, (2.23) where the dots “...” denote “to infinity”. Subsets of the set of natural numbers are the sets of natural numbers of order n ∈ N, which are defined as

Nn := {1, 2, . . . , n}. (2.24)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sets 16

The union of the set of natural numbers and zero is denoted by N0, i.e., N0 := N ∪ {0}. If the “negative natural numbers” are added to the set N, the set of integers, which is defined as

Z := {..., −3, −2, −1, 0, 1, 2, 3,...}, (2.25) results. In turn, adding ratios of integers to Z yields the set of rational numbers, which is defined as p  := | p, q ∈ , q 6= 0 . (2.26) Q q Z

The most important set of numbers for the purposes of quantitative data analysis is the set of real numbers, denoted by R. The real numbers R are a superset of the rational numbers, i.e., Q ⊂ R. In addition to√ the rational numbers, the real numbers include the solutions of some algebraic equations, for example 2, the solution of the equation x2 = 2, which is not an element of Q. These numbers are called irrational numbers. Additionally, the real numbers include the limits of sequences of irrational numbers, such as π ≈ 3.14. Intuitively, the set of real numbers is the set one thinks of when referring to “continuous experimental variables”. Notably, there exist infinitely many real numbers between any x ∈ R and y ∈ R, while R also extends to negative and positive infinity (but neither includes negative nor positive infinity). More formally, it can be shown that there are more real numbers than natural numbers (Cantor, 1892), i.e., that the real numbers are an uncountable infinite set. Finally, on occasion, one restricts attention to the non-negative or positive real numbers, which are denoted by

R≥0 := {x ∈ R|x ≥ 0} and R>0 := {x ∈ R|x > 0}, (2.27) respectively.

Intervals Contiguous subsets of the real numbers are referred to as intervals. We will primarily deal with closed intervals, i.e., subsets of R which are defined in terms of the upper and lower boundary values a, b ∈ R by

[a, b] := {x ∈ R|a ≤ x ≤ b}. (2.28) Note that both a and b are elements of [a, b] and that [a, b] is defined as the empty set if b < a. In addition to closed intervals, one may define three further types of intervals

]a, b] := {x ∈ R|a < x ≤ b}, [a, b[ := {x ∈ R|a ≤ x < b}, (2.29) ]a, b[ := {x ∈ R|a < x < b}, referred to as left- or right-semi-open and open intervals. Note, for example, that the open interval ]a, b[ neither contains a nor b.

Cartesian products Let M and N denote two sets. Then the Cartesian product of M and N is the set of all ordered tuples (m, n) for which m ∈ M and n ∈ N. The Cartesian product of M and N thus comprises the dyads of all elements of M with all elements of N. Formally, the Cartesian product of two sets M and N is denoted and defined as M × N := {(m, n)|m ∈ M, n ∈ N}. (2.30) The cardinality of the Cartesian product of two sets M and N with finite cardinalities is given by

|M × N| = |M| · |N|. (2.31)

The Cartesian product of a set with itself is denoted and defined as

M 2 := M × M := {(m, m0)|m ∈ M, m0 ∈ M}. (2.32)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sums, products, and exponentiation 17

Examples. Consider the sets M := {1, 2, 3} and N := {2, 4}. Then the Cartesian product of M and N is the set M × N := {(1, 2), (1, 4), (2, 2), (2, 4), (3, 2), (3, 4)}. (2.33)

Similarly, consider the sets I1 := [1, 2] and I2 := [3, 4]. Then the Cartesian product of I1 and I2 is the “rectangle” I1 × I2 = [1, 2] × [3, 5] = {(x1, x2)|1 ≤ x2 ≤ 2, 3 ≤ x2 ≤ 5}. (2.34) The Cartesian product of two sets can be generalized to the Cartesian product of a finite set of n sets. To this end, let M1, ..., Mn denote a set of n sets. Then the Cartesian product of M1, ..., Mn is denoted and defined as n Y Mi := M1 × · · · × Mn := {(m1, ..., mn)|m1 ∈ M1, ..., mn ∈ Mn}. (2.35) i=1

The Cartesian product of M1, ..., Mn is thus the set of ordered n-tuples (m1, ..., mn) for which mi ∈ Mi, i = 1, ..., n. As a special case, the n-fold Cartesian product of a set X with itself is denoted and defined as n n Y X := Xi := {(x1, ..., xn)|x1 ∈ X, ..., xn ∈ X}. (2.36) i=1

The set Rn. The n-fold Cartesian product of the set of real numbers with itself is denoted and defined as n n Y R := R = {x := (x1, ..., xn)|xi ∈ R}. (2.37) i=1 n n R is thus the set of ordered n-tuples (x1, ..., xn), for which xi ∈ R, i = 1, ..., n. The elements of R are typically denoted as column lists   x1  .  x :=  .  (2.38) xn and referred to as n-dimensional real vectors. We thus often write “x ∈ Rn” to express that x is a column vector of n real numbers. An example for a four-dimensional real vector x ∈ R4 is 0.16 1.76 x =   . (2.39) 0.23 7.01

From the perspective of the set Rn, the special case R1 = R is the set of real numbers, the elements of which are often referred to as scalars.

2.2 Sums, products, and exponentiation

We next introduce three often encountered concepts that relate to basic mathematical operations: the sum symbol, the product symbol, and exponentiation.

Sums and products In mathematics, one often has to add numbers. A concise way to represent sums is afforded by the sum symbol, X (2.40) The sum symbol is reminiscent of the Greek letter Sigma (Σ), corresponding to the Roman capital S and thus mnemonic for sum. The terms summed over are denoted to the right of the sum symbol, usually with the help of indices. For example, for x1, x2, x3 ∈ R, we can write the equation

x1 + x2 + x3 = y (2.41) in shorthand notation as 3 X xi = y. (2.42) i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sums, products, and exponentiation 18

The subscript on the sum symbol indicates the running index and its initial value, here given by i and 1, respectively. The superscript on the sum symbol denotes the final value of the running index, here given by i = 3. To become familiar with the sum symbol, we consider the following examples

a = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100 (2.43)

b = 1 · x1 + 2 · x2 + 3 · x3 + ... + n · xn (2.44) c = 2 + 2 + 2 + 2 + 2. (2.45)

Using the sum symbol, the sums a, b, and c may be written as follows: for a, all squares of the natural numbers from 1 to 10 are summed up. We thus write

10 X a = i2 = 1 + 4 + 9 + 16 + 25 + 36 + 49 + 64 + 81 + 100. (2.46) i=1 Note that the denotation of the index variable is irrelevant, as we have for example

10 10 X X a = i2 = j2. (2.47) i=1 j=1

For b, we are given the numbers x1, . . . , xn ∈ R and have to multiply each one with its index and then add them all up. We thus write

n X b = i · xi = 1 · x1 + 2 · x2 + 3 · x3 + ... + n · xn. (2.48) i=1 Finally, for c we have to add the number 2 five times. For this, we write

5 X c = 2=2+2+2+2+2. (2.49) 1 Constant multiplicative factors in sums, i.e., multiplicative factors that do not depend on the sum index, may either be written to the right or to the left of the sum symbol. To see this, consider the x¯ of n real numbers: the arithmetic mean of n real numbers x1, x2, . . . , xn is defined as the sum of the n numbers divided by n x + x + ... + x x¯ := 1 2 n . (2.50) n Using the sum symbol notation, we have

n n x1 + x2 + ... + xn x1 x2 xn X xi X 1 x¯ = = + + ... + = = x , (2.51) n n n n n n i i=1 i=1 or, equivalently, Pn n x1 + x2 + ... + xn xi 1 X x¯ = = i=1 = x . (2.52) n n n i i=1 We have thus shown that n n X 1 1 X x = x . (2.53) n i n i i=1 i=1 Constant factors under the sum symbol may thus be taken out of the summation operation. Finally, another common mathematical operation is the multiplication of numbers. To this end, the product sign Π (the greek capital Pi for product) allows for writing products of multiple factors in a concise manner. In complete analogy to the sum symbol, the product symbol has the following semantics

n Y ai = a1 · a2 · ... · an. (2.54) i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 19

Exponentiation

Finally, we review some basic aspects of the exponentiation operation. For a ∈ R and n ∈ N0,“a to the power of n” is defined recursively as

a0 := 1 and an+1 := an · a. (2.55)

Further, for a ∈ R \{0} and n ∈ N,“a to the power of minus n” is defined by 1 a−n := (an)−1 := . (2.56) an In an, a is referred to as base and n is referred to as exponent or power. Based on the definitions in (2.55) and (2.56), the following familiar laws of exponentiation can be derived, which hold for all a, b ∈ R and n, m ∈ Z (and, in the case case of negative powers, a 6= 0): anam = an+m, (2.57)

(an)m = anm, (2.58) (ab)n = anbn. (2.59)

The nth root of a number a ∈ R is defined as a number r ∈ R such that its nth power equals a rn := a. (2.60)

From this definition it follows, that the nth root may equivalently be written using a rational exponent

1 r = a n , (2.61) because, with (2.58) and (2.57), it then follows that

n n  1  1 1 1 Pn 1 1 r = a n = a n · a n · ... · a n = a i=1 n = a = a. (2.62)

The familiar square root of a number a ∈ R, a ≥ 0 may thus equivalently be written as

√ 1 a = a 2 , (2.63) which, together with the laws of exponentiation (2.57) - (2.59), often simplifies handling of square roots in mathematical expressions.

2.3 Functions

Functions are operations between sets and constitute rules that relate elements of one set to elements of another set. In the following, we review how to mathematically formulate functions, discuss some basic properties of functions, and finally review a collection of essential functions.

Formulation A function f is generally specified as

f : D → R, x 7→ f(x), (2.64) where the set D is called the domain of the function f and the set R is called the range or co-domain of the function f. In (2.64), the statement f : D → R should be read as “the function f maps all elements of the set D onto elements of the set R”. The statement x 7→ f(x) in (2.64) denotes the mapping of the domain element x ∈ D onto the range element f(x) ∈ R and should be read as “x, which is an element of D, is mapped by f onto f(x), which is an element of R”. Note that the arrow → is used to denote the mapping between the two sets D and R, and the arrow 7→ is used to denote the mapping between an element in the domain of f and an element in the range of f. The most important aspect of (2.64) is that specifying functions in this way differentiates between the function f proper, which corresponds to an abstract rule, and elements f(x) of its range, which often correspond to numerical values. A function can thus be understood as a rule that relates two sets of quantities, the inputs and the outputs. Importantly,

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 20 each input x to a function is related to an output of the function f(x) in a deterministic fashion. While statements of the form (2.64) may appear cumbersome on first sight, they are of immense practical value in a computational context, because they directly lean themselves to programming implementations and the handling of different data types. In the mathematical definition of a function, the specification of its denomination as well as its domain and range is usually followed by a definition of its functional form. The functional form of a function specifies how the elements of a function’s range are evaluated based on the elements of the function’s domain. Consider the example of a familiar function, the square of a real number. Using the notation introduced above, this function is written as

2 f : R → R, x 7→ f(x) := x , (2.65) and we refer to x2 as the functional form of f.

Function concatenation If f : D → R and g : R → S are two functions for which the domain of g corresponds to the range of f, then the two functions can be applied to an element in the domain of f subsequently. Formally, this concatenation of two functions is written as g ◦ f : D → S, x 7→ (g ◦ f)(x) := g(f(x)). (2.66) Note that the function g ◦ f maps an element x ∈ D onto an element g(f(x)) ∈ S. Intuitively, g ◦ f thus carries out a transformation of the form D → R → S. As an example for a concatenated function, consider the function

2 h : R → R>0, x 7→ h(x) := exp −x . (2.67) From the perspective of concatenated functions, h can be viewed as the concatenation of the functions

2 f : R → R≥0, x 7→ f(x) := −x and g : R≤0 → R>0, x 7→ g(x) := exp(x), (2.68) because 2 g ◦ f : R → R>0, x 7→ (g ◦ f)(x) := g(f(x)) = exp −x . (2.69)

Basic properties of functions Functions can be categorized according to a basic set of properties that specify how a function relates elements of its domain to elements of its range. To define these properties, it is helpful to introduce the notions of a function’s image and preimage first. To this end, let f : D → R be a function and let x ∈ D. Then f(x), i.e., the element of R that x is mapped onto, is called the image of x under f. The entire subset of the range R for which images under f exist, i.e., the set

f(D) := {y ∈ R | there exists an x ∈ D with f(x) = y} ⊆ R (2.70) is called the image of D under f. Note that the image f(D) and the range R are not necessarily identical and that the image can be a subset of the range. If y is an element of f(D), then an x ∈ D for which f(x) = y holds is called a preimage of y under f. The following relationships between images and preimages are important: ˆ Every element in the domain of a function is allocated exactly one image in the range under f. ˆ Not every element in the range of a function has to be a member of the image of f. ˆ If y is an element of f(D), then there may exist multiple preimages of y. The standard example to understand these properties is the square function (2.65). While every real number x ∈ R may be multiplied by itself, and thus y := x2 always exists, many y ∈ R do not have preimages under f, namely all negative numbers. This simply follows as the square of 0 is 0, the square of a positive number is a positive number, and the square of a negative number is a positive number. Finally, this function also has the property that for one y ∈ f(R) there exist multiple preimages: for example, under f, 4 has the preimages 2 and −2. The relations between images and preimages of functions as just sketched are formalized by the concepts of injective, surjective, and bijective functions. These are defined as follows:

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 21

Non-surjective Non-injective Bijective

A 1 1 1 A B A 2 2 2 B C B 3 3 3 C D

Figure 2.2. Non-surjective, non-injective, and bijective functions. A function that is non-surjective leaves some elements in its range without corresponding elements in its domain. A function that is non-injective maps multiple elements in its domain on a single element in its range. A bijective mapping is a one-to-one mapping.

ˆ Let f : D → R be a function. f is called surjective, if every element y ∈ R is a member of the image of f, or in other words, if f(D) = R. If this is not the case, f is said to be not surjective. ˆ f is called injective, if every element in the image of f has exactly one preimage under f. If this is not the case, f is said to be not injective. ˆ A function f that is surjective and injective is called bijective or a one-to-one mapping. Figure 2.2 illustrates a non-surjective function, a non-injective function, and a bijective function.

Linear and nonlinear functions A function f : D → R is called linear, if and only if for all a, b ∈ D and a scalar c the following properties hold: f(a + b) = f(a) + f(b) and f(ca) = cf(a). (2.71) A function that is not linear is called a non-linear function. An example for a linear function is

f : R → R, x 7→ f(x) := ax (a ∈ R). (2.72)

That f is a linear function can be seen as follows: first, for all x, y ∈ R, we have f(x + y) = a(x + y) = ax + ay = f(x) + f(y). (2.73)

Second, for all c ∈ R, we also have f(cx) = acx = cax = cf(x). (2.74)

The function thus satisfies the definition of a linear function (2.71). On the other hand, the function

g : R → R, x 7→ f(x) := ax + b (a, b ∈ R) (2.75) is not a linear function. For example, with a := 1 and b := 1 we have for x, y ∈ R g(x + y) = 1(x + y) + 1 = x + y + 1 6= x + 1 + y + 1 = g(x) + g(y). (2.76)

Functions of the form g in (2.75) are referred to as linear-affine functions. Linear functions, unlike linear-affine functions, always map the zero element 0 onto the zero element 0.

Inverse functions Assume that f : D → R is a bijective function. Then there exists a function f −1 that undoes the transformation of x onto f(x), i.e., that maps f(x) onto x. Such a function f −1 is called the inverse function of f. Notably, first applying f to x and then f −1 to f(x) yields x, formally

f −1 ◦ f : D → D, x 7→ (f −1 ◦ f)(x) := f −1(f(x)) = x (2.77) f −1 ◦ f thus corresponds to the identity function that maps x onto itself.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 22

Example 1. As an example for an inverse function, consider the function

f : R → R, x 7→ f(x) := ax + b with a, b ∈ R, a 6= 0. (2.78) Then the inverse function of f is y − b f −1 : → , y 7→ f −1(y) := , (2.79) R R a because ax + b − b ax f −1(f(x)) = f −1 (ax + b) = = = x, (2.80) a a and f −1 ◦ f thus corresponds to the identity function.

Example 2. As a second example consider the square function. While 2 f : R → R≥0, x 7→ f(x) := x (2.81) is not injective and hence no inverse function of f exists, its restrictions to the positive and negative real numbers can be inverted. Specifically, for positive x ∈ R>0, let 2 f1 : R>0 → R>0, x 7→ f1(x) := x . (2.82) and 2 f2 : R<0 → R>0, −x 7→ f2(−x) := (−x) . (2.83)

Then the inverse function of f1 is −1 −1 √ f1 : R>0 → R>0, y 7→ f1 (y) := y, (2.84) because √ −1 2 f1 (f1(x)) = x = x. (2.85)

On the other hand, the inverse function of f2 is −1 −1 √ f2 : R>0 → R<0, y 7→ f2 (y) := − y, (2.86) because √ −1 p 2 2 f2 (f2(x)) = − (−x) = − x = −x. (2.87)

Essential functions This section catalogues a number of essential functions and their properties for later reference. Although not formally introduced at this point, we include the function’s derivatives for completeness.

The identity function. The identity function is defined as

id : R → R, x 7→ id(x) := x. (2.88) The identity function thus maps values x ∈ R onto themselves. The derivative of the identity function is 1: d d id(x) = x = 1. (2.89) dx dx The graph of the identity function is depicted in Figure 2.3A.

The exponential function. The exponential function is defined as ∞ X xn x2 x3 x4 exp : → , x 7→ exp (x) := ex := = 1 + x + + + + .... (2.90) R R n! 2! 3! 4! n=0 P∞ xn The mathematical object n=0 n! , an infinite sum, is called a series. For the purposes of the GLM, it suffices to recall that e1 ≈ 2.71 is called Euler’s number and that exp (x) thus corresponds to Euler’s number to the power of x. The graph of the exponential function is depicted in Figure 2.3A. A defining property of the exponential function is that it is equal to its own derivative: d exp (x) = exp0 (x) = exp(x). (2.91) dx Without proofs, we note the following properties of the exponential function, which are often helpful in algebraic manipulations:

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 23

ˆ Special values exp(0) = 1 and exp(1) = e. (2.92)

ˆ Value range x ∈] − ∞, 0[⇒ 0 < exp(x) < 1 and x ∈]0, ∞[⇒ 1 < exp(x) < ∞. (2.93) The exponential function thus assumes only strictly positive values

exp(R) =]0, ∞[. (2.94)

ˆ Monotonicity x < y ⇒ exp(x) < exp(y). (2.95) The exponential function is thus said to be strictly monotonically increasing. ˆ Exponentiation identity (product property): exp(a + b) = exp(a) exp(b). (2.96) From the exponentiation property, it directly follows that exp(a) exp(a) exp(a − b) = and exp(a) exp(−a) = = 1. (2.97) exp(b) exp(a)

The natural logarithm. The natural logarithm can be defined as the inverse function of the exponential function: ln : ]0, ∞[→ R, x 7→ ln(x), (2.98) where by definition

ln(exp(x)) = x for all x ∈ R and exp(ln(x)) = x for all x ∈]0, ∞[. (2.99) Note that the natural logarithm is only defined for positive values x ∈]0, ∞[. The graph of the natural logarithm is depicted in Figure 2.3A. The derivative of the natural logarithm is d 1 ln(x) = ln0(x) = . (2.100) dx x Without proofs, we note the following properties of the natural logarithm: ˆ Special values ln(1) = 0 and ln(e) = 1. (2.101)

ˆ Value ranges x ∈]0, 1[⇒ ln(x) < 0 and x ∈]1, ∞[⇒ ln(x) > 0. (2.102) The natural logarithm thus assumes values in the entire range of the real numbers, but is only defined on the set of positive real numbers, i.e.,

ln(]0, 1[) = R. (2.103)

ˆ Monotonicity x < y ⇒ ln(x) < ln(y). (2.104) The natural logarithm is thus strictly monotonically increasing. ˆ Inverse property  1  ln = − ln(x) for all x ∈]0, ∞[. (2.105) x

ˆ Product property ln(xy) = ln(x) + ln(y) for all x, y ∈]0, ∞[. (2.106) The natural logarithm thus “turns multiplication into addition”. ˆ Power property k ln(x ) = k ln(x) for all x ∈]0, ∞[ and k ∈ Q. (2.107) The natural logarithm thus “turns exponentiation into multiplication”.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Functions 24

A B

4 4 3 3 2 2 1 1

-4 -3 -2 -1 1 2 3 4 x -4 -3 -2 -1 1 2 3 4 x -1 -1 -2 -2 p(x) := 2 -3 -3 p(x) := x id(x) p(x) := 0:5x -4 exp(x) -4 p(x) := 1 + 0:5x ln(x) p(x) := x2

Figure 2.3. Essential functions. A Graphs of the identity, exponential, and logarithm functions. B Graphs of selected polynomial functions (glm 2.m).

Polynomial functions. A function of the form

k X i 2 k p : R → R, x 7→ p(x) := aix = a0 + a1x + a2x + ··· + akx (2.108) i=0 is called polynomial function of degree k ∈ N with coefficients a0, a1, ..., ak ∈ R. Depending on the degree and the value of the coefficients, typical examples for polynomial functions of degrees k = 0, k = 1, and k = 2 are listed below

Name Functional form Special coefficient values

Constant function p(x) = a0

Identity function p(x) = x a0 := 0, a1 := 1

Linear functions p(x) = a1x a0 := 0

Linear-affine functions p(x) = a0 + a1x

2 Square function p(x) = x a0 := 0, a1 := 0, a2 := 1 Graphs of the polynomial functions listed above are depicted in Figure 2.3B. Note that the identity function is a special polynomial function.

Gamma function. The Gamma function is defined as Z ∞ x−1 Γ: R → R, x 7→ Γ(x) := ξ exp(−ξ) dξ. (2.109) 0 Without proofs, we note the following properties of the Gamma function: ˆ Recursive property For x > 0, it holds that Γ(x + 1) = xΓ(x). (2.110)

ˆ Special values 1 √ Γ(1) = 1, Γ = π, and Γ(n) = (n − 1)! for n ∈ . (2.111) 2 N

ˆ Integral formula For 0 < x < ∞ and α, β > 0, it holds that Z ∞   α−1 x 1 x exp − dx = α . (2.112) 0 β Γ(α)β

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 25

2.4 Bibliographic remarks

The material presented in this chapter is standard and can be found in any undergraduate mathematical textbook. Spivak(2008) is a good starting point.

2.5 Study questions

n 1. Give brief explanations of the symbols N, Nn, Z, Q, R, R . 2. Provide a numerical example for x ∈ R5. 3. Consider the sets A := {1, 2, 3} and B := {3, 4, 5}. Write down the sets C := A ∪ B and D := A ∩ B.

4. Write down the definition of the interval [0, 1] ⊂ R. Is 0 an element of this interval? 5. Evaluate the sum 4 X y := aixi (2.113) i=1 for a1 = −1, a2 = 0, a3 = 2, a4 = −2 (2.114) and x1 = 3, x2 = 2, x3 = 5, x4 = −2. (2.115)

6. Explain the meaning of f : D → R, x 7→ f(x) (2.116) and its components f, D, R, x, and f(x) using an example of your choice. 7. Is the function f : R → R, x 7→ f(x) := 2x + 2 (2.117) a linear function? Justify your answer. 8. Sketch the identity function. 9. Sketch the exponential function. 10. Sketch the natural logarithm.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 3 | Calculus

The relative quality of parameter values during model estimation is commonly evaluated by means of a function that measures the relative goodness of a given parameter value with respect to the model and observed data. The ability to study how the value of such a function changes in response to changes in the parameter value is an essential requirement for adapting parameter values in a sensible fashion. The evaluation of function changes is the core topic of differential calculus. In this Section, we first review some essential aspects of differential calculus from the viewpoint of developing the theory of the GLM, including the notions of derivatives of univariate real-valued functions (Section 3.1), the analytical optimization of univariate real-valued functions (Section 3.2), and derivatives of multivariate real-valued functions (Section 3.3). In a final Section (Section 3.5), we review some essential aspects of integral calculus. In the context of the theory of the GLM, integrals primarily occur as expectations, , and of random variables and random vectors.

3.1 Derivatives of univariate real-valued functions

We first consider derivatives of univariate real-valued functions, by which we understand functions that map real numbers onto real numbers. In other words, we consider functions f of the type

f : R → R, x 7→ f(x). (3.1)

0 The derivative f (x0) ∈ R of such a function at the location x0 ∈ R conveys two basic and familiar intuitions: 0 1. f (x0) is a measure of the rate of change of f at location x0, 0 2 2. f (x0) is the slope of the tangent line of f at the point (x0, f(x0)) ∈ R . Formally, this may be expressed by the differential quotient of f. The differential quotient, also referred to as Newton’s difference quotient, expresses the difference between two values of the function f(x + h) and f(x) with respect to the difference between the two locations x and x + h for h approaching zero:

f(x + h) − f(x) f 0(x) = lim . (3.2) h→0 h The differential quotient (3.2) represents the formal definition of the derivative of f at x and the basis for mathematical proofs of the rules of differentiation to be discussed in the following. However, its practical importance for the development of the theory of the GLM is by and large negligible. It is, however, important to distinguish two common usages of the term derivative: first, the derivative of a function f can be considered at a specific value x0 in the domain of f, denoted by

0 f (x)|x=x0 ∈ R, (3.3) and represented by a number. Second, if the derivative (3.3) is evaluated for all possible values in the domain of f, the derivative of f can be conceived as a function

0 0 0 f : R → R, x0 7→ f (x0) := f (x)|x=x0 . (3.4) Intuitively, (3.4) means that the derivative of a (differentiable) univariate real-valued function may be evaluated at any point of the real line.

Higher-order derivatives The derivative f 0 of a function f is also referred to as the first-order derivative of a function (the zeroth-order derivative of a function corresponds to the function itself). Higher-order derivatives (i.e., second-order, third-order, and so on) can be evaluated by recursively forming the derivative of the respective lower order derivative. For example, the second-order derivative of a function corresponds to Derivatives of univariate real-valued functions 27

d the (first-order) derivative of the first-order derivative of a function. To this end, the dx operator notation for derivatives is useful. Intuitively, the symbol d f (3.5) dx can be understood as the imperative to evaluate the derivative of f, or simply as an alternative notation 0 d for f . By itself, dx carries no meaning. We thus have for the first-order derivative d f(x) = f 0(x) (3.6) dx and for the second-order derivative d2 d  d  f(x) = f(x) = f 00(x). (3.7) dx2 dx dx Intuitively, the second-order derivative measures the rate of change of the first-order derivative in the vicinity of x. If these first-order derivatives, which may be visualized as tangent lines, change relatively quickly in the vicinity of x, the second-order derivative is large, and the function is said to have a high curvature.

Derivatives of important functions We next collect the derivatives of essential functions without proofs.

Constant function. The derivative of any constant function

f : R → R, x 7→ f(x) := a, where a ∈ R, (3.8) is zero: 0 0 f : R → R, x 7→ f (x) = 0. (3.9) For example, the derivative of f(x) := 2 is f 0(x) = 0.

Single-term polynomial functions. Let f be a single-term polynomial function of the form

b f : R → R, x 7→ f(x) := ax , where a ∈ R, b ∈ R \{0}. (3.10) Then the derivative of f is given by

0 0 b−1 f : R → R, x 7→ f (x) = bax . (3.11)

3 0 2 √ 1 For example, the derivative of f(x) := 2x is f (x) = 6x and the derivative of g(x) := x = x 2 is 1 1 1 0 − 2 √ g (x) = 2 x = 2 x .

Exponential function. Let f be the exponential function

f : R → R>0, x 7→ f(x) := exp(x). (3.12) Then the derivative of f is given by

0 0 f : R → R>0, x 7→ f (x) = exp(x). (3.13)

Natural logarithm. Let f be the natural logarithm

f : R>0 → R, x 7→ f(x) := ln(x). (3.14) Then the derivative of f is given by 1 f 0 : → , x 7→ f 0(x) = . (3.15) R>0 R x

Rules of differentiation We next state important rules for evaluating the derivatives of univariate functions without proof.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of univariate real-valued functions 28

Summation rule. Let n X f : R → R, x 7→ f(x) := gi(x) (3.16) i=1 be the sum of n arbitrary functions gi : R → R (i = 1, 2, ..., n). Then the derivative of f is given by the sum of the derivatives of the functions gi:

n 0 0 X 0 f : R → R, x 7→ f (x) := gi(x). (3.17) i=1 For example, the derivative of

2 2 f(x) = x + 2x, where g1(x) := x and g2(x) := 2x, (3.18) is 0 0 f(x) = g1(x) + g2(x) = 2x + 2. (3.19)

Chain rule. Let h be the composition of two functions f : R → R and g : R → R, i.e.,

h : R → R, x 7→ h(x) := (g ◦ f)(x) = g (f(x)) . (3.20) Then the derivative of h is given by

0 0 0 0 h : R → R, x 7→ h (x) := g (f(x)) f (x). (3.21) In words, the derivative of a function that can be written as the composition of a first function f with a second function g is given by the derivative of the second function g “at the location of the function f” multiplied with the derivative of the first function f. For example, the derivative of

 1  h(x) := exp − x2 , (3.22) 2 which can be written as the composition of a function g(x) := exp(x) with derivative g0(x) = exp(x) and 1 2 0 a function f(x) := − 2 x with derivative f (x) = −x, is given by  1  h0(x) = − exp − x2 x. (3.23) 2

Product rule. Let f be the product of two functions gi : R → R with i = 1, 2, i.e.,

f : R → R, x 7→ f(x) := g1(x)g2(x). (3.24) Then the derivative of f is given by

0 0 0 0 f : R → R, x 7→ f (x) := g1(x)g2(x) + g1(x)g2(x), (3.25)

0 0 where g1 and g2 denote the derivatives of g1 and g2, respectively. In words, if a function can be written as the product of a first and a second function, its derivative corresponds to the product of the derivative of the first function with the second function plus the product of the first function with the derivative of the second function. For example, the derivative of

f(x) := x2 exp(x) (3.26)

2 0 can be found by writing f as g1 · g2 with g1(x) := x and g2(x) := exp(x) with derivatives g1(x) = 2x and 0 g2(x) = exp(x), respectively. This then yields

f 0(x) = 2x exp(x) + x2 exp(x). (3.27)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Analytical optimization of univariate real-valued functions 29

Quotient rule. Let f be the quotient of two functions gi : R → R with i = 1, 2, i.e.

g1(x) f : R → R, x 7→ f(x) := . (3.28) g2(x) Then the derivative of f is given by

0 0 0 0 g1(x)g2(x) − g1(x)g2(x) f : R → R, x 7→ f (x) := 2 . (3.29) g2(x) In words, the derivative of a function that can be written as the quotient of a first function in the numerator and a second function in the denominator is given by the difference of the product of the derivative of the first function with the second function and the product of the first function with the derivative of the second function, divided by the square of the second function, i.e., the function in the denominator of the original function. For example, the derivative of

exp(x) f(x) := (3.30) x2 + 1

0 2 can be evaluated by considering g1(x) := exp(x) with derivative g1(x) = exp(x) and g2(x) := x + 1 with 0 derivative g2(x) = 2x yielding

exp(x) x2 + 1 − exp(x)2x f 0(x) = . (3.31) (x2 + 1)2

3.2 Analytical optimization of univariate real-valued functions

First- and second-order derivatives can be used to find local maxima and minima of functions. Finding maxima and minima of functions is a fundamental aspect of applied mathematics and is, in general, referred to as optimization. As discussed in the introductory statements of this Section, optimization is central for model estimation, as will become evident in Section 8 | Maximum likelihood estimation.

Extrema and extremal points It is helpful to clearly differentiate between two aspects of optimization: on the one hand, when finding a maximum or a minimum, one finds a value of a function f : D → R for which the conditions f(x) ≥ f(x0) or f(x) ≤ f(x0) hold for at least all x0 in a vicinity of x ∈ D. These values, which are elements of the range of f, are called maxima or minima and are abbreviated by

max f(x) and min f(x), (3.32) x∈D x∈D respectively. On the other hand, one simultaneously finds those points x in the domain of f, for which f(x) assumes a maximum or minimum. These points, which are often more interesting than the corresponding values f(x) themselves, are referred to as extremal points and are abbreviated by

arg max f(x) and arg min f(x), (3.33) x∈D x∈D for extremal points that correspond to maxima and minima of f, respectively. Note the difference between maxx∈D f(x) and arg maxx∈D f(x): the former refers to a point or a set of points in the range of f, the latter to a point or a set of points in the domain of f.

Necessary condition for an extremum When using first- and second-order derivatives to find extremal points and their corresponding maxima or minima, it is helpful to distinguish necessary and sufficient conditions for extrema. The necessary condition for an extremum (i.e., a maximum or a minimum) of a function f : R → R at a point x ∈ R is that the first derivative is equal to zero: f 0(x) = 0. Intuitively, this can be made transparent by considering a maximum of f at a point xmax: for all values of x which are smaller than xmax, the derivative is positive, because the function is increasing, leading up to the maximum. For all values of x which are larger than xmax, the derivative is negative, because the function is decreasing. At the location of the maximum, the

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Analytical optimization of univariate real-valued functions 30 function is neither increasing nor decreasing, and thus f 0(x) = 0. The reverse is true for a minimum of f at a point xmin: for all values of x which are smaller than xmin, the derivative is negative, because the function is decreasing towards the minimum. For all values of x which are larger than xmin, the derivative is positive, because the function is increasing again and recovering from the minimum. Again, at the location of the minimum, the function is neither increasing nor decreasing, and thus f 0(x) = 0. Based on finding a point x∗ with f 0 (x∗) = 0 one cannot decide whether x∗ corresponds to a maximum or minimum, because in both cases f 0(x) = 0. On the other hand, if a minimum or maximum exists at a point x∗, it necessarily follows that f 0(x∗) = 0; hence, the nomenclature necessary condition. In fact, there are two more possibilities for points at which f 0(x∗) = 0: the function may be increasing for x < x∗ and for x > x∗, or the function may be decreasing for x < x∗ and for x > x∗. In both cases, there is no maximum nor minimum in x∗, but what is referred to as a saddle point.

Sufficient conditions for an extremum The second-order derivative f 00(x) allows for testing whether a critical point x∗ for an extremum, i.e., a point for which f 0(x∗) = 0, is a maximum, a minimum, or a saddle point. In brief, if f 00(x∗) < 0, there is a maximum at x∗, if f 00(x∗) > 0, there is a minimum at x∗, and if f 00(x∗) = 0, there is a saddle point at x∗. Together with the condition f 0(x∗) = 0, these conditions are referred to as sufficient conditions for an extremum or a saddle point. The role of the second derivative can be made intuitive by considering a maximum at x∗. For points x < x∗, the slope of the tangent line at f(x) must be positive, because f is increasing towards x∗. Likewise for points x > x∗, the slope of the tangent line at f(x) must be negative, because f is decreasing after assuming its maximum in x∗. In other words, f 0(x) > 0 (positive) for x < x∗ and f 0(x∗) < 0 (negative) for x > x∗ and f 0(x∗) = 0. We next consider the change of f 0, i.e., f 00: in the region around the maximum, f 0 decreases from a positive value to zero to a negative value, as just stated. Because f 0(x) is positive just before (to the left of) x∗ and negative just after (to the right of) x∗, it obviously decreases from just before to just after x∗. But this means that its own rate of change, f 00, is negative in x∗. The reverse holds for a minimum of f in x∗. To recapitulate, we have established the following conditions that use the derivatives of a function f : R → R to determine its extrema: ˆ If there is a maximum or minimum of f at x∗, then f 0(x∗) = 0. ˆ If f 0(x∗) = 0 and f 00(x∗) > 0, then there is a minimum at x∗. ˆ If f 0(x∗) = 0 and f 00(x∗) < 0, then there is a maximum at x∗. The first of these conditions is referred to as the necessary condition for extrema, the latter two conditions as the sufficient conditions for extrema. We next discuss three examples in which we use these conditions to determine the location of extrema (Figure 3.1).

Example 1. Consider the function

f : [0, π] → [−1, 1], x 7→ sin (x) (3.34) depicted in Figure 3.1A by a blue curve. The first derivative of f is given by d f 0 : [0, π] → [−1, 1], x 7→ sin(x) = cos(x) (3.35) dx and depicted in Figure 3.1A by a red curve. Notably, in the interval [0, π], the cosine function assumes a π ∗ zero point at 2 . We thus have the critical point x = π/2 for an extremum. The second derivative of f is given by d f 00 : [0, π] → [−1, 1], x 7→ cos(x) = − sin (x) (3.36) dx and is depicted in Figure 3.1A by a dashed red curve. Because sin(π/2) = −1, we can conclude that there is a maximum of f at x = π/2. Of course, this is also obvious from the graph of f.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Analytical optimization of univariate real-valued functions 31

A B C f(x) := sin(x) f(x) := (x ! 1)2 f(x) := !x2 1 f(x) 2 2 f 0(x) 0.5 f 00(x) x$ 0 0 0

-0.5 -2 -2 -1 0 1 2 3 -1 0 1 2 3 -2 0 2 x x x

Figure 3.1. Analytical optimization of basic functions. For a detailed discussion, please refer to the main text (glm 3.m).

Example 2. Consider the function 2 f : R → R, x 7→ (x − 1) (3.37) depicted in Figure 3.1B by a blue curve. The first derivative of f is given by d f 0 : → , x 7→ (x − 1)2 = 2x − 2 (3.38) R R dx and depicted in Figure 3.1B by a red curve. Setting this derivative to zero and solving for x yields

2x − 2 = 0 ⇔ 2x = 2 ⇔ x = 1. (3.39)

We thus have the critical point x∗ = 1 for an extremum. The second derivative of f is given by d f 00 : → , x 7→ (2x − 2) = 2 (3.40) R R dx and is depicted in Figure 3.1B by a dashed red curve. The second derivative is thus a constant function, and f 00(x∗) = 2 > 0. We thus conclude that there is a minimum of f at x = 1 . Again, this is also obvious from the graph of f.

Example 3. Finally, we consider the function

2 f : R → R, x 7→ −x (3.41) depicted in Figure 3.1C by a blue curve. The first derivative of f is given by d f 0 : → , x 7→ −x2 = −2x (3.42) R R dx and depicted in Figure 3.1C by a red curve. Setting this derivative to zero and solving for x yields

− 2x = 0 ⇔ x = 0. (3.43)

We thus have the critical point x∗ = 0 for an extremum. The second derivative of f is given by d f 00 : → , x 7→ (−2x) = −2 (3.44) R R dx and is depicted in Figure 3.1C by a dashed red curve. The second derivative is thus a constant function and f 00(x∗) = −2 < 0. We thus conclude that there is a maximum of f at x = 1.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of multivariate real-valued functions 32

Figure 3.2. Visualization of multivariate (here bivariate), real-valued functions. Real-valued functions of multiple variables are often visualized in a three-dimensional way as in the left panels of the Figure. Note that although this is a 3D plot, the function is bivariate, i.e., it is a function of two variables. The same information can be conveyed by using isocontour plots, which visualize the isocontours of functions in 2D. Isocontours are the lines assuming equal values in the range of the function. Usually isocontour plots suffice to convey all relevant information about a bivariate function (glm 03.m).

3.3 Derivatives of multivariate real-valued functions

Thus far, we considered functions of the form f : R → R, which map numbers x ∈ R onto numbers f(x) ∈ R. Another function type that is encountered in the development of the GLM are functions of the form n f : R → R, x 7→ f(x), (3.45) where   x1  .  n x :=  .  ∈ R (3.46) xn is an n-dimensional vector. Because the input argument x to such a function can vary along n ≥ 1 dimensions and its output argument f(x) is a scalar real number, such functions are also called multivariate real-valued functions. In physics, such functions are referred to as scalar fields, because they allocate scalars f(x) ∈ R to points x in the n-dimensional space Rn. An example for a function of the type (3.45) is   2 x1 2 2 f : R → R, x 7→ f(x) = f := x1 + x2, (3.47) x2 which is visualized in Figure 3.2A. Another example is the function     2 x1 1  2 2 g : R → R, x 7→ g(x) = g := exp − (x1 − 1) + (x2 − 1) (3.48) x2 2 which is visualized in Figure 3.2B. Note that functions defined on spaces Rn with n > 2 cannot be visualized easily.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of multivariate real-valued functions 33

As for univariate real-valued functions, one can ask how much a change in the input argument at a specific point in Rn of a multivariate real-valued function affects the value of the function. If one asks this n question for each of the subcomponents xi, i = 1, ..., n of x ∈ R independently of the remaining n − 1 subcomponents, one is led to the concept of a partial derivative: the partial derivative of a multivariate n real-valued function f : R → R with respect to a variable xi, i = 1, ..., n captures how much the function n value changes “in the direction” of xi, i.e., in the cross-section through the space R defined by the variable of interest. Stated differently, the partial derivative of a function f : Rn → R in a point x ∈ Rn with respect to a variable xi is the derivative of the function f with respect to xi while all other variables n xj, j = 1, 2, i − 1, i + 1, ..., n are held constant. The partial derivative of a function f : R → R in a point n x ∈ R with respect to a variable xi is denoted by ∂ f(x), (3.49) ∂xi where the ∂ symbol is used to distinguish the notion of a partial derivative from a standard derivative. This notation is somewhat redundant, because the subscript i on the x in ∂ already makes it clear ∂xi that the derivative is with respect to xi only. The notation is, however, commonly used, and if the subcomponents of x are not denoted by x1, ..., xn, but by, say, a := x1, b := x2, ..., e := x5, it is, in fact, helpful. Like the derivative of a univariate real-valued function, one may evaluate the partial derivative for all x ∈ Rn and hence also view the partial derivative of a multivariate real-valued function as a function

∂ n ∂ f : R → R, x 7→ f(x). (3.50) ∂xi ∂xi We next discuss two examples.

Examples 1. We first consider the function

2 2 2 f : R → R, x 7→ f(x) := x1 + x2. (3.51) Because this function has a two-dimensional domain, one can evaluate two different partial derivatives,

∂ 2 ∂ f : R → R, x 7→ f(x) (3.52) ∂x1 ∂x1 and ∂ 2 ∂ f : R → R, x 7→ f(x). (3.53) ∂x2 ∂x2 To evaluate the partial derivative (3.52), one considers the function

2 2 fx2 : R → R, x1 7→ fx2 (x1) := x1 + x2 (3.54) where x2 assumes the role of a constant. To indicate that x2 is no longer an input argument of the function, but the function is still dependent on the constant x2, we have used the subscript notation fx2 (x1). To evaluate the partial derivative, we evaluate the standard univariate derivative of fx2 , f 0 (x) = 2x . (3.55) x2 1 We thus have ∂ ∂ ∂ f : 2 → , x 7→ f(x) = (x2 + x2) = f 0 (x) = 2x . (3.56) R R 1 2 x2 1 ∂x1 ∂x1 ∂x1

Accordingly, with the corresponding definition of fx1 , we have ∂ ∂ ∂ f : 2 → , x 7→ f(x) = (x2 + x2) = f 0 (x) = 2x . (3.57) R R 1 2 x1 2 ∂x2 ∂x2 ∂x2

Example 2. We next consider the example  1   g : 2 → , x 7→ g(x) := exp − (x − 1)2 + (x − 1)2 . (3.58) R R 2 1 2 Again, there are two partial derivatives to consider. Using the chain rule of differentiation and the logic of treating the variable with respect to which the derivative is not performed as a constant, we obtain

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of multivariate real-valued functions 34

   ∂ ∂ 1 2 2 g(x) = exp − (x1 − 1) + (x2 − 1) ∂x1 ∂x1 2     ∂ 1 2 2 ∂ 1 2 2 = exp − (x1 − 1) + (x2 − 1) − (x1 − 1) + (x2 − 1) (3.59) ∂x1 2 ∂x1 2  1  = − exp − (x − 1)2 + (x − 1)2 (x − 1) , 2 1 2 1 and    ∂ ∂ 1 2 2 g(x) = exp − (x1 − 1) + (x2 − 1) ∂x2 ∂x2 2     ∂ 1 2 2 ∂ 1 2 2 = exp − (x1 − 1) + (x2 − 1) − (x1 − 1) + (x2 − 1) (3.60) ∂x2 2 ∂x2 2  1  = − exp − (x − 1)2 + (x − 1)2 (x − 1) , 2 1 2 2 for the values of the respective partial derivative functions.

Higher-order partial derivatives

As for the standard derivative of a univariate real-valued function f : R → R, higher-order partial derivatives can be formulated and evaluated by taking partial derivatives of partial derivatives. Because multivariate real-valued functions are functions of multiple input arguments, more possibilities exist for higher-order derivatives compared to the univariate case. For example, given the partial derivative ∂ f ∂x1 3 of a function f : R → R, one may next form the partial derivative again with respect to x1, yielding the second-order partial derivative equivalent to the second-order derivative of a univariate function and ∂2 ∂2 denoted by 2 f. However, one may also form the partial derivative with respect to x2, f, or with ∂x1 ∂x1∂x2 ∂2 respect to x3, f. Note that the numerator of the partial derivative sign increases its power with the ∂x1∂x3 order of the derivative and the denominator denotes the variables with respect to which the derivative is taken. If the derivative is taken multiple times with respect to the same variable, the variable in the denominator is notated with the corresponding power. Again, note that these are mere conventions to signal the form of the partial derivative, but the symbols themselves do not carry any meaning beyond the implicit imperative to consider or evaluate the corresponding partial derivative.

Example. To exemplify the notation introduced above, we evaluate the first and second-order partial derivatives of the function

3 2 √ f : R → R, x 7→ f(x) := x1 + x1x2 + x2 x3. (3.61) For the first-order-derivatives, we have

∂ ∂ 2 √  f(x) = x1 + x1x2 + x2 x3 = 2x1 + x2, ∂x1 ∂x1 ∂ ∂ 2 √  √ f(x) = x1 + x1x2 + x2 x3 = x1 + x3, (3.62) ∂x2 ∂x2 ∂ ∂ 2 √  x2 f(x) = x1 + x1x2 + x2 x3 = √ . ∂x3 ∂x3 2 x3

For the second-order derivatives with respect to x1, we then have

∂2 ∂  ∂  ∂ f(x) = f(x) = (2x1 + x2) = 2, ∂x1∂x1 ∂x1 ∂x1 ∂x1 ∂2 ∂  ∂  ∂ f(x) = f(x) = (2x1 + x2) = 1, (3.63) ∂x2∂x1 ∂x2 ∂x1 ∂x2 ∂2 ∂  ∂  ∂ f(x) = f(x) = (2x1 + x2) = 0. ∂x3∂x1 ∂x3 ∂x1 ∂x3

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of multivariate real-valued functions 35

For the second-order derivatives with respect to x2, we have ∂2 ∂  ∂  ∂ √ f(x) = f(x) = (x1 + x3) = 1, ∂x1∂x2 ∂x1 ∂x2 ∂x1 ∂2 ∂  ∂  ∂ √ f(x) = f(x) = (x1 + x3) = 0, (3.64) ∂x2∂x2 ∂x2 ∂x2 ∂x2 ∂2 ∂  ∂  ∂ √ 1 f(x) = f(x) = (x1 + x3) = √ . ∂x3∂x2 ∂x3 ∂x2 ∂x3 2 x3

Finally, for the second-order derivatives with respect to x3, we have 2   ∂ ∂ ∂ ∂  x2 √  f(x) = f(x) = x3 = 0, ∂x1∂x3 ∂x1 ∂x3 ∂x1 2 ∂2 ∂  ∂  ∂  x  1 f(x) = f(x) = √2 = √ , (3.65) ∂x2∂x3 ∂x2 ∂x3 ∂x2 2 x3 2 x3

2    1  3 ∂ ∂ ∂ ∂ 1 − 2 1 − 2 f(x) = f(x) = x2 x3 = − x2x3 . ∂x3∂x3 ∂x3 ∂x3 ∂x3 2 4 Note from the above that it does not matter in which order the second derivatives are taken, as ∂2 ∂2 f(x) = f(x) = 1, ∂x1∂x2 ∂x2∂x1 ∂2 ∂2 f(x) = f(x) = 0, (3.66) ∂x1∂x3 ∂x3∂x1 ∂2 ∂2 1 f(x) = f(x) = √ . ∂x2∂x3 ∂x3∂x2 2 x3

This is a general property of partial derivatives known as Schwarz’ Theorem, which we state without proof. Theorem 3.3.1 (Schwarz’ Theorem). For a multivariate real-valued function

n f : R → R, x 7→ f(x), (3.67) it holds that ∂2 ∂2 f(x) = f(x) for all 1 ≤ i, j ≤ n. (3.68) ∂xi∂xj ∂xj∂xi

Schwarz’ Theorem is helpful when evaluating partial derivatives: on the one hand, one can save some work by relying on it, on the other hand, it can help to validate one’s analytical results, because if one finds that it does not hold for certain second-order partial derivatives, there must be an error.

Gradient and Hessian The first- and second-order partial derivatives of a multivariate real-valued f functions can be summarized in two entities known as the gradient (or gradient vector) and the Hessian (or Hessian matrix).

Gradient. The gradient of a function

n f : R → R, x 7→ f(x) (3.69) at a location x ∈ Rn is defined as the n-dimensional vector of the function’s partial derivatives evaluated at this location and is denoted by the ∇ (nabla) symbol:

 ∂ f(x) ∂x1 ∂  ∂x f(x) ∇f : n → n, x 7→ ∇f(x) :=  2  . (3.70) R R  .   .  ∂ f(x) ∂xn Note that the gradient is a vector-valued function: it takes a vector x ∈ Rn as input and returns a vector ∇f (x) ∈ Rn. We note without proof that the gradient evaluated at x ∈ Rn is a vector that points in the direction of the greatest rate of increase (steepest ascent) of the function.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Derivatives of multivariate vector-valued functions 36

Hessian. The second-order partial derivatives of a multivariate-real valued function f can be summarized in the Hessian of the function, which hereinafter is denoted by Hf . It is defined as

f n n×n f H : R → R , x 7→ H (x), (3.71) where

2 2 2  ∂ f(x) ∂ f(x) ··· ∂ f(x) ∂x1∂x1 ∂x1∂x2 ∂x1∂xn 2 2 2  ∂ f(x) ∂ f(x) ··· ∂ f(x) f  ∂x2∂x1 ∂x2∂x2 ∂x2∂xn  H (x) :=  . . . .  . (3.72)  . . .. .   2 2 2  ∂ f(x) ∂ f(x) ··· ∂ f(x) ∂xn∂x1 ∂xn∂x2 ∂xn∂xn Note that in each row of the Hessian, the second of the two partial derivatives is constant (in the order of differentiation, not in the order of notation), while the first partial derivative varies from 1 to n over columns, and the reverse is true for each column. Notably, the Hessian matrix is a matrix-valued function: it takes a vector x ∈ Rn as input and returns an n × n matrix Hf (x) ∈ Rn×n. Finally, note that due to Schwarz’ Theorem, the Hessian matrix is symmetric, i.e.,

T Hf (x) = Hf (x) . (3.73)

3.4 Derivatives of multivariate vector-valued functions

Multivariate vector-valued functions Thus far, we have discussed univariate real-valued and multivariate real-valued functions. A further type of function that is commonly encountered are functions that map vectors onto vectors. A principled account and theoretical development of derivatives for such multivariate vector-valued functions and their derivatives is provided by Magnus and Neudecker(1989). Multivariate vector-valued functions are functions of the form   f1 (x1, ..., xn)  f2 (x1, ..., xn)  f : n → m, x 7→ f(x) :=   . (3.74) R R  .   .  fm (x1, ..., xn) In physics, such functions are referred to as vector fields. The multivariate real-valued functions

n fi : R → R, i = 1, ..., n (3.75) are referred to as the component functions of f.

Example. A first example for a multivariate vector-valued function is   3 2 x1 + x2 f : R → R , x 7→ f(x) := , (3.76) x2x3 for which the component functions are given by

3 f1 : R → R, x 7→ f1(x) := x1 + x2, (3.77) 3 f2 : R → R, x 7→ f1(x) := x2x3. (3.78)

The Jacobian matrix

The first derivative of multivariate vector-valued functions evaluated at x ∈ Rn is given by the Jacobian matrix. The Jacobian matrix is denoted and defined by

 ∂ ∂  ∂x f1(x) ··· ∂x f1(x)  ∂  1 n J f : n → m×n, x 7→ J f (x) := f (x) =  . .. .  . (3.79) R R ∂x i  . . .  j i=1,...,m,j=1,...,n ∂ ∂ fm(x) ··· fm(x) ∂x1 ∂xn

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 37

In words, the Jacobian matrix of a multivariate vector-valued function f : Rn → Rm with component functions fi, i = 1, ..., m is the m × n matrix of n partial derivatives of the m component functions with respect to the n input vector components xj, j = 1, ..., n. Note that the gradient of a multivariate real-valued function corresponds to the transpose of the Jacobian matrix of the function: for

n m f : R → R (3.80) with m = 1, we have T ∇f(x) = J f (x) . (3.81) This can be readily seen by inspecting the entries of the first row of (3.82). Finally, note that the determinant of the Jacobian matrix is often referred to as Jacobian.

Example. As an example, consider the Jacobian matrix of the function defined in (3.76). By evaluation of the respective partial derivatives, we have

∂ ∂ ∂ ! ! f1(x) f1(x) f1(x) 1 1 0 J f : 3 → 2×3, x 7→ J f (x) := ∂x1 ∂x2 ∂x3 = . (3.82) R R ∂ ∂ ∂ f2(x) f2(x) f2(x) 0 x3 x2 ∂x1 ∂x2 ∂x3

3.5 Basic integrals

In this Section, we review the intuition of the definite integral as the signed area under a function’s graph and the notion of indefinite integration as the inverse of differentiation.

Definite integrals of univariate real-valued functions

We denote the definite integral of a univariate real-valued function f on an interval [a, b] ⊂ R by the real number Z b I := f (x)dx ∈ R. (3.83) a It is important to realize two aspects of (3.83): first, the definite integral is a real number and second, the right-hand side of (3.83) is merely notational and to be understood as the imperative for integrating the function f on the interval [a, b]. In other words, there is no mathematical meaning associated with R b the dx or the a that goes beyond the definition of the integral boundaries a and b. The term definite is used here to distinguish this integral from the indefinite integral discussed below. Put simply, definite integrals are those integrals for which the integral boundaries appear at the integral sign - although they may sometimes be omitted, e.g., if the interval of integration is the entire real line. Intuitively, the definite R b integral a f(x)dx is best understood as the continuous generalization of the discrete sum

n X f(xi)∆x, (3.84) i=1 where a =: x1, x2 := x1 + ∆x, x3 := x2 + ∆x, ..., xn+1 := b (3.85) corresponds to an equipartition of the interval [a, b], i.e., a partition of the interval [a, b] into n + 1 bins of equal size ∆x. The term f (xi) ∆x for i = 1, ..., n in (3.84) corresponds to the area of the rectangle formed by the value of the function f at xi (i.e., the upper left corner of the rectangle) as height and the bin width ∆x as width. Summing over all rectangles then yields an approximation of the area under the graph of the function f, where terms with negative values of f(xi) enter the sum with a negative sign. Intuitively, letting the bin width ∆x in the sum (3.84) approach zero then approximates the integral of f on the interval [a, b] n Z b X f(x)dx ≈ f(xi)∆x for ∆x → 0. (3.86) a i=1 This approximation approach to the definite integral is visualized in Figure 3.3.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 38

Figure 3.3. Evaluation of a definite integral by means of the approximation approach described in eq. (3.86).

Definite integrals have a linearity property, which is often useful when evaluating integrals analytically. Based on the intuition that for a function f : R → R, the definite integral corresponds to n Z b X f(x)dx ≈ f(xi)∆x, (3.87) a i=1 and the fact that for a second function g : R → R, we have n n n n X X X X (f (xi) + g(xi)) ∆x = (f (xi) ∆x + g(xi)∆x) = f (xi) ∆x + g(xi)∆x, (3.88) i=1 i=1 i=1 i=1 and for a constant c ∈ R we have n n X X cf (xi) ∆x = c f (xi) ∆x, (3.89) i=1 i=1 we can infer the following two properties of the integral:

Z b Z b Z b (f(x) + g(x))dx = f(x)dx + g(x)dx (3.90) a a a and Z b Z b cf(x)dx = c f(x)dx. (3.91) a a In words, first, the integral of the sum of two functions f + g over an interval [a, b] corresponds to the sum of the integrals of the individual functions f and g on [a, b]. Second, the integral of a function f multiplied by a constant c on an interval [a, b] corresponds to the integral of the function f on an interval [a, b] multiplied by the constant. Both properties are very useful when evaluating integrals analytically: the first allows for decomposing integrals of composite functions into sums of integrals of less complex functions, while the second allows for removing constants from integration.

Indefinite integrals Consider a univariate real-valued function

f : R → R, x 7→ f(x). (3.92) Next, consider a second function that is defined in terms of definite integrals of f by making the upper integration boundary of these definite integrals its argument: Z x F : R≥0 → R, x 7→ F (x) := f(s) ds. (3.93) 0

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 39

From the discussion above, we have that the value F at x corresponds to the signed area under the graph of the function f on the interval from 0 to x. Notably, the derivative F 0(x) of F at x corresponds to the value of the function f, i.e., d Z x  F 0(x) = f(s) ds = f(x). (3.94) dx 0 Intuitively, eq. (3.94) states that integration is the inverse of differentiation, in the sense that first integrating f from 0 to x and then computing the derivative with respect to x yields f. Any function F with the property F 0(x) = f(x) for a function f is called an anti-derivative or indefinite integral of f. An indefinite integral is denoted by Z F : R → R, x 7→ F (x) = f(s) ds. (3.95)

Note that the definite integral defined above corresponds to a real scalar number, while the indefinite integral is a function.

Proof of (3.94) While the statement of equation (3.94) is familiar and intuitive, it is not necessarily formally easy to grasp. Here, we provide a proof of this equation based on Leithold(1976). The proof makes use of limiting processes and the mean value theorem (Spivak, 2008). Let f : R → R, s 7→ f(s) be a univariate real-valued function, and define another function Z x F : R → R, x 7→ F (x) := f(s) ds. (3.96) a For any two numbers x1 and x1 + ∆x in the (closed) interval [a, b] ⊂ R, we then have

Z x1 Z x1+∆x F (x1) = f(s)ds and F (x1 + ∆x) = f(s) ds. (3.97) a a Subtraction of these two equalities yields

Z x1+∆x Z x1 F (x1 + ∆x) − F (x1) = f(s)ds − f(s) ds. (3.98) a a From the intuition of the integral as the area between the function f and the x-axis, it follows naturally that the sum of the areas of two adjacent areas is equal to the area of both regions combined, i.e.,

Z x1 Z x1+∆x Z x1+∆x f(s)ds + f(s)ds = f(s) ds. (3.99) a x1 a From this it follows that the difference above evaluates to

Z x1+∆x F (x1 + ∆x) − F (x1) = f(s) ds. (3.100) x1

According to the mean value theorem for integration, there exists a real number c∆x ∈ [x1, x1 +∆x] (the dependence on ∆x of which we have denoted by the subscript) with

Z x1+∆x f(s)ds = f (c∆x) ∆x. (3.101) x1 and we hence obtain F (x1 + ∆x) − F (x1) = f (c∆x) ∆x. (3.102) Division by ∆x then yields F (x + ∆x) − F (x ) 1 1 = f (c ) , (3.103) ∆x ∆x where the left-hand side corresponds to Newton’s difference quotient. Taking the limit ∆x → 0 on both sides then yields F (x + ∆x) − F (x ) 1 1 = f (c ) ⇔ F 0(x ) = f (c ) (3.104) ∆x ∆x 1 ∆x by definition of the derivative as the limit of Newton’s difference quotient. The limit on the right-hand side of (3.104) remains to be evaluated. To this end, we recall that c∆x ∈ [x1, x1 + ∆x] or, in other words, that x1 ≤ c∆x ≤ x1 + ∆x. Notably, x1 = x1 and x1 + ∆x = x1. Therefore, we can conclude that c∆x = x1, as c∆x is squeezed between x1 = x1 and x1 + ∆x = x1. We thus find

0 F (x1) = f (c∆x) = f (x1) , (3.105)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 40 which concludes the proof. 2 R b Indefinite integrals allow for the evaluation of definite integrals a f(s)ds by means of the fundamental theorem of calculus Z b f(s)ds = F (b) − F (a). (3.106) a In words, to evaluate the integral of a univariate real-valued function f on the interval [a, b], one has to first compute the anti-derivative of f, and then compute the difference between the anti-derivative evaluated at the upper integral interval boundary b and the anti-derivative evaluated at the lower integral interval boundary a. Equation (3.106) is very familiar. We first consider some of its properties and then provide a formal justification.

Properties of indefinite integrals. We first note without proof that the linearity properties of the definite integral also hold for the indefinite integral: for functions f, g : R → R and a constant c ∈ R we have Z Z Z (f(x) + g(x)) dx = f(x)dx + g(x) dx (3.107) and Z Z cf(x)dx = c f(x) dx. (3.108)

As for differentiation, it is useful to know the anti-derivatives of a handful of univariate real-valued functions that are commonly encountered. A selection of anti-derivatives is presented below. These can readily be verified by evaluating the derivatives of the respective anti-derivatives to recover the original functions. Note that the derivative of the constant function f(x) := c, c ∈ R is zero. We have

f(x) := a ⇒ F (x) = ax + c a 1 a+1 f(x) := x ⇒ F (x) = a+1 x + c (a 6= −1) f(x) := x−1 ⇒ F (x) = ln x + c f(x) := exp(x) ⇒ F (x) = exp (x) + c f(x) := sin(x) ⇒ F (x) = − cos (x) + c f(x) := cos(x) ⇒ F (x) = sin (x) + c

Proof of (3.106) As for the statement that the derivative of an anti-derivative is the original function, the fundamental theorem of calculus is very familiar but a formal derivation is somewhat more involved. The proof provided here, which again follows Leithold(1976), makes use of limiting processes and the mean value theorem of differentiation. We first consider the quantity F (b) − F (a). To this end, we select numbers x0, ..., xn such that

a := x0 < x1 < x2 < . . . < xn−1 < xn =: b. (3.109)

It then follows that F (b) − F (a) = F (xn) − F (x0) . (3.110)

Next, each F (xi) , i = 1, . . . , n − 1 is added to the quantity F (b) − F (a) together with its additive inverse

F (b) − F (a) = F (xn) + (−F (xn−1) + F (xn−1)) + ... + (−F (x1) + F (x1)) − F (x0)

= (F (xn) − F (xn−1)) + (F (xn−1) − F (xn−2)) + ... + (F (x1) − F (x0)) n (3.111) X = (F (xi) − F (xi−1)) . i=1

For a function F :[a, b] → R, the mean value theorem of differentiation states that under certain constraints on F , which we assume to be fulfilled, there exists a number c ∈]a, b[ such that

F (b) − F (a) F 0(c) = . (3.112) b − a From the mean value theorem of differentiation, it thus follows that for the terms of the sum above, we have with appropriately chosen ci ∈]a, b[, i = 1, . . . , n

0 F (xi) − F (xi−1) = F (ci)(xi − xi−1), (3.113)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 41 and substitution then yields n X 0 F (b) − F (a) = F (ci)(xi − xi−1). (3.114) i=1 By definition, it follows that 0 F (ci) = f(ci), (3.115) and setting ∆xi−1 := xi − xi−1 yields

n X F (b) − F (a) = f(ci)∆xi−1. (3.116) i=1

Now, F (b) and F (a) are independent of xi and the left-hand side of the above thus evaluates to F (b) − F (a). For the right-hand side, we note that xi−1 ≤ ci ≤ xi−1 + ∆xi−1 and thus xi−1 = xi−1 and xi−1 + ∆xi−1 = xi−1, from which it follows that ci = xi−1. We thus have

n n−1 X X Z b F (b) − F (a) = f (xi−1) ∆xi−1 = f (xi) ∆xi ≈ f (s) ds (3.117) i=1 i=0 a with the definition of the definite integral under the generalization that the ∆xi may not be equally spaced. This concludes the proof.

2

Example To illustrate the theory and interplay of indefinite and definite integrals, we evaluate the definite integral of the function 2 f : R → R, x 7→ f(x) := 2x + x + 1 (3.118) on the interval [1, 2]. To this end, we first use the linearity property of the indefinite integral, which yields Z Z Z Z Z 2  2 F : R → R, x 7→ F (x) : = f(x)dx = 2x + x + 1 dx = 2 x dx + xdx + 1dx. (3.119)

We then make use of the table of commonly encountered anti-derivatives to evaluate the remaining integral terms, yielding 2 1 F (x) = x3 + x2 + x + c, (3.120) 3 2 where the constant c ∈ R comprises all constant terms. Importantly, this constant term vanishes once we evaluate a definite integral by means of the fundamental theorem of calculus:

Z 2 f(x) dx = F (2) − F (1) 1 2 1 2 1  = 23 + 22 + 2 + c − 13 + 12 + 1 + c 3 2 3 2 16 4 2 1 = + + 2 + c − − − 1 − c (3.121) 3 2 3 2 32 12 12 4 3 6 = + + + c − − − − c 6 6 6 6 6 6 43 = . 6

Integration by parts Integration by parts can be considered an analogon of the product rule of differentiation for integrals. For two functions f :[a, b] → R and g :[a, b] → R, the integration by parts rule states that Z b Z b f 0(x)g(x) dx = f(b)g(b) − f(a)g(a) − f(x)g0(x) dx (3.122) a a The integration by parts rule can be useful, if the anti-derivative of f and the integral on the left-hand side are readily available.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Basic integrals 42

Proof. With the product rule of differentation, we have (f(x)g(x))0 = f 0(x)g(x) + f(x)g0(x) ⇔ f 0(x)g(x) = (f(x)g(x))0 − f(x)g0(x) Z b Z b Z b (3.123) ⇔ f 0(x)g(x) dx = (f(x)g(x))0 dx − f(x)g0(x) dx a a a Because f(x)g(x) is an anti-derivative of (f(x)g(x))0, it follows immediately with the fundamental theorem of calculus that Z b Z b f 0(x)g(x) dx = f(b)g(b) − f(a)b(a) − f(x)g0(x) dx (3.124) a a

Integration by substitution The fundamental theorem of calculus allows the evaluation for certain integrals by means of an integration rule that is known as integration by substitutions and sometimes referred to as “integration by a change of variables”. Specifically, for two functions f : I → R and g :[a, b] → R, it holds that Z b Z g(b) f(g(x))g0(x) dx = f(x) dx (3.125) a g(a)

Proof. We first note that the anti-derivative of f(g(x))g(x) is given by (F ◦ g)(x), where F denotes an anti-derivative of f, because (F ◦ g)0(x) = F 0(g(x))g(x) = f(g(x))g(x). (3.126) With the fundamental theorem of calculus, we then have Z b Z g(b) f(g(x))g0(x) dx = (F ◦ g)(b) − (F ◦ g)(a) = F (g(b)) − F (g(a)) = f(x) dx (3.127) a g(a)

Definite integrals of multivariate real-valued functions on rectangles The notion of the definite integral of a univariate real-valued function can be generalized to the definite integral of multivariate real-valued functions. Specifically, let

n R := [a1, b1] × · · · × [an, bn] ⊆ R (3.128) denote a rectangle, where the ai, bi, i = 1, ..., n may be finite or infinite. Further, let

n f : R → R, x 7→ f(x) (3.129) denote a multivariate real-valued function. Then, under certain regularity conditions which are omitted here, Fubini’s theorem states that

Z Z b1 Z bn f(x) dx = ··· f(x1, ..., xn) dxn ··· dx1. (3.130) R a1 an

In words, the definite integral ∫R f(x) dx of the multivariate real-valued function f on the rectangle R can be evaluated as the iterated integral ∫ b1 · · · ∫ bn f(x , ..., x ) dx ··· dx , which corresponds to a sequence a1 an 1 n n 1 of definite integrals of univariate real-valued functions. Crucially, Fubini’s theorem implies that the order of integration in iterated integrals does not matter.

Example. As an example for the definite integral of a multivariate real-valued function, consider the integral of the function 2 f : R → R, x 7→ f(x) := 2x1 + x2 (3.131) on R := [0, 2] × [0, 3]. We have Z Z Z ! f(x) dx = 2x1 + x2 dx1 dx2 R [0,2] [0,3] Z Z Z ! = 2x1 dx1 + x2 dx1 dx2 (3.132) [0,2] [0,3] [0,3] Z Z Z ! = 2x1 dx1 + x2 1 dx1 dx2. [0,2] [0,3] [0,3]

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 43

For the integrals with respect to x1, we then have with the fundamental theorem of calculus and the fact 2 that the anti-derivatives of g(x1) = 2x1 and h(x) = 1 evaluate to G(x1) = x1 + c and H(x1) = x1 + c, respectively, Z Z Z 2 2  f(x) dx = 3 − 0 + x2(3 − 0) dx2 = 3x2 + 9 dx2. (3.133) R [0,2] [0,2]

3 2 With the fact that the anti-derivative of g(x2) = 3x2 + 9 is given by G(x) = 2 x2 + 9x2, we then have Z 3 3 f(x) dx = 22 + 9 · 2 − 02 − 9 · 0 = 24. (3.134) R 2 2

3.6 Bibliographic remarks

The material presented in this chapter is standard and can be found in any undergraduate textbook on calculus. A good starting point is Spivak(2008), which also provides many justifications for the results presented in the current chapter. Abbott(2015) provides a very readable treatment to the more subtle aspects of real analysis. A theoretically grounded introduction to multivariate calculus is provided by Magnus and Neudecker(1989). As previously mentioned, for justifying the central results on definite and indefinite integration, we consulted Leithold(1976).

3.7 Study questions

1. Give a brief explanation of the notion of a derivative of a univariate function f in a point x.

d d2 ∂ ∂2 2. Provide brief explanations of the symbols dx , dx2 , ∂x , and ∂x2 . 3. Compute the first derivatives of the following functions:

2 f : R → R, x 7→ f(x) := 3 exp (−x ) (3.135) 2 3 g : R → R, x 7→ g(x) := (x + 2 ln(x) − a) . (3.136)

4. Determine the minimum of the function

2 f : R → R, x 7→ f(x) := x + 3x − 2. (3.137)

5. Compute the partial derivatives of the function

n 2 X 2 f : R → R, (x, y) 7→ f(x, y) := ln(x) + (y − 3) . (3.138) i=1

6. Write down the definition of the gradient of a multivariate real-valued function. 7. Write down the definition of the Hessian of a multivariate real-valued function. 8. Evaluate the gradient and the Hessian of

2 2 f : R → R, (x, y) 7→ f(x, y) := 2 exp(x − 3y). (3.139)

R b R 9. State the intuitions for the definite integral a f(x)dx and the indefinite integral f(x)dx of a univariate real-valued function f. 10. Evaluate the definite integral Z 3 I := 5x2 + 2x dx. (3.140) 1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 4 | Matrices

Matrices and matrix computations are essential in the development of the theory of the GLM. In brief, matrices provide a general language to subsume many variants of linear statistical models and represent their associated theory in an integrated fashion, while they also allow for representing a given modelling scenario in a precise mathematical form that can readily be implemented in a computational environment. In this Section, we first define the notion of a matrix (Section 4.1) and then discuss a number of essential matrix operations (Section 4.2). We also introduce two matrix concepts that are important for introducing multivariate Gaussian distributions, determinants (Section 4.3) and positive-definite matrices (Section 4.4). In essence, we here review the very minimum of concepts from linear algebra that still allows for obtaining a first understanding of the GLM. For a more comprehensive view, it is strongly recommended to consult standard undergraduate textbooks on linear algebra, such as Strang(2009).

4.1 Matrix definition

A matrix is a rectangular collection of numbers. Matrices are usually denoted by capital letters as follows:   a11 a12 ··· a1m a21 a22 ··· a2m  A :=   = (a ) . (4.1)  . . .. .  ij 1≤i≤n, 1≤j≤m  . . . .  an1 an2 ··· anm

An entry aij in a matrix A is indexed by its row index i and its column index j. For example, the entry a32 in the matrix 2 7 5 2 8 2 5 6 A :=   (4.2) 6 4 0 9 9 2 1 2 is 4. The size of a matrix is determined by its number of rows n ∈ N and its number of columns m ∈ N. If a matrix has the same number of rows and columns, i.e., if n = m, the matrix is called a square matrix. When referring to a matrix, it is very useful to mention its size and the properties of its entries. In the theory of the GLM, the entries of a matrix are usually elements of the set of real numbers R. We thus write n×m A ∈ R (4.3) to denote that a given matrix A has entries from the set of real numbers and that it comprises n rows and m columns. In words, (4.3) is to be read as “the matrix A consists of n rows and m columns and the entries in A are real numbers”. For example, for the matrix A in (4.2), we write

4×4 A ∈ R , (4.4) because it has four rows and four columns. Note that this matrix is a square matrix. Matrices comprising a single column require special attention: in the context of matrix algebra, we can identify the n-dimensional vectors introduced in Section 2 | Sets, sums, and functions with the set of n × 1 matrices. In other words, we treat n-dimensional vectors and matrices with n rows and a single column as equivalent. This means that we usually set Rn := Rn×1 for n ∈ N.

4.2 Matrix operations

Just as one can compute with real numbers, one can do algebra with matrices. More specifically, one can ˆ add and subtract two matrices of the same size, referred to as matrix addition and subtraction, ˆ multiply a matrix with a scalar, referred to as matrix scalar multiplication, ˆ multiply two matrices under certain conditions, referred to as matrix multiplication, Matrix operations 45

ˆ divide by a matrix or, more precisely, multiply by a matrix inverse, which is found by a process referred to as matrix inversion, ˆ change the ordering of elements of a matrix in a prescribed manner, referred to as matrix transposition. In the following, we discuss these operations in some detail and provide examples.

Matrix addition and subtraction Two matrices of the same size can be added or subtracted by adding or subtracting their entries in an element-wise fashion. Formally, in case of addition, this can be expressed as

A + B = C (4.5) with A, B, C ∈ Rn×m, and where the matrix C is given by     a11 a12 ··· a1m b11 b12 ··· b1m a21 a22 ··· a2m  b21 b22 ··· b2m  A + B =   +    . . .. .   . . .. .   . . . .   . . . .  an1 an2 ··· anm bn1 bn2 ··· bnm   (4.6) a11 + b11 a12 + b12 ··· a1m + b1m a21 + b21 a22 + b22 ··· a2m + b2m  =   .  . . .. .   . . . .  an2 + bn1 an2 + bn2 ··· anm + bnm

The analogue element-wise operation is defined for subtraction:

A − B = D. (4.7)

Example. As an example, consider the 2 × 3 matrices A, B ∈ R2×3 defined as 2 −3 0  4 1 0 A := and B := . (4.8) 1 6 5 −4 2 0

Since both matrices have the same size, we can compute their sum

C = A + B 2 −3 0  4 1 0 = + 1 6 5 −4 2 0 2 + 4 −3 + 1 0 + 0 (4.9) = 1 − 4 6 + 2 5 + 0  6 −2 0 = −3 8 5 and their difference D = A − B 2 −3 0  4 1 0 = − 1 6 5 −4 2 0 2 − 4 −3 − 1 0 − 0 (4.10) = 1 + 4 6 − 2 5 − 0 −2 −4 0 = . 5 4 5

Matrix scalar multiplication

One can also multiply a matrix A ∈ Rn×m by a scalar c ∈ R. In contrast to a matrix or vector, a scalar is a single number. The operation of multiplying a matrix by a scalar is called scalar multiplication and,

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 46 like matrix addition and subtraction, is performed in an element-wise fashion. Formally, we have   a11 a12 ··· a1m a21 a22 ··· a2m  cA = c    . . .. .   . . . .  an1 an2 ··· anm   ca11 ca12 ··· ca1m ca21 ca22 ··· ca2m  =    . . .. .   . . . .  (4.11) can1 can2 ··· canm   b11 b12 ··· b1m b21 b22 ··· b2m  =:    . . .. .   . . . .  bn1 bn2 ··· bnm = B.

Example. As an example, consider the matrix A ∈ R4×3 defined as 3 1 1 5 2 5 A :=   (4.12) 2 7 1 3 4 2 and let c := −3. Then B = cA evaluates to

3 1 1 −3 · 3 −3 · 1 −3 · 1  −9 −3 −3  5 2 5 −3 · 5 −3 · 2 −3 · 5 −15 −6 −15 B = −3   =   =   . (4.13) 2 7 1 −3 · 2 −3 · 7 −3 · 1  −6 −21 −3  3 4 2 −3 · 3 −3 · 4 −3 · 2 −9 −12 −6

Matrix multiplication In addition to adding and subtracting matrices of the same size, as well as multiplying a matrix by a scalar, one can also multiply two matrices. However, matrix multiplication is not an element-wise operation, but has a special definition. Importantly, two matrices A and B can only be multiplied in the order

AB, (4.14) if the first matrix A has as many columns as the second matrix B has rows, or in other words, if

n×m m×p A ∈ R and B ∈ R (4.15) with n, m, p ∈ N. This condition is sometimes referred to as the equality of the inner dimensions of the product AB. If this equality does not hold, the two matrices cannot be multiplied. If the inner dimensions of the matrices A and B are equal, the matrix product of AB is defined as

AB = C, (4.16)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 47 where the resulting matrix C ∈ Rn×p is computed as     a11 a12 ··· a1m b11 b12 ··· b1p a21 a22 ··· a2m   b21 b22 ··· b2p  AB =      . . .. .   . . .. .   . . . .   . . . .  an1 an2 ··· anm bm1 bm2 ··· bmp Pm Pm Pm  i=1 a1ibi1 i=1 a1ibi2 ··· i=1 a1ibip Pm Pm Pm  i=1 a2ibi1 i=1 a2ibi2 ··· i=1 a2ibip  =    . . .. .   . . . .  (4.17) Pm Pm Pm i=1 anibi1 i=1 anibi2 ··· i=1 anibip   c11 c12 ··· c1p c21 c22 ··· c2p  =:    . . .. .   . . . .  cn1 cn2 ··· cnp = C.

The expression for the matrix product may appear a bit unhandy, but it is very important. The expression states that the entry in the ith row and jth column of the matrix C ∈ Rn×p is given by overlaying the entries in the ith row of matrix A with the jth column entries of matrix B, multiplying the overlaid entries, and then adding them up.

Example. Let A ∈ R2×3 and B ∈ R3×2 be defined as  4 2 2 −3 0 A := and B := −1 0 . (4.18) 1 6 5   1 3

We first consider the size of the matrix C := AB. We know that A has two rows and three columns, while B has three rows and two columns. Because B has the same number of rows as A has columns, the matrix product AB = C is defined. Expression (4.17) tells us that the resulting matrix C has two rows and two columns, because the number of rows of the resulting matrix C is determined by the number of rows of the first matrix A, and the number of columns of the resulting matrix C is determined by the number of columns of the second matrix B. We thus know that C is a 2 × 2 matrix, i.e., C ∈ R2×2. Overlaying the rows of A on the columns of B, multiplying the entries, and adding up the results yields

C = AB  4 2 2 −3 0 = −1 0 1 6 5   1 3 2 · 4 + (−3) · (−1) + 0 · 1 2 · 2 + (−3) · 0 + 0 · 3 = (4.19) 1 · 4 + 6 · (−1) + 5 · 1 1 · 2 + 6 · 0 + 5 · 3 8 + 3 + 0 4 + 0 + 0  = 4 − 6 + 5 2 + 0 + 15 11 4  = . 3 17

It is essential to always keep track of the sizes of the matrices that are involved in matrix multiplications. Specifically, if matrix A is of size n × m and matrix B is of size m × p, then the product AB will always be of size n × p. This can be visualized as

(n × m) · (m × p) = (n × p), (4.20) i.e., the inner numbers m disappear.

Note that if one is calculating with scalars, multiplication is commutative, i.e., for a, b ∈ R, it holds that ab = ba. In contrast, matrix multiplication is generally not commutative, i.e., the order of A and B in the

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 48 matrix product matters. As an example, consider the matrices A and B above. We have just seen that

 4 2 2 −3 0 11 4  C := AB = −1 0 = . (4.21) 1 6 5   3 17 1 3

On the other hand, we have

 4 2  10 0 10 2 −3 0 D := BA = −1 0 = −2 3 0 , (4.22)   1 6 5   1 3 5 15 15 and thus, for the current example, AB 6= BA. Depending on the matrix sizes, the commuted matrix product may not even be defined: if A ∈ R2×3 and B = R3×4, the matrix product AB exists, but the matrix product BA does not.

Matrix inversion To motivate the idea of matrix inversion consider the equation

Ax = b, (4.23) where A ∈ Rn×n, x ∈ Rn, and b ∈ Rn. We next simplify eq. (4.23) by assuming that n = 1. To remind us of this assumption, we rewrite (4.23) using lower-case letters

ax = b, (4.24) where now a, x, b ∈ R. We further assume that we know that a = 2 and b = 6, i.e., we have the equation 2x = 6. (4.25)

From high school mathematics we know how to solve equations such as (4.25) for the unknown variable x. The general strategy is to isolate x on the left-hand side by the appropriate algebraic operations, such as additions and/or multiplications by the known variables, and observe the outcome on the right-hand side of the equation. For the current case, such a strategy takes the following form:

2x = 6 | divide both sides by 2 2 6 x = | evaluate the ratios (4.26) 2 2 x = 3, and we have found that the value of the unknown variable x that solves eq. (4.25) is x = 3. From high school mathematics, we also recall that division by a scalar number a corresponds to multiplication by its multiplicative inverse. The multiplicative inverse, also referred to as the reciprocal, of a scalar a is the number which yields 1 if it is multiplied with a, i.e., 1/a. We may thus also write the solution strategy (4.26) in the form 1 2x = 6 | multiply both sides by 2 1 1 · 2x = · 6 | evaluate the ratios (4.27) 2 2 x = 3.

Of course both strategies, (4.26) and (4.27), yield the same result. From high school mathematics, one may also remember that the multiplicative inverse 1/a of a scalar a can be denoted by a−1. In general mathematics, a−1 is called the multiplicative inverse of a if, and only if,

aa−1 = a−1a = 1, (4.28) where 1 denotes the neutral element with respect to multiplication, i.e.,

a · 1 = 1 · a = a, (4.29)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 49 for all a. Recapitulating (4.26) in more abstract terms, we have thus carried out the following operations

ax = b | multiply both sides by a−1 a−1ax = a−1b | evaluate the multiplications (4.30) x = a−1b.

We now return to the case that A ∈ Rn×n, x ∈ Rn, and b ∈ Rn are in fact matrices and vectors, i.e., n > 1. In matrix algebra, one often encounters statements of the form

Ax = b, (4.31) which are referred to as systems of linear equations. Here, the values of A ∈ Rn×n and b ∈ Rn are known and one would like to find an x ∈ Rn which solves (4.31). In complete analogy to the solution strategy described in (4.30), one can multiply both sides of (4.31) with the inverse A−1 ∈ Rn×n of A (if it exists) to obtain A−1Ax = A−1b. (4.32) In analogy to the scalar case a−1a = 1, the matrix product A−1A yields, by definition, a specific matrix, n×n known as the identity matrix In ∈ R . The identity matrix comprises ones on its main diagonal and zeros everywhere else. That is, by definition, we have

1 0 ··· 0 0 1 ··· 0 A−1A = I :=   . (4.33) n . . .. . . . . . 0 0 ··· 1

In analogy to a · 1 = 1 · a = a for scalars, the product of a matrix A with the identity matrix always yields the matrix A again, AI = IA = A, (4.34) i.e., the identity matrix is the neutral element with respect to matrix multiplication. We thus see that we can evaluate the left-hand side of (4.31) as follows

Ax = b | multiply both sides by A−1 A−1Ax = A−1b | evaluate the multiplications (4.35) x = A−1b.

In other words, if we have a way to evaluate the inverse A−1 of the square matrix A, we can solve for the unknown x ∈ Rn in the same way that we can solve for the unknown x ∈ R in the scalar case. Matrix inversion is fundamental for solving systems of linear equations and scientific computing in general. However, a comprehensive discussion of algorithms for matrix inversion is beyond the scope of an introduction to the GLM. In the following, we thus content with providing two examples for the evaluation of matrix inverses in special cases, namely diagonal matrices and small invertible matrices.

The inverse of a diagonal matrix. A diagonal matrix D is a square matrix with non-zero elements di > 0, i = 1, ..., n along its main diagonal and zeros everywhere else,   d1 0 ··· 0  0 d2 ··· 0  D =   ∈ n×n. (4.36)  . . .. .  R  . . . .  0 0 ··· dn

−1 −1 The inverse of a diagonal matrix is given by a diagonal matrix D with the scalar inverses di of the di on its main diagonal and zeros everywhere else,

 1 0 ··· 0  d1 1  0 d ··· 0  D−1 =  2  ∈ n×n, (4.37)  . . .. .  R  . . . .  0 0 ··· 1 dn

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 50 because  1 0 ··· 0    d1 d1 0 ··· 0 1  0 d ··· 0   0 d2 ··· 0  D−1D =  2     . . .. .   . . .. .   . . . .   . . . .  1 0 0 ··· 0 0 ··· dn dn  d1 0 ··· 0  (4.38) d1 d2  0 d ··· 0  =  2   . . .. .   . . . .  0 0 ··· dn dn

= In, and likewise for DD−1.

Gaussian elimination. Small matrices can be inverted using Gaussian elimination. We will demonstrate this approach in the context of solving the system of linear equations 1 0 2 2 3×3 3 3 Ax = b with A := 2 −1 3 ∈ R , x ∈ R , and b := 1 ∈ R . (4.39) 4 1 8 2

Based on the discussion above, we know that we can solve the system of linear equations for x ∈ R3, if we are able to determine the inverse A−1 of A, because then

x = A−1b. (4.40)

We also know that for the inverse of A, we have

−1 A A = I3. (4.41) The inverse of A can be found using the following procedure:

1. Write down the matrix A and next to it the identity matrix I3 1 0 2 | 1 0 0 2 −1 3 | 0 1 0 . (4.42) 4 1 8 | 0 0 1

2. Use three kinds of operations on the rows of A to transform A in the array above into the identity matrix. Apply the same operations to the identity matrix in parallel. The admissible operations are: (a) exchanging two rows of A, (b) multiplying a row of A by a number, and (c) adding or subtracting a multiple of another row of A from any row of A. Adding -2 times the first row to the second row, and -4 times the first row to the third row yields 1 0 2 | 1 0 0 0 −1 −1 | −2 1 0 . (4.43) 0 1 0 | −4 0 1 Exchanging the second and third row yields 1 0 2 | 1 0 0 0 1 0 | −4 0 1 . (4.44) 0 −1 −1 | −2 1 0 Adding the second row to the third row yields 1 0 2 | 1 0 0 0 1 0 | −4 0 1 . (4.45) 0 0 −1 | −6 1 1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Matrix operations 51

Adding 2 times the third row to the first row yields

1 0 0 | −11 2 2 0 1 0 | −4 0 1 . (4.46) 0 0 −1 | −6 1 1

Multiplying the third row by -1 yields

1 0 0 | −11 2 2  0 1 0 | −4 0 1  . (4.47) 0 0 1 | 6 −1 −1

Having transformed the matrix on the left into the identity matrix, the matrix that is left on the right is now the inverse A−1 of A, as can be verified by computing

1 0 2 −11 2 2  1 0 0 2 −1 3  −4 0 1  = 0 1 0 . (4.48) 4 1 8 6 −1 −1 0 0 1

Hence, with −11 2 2  −1 A =  −4 0 1  , (4.49) 6 −1 −1 the solution of (4.39) can be computed using (4.40) as

−11 2 2  2 −16 −1 x = A b =  −4 0 1  1 =  −6  . (4.50) 6 −1 −1 2 9

Matrix transposition The transposition of a matrix is the exchange of its row and column elements. Transposition of a matrix A is denoted by AT and implies that if A ∈ Rn×m, then AT ∈ Rm×n. Formally, we have for   a11 a12 ··· a1m a21 a22 ··· a2m  A :=   ∈ n×m (4.51)  . . .. .  R  . . . .  an1 an2 ··· anm and T m×n B := A ∈ R , (4.52) that     b11 b12 ··· b1n a11 a21 ··· an1  b21 b22 ··· b2n   a12 a22 ··· an2  B =   :=   ∈ n×m. (4.53)  . . .. .   . . .. .  R  . . . .   . . . .  bm1 bm2 ··· bmn a1m a2m ··· anm

Example. For example, if 2 3 0 A := , (4.54) 1 6 5 then 2 1 T A := 3 6 . (4.55) 0 5 Note that the transpose of a 1 × 1 matrix, i.e., a scalar, is just the same scalar again.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Determinants 52

4.3 Determinants

Historically, determinants were used to characterize the solution space of systems of linear equations of the form (4.23): if the determinant of the coefficient matrix A is non-zero, the system has a unique solution, if the determinant of A is zero, the system either has infinitely many solutions or no solution at all. There exists a rich general theory of determinants in Linear Algebra. However, from the perspective of an introduction to the GLM, it suffices to review the determinants of a few basic matrices that will reappear in the context of Gaussian distributions.

Determinants of 2 × 2 and 3 × 3 matrices. In general, determinants can be understood as functions on the set of square matrices A ∈ Rn×n which take on scalar values. The determinant of a matrix A ∈ Rn×n is denoted by |A|. For a matrix A ∈ R2×2, the determinant is defined as   2×2 a11 a12 | · | : R → R,A 7→ |A| = := a11a22 − a12a21. (4.56) a21 a22

For a matrix A ∈ R3×3, the determinant is defined as   a11 a12 a13 3×3 | · | : R → R,A 7→ |A| = a21 a22 a23 :=a11a22a33 + a12a23a31 + a13a21a32 (4.57) a31 a32 a33

− a12a21a33 − a11a23a32 − a13a22a31.

n×n Determinants of diagonal matrices. Let D ∈ R denote a diagonal matrix, i.e., D = (dij)1≤i,j≤n with dij 6= 0 for the diagonal elements (i.e., i = j) and dij = 0 for the off-diagonal elements (i.e., i 6= j). Then the determinant of D is the product of its diagonal elements:   d11 0 ··· 0 n  0 d22 ··· 0  Y | · | : n×n → ,D 7→ |D| = |D| =   = d . (4.58) R R  . .. .  ii  . ··· . .  i=1 0 ··· 0 dnn

4.4 Symmetry and positive-definiteness

A square matrix A ∈ Rn×n is called symmetric, if it equals its transpose, i.e., if A = AT . For example, the matrix 2 1 3 A = 1 4 5 (4.59) 3 5 0 is symmetric which is easily seen by writing down its transpose. Note that non-square matrices cannot be symmetric, because their transposition results in a new matrix which differs from the original matrix in the number of its rows and columns, and thus cannot be identical to the original matrix. Symmetric matrices can have an additional property known as positive-definiteness. The concept of positive-definiteness can be approached from multiple perspectives and it is usually not trivial to check whether a given matrix is in fact positive-definite. A very fundamental notion of positive-definiteness is the following definition: a symmetric matrix A ∈ Rn×n is called positive-definite, if

T n x Ax > 0 for all x ∈ R , x 6= 0. (4.60)

Note that based on this definition, one would have to evaluate the product xT Ax for all x ∈ Rn to check whether A is in fact positive-definite. This is not realistic. Therefore a variety of additional and equivalent criteria exist for the positive-definiteness of a matrix, which are, however, beyond the scope of this introduction to the GLM. In the following example, we consider a simple scenario where the positive-definiteness of an exemplary matrix can actually be determined directly.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 53

Example. We consider the symmetric matrix

2 1 A = ∈ 2×2. (4.61) 1 2 R

If we consider an arbitrary, but non-zero, vector   x1 2 x := ∈ R , (4.62) x2 we find that     T  2 1 x1 x Ax = x1 x2 1 2 x2    x1 (4.63) = 2x1 + x2 x1 + 2x2 x2

= (2x1 + x2) x1 + (x1 + 2x2) x2.

From the right-hand side of eq. (4.63), we then have

2 2 (2x1 + x2) x1 + (x1 + 2x2) x2 = 2x1 + x2x1 + x1x2 + 2x2 2 2 = 2x1 + 2x1x2 + 2x2 2 2 2 2 (4.64) = x1 + 2x1x2 + x2 + x1 + x2 2 2 2 = (x1 + x2) + x1 + x2.

T For an arbitrary non-zero x = (x1, x2) , we thus obtained a sum of squares of the components of x, which is always larger than zero. We have thus shown that A in (4.61) is positive-definite.

4.5 Bibliographic remarks

In this Section, we reviewed some fundamental aspects of matrices and matrix computations as required for the remainder of this introduction to the GLM. There exists a large variety of introductory texts on linear algebra. A good starting point is the classical textbook by Strang(2009). A more advanced treatment is offered for example by Horn and Johnson(2012). Comprehensive introductions to solution theory of systems of linear equations and the algorithmic evaluation of matrix inverses can be found for example in Press(2007). Searle(1982) provides an excellent comprehensive overview of matrix algebra as relevant to statistics.

4.6 Study Questions

1. Write down the definition of matrix addition, subtraction, and scalar multiplication. 2. Write down the definition of the matrix product. 3. If X is an n × p matrix and β is an p × 1 vector, what is the size of Xβ? 4. Let 1 2 1 1 A := and B := . (4.65) 3 4 0 2 Evaluate the following matrices:

C := A + BT ,D := A − B,E := AB, and F := BA. (4.66)

5. Let 2 4 3 A := 1 3 and b = . (4.67)   2 0 2 Evaluate x = Ab, B = bbT AT , and C = bT AT A. (4.68)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study Questions 54

6. Let X ∈ R10×3 and y ∈ R10. What are the sizes of XT X,(XT X)−1, XT y, and of (XT X)−1XT y? 7. Explain the concept of a matrix inverse. 8. Evaluate the inverse A−1 for 1 0 0 A := 0 2 0 . (4.69) 0 0 4

9. Evaluate the determinants of the matrices 1 2 2 1 M := and N := . (4.70) 0 1 2 1

10. Write down the definition of a positive-definite matrix.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 5 | Probability spaces and random variables

In this Chapter, we review some essentials of probability theory as required for the theory of the GLM. We focus on the particularities and inner logic of the probability theory model rather than its practical application and primarily aim to establish important concepts and notation that will be used in subsequent sections. In Section 5.1, we first introduce the basic notion of a probability space as a model for experiments that involve some degree of . We then discuss some elementary aspects of probability in Section 5.2 which mainly serve to ground the subsequently discussed theory of random variables and random vectors. The fundamental mathematical construct to model univariate data endowed with uncertainty is the concept of a random variable. We focus on different ways of specifying probability distributions of random variables, notably probability mass and density functions for discrete and continuous random variables, respectively, in Section 5.3. The concise mathematical representation of more than one data point requires the concept of a random vector. In Section 5.4, we first discuss the extension of random variable concepts to the multivariate case of random vectors and then focus on three concepts that arise only in the multivariate scenario and are of immense importance for statistical data analysis: marginal distributions, conditional distributions, and independent random variables.

5.1 Probability spaces

Probability spaces Probability spaces are very general and abstract models of random experiments. We use the following definition.

Definition 5.1.1 (Probability space). A probability space is a triple (Ω, A, P), where ˆ Ω is a set of elementary outcomes ω, ˆ A is a σ-algebra, i.e., A is a set with the following properties ◦ Ω ∈ A, ◦A is closed under the formation of complements, i.e. if A ∈ A, then also Ac = Ω for all A ∈ A, ∞ ◦A is closed under countable unions, i.e., if A1,A2,A3, ... ∈ A, then ∪i=1Ai ∈ A. ˆ P is a probability measure, i.e., P is a mapping P : A → [0, 1] with the following properties: ◦ P is normalized, i.e., P (∅) = 0 and P (Ω) = 1, and

◦ P is σ-additive, i.e., if A1,A2, ... is a pairwise disjoint sequence in A (i.e., Ai ∈ A for i = 1, 2, ... ∞ P∞ and Ai ∩ Aj = ∅ for i 6= j), then P(∪i=1Ai) = i=1 P(Ai). •

Example A basic example is a probability space that models the throw of a die. In this case the elementary outcomes ω ∈ Ω model the six faces of the die, i.e., one may define Ω := {1, 2, 3, 4, 5, 6}. If the die is thrown, it will roll, and once it comes to rest, its upper surface will show one of the elementary outcomes. The typical σ-algebra used in the case of discrete and finite outcomes sets (such as the current Ω) is the power set P(Ω) of Ω. It is a basic exercise in probability theory to show that the power set indeed fulfils the properties of a σ-algebra as defined above. Because P(Ω) contains all subsets of Ω, it also contains the elementary outcome sets {1}, {2}, ..., {6}, which thus get allocated a probability P({ω}) ∈ [0, 1], ω ∈ Ω by the probability measure P. Probabilities of sets containing a single elementary outcome are also often written simply as P(ω) (:= P({ω})). The typical value ascribed to P(ω), ω ∈ Ω, if used to model a fair die, is P(ω) = 1/6. The σ-algebra P(Ω) contains many more sets than the sets of elementary outcomes. The purpose of these additional elements is to model all sorts of events to which an observer of the random experiment may want to ascribe probabilities. For example, the observer may ask “What is the probability that the upper Elementary probabilities 56 surface shows a number larger than three?”. This event corresponds to the set {4, 5, 6}, which, because the σ-algebra P(Ω) contains all possible subsets of Ω, is contained in P(Ω). Likewise, the observer may ask “What is the probability that the upper surface shows an even number?”, which corresponds to the subset {2, 4, 6} of Ω. The probability measure P is defined in such a manner that the answers to the following questions are predetermined: “What is the probability that the upper surface shows nothing?” and “What is the probability that the upper surface shows any number in Ω?”. The element of P(Ω) that corresponds to the first question is the empty set, and by definition of P, P(∅) = 0. This models the idea that one of the elementary outcomes, i.e., one surface with pips, will show up on every instance of the random experiment. If this is not the case, for example because the pips have worn off at one of the surfaces, the probability space model as sketched thus far is not a good model of the die experiment. The element of P(Ω) that corresponds to the second question is Ω itself. Here, the definition of the probability measure assigns P(Ω) = 1, i.e., the probability that something unspecific will happen, is one. Again, if the die falls off the table and cannot be recovered, the probability space model and the experiment are not in good alignment. Finally, the definition of the probability space as provided above allows one to evaluate probabilities for certain events based on the probabilities of other events by means of the σ-additivity of P. Assume for example that the probability space models the throw of a fair die, such that P({ω}) = 1/6 by definition. Based on this assumption, the σ-additivity property allows to evaluate the probabilities of many other events. Consider for example an observer who is interested in the probability of the event that the surface of the die shows a number smaller or equal to three. Because the elementary events {1}, {2}, {3} are pairwise disjoint, and because the event of interest can be written as the countable union {1, 2, 3} = {1} ∪ {2} ∪ {3} of these events, one may evaluate the probability of the event of interest by 3 P3 P(∪i=1{i}) = i=1 P(i) = 1/6 + 1/6 + 1/6 = 1/2. The die example is concerned with the case that a probability space is used to model a random experiment with a finite number of elementary outcomes. In the modelling of scientific experiments, the elementary outcomes are often modelled by the set of real numbers or real-valued vectors. Much of the theoretical development of modern probability theory in the early twentieth century was concerned with the question of how ideas from basic probability with finite elementary outcome spaces can be generalized to the continuous outcome space case of real numbers and vectors. In fact, it is perhaps the most important contribution of the probability space model as defined above and originally developed by Kolmogorov (1956) to be applicable in both the discrete-finite and the continuous-infinite elementary outcome set scenarios. The study of probability spaces for Ω := R or Ω := Rn, n > 1 is a central topic in probability theory which we by and large omit here. We do however note that the σ-algebras employed when Ω := Rn, n ≥ 1 are the so-called Borel σ-algebras, commonly denoted by B for n = 1 and Bn for n > 1. The mathematical construction of these σ-algebras is beyond our scope, but for the theory of the GLM, it is not unhelpful to think of Borel σ-algebras as power sets of R or Rn, n > 1. This is factually wrong as it can be shown that there are in fact more subsets of R or Rn, n > 1 than there are elements in the corresponding Borel σ-algebras. Nevertheless, many events of interest, such as the probability for the elementary outcome of a random experiment with outcome space R to fall into a real interval [a, b], are in B.

5.2 Elementary probabilities

We next discuss a few elementary aspects of probabilities defined on probability spaces. Throughout, let (Ω, A, P) denote a probability space, such that P : A → [0, 1] is a probability measure.

Interpretation

We first note that the probability P(A) of an event A is associated with at least two interpretations. From a Frequentist perspective, the probability of an event corresponds to the idealized long run frequency of observing the event A. From a Bayesian perspective, the probability of an event corresponds to the degree of belief that the event is true. Notably, both interpretations are subjective in the sense that the Frequentist perspective envisions an idealized long run frequency which can never be realized in practice, while the Bayesian belief interpretation is explicitly subjective and specific to a given observer. However, irrespective of the specific interpretation of the probability of an event, the logical rules for probabilistic inference, also known as probability calculus, are identical under both interpretations.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Elementary probabilities 57

Basic properties We next note the following basic properties of probabilities, which follow directly from probability space definition.

Theorem 5.2.1 (Properties of probabilities). Let (Ω, A, P) denote a probability space. Then the following properties holds.

(1) If A ⊂ B, then P(A) ≤ P(B). (2) P(Ac) = 1 − P(A). (3) If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B). (4) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

 Exemplary, we prove property (4) of Theorem 5.2.1 below.

Proof. With the fact that any union of two sets A, B ⊂ Ω can be written as the union of the disjoint sets A ∩ Bc, A ∩ B, c and A ∩ B (cf. Section 2 | Sets, sums, and functions) and with the additivity of P for disjoint events, we have: c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) c c (5.1) = P ((A ∩ B ) ∪ (A ∩ B)) + P ((A ∩ B) ∪ (A ∩ B)) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B).

Independence An important feature of many probabilistic models is the independence of events. Intuitively, independence models the absence of deterministic and stochastic influences between events. Notably, independence can either be assumed and thus build into a probabilistic model by design or independence can follow from the design of the model. Regardless of the origin of the independence of events, we use the following definitions.

Definition 5.2.1 (Independent events). Let (Ω, A, P) denote a probability space. Two events A ∈ A and B ∈ A are independent, if P(A ∩ B) = P(A)P(B). (5.2)

A set of events {Ai|i ∈ I} ⊂ A with index set I is independent, if for every finite subset J ⊂ I Y P (∩j∈J Aj) = P (Aj) . (5.3) j∈J • Notably, disjoint events with positive probability, such as observing an even or odd number of pips in the die experiment, are not independent: if P(A) > 0 and P(B) > 0, then P(A)P(B) > 0, but P(A∩B) = P(∅) = 0, and thus P(A ∩ B) 6= P(A)P(B).

Conditional probability The basis for many forms of probabilistic inference is the conditional probability of an event given that another event occurs. We use the following definition.

Definition 5.2.2 (Conditional probability). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(B) > 0. Then the conditional probability of A given B is defined as

P(A ∩ B) P(A|B) = . (5.4) P(B) •

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 58

Without proof, we note that for any fixed B ∈ A, P(·|B) is a probability measure, i.e., P(·|B) ≥ 0, ∞ P∞ P(Ω|B) = 1, and for disjoint A1,A2, ... ∈ A, P (∪i=1Ai|B) = i=1 P(Ai|B). Note that the rules of probability apply to the events on the left of the vertical bar. Intuitively, P(A|B) is the fraction of times the event A occurs among those times in which the event B occurs. This fraction is defined proportionally already by P(A ∩ B), the idealized relative frequency or the belief that the events A and B occur together. Division of P(A ∩ B) by P(B) yields a normalized measure. Furthermore, in most probabilistic models P(A|B) 6= P(B|A). For example, the probability of exhibiting respiratory symptoms after contracting corona virus does not necessarily equal the probability of contracting corona virus when exhibiting respiratory symptoms. Finally, a mathematical extension of conditional probability to the case of P(B) = 0 is possible, but technically beyond our scope. Rearranging the definition of conditional probability allows for expressing the probability of two events to occur jointly by the product of the conditional probability of one event given the other and the probability of the conditioning event. This fact is routinely used in the construction of probabilistic models. Formally, we have the following theorem, which follows directly from the definition of conditional probability.

Theorem 5.2.2 (Joint and conditional probabilities). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0. Then

P(A ∩ B) = P(A|B)P(B) = P(B|A)P(B). (5.5)

 For independent events, knowledge of the occurrence of one of the events does not affect the probability of the other event to occur:

Theorem 5.2.3 (Conditional probability for independent events). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0 denote two independent events. Then

P(A|B) = P(A) and P(B|A) = P(B). (5.6)



Proof. With the definitions of conditional probability and independent events, we have

P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (5.7) P(B) P(B) and analogously for P(B|A).

5.3 Random variables and distributions

The fundamental construct for the mathematical representation of numerical data endowed with uncertainty are random variables. From a mathematical perspective, random variables are neither random nor variables. Instead, random variables are functions that map elements of a probability outcome space Ω into another outcome space Γ. Γ is either a countable set X , in which case the functions are referred to as discrete random variables, or Γ is the real line R, in which case the functions are referred to as continuous random variables. If Γ is a multidimensional space, the respective functions are referred to as random vectors. In the current section, we are concerned with some fundamental aspects of random variables. In Section 5.4, we consider their multivariate generalization as random vectors.

Measurable functions and random variables First, not all functions that map elements of a probability outcome space Ω onto elements of another outcome space Γ are random variables. A fundamental feature of random variables is that they are measurable. In the mathematical literature, the terms measurable function and random variable are hence used interchangeably. To make the concept of a measurable function precise, let (Ω, A, P) denote a probability space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.8) denote a function. Assume further that there exists a σ-algebra S on Γ. The tuple of a set Γ and a σ-algebra S is referred to as measurable space (for every probability space (Ω, A, P), (Ω, A) thus forms

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 59 a measurable space). Finally, for every set S ∈ S let ξ−1(S) denote the preimage of S under ξ. The preimage of S ∈ S under ξ is the set of all ω ∈ Ω that are mapped onto elements of S by ξ, i.e.,

ξ−1(S) := {ω ∈ Ω|ξ(ω) ∈ S}. (5.9)

Now, if the preimages of all S ∈ S are elements of the σ-algebra A on Ω, then ξ is called a measurable function. Formally, we have the following definition.

Definition 5.3.1 (Measurable function). Let (Ω, A, P) be a probability space, let (Γ, S) denote a measur- able space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.10) be a function. If ξ−1(S) ∈ A for all S ∈ S, (5.11) then ξ is called a measurable function. • A measurable function ξ :Ω → Γ is called a random variable:

Definition 5.3.2 (Random variable). Let (Ω, A, P) denote a probability space and let ξ :Ω → Γ denote a function. If ξ is a measurable function, then ξ is called a random variable. •

Probability distributions The condition of measurability of the function ξ has a fundamental consequence for the sets in S: because the probability measure P allocates a probability P(A) to all sets in A, and because, by definition of the measurability of ξ, all preimages ξ−1(S) of all sets S ∈ S are sets in A, the construction of a random variable allows for allocating a probability to all sets S ∈ S - namely the probability of the preimage ξ−1(S) ∈ A under P. This entails the induction of a probability measure on the measurable space (Γ, S). This induced probability measure is called the probability distribution of the random variable ξ and is denoted by Pξ. We use the following definition. Definition 5.3.3 (Probability distribution). Let (Ω, A, P) denote probability space, let (Γ, S) denote a measurable space, and let ξ :Ω → Γ, ω → ξ(ω) (5.12) denote a random variable. Then the probability measure Pξ defined by

−1  Pξ : S → [0, 1],S 7→ Pξ(S) := P ξ (S) (5.13) is called the probability distribution of the random variable ξ. •

Intuitively, the notion of randomness in the values ξ(ω) of ξ is captured by this construction as follows: in a first step, an element ω ∈ Ω is selected according to the probability P({ω}) that is allocated to ω by the probability measure P on (Ω, A). In a second step, this ω is mapped onto an element ξ(ω) in Γ, which is also referred to as a realization of the random variable ξ. Across realizations, the values of ξ exhibit a probability distribution that depends both on the properties of P and ξ and is denoted by Pξ. Figure 5.1 visualizes the situation.

Clearly, if Γ = Ω, S = A and ξ := id, then P and Pξ are identical. Importantly, the union of the measurable space (Γ, S) and the probability measure Pξ forms the probability space (Γ, S, Pξ). In most probabilistic models, it is the latter probability space that takes center stage. Most commonly, the random variable outcome set is given by the real line Γ := R and the σ-algebra corresponds to the Borel σ-algebra S := B. Moreover, the probability measure Pξ is usually directly defined by means of a probability density function (see below). Notably, given the probability space (R, B, Pξ), an underlying probability space (Ω, A, P) can always be constructed post-hoc by setting ξ := id.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 60

Ω 풳 휉−1(푆)

(Ω, 풜, ℙ) 휉: Ω → Γ

푆 풳 Γ

(풳, 풮, ℙ휉)

−1 ℙ휉 푆 ≔ ℙ 휉 푆 = ℙ {휔 ∈ Ω|휉 휔 ∈ 푆}

Figure 5.1. Random variables and probability distributions. For a detailed discussion, please refer to the main text.

Notation In the following, we discuss a number of notational conventions with regards to probability distributions. We first note that random variables of the form ξ :Ω → Γ are often written as

ξ : (Ω, A) → (Γ, S) or ξ : (Ω, A, P) → (Γ, S). (5.14) Both notations are not inherently meaningful, as the random variable ξ only maps elements of Ω onto elements of Γ. Presumably, the notations of (5.14) evolved to stress the fact that the concept of a random variable entails the theoretical overhead of probability distributions that relate to S, A and P as described above. Second, the following notational conventions for events in A are commonly employed:

{ξ ∈ S} := {ω ∈ Ω|ξ(ω) ∈ S} {ξ = x} := {ω ∈ Ω|ξ(ω) = x} {ξ < x} := {ω ∈ Ω|ξ(ω) < x} (5.15) {ξ ≤ x} := {ω ∈ Ω|ξ(ω) ≤ x} {ξ > x} := {ω ∈ Ω|ξ(ω) > x} {ξ ≥ x} := {ω ∈ Ω|ξ(ω) ≥ x} for S ∈ S and x ∈ Γ and

{x1 < ξ < x2} := {ω ∈ Ω|x1 < ξ(ω) < x2}

{x1 ≤ ξ < x2} := {ω ∈ Ω|x1 ≤ ξ(ω) < x2} (5.16) {x1 < ξ ≤ x2} := {ω ∈ Ω|x1 < ξ(ω) ≤ x2}

{x1 ≤ ξ ≤ x2} := {ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2} for x1, x2 ∈ Γ, x1 ≤ x2 and similarly for larger than relationships. These conventions entail the following conventions for expressing the probabilistic behaviour of random variables, here demonstrated for three events listed above:

Pξ(ξ ∈ S) := P({ξ ∈ S}) = P({ω ∈ Ω|ξ(ω) ∈ S}) (5.17) Pξ(ξ = x) := P({ξ = x}) = P({ω ∈ Ω|ξ(ω) = x}) (5.18) Pξ(ξ ≤ x) := P({ξ ≤ x}) = P({ω ∈ Ω|ξ(ω) ≤ x}) (5.19) Pξ(x1 ≤ ξ ≤ x2) := P({x1 ≤ ξ ≤ x2}) = P({ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2}). (5.20)

Because of the redundancy in the reference to ξ in symbols of the form Pξ(ξ ≥ s), the subscript is often omitted, i.e., the expression is written as P(ξ ≥ s). Note that this notation entails the danger of confusing

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 61 the underlying probability measure P of the probability space (Ω, A, P) with the induced probability measure Pξ on (Γ, S). However, as remarked above, (Ω, A, P) is negligible in most applied cases and hence this danger is usually negligible. We next consider the direct specification of probability distributions by means of cumulative distribution functions, probability mass functions, and probability density functions.

Cumulative distribution functions

One way to specify the probability distribution Pξ of a random variable is to define its cumulative distribution function. We denote the cumulative distribution function of a random variable ξ by Pξ and use the following definition. Definition 5.3.4 (Cumulative distribution function). Let ξ be a real-valued random variable. Then a cumulative distribution function of ξ is a function defined as

Pξ :Γ → [0, 1], x 7→ Pξ(x) := Pξ(ξ ≤ x). (5.21) •

Intuitively, Pξ(x) represents the probability that the random variable ξ takes on a value equal or smaller than x. It thus follows that 1 − Pξ(x) represents the probability that the random variable ξ takes on a value larger than x. Importantly, by specifying the functional form of a cumulative distribution function Pξ, the probability of all events {ξ ≤ x} for x ∈ Γ is defined. An alternative and much more common approach to define the probability distributions of random variables is by means of probability mass and probability density functions.

Probability mass functions Probability mass functions are used to define the distributions of discrete random variables ξ :Ω → X with discrete and finite (or least countable) outcome set X . We use the following definitions.

Definition 5.3.5 (Discrete random variable, probability mass function). Let (Ω, A, P) denote a probability space. A random variable ξ :Ω → X is called discrete, if its outcome space X contains only a finite number of or countably many elements xi, i = 1, 2, .... The probability mass function (PMF) of a discrete random variable ξ is denoted by pξ and is defined as

pξ : X → [0, 1], xi 7→ pξ(xi) := Pξ(ξ = xi). (5.22)

• Note that by definition, PMFs are non-negative and normalized, i.e., X pξ(xi) ≥ 0 for all xi ∈ X and pξ(xi) = 1, (5.23)

xi∈X respectively. Both properties follow directly from the definition of a probability distribution as a probability measure.

The cumulative distribution function of a discrete random variable ξ with PMF pξ evaluates to X Pξ : X → [0, 1], xi 7→ Pξ(ξ) := Pξ(x ≤ ξ) = pξ(xi) (5.24) xi≤ξ and is also referred to as cumulative mass function (CMF).

Probability density functions

Probability density functions are used to define the distributions of continuous random variables ξ :Ω → R. We use the following definitions.

Definition 5.3.6 (Continuous random variable, probability density function). Let (Ω, A, P) denote a probability space. A random variable ξ :Ω → R is called a continuous random variable or a real-valued random variable. The probability density function (PDF) of a continuous random variable is defined as a function pξ : R → R≥0, x 7→ pξ(x) (5.25) with the properties

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 62

R ∞ (1) −∞ pξ(x) dx = 1, and R x2 (2) ξ(x1 ≤ x ≤ x2) = pξ(x) dx for all x1, x2 ∈ with x1 ≤ x2. P x1 R • Property (2) of 5.3.6 is central to the understanding of PDFs: the probability of a continuous random variable ξ to take on values in an interval [x1, x2] ⊂ R is obtained by integrating its associated PDF on the interval [x1, x2]. Notably, the probability for a continuous random variable ξ to take on any specific value x ∈ R is zero, because by property (2) in 5.3.6, we have Z x Pξ(ξ = x) = Pξ(x ≤ ξ ≤ x) = pξ(s) ds = 0. (5.26) x Also note that the motivation of the term probability density relates closely to the physical relations between mass, density, and volume,

Mass = Density × Volume. (5.27)

Physical density is a measure of the physical mass of a material per unit volume. To obtain the physical mass of an object of a given material with arbitrary volume, the physical density of the material has to be multiplied with the volume of the object. In analogy and with the intuition of definite integrals (cf. Section 3 | Calculus), to obtain the probability mass that is associated with a given interval of the real numbers, the size of the interval has to be multiplied with the associated values of the probability density. The cumulative distribution function of a continuous real-valued random variable ξ with PDF ξ evaluates to Z x Pξ : R → [0, 1], x 7→ Pξ(x) = pξ(s) ds (5.28) −∞ and is also referred to as cumulative density function (CDF). With the intuition of indefinite integrals (cf. Section 3 | Calculus), we thus see that PDFs can regarded as derivatives of CDFs - or vice versa, CDFs can be regarded as anti-derivatives of PDFs, in symbols d p (x) = P (x). (5.29) ξ dx ξ Finally, with the properties of basic integrals, we have the following possibility to evaluate the probability that a continuous random variable takes on values in an interval [x1, x2] by means of its CDF (and likewise for semi-open and open intervals):

Pξ(x1 ≤ ξ ≤ x2) = Pξ(x2) − Pξ(x1). (5.30)

5.4 Random vectors and multivariate probability distributions

Random vectors Random vectors are the multivariate extension of random variables. We use the following definition.

Definition 5.4.1 (Random vector). Let (Ω, A, P) denote a probability space and let (Γn, Sn) denote the n-dimensional measurable space. Then a function

ξ :Ω → Γn, ω 7→ ξ(ω) (5.31) is called an n-dimensional random vector, if it is a measurable function, i.e., if

ξ−1(S) ∈ A for all S ∈ Sn. (5.32)

T Without proof, we note that a multivariate function ξ = (ξ1, ..., ξn) is a measurable function, if its component functions ξ1, ..., ξn are measurable functions. This implies that the component functions of a random vector are random variables. n-dimensional random vectors may thus be conceived as the concatenation of n random variables, while random variables are one-dimensional random vectors.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 63

Multivariate probability distributions Multivariate probability distributions are the probability distributions of random vectors. In complete analogy to the random variable scenario, we use the following definition.

Definition 5.4.2 (Multivariate probability distribution). Let (Ω, A, P) denote a probability space, let (Γn, Sn) denote the n-dimensional measurable space, and let

ξ :Ω → Γn, ω → ξ(ω) (5.33) denote a random vector. Then the probability measure Pξ defined by n −1  Pξ : S → [0, 1],S 7→ Pξ(S) := P ξ (S) = P ({ω ∈ Ω|ξ(ω) ∈ S}) (5.34) is called the multivariate probability distribution of the random vector ξ. • For simplicity, the multivariate nature of the probability distribution of a random vector is often left implicit, such that one simply speaks of the probability distribution of a random vector.

Notation The notational conventions for events discussed in Section 5.3 extend to the multivariate case. For example, for S ∈ Sn and x ∈ Γn, we have

Pξ(ξ ∈ S) := P ({ξ ∈ S}) = P ({ω ∈ Ω|ξ(ω) ∈ S}) Pξ(ξ = x) := P ({ξ = x}) = P ({ω ∈ Ω|ξ(ω) = x}) (5.35) Pξ(ξ ≤ x) := P ({ξ ≤ x}) = P ({ω ∈ Ω|ξ(ω) ≤ x}) Pξ(x1 ≤ ξ ≤ x2) := P ({x1 ≤ ξ ≤ x2}) = P ({ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2}) . Note that relational operators such as ≤ are understood to hold component-wise for multivariate entities, n e.g., x ≤ y for x, y ∈ Γ is understood as xi ≤ yi for all i = 1, ..., n.

Multivariate cumulative distribution functions

One way to specify the probability distribution Pξ of a random vector is to define its multivariate cumulative distribution function. In analogy to the random variable scenario, we use the following definition. Definition 5.4.3 (Multivariate cumulative distribution function). Let ξ be a random vector. Then a multivariate cumulative distribution function of ξ is a function

n Pξ :Γ → [0, 1], x 7→ Pξ(x) := Pξ(ξ ≤ x). (5.36) •

More commonly employed alternatives for specifying the probability distributions of random vectors are multivariate probability mass and density functions. The intuitions for probability mass and density functions established for random variables extend to random vectors.

Multivariate probability mass functions Multivariate probability mass functions are used to define the distributions of discrete random vectors. We use the following definitions.

Definition 5.4.4 (Discrete random vector, multivariate probability mass function). Let (Ω, A, P) denote a probability space. A random vector ξ :Ω → X is called discrete, if its outcome space X contains only finite number of or countably many elements xi, i = 1, 2, .... The multivariate probability mass function of a discrete random vector ξ is denoted by pξ and is defined as

pξ : X → [0, 1], xi 7→ pξ(xi) := Pξ(ξ = xi). (5.37) •

Like their univariate counterparts, multivariate PMFs are non-negative and normalized.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 64

Example. To exemplify the concept of multivariate PMF, we consider a discrete two-dimensional random vector ξ = (ξ1, ξ2) taking values in X = X1 × X2 with X1 := {1, 2, 3} and X2 := {1, 2, 3, 4}. An exemplary two-dimensional PMF of the form

pξ : {1, 2, 3} × {1, 2, 3, 4} → [0, 1], (ξ1, ξ2) 7→ pξ(x1, x2) (5.38) is specified according in Table 5.1. Note that P3 P4 p (x , x ) = 1. x1=1 x2=1 ξ 1 2

pξ(x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 x1 = 1 0.1 0.0 0.2 0.1 x1 = 2 0.1 0.2 0.0 0.0 x1 = 3 0.0 0.1 0.1 0.1

Table 5.1. An exemplary bivariate PMF.

Multivariate probability density functions Multivariate probability density functions are used to define the distributions of continuous random vectors. We use the following definitions.

Definition 5.4.5 (Continuous random vector, multivariate probability density function). Let (Ω, A, P) denote a probability space. A random vector ξ :Ω → Rn is called a continuous random vector. The multivariate probability density function of a continuous random vector is defined as a function

n pξ : R → R≥0, x 7→ pξ(x), (5.39) such that R (1) n pξ(x) dx = 1, and R R x21 R x2n n (2) Pξ(x1 ≤ ξ ≤ x2) = ··· pξ(s1, ..., sn) ds1 ··· dsn for all x1, x2 ∈ R with x1 ≤ x2. x11 x1n • Like in the random variable scenario, we have

Z x1 Z xn Pξ(ξ = x) = Pξ(x ≤ ξ ≤ x) = ··· pξ(s1, ..., sn) ds1 ··· dsn = 0. (5.40) x1 xn As for the probability distributions of random vectors, we often omit the qualifying adjective multivariate when discussing the PMFs and PDFs of random vectors.

Marginal distributions Marginal distributions are the probability distributions of the components of random vectors. In the following, we first define marginal distributions and discuss how univariate marginal distributions can be evaluated based on multivariate PMFs and PDFs. We then discuss an example for the marginal distributions of a two-dimensional discrete random vector. Examples for marginal distributions of multivariate continuous vectors are discussed in the context of Gaussian distributions in Section 7 | Probability distributions. Definition 5.4.6 (Marginal random variables and vectors, marginal probability distributions). Let n (Ω, A, P) denote a probability space, let ξ :Ω → Γ denote a random vector, let Pξ denote the probability (i) n n (i) distribution of ξ, and let Γ denote the outcome space of the ith component of ξ such that Γ = ×i=1Γ . Then the probability distribution defined by

 (1) (i−1) (i+1) (n) (i) Pξi : S → [0, 1],S 7→ Pξi Γ × · · · × Γ × S × Γ × · · · × Γ for S ⊆ Γ (5.41) is called the ith univariate marginal distribution of ξ. •

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 65

Without proof, we note that marginal distributions can be evaluated from multivariate PMFs and PDFs by means of summation and integration, respectively. Theorem 5.4.1 (Marginal probability mass functions, marginal probability density functions). Let ξ denote a discrete random vector with probability mass function pξ. Then the probability mass function of the ith component ξi of ξ evaluates to X X X X pξi : R → [0, 1], xi 7→ pξi (xi) := ··· ··· pξ(x). (5.42) x1 xi−1 xi+1 xn

Similarly, let ξ denote a continuous random vector with probability density function pξ. Then the probability density function of the ith component ξi of ξ evaluates to Z Z Z Z

pξi : R → [0, 1], xi 7→ pξi (xi) := ··· ··· pξ(x) dx1 ··· dxi−1 dxi+1 ··· dxn. (5.43) x1 xi−1 xi+1 xn



Example To exemplify the concept of a marginal PMF, we reconsider the discrete two-dimensional random vector ξ = (ξ1, ξ2) taking values in X = X1 × X2 with X1 := {1, 2, 3} and X2 := {1, 2, 3, 4} and

PMF specified in Table 5.1. Based on Theorem 5.4.1, the marginal PMFs pξ1 and pξ2 of ξ evaluate as specified in Table 5.2 below. Note that P3 p (x ) = 1 and P4 p (x ) = 1. x1=1 ξ1 1 x2=1 ξ2 2

pξ(x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 pξ1 (x1) x1 = 1 0.1 0.0 0.2 0.1 0.4 x2 = 2 0.1 0.2 0.0 0.0 0.3 x2 = 3 0.0 0.1 0.1 0.1 0.3

pξ2 (x2) 0.2 0.3 0.3 0.2

Table 5.2. Exemplary marginal PMFs.

Conditional distributions

Recall that for a probability space (Ω, A, P) and two events A, B ∈ A with P(B) > 0, the conditional probability of event A given event B is defined as

P(A ∩ B) P(A|B) = . (5.44) P(B)

Analogously, for the distribution of two random variables ξ1 and ξ2, the conditional probability distribution of ξ1 given ξ2 is defined in terms of events A = {ξ1 ∈ X1} and B = {ξ2 ∈ X2}. To introduce conditional distributions, we first consider the case of two-dimensional (bivariate) discrete and continuous random vectors.

T Definition 5.4.7 (Conditional PMF, discrete conditional distribution). Let ξ = (ξ1, ξ2) denote a discrete random vector with PMF pξ = pξ1,ξ2 and marginal PMFs pξ1 and pξ2 . Then the conditional PMF of ξ1 given ξ2 = x2 is defined as

pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → [0, 1], x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0 (5.45) pξ2 (x2) and the conditional PMF of ξ2 given ξ1 = x1 is defined as

pξ1,ξ2 (x1, x2) pξ2|ξ1 : R → [0, 1], x2 7→ pξ2|ξ1 (x2|x1) = for pξ1 (x1) > 0. (5.46) pξ1 (x1)

The discrete distributions with PMFs pξ1|ξ2 (·|ξ2 = x2) and pξ2|ξ1(·|ξ1 = x1) are called the conditional distributions of ξ1 given ξ2 = x2 and ξ2 given ξ1 = x1, respectively. •

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 66

In complete analogy to the conditional probabilities of events, we have

pξ1,ξ2 (x1, x2) P ({ξ1 = x1} ∩ {ξ2 = x2}) pξ1|ξ2 (x1|x2) = = (5.47) pξ2 (x2) P(ξ2 = x2) and likewise for pξ2|ξ1 . Like conditional probabilities, conditional PMFs behave like proper probability measures in their first argument.

Example. Consider the earlier example of the two-dimensional PMF pξ1,ξ2 and its marginal PMFs pξ1 and pξ2 documented in Table 5.1 and Table 5.2. For this example, the conditional PMFs of ξ2 given

ξ1 = 1, ξ1 = 2, and ξ1 = 3 are evaluated in Table 5.3 below. Note the qualitative similarity of pξ1,ξ2 (x1, x2) and pξ2|ξ1 (x2|x1).

pξ2|ξ1 (x2|x1) x2 = 1 x2 = 2 x2 = 3 x2 = 4 p (x |x = 1) 0.1 = 1 0.0 = 0 0.2 = 1 0.1 = 1 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.4 4 0.4 0.4 2 0.4 4 x2=1 ξ2|ξ1 2 1 p (x |x = 2) 0.1 = 1 0.2 = 2 0.0 = 0 0.0 = 0 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.3 3 0.3 3 0.3 0.3 x2=1 ξ2|ξ1 2 1 p (x |x = 3) 0.0 = 0 0.1 = 1 0.1 = 1 0.1 = 1 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.3 0.3 3 0.3 3 0.3 3 x2=1 ξ2|ξ1 2 1

Table 5.3. Exemplary conditional PMF.

Similarly, we have the following definition for conditional distributions of continuous random variables.

T Definition 5.4.8 (Conditional PDF, continuous conditional distribution). Let ξ = (ξ1, ξ2) denote a continuous random vector with PDF pξ = pξ1,ξ2 and marginal PDFs pξ1 and pξ2 . Then the conditional PDF of ξ1 given ξ2 is defined as

pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → [0, 1], x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0, (5.48) pξ2 (x2) and the conditional PDF of ξ2 given ξ1 = x1 is defined as

pξ1,ξ2 (x1, x2) pξ2|ξ1 : R → [0, 1], x2 7→ pξ2|ξ1 (x2|x1) = for pξ1 (x1) > 0. (5.49) pξ1 (x1)

The continuous distributions with PDFs pξ1|ξ2 (·|ξ2 = x2) and pξ2|ξ1(·|ξ1 = x1) are called the conditional distributions of ξ1 given ξ2 and ξ2 given ξ1, respectively. • Finally, the two-dimensional scenario discussed thus far can be generalized to the multivariate scenario in terms of the following definition, which overs both the discrete and continuous settings.

Definition 5.4.9 (Multivariate conditional PMF and PDF). Let ξ = (ξ1, ξ2) denote an n-dimensional random vector, where ξ1 and ξ2 denote k and n − k-dimensional random vectors, respectively. Let pξ1,ξ2 denote the PMF or PDF of ξ and let pξ2 denote the (n − k)-dimensional marginal PMF or PDF of ξ2. Then, for ξ2 = x2, the conditional k-dimensional PMF or PDF of ξ1 given ξ2 is defined as

k pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → R≥0, x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0. (5.50) pξ2 (x2) •

Independence

In analogy to the definition of independent events (cf. 5.2.1), two random variables ξ1 and ξ2 are called independent, if {ξ1 ∈ S1} and {ξ2 ∈ S2} are independent events for all S1 and S2. We use the following definition.

(1) (2) Definition 5.4.10 (Independent random variables). Two random variables ξ1 :Ω → Γ and ξ2 :Ω → Γ (1) (2) are independent, if for every S1 ⊆ Γ and S2 ⊆ Γ it holds that

P(ξ1 ∈ S1, ξ2 ∈ S2) = P(ξ1 ∈ S1)P(ξ2 ∈ S2). (5.51) •

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 67

As in the elementary probability scenario, independence of random variables implies that

P({ξ1 ∈ S1}|{ξ2 ∈ S2}) = P({ξ1 ∈ S1}) (5.52) or, intuitively, that knowledge of the fact that ξ2 ∈ S2 does not affect the probability of the event ξ1 ∈ S1. Without proof, we note the following theorem that transfers the definition of independent random variables to their respective PMF or PDF.

Theorem 5.4.2 (Independence and PMF/PDF factorization). Let ξ1 :Ω → X1 and ξ2 :Ω → X2 denote discrete random variables with PMF pξ1,ξ2 and marginal PMFs pξ1 and pξ2 , respectively. Then ξ1 and ξ2 are independent, if and only if

pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) for all (x1, x2) ∈ X1 × X2. (5.53)

Similarly, let ξ1 and ξ2 denote continuous random variables with PDF pξ1,ξ2 and marginal PDFs pξ1 and pξ2 , respectively. Then ξ1 and ξ2 are independent, if and only if

2 pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) for all (x1, x2) ∈ R . (5.54)

2

Notably, the PMF or PDF property

pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) (5.55) is referred to as factorization of the PMF or PDF. The independence of two random variables is thus equivalent to the factorization of their bivariate PMF or PDF.

Example Consider the earlier example of a bivariate PMF and its associated marginal PMFs (cf. Table 5.2). Because

pξ1,ξ2 (1, 1) = 0.1 6= 0.08 = pξ1 (1)pξ2 (1), (5.56) the random variables ξ1 and ξ2 are not independent. For the marginal distributions specified in Table 5.2, the bivariate PMF for independent ξ1 and ξ2 is documented in Table 5.4 below.

pξ1,ξ2 (x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 pξ1 (x1) x1 = 1 0.08 0.12 0.12 0.08 0.40 x1 = 2 0.06 0.09 0.09 0.06 0.30 x1 = 3 0.06 0.09 0.09 0.06 0.30

pξ2 (x2) 0.20 0.30 0.30 0.20

Table 5.4. A factorized PMF

The bivariate case of two independent random variables is generalized to the case of n independent random variables in the following definition.

Definition 5.4.11 (n independent random variables). n random variables ξ1, ..., ξn are independent, if (1) (n) for every S1 ⊆ Γ , ..., Sn ⊆ Γ ,

n Y P(ξ1 ∈ S1, ..., ξn ∈ Sn) = P(ξi ∈ Si). (5.57) i=1

If the random variables have a multivariate PMF or PDF pξ1,...,ξn (x1, ..., xn) with marginal PMFs or

PDFs pξi , i = 1, ..., n, then independence holds if

n Y pξ1,...,ξn (x1, ..., xn) = pξi (xi). (5.58) i=1 • The special case of n independent random variables with identical marginal distributions serves as a fundamental assumption in many statistical settings. We use the following definition.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 68

Definition 5.4.12 (Independent and identically distributed random variables). n random variables ξ1, ..., ξn are called independent and identically distributed (iid), if and only if

(1) ξ1, ..., ξn are independent random variables, and

(2) each ξi has the same marginal distribution for i = 1, ..., n. •

In Section 7 | Probability distributions, we consider the case of n iid Gaussian random variables and how their joint distribution can be represented by a multivariate Gaussian distribution.

5.5 Bibliographic remarks

The presented material is standard and can be found in any introductory textbook on probability and statistics. DeGroot and Schervish(2012) and Wasserman(2004) are the main sources for the presentation provided here. Excellent introductions to modern probability theory include Billingsley(1995), Fristedt et al.(1998), Rosenthal(2006), and, from a statistical perspective, Shao(2003).

5.6 Study questions

1. Write down the definition of a probability space. 2. Write down the definition of the independence of two events A and B. 3. Write down the definition of a random variable. 4. Write down the definition of the cumulative distribution function of a random variable. 5. Write down the definitions of a PMF and a PDF. 6. Write down the definition of a random vector. 7. Write down the definition of the cumulative distribution function of a random vector. 8. Write down the definition of a multivariate PMF and a multivariate PDF.

9. Write down the definition of the independence of n random variables ξi, i = 1, ..., n.

10. What does it mean for n random variables ξ1, ..., ξn to be iid?

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 6 | Expectation, covariance, and transformations

6.1 Expectation

Definition and examples The expectation is a one-number summary of the distribution of a random variable. Intuitively, the expectation is the value that one expects to observe on average over many realizations of the random variable. We use the following definition.

Definition 6.1.1 (Expectation of a random variable). Let (Ω, A, P) denote a probability space and let ξ denote a random variable. The expectation (or expected value) of ξ is defined as ˆ P E(ξ) := x∈X x pξ(x), if ξ :Ω → X is a discrete random variable with PMF pξ, and as ˆ R ∞ E(ξ) := −∞ x pξ(x) dx, if ξ :Ω → R is a continuous random variable with PDF pξ. The expectation of a random variable is said to exist, if it is finite. •

Example 1 As a first example, we consider the expectation of a discrete random variable ξ with PMF 1 p : → [0, 1], x 7→ p (x) := , (6.1) ξ N6 ξ 6 e.g., a random variable modelling the numerical outcome of a fair die. From 6.1.1 and with X = N6, we then have X 1 1 1 1 1 1 21 (ξ) = x p (x) = 1 · + 2 · + 3 · + 4 · + 5 · + 6 · = = 3.5. (6.2) E ξ 6 6 6 6 6 6 6 x∈N6

Example 2 As a second example, we consider the expectation of a Gaussian random variable ξ , i.e., a continuous random variable with PDF   1 1 2 pξ : R → R>0, x 7→ pξ(x) := √ exp − (x − µ) . (6.3) 2πσ2 2σ2 Then E(ξ) = µ. (6.4)

Proof. We first note without proof that Z ∞ √ exp(−x2) dx = π. (6.5) −∞ From 6.1.1, we have Z ∞ 1  1  (ξ) = x √ exp − (x − µ)2 dx. (6.6) E 2 -∞ 2πσ2 2σ With the integration by substitution rule (cf . Section 3 | Calculus) Z g(b) Z b f(x) dx = f(g(x))g0(x) dx (6.7) g(a) a and the definition of √ √ 0 g : R → R, x 7→ g(x) := 2σ2x + µ with g (x) = 2σ2, (6.8) Expectation 70 we then have 1 Z ∞  1  (ξ) = √ x exp − (x − µ)2 dx E 2 2πσ2 -∞ 2σ 1 Z ∞ √  1 √  √ = √ ( 2σ2x + µ) exp − (( 2σ2x + µ) − µ)2 2σ2 dx 2 2πσ2 -∞ 2σ √ 2σ2 Z ∞ √ = √ ( 2σ2x + µ) exp −x2 dx (6.9) 2πσ2 -∞ 1 √ Z ∞ Z ∞  = √ 2σ2 x exp −x2 dx + µ exp −x2 dx π -∞ -∞ 1 √ Z ∞ √  = √ 2σ2 x exp −x2 dx + µ π . π -∞ 2 1 2 An anti-derivative of x exp −x is given by − 2 exp(−x ), because d  1  1 d 1 − exp(−x2) = − exp(−x2) = − exp(−x2)(−2x) = x exp(−x2). (6.10) dx 2 2 dx 2

1 2 1 2 With limx→−∞ − 2 exp(−x ) = 0 and limx→∞ − 2 exp(−x ) = 0 the remaining integral term thus vanishes and we obtain 1 √  E(ξ) = √ µ π = µ. (6.11) π

Properties of expectations We next discuss some properties of expectations that are often useful when evaluating the expectation of a random variable. Intuitively, the expectation of the sum of (scaled) random variables corresponds to the sum of the (scaled) expectations of the individual random variables and, for independent random variables, the expectation of the product of random variable corresponds to the product of the expectations of the random variables. These properties follow directly from the definition of the expectation as a sum or integral, which exhibit linearity properties (cf. Section 2 | Sets, sums, and functions and Section 3 | Calculus). For conciseness, we only consider the case of continuous random variables in the given proofs. We start with the following theorem. Theorem 6.1.1 (Expectations of linear-affine transformations of random variables). Let ξ denote a random variable, let a, b ∈ R, and let ζ := aξ + b. Then

E(ζ) = aE(ξ) + b. (6.12)

Proof. The theorem follows directly with the linearity properties of sums and expectations (cf. Section 2 | Sets, sums, and functions and Section 3 | Calculus). We consider the case of a continuous random variable ξ with PDF pξ in more detail. In this case, we have Z Z Z Z E(ζ) = E(aξ + b) = (ax + b)pξ(x) dx = apξ(x)x + bpξ(x) dx = a pξ(x)x dx + b pξ(x) dx = aE(ξ) + b. (6.13)

The following theorem formalizes that the expectation of the scaled sum of random variables corresponds to the sum of scaled random variable expectations.

Theorem 6.1.2 (Expectations of linear combinations of random variables). Let ξ1, ..., ξn denote random variables and let a1, ..., an ∈ R. Then

n ! n X X E aiξi = aiE(ξi). (6.14) i=1 i=1

Proof. The theorem follows directly with the linearity properties of sums and expectations (cf. Section 2 | Sets, sums, and functions and Section 3 | Calculus). We consider the case of two continuous random variables ξ1 and ξ2 with bivariate PDF

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Variance 71

pξ1,ξ2 in more detail. In this case, we have 2 ! X E aiξi = E(a1ξ1 + a2ξ2) i=1 ZZ = (a1x1 + a2x2)pξ1,ξ2 (x1, x2) dx1 dx2 ZZ = a1x1pξ1,ξ2 (x1, x2) + a2x2pξ1,ξ2 (x1, x2) dx1 dx2 ZZ ZZ = a1 x1pξ ,ξ (x1, x2) dx1 dx2 + a2 x2pξ ,ξ (x1, x2) dx1 dx2 1 2 1 2 (6.15) Z Z  Z Z  = a1 x1 pξ1,ξ2 (x1, x2) dx2 dx1 + a2 x2 pξ1,ξ2 (x1, x2) dx1 dx2 Z Z = a1 x1pξ1 (x1) dx1 + a2 x2pξ2 (x2) dx2

= a1E(ξ1) + a2E(ξ2) 2 X = aiE(ξi). i=1 Finally, an induction argument can be used to generalize the bivariate to the n-variate case.

Finally, the expectation of the product of random variables corresponds to the product of the expectations of the individual random variables, but in general only if the random variables are independent.

Theorem 6.1.3 (Expectation of products of independent random variables). Let ξ1, ..., ξn denote inde- pendent random variables. Then n ! n Y Y E ξi = E(ξi). (6.16) i=1 i=1

Proof. We consider the case of n continuous random variables with joint PDF pξ1,...,ξn . Because ξ1, ..., ξn are independent, it holds that n Y pξ1,...,ξn (x1, ..., xn) = pξi (xi). (6.17) i=1 We thus have n ! n ! Y Z Z Y E ξi = ··· ξi pξ1,...,ξn (x1, ..., xn) dx1... dxn i=1 i=1 n n Z Z Y Y = ··· ξi pξi (xi) dx1... dxn i=1 i=1 n Z Z Y = ··· ξipξi (xi) dx1... dxn (6.18) i=1 n Y Z = ξipξi (xi) dxi i=1 n Y = E(ξi) i=1

6.2 Variance

Definition and examples Variance and standard deviation are further one-number summaries of distributions. Intuitively, both the the variance and the standard deviation capture the spread of the realizations of the random variable. The standard deviation of a random variable is the square root of the variance of a random variable. Because the variance of a random variable includes a squaring operation, the standard deviation is expressed in the same units as the random variable. Definition 6.2.1 (Variance and standard deviation of a random variable). Let ξ denote a random variable with expectation E(ξ). The variance ξ is defined as 2 V(ξ) := E (ξ − E(ξ)) , (6.19)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Variance 72 assuming that this expectation exists. The standard deviation of a random variable is defined as the square root of the variance of a random variable, p S(ξ) := V(ξ). (6.20) •

Example 1 As a first example, we consider the variance of the discrete random variable ξ with PMF 1 p : → [0, 1], x 7→ p (x) := . (6.21) ξ N6 ξ 6

Above, we have evaluated the expectation of this random variable to E(ξ) = 3.5. With 6.2.1 and the definition of the expectation of a discrete random variable, we then have

2 V(ξ) = E (ξ − E(ξ)) X 2 = (x − E(ξ)) pξ(x) x∈N6 X 2 1 = (x − 3.5) · 6 (6.22) x∈N6 1 = (1 − 3.5)2 + (2 − 3.5)2 + (3 − 3.5)2 + (4 − 3.5)2 + (5 − 3.5)2 + (6 − 3.5)2 · 6 17.5 = . 6 √ The variance of ξ is thus V(ξ) = 17.5/6 ≈ 2.91 and the standard deviation of ξ is S(ξ) ≈ 2.91 ≈ 1.70. An alternative representation of the variance of a random variable that is often useful for the analytical evaluation of variances is given in the following theorem. Theorem 6.2.1 (Variance translation theorem). Let ξ be a random variable. Then

2 2 V(ξ) = E ξ − E(ξ) . (6.23)



Proof. With the definition of the variance of a random variable and the linearity of expectations, we have 2 V(ξ) = E (ξ − E(ξ)) 2 2 = E ξ − 2ξE(ξ) + E(ξ) 2 2 = E ξ − 2E(ξ)E(ξ) + E(ξ) (6.24) 2 2 2 = E ξ − 2E(ξ) + E(ξ) 2 2 = E ξ − E(ξ) .

Example 2 In the following, we use the variance translation theorem to show that the variance of a Gaussian random variable ξ with PDF N(ξ; µ, σ2) is given by

2 V(ξ) = σ . (6.25)

Proof. We first note that with the variance translation theorem 1 Z ∞  1  (ξ) = (ξ2) − (ξ)2 = x2 exp − (x − µ)2 dx − µ2. (6.26) V E E 2 2 2πσ -∞ 2σ With the integration by substitution rule (cf. Section 3 | Calculus) Z b Z g(b) f(g(x))g0(x) dx = f(x) dx (6.27) a g(a) and the definition of √ √ 0 g : R → R, x 7→ 2σ2x + µ, g(−∞) := −∞, g(∞) := ∞, with g (x) = 2σ2, (6.28)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Variance 73 the integral term on the right-hand side of eq. (6.26) can be rewritten as Z ∞  1  Z ∞ √  1 √  √ x2 exp − (x − µ)2 dx = ( 2σ2x + µ)2 exp − (( 2σ2x + µ) − µ)2 2σ2 dx 2 2 -∞ 2σ -∞ 2σ √ Z ∞ √  2σ2x2  = 2σ2 ( 2σ2x + µ)2 exp − dx (6.29) 2 -∞ 2σ √ Z ∞ √ = 2σ2 ( 2σ2x + µ)2 exp −x2 dx. -∞ We thus have √ 2 Z ∞ √ 2σ 2 2 2 V(ξ) = √ ( 2σ2x + µ) exp −x dx − µ 2πσ2 -∞ 1 Z ∞ √ √ = √ ( 2σ2x)2 + 2 2σ2xµ + µ2) exp −x2 dx − µ2 (6.30) π -∞ 1  Z ∞ √ Z ∞ Z ∞  = √ 2σ2 x2 exp −x2 dx + 2 2σ2µ x exp −x2 dx + µ2 exp −x2 dx − µ2. π -∞ -∞ -∞ Taking Z ∞ Z ∞ √ x exp(−x2) dx = 0 and exp(−x2) dx = π (6.31) -∞ -∞ as given, we then obtain  Z ∞  1 2 2 2 2√ 2 V(ξ) = √ 2σ x exp −x dx + µ π − µ π -∞ 2σ2 Z ∞ = √ x2 exp −x2 dx + µ2 − µ2 (6.32) π -∞ 2σ2 Z ∞ = √ x2 exp −x2 dx. π -∞ With the integration by parts rule (cf. Section 3 | Calculus) Z b Z b 0 b 0 f (x)g(x) dx = f(x)g(x)|a − f(x)g (x) dx (6.33) a a and the definition of 2 0 2 f : R → R, x 7→ f(x) := exp(−x ) with f (x) = −2 exp(−x ) (6.34) and 1 0 1 g : R → R, x 7→ g(x) := − x with g (x) = − (6.35) 2 2 such that  1  f 0(x)g(x) = −2 exp(−x2) − x = x2 exp(−x2), (6.36) 2 we then have 2 Z ∞ 2σ 2 2 V(ξ) = √ x exp −x dx π -∞ 2  Z ∞    2σ 1 2 ∞ 2 1 = √ − x exp(−x )|−∞ − exp −x − dx (6.37) π 2 -∞ 2 2  Z ∞  2σ 1 2 ∞ 1 2 = √ − x exp(−x )|−∞ + exp −x dx . π 2 2 -∞ 2 From limx→±∞ exp(−x ) = 0, we infer that the first term in the bracketed term on the right-hand side of the above evaluates to 0, such that we obtain 2  Z ∞  2 2σ 1 2 σ √ 2 V(ξ) = √ exp −x dx = √ π = σ . (6.38) π 2 -∞ π

Properties of variances We next discuss some properties of variances that are often useful when evaluating the variance and/or standard deviation of a random variable. In brief, the variance of a scaled random variable corresponds to the variance of the original random variable multiplied by the square of the scaling factor, and the variance of the sum of independent random variables corresponds to the sum of the variances of the individual random variables. We commence with the following theorem. Theorem 6.2.2 (Variances and standard deviations of linear-affine transformations of random variables). Let ξ denote a random variable, let a, b ∈ R, and let ζ := aξ + b. Then 2 V(ζ) = a V(ξ) (6.39) and S(ζ) = |a|S(ξ). (6.40)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sample mean, sample variance, and sample standard deviation 74

Proof. We first note that from Theorem 6.1.1, E(ζ) = aE(ξ) + b. For the variance of ζ, we thus have 2 V(ζ) = E (ζ − E(ζ)) 2 = E (aξ + b − aE(ξ) − b) 2 = E (aξ − aE(ξ)) 2 = E (a(ξ − E(ξ))) (6.41) 2 2 = E a (ξ − E(ξ)) 2 2 = a E (ξ − E(ξ)) 2 = a V(ξ). Taking the square root then yields the result for the standard deviation.

The next theorem shows that for independent random variables, the variance of the sum of random variables corresponds to the sum of the individual random variables’ variances.

Theorem 6.2.3 (Variances of linear combinations of independent random variables). Let ξ1, ..., ξn denote independent random variables and let a1, ..., an ∈ R. Then

n ! n X X 2 V aiξi = ai V(ξi). (6.42) i=1 i=1

Proof. We consider the case of two independent random variables ξ1 and ξ2 in more detail. We first note that in this case, we have E (a1ξ1 + a2ξ2) = a1E(ξ1) + a2E(ξ2). (6.43) We thus have 2 ! X V aiξi = V(a1ξ1 + a2ξ2) i=1 2 = E (a1ξ1 + a2ξ2 − E (a1ξ1 + a2ξ2)) 2 = E (a1ξ1 + a2ξ2 − a1E(ξ1) − a2E(ξ2)) 2 = E (a1ξ1 − a1E(ξ1) + a2ξ2 − a2E(ξ2)) 2 = E ((a1(ξ1 − E(ξ1)) + (a2(ξ2 − E(ξ2))) (6.44) 2 2 = E (a1(ξ1 − E(ξ1))) − 2(a1(ξ1 − E(ξ1))(a2(ξ2 − E(ξ2)) + (a2(ξ2 − E(ξ2))) 2 2 2 2 = E a1(ξ1 − E(ξ1)) − 2a1a2(ξ1 − E(ξ1))(ξ2 − E(ξ2)) + a2(ξ2 − E(ξ2)) 2 2 2 2 = a1E (ξ1 − E(ξ1)) − 2a1a2E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) + a2E (ξ2 − E(ξ2)) 2 2 = a1V(ξ1) − 2a1a2E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) + a2V(ξ2) 2 X 2 = ai V(ξi) − 2a1a2E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) . i=1

Because ξ1 and ξ2 are independent, we have with Theorem 6.1.3

E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) = E ((ξ1 − E(ξ1))) E ((ξ2 − E(ξ2))) = (E(ξ1) − E(ξ1))(E(ξ2) − E(ξ2)) (6.45) = 0, and thus 2 ! 2 X X 2 V aiξi = ai V(ξi). (6.46) i=1 i=1 Finally, an induction argument can be used to generalize the bivariate to the n-variate case.

6.3 Sample mean, sample variance, and sample standard deviation

The theoretical constructs of expectations, variances, and standard deviations should not be confused with the concepts of sample means, sample variances, and sample standard deviations. The former entities are of theoretical nature and can be evaluated once the distributions of random variables have been specified. The latter entities are of practical nature, can be evaluated numerically based on observed data, and serve as estimators of the former theoretical quantities. We use the following definitions.

Definition 6.3.1 (Sample mean, sample variance, sample standard deviation). Let ξ1, ..., ξn denote random variables. Then

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Covariance and correlation of random variables 75

ˆ the sample mean of ξ1, ..., ξn is defined as the arithmetic average of ξ1, ..., xn,

n 1 X ξ¯ := ξ , (6.47) n n i i=1

ˆ the sample variance of ξ1, ..., ξn is defined as

n 1 X S2 := (ξ − ξ¯ )2, (6.48) n n − 1 i n i=1 and ˆ the sample standard deviation is defined as

p 2 S := Sn. (6.49)

Example As an example, we consider the case of 10 independent and identically distributed Gaussian 2 random variables with expectation parameter µ = 1 and variance parameter σ = 2, i.e., ξ1, ..., ξ10 ∼ N(1, 2). A set of realizations x1, ..., x10 is provided in Table 6.1 below.

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 0.54 1.01 -3.28 0.35 2.75 -0.51 2.32 1.49 0.96 1.25

Table 6.1. Realizations of 10 independent and identically Gaussian random variables.

For ξ1 = x1, ..., ξ10 = x10, application of the formula for the sample mean then yields

10 10 1 X 1 X 6.88 ξ¯ = ξ = x = = 0.68, (6.50) 10 10 i 10 i 10 i=1 i=1 application of the formula for the sample variance yields

10 10 10 1 X 1 X 1 X 25.37 S2 = (ξ − ξ¯ )2 = (x − x¯ )2 = (x − 0.68)2 = = 2.82, (6.51) 10 10 − 1 i 10 9 i 10 9 i 9 i=1 i=1 i=1 and application of the formula for the sample standard deviation yields q √ 2 S10 = S10 = 2.82 = 1.68. (6.52)

6.4 Covariance and correlation of random variables

Definition 6.4.1 (Covariance and correlation of random variables). The covariance of two random variables ξ1 and ξ2 with finite expectations is defined as

C(ξ1, ξ2) = E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) , (6.53) if this expectation exists. The correlation of two random variables ξ1 and ξ2 with finite expectations is defined as C (ξ1, ξ2) C (ξ1, ξ2) ρ(ξ1, ξ2) = p p = . (6.54) V(ξ1) V(ξ2) S(ξ1)S(ξ2) •

Note that the covariance of a random variable with itself corresponds to its variance:

2 C(ξ, ξ) = E ((ξ − E(ξ))(ξ − E(ξ))) = E (ξ − E(ξ)) = V(ξ). (6.55)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Covariance and correlation of random variables 76

pξ(x1, x2) x2 = 1 x2 = 2 x2 = 3 pξ1 (x1) x1 = 1 0.10 0.05 0.15 0.30 x1 = 2 0.60 0.05 0.05 0.70

pξ2 (x2) 0.70 0.10 0.20

Table 6.2. An exemplary joint PMF.

Example 1. As a first example, we consider the covariance of two discrete random variables ξ1 and ξ2 with the bivariate PMF depicted in Table 6.2 above. To this end, we first note that

2 X E(ξ1) = x1pξ1 (x1) = 1 · 0.3 + 2 · 0.7 = 1.7 (6.56) x1=1 and 3 X E(ξ2) = x2pξ2 (x2) = 1 · 0.7 + 2 · 0.1 + 3 · 0.2 = 1.5. (6.57) x2=1

With the definition of the covariance of ξ1 and ξ2, we then have

C(ξ1, ξ2) = E((ξ1 − E(ξ1))(ξ2 − E(ξ2))) 2 3 X X = (x1 − E(ξ1))(x2 − E(ξ2))pξ1,ξ2 (x1, x2) x1=1 x2=1 2 3 X X = (x1 − 1.7)(x2 − 1.5)pξ1,ξ2 (x1, x2) x1=1 x2=1 2 X = (x1 − 1.7)(1 − 1.5)pξ1,ξ2 (x1, 1) + (x1 − 1.7)(2 − 1.5)pξ1,ξ2 (x1, 2) + (x1 − 1.7)(3 − 1.5)pξ1,ξ2 (x1, 3) x1=1

= (1 − 1.7)(1 − 1.5)pξ1,ξ2 (1, 1) + (1 − 1.7)(2 − 1.5)pξ1,ξ2 (1, 2) + (1 − 1.7)(3 − 1.5)pξ1,ξ2 (1, 3)

+ (2 − 1.7)(1 − 1.5)pξ1,ξ2 (2, 1) + (2 − 1.7)(2 − 1.5)pξ1,ξ2 (2, 2) + (2 − 1.7)(3 − 1.5)pξ1,ξ2 (2, 3) = (−0.7) · (−0.5) · 0.10 + (−0.7) · 0.5 · 0.05 + (−0.7) · 1.5 · 0.15 + 0.3 · (−0.5) · 0.60 + 0.3 · 0.5 · 0.05 + 0.3 · 1.5 · 0.05 = 0.035 − 0.0175 − 0.1575 − 0.09 + 0.0075 + 0.0225 = −0.2. (6.58)

Example 2. As a second example, we note that the covariance and correlation of two random variables ξ1 and ξ2 with bivariate Gaussian joint distribution N(µ, Σ), where

   2 2  µ1 σ11 σ12 µ = and Σ = 2 2 , (6.59) µ2 σ21 σ22 are given by 2 2 2 σ12 C(ξ1, ξ2) = σ12 = σ21 and ρ(ξ1, ξ2) = , (6.60) σ11σ22 respectively. For a proof, see Casella and Berger(2012, pp.175). An alternative representation of the variance of a random variable that is often useful for the analytical evaluation of covariances is given in the following theorem.

Theorem 6.4.1 (Covariance translation theorem). Let ξ1 and ξ2 denote two random variables. Then

C(ξ1, ξ2) = E(ξ1ξ2) − E(ξ1)E(ξ2). (6.61)



The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Covariance and correlation of random variables 77

Proof. With the definition of the covariance of ξ1 and ξ2, we have

C(ξ1, ξ2) = E ((ξ1 − E(ξ1))(ξ2 − E(ξ2))) = E (ξ1ξ2 − ξ1E(ξ2) + E(ξ1)ξ2 − E(ξ1)E(ξ2)) (6.62) = E(ξ1ξ2) − E(ξ1)E(ξ2) + E(ξ1)E(ξ2) − E(ξ1)E(ξ2) = E(ξ1ξ2) − E(ξ1)E(ξ2).

Note that for independent ξ1 and ξ2, we have E(ξ1ξ2) = E(ξ1)E(ξ2) and thus C(ξ1, ξ2) = 0. Covariance and dependence are two different concepts. As the following theorem shows, two random variables can have zero covariance, but be dependent. On the other hand, independent random variables always have zero covariance.

Theorem 6.4.2 (Covariance, correlation and independence). Let ξ1 and ξ2 denote two random variables. If ξ1 and ξ2 are independent random variables, then C(ξ1, ξ2) = 0 and ξ1 and ξ2 are uncorrelated. Conversely, if C(ξ1, ξ2) = 0 and hence ξ1 and ξ2 are uncorrelated, then ξ1 and ξ2 are not necessarily independent.

Proof. (1) We first show that the independence of ξ1 and ξ2 implies that their covariance is zero. To this end, we note that for independent random variables, we have E(ξ1ξ2) = E(ξ1)E(ξ2). (6.63) With the covariance translation theorem, it then follows that

C(ξ1, ξ2) = E(ξ1ξ2) − E(ξ1)E(ξ2) = E(ξ1)E(ξ2) − E(ξ1)E(ξ2) = 0. (6.64)

With the definition of the correlation coefficient, it follows immediately that ρ(ξ1, ξ2) = 0 and thus that ξ1 and ξ2 are uncorrelated.

(2) We next show by example that the covariance of non-independent random variables ξ1 and ξ2 can be zero. To this end, we consider the case of two discrete random variables ξ1 and ξ2 with outcome spaces X = {−1, 0, 1} and Y = {0, 1}, 2 marginal PMF of ξ1 given by pξ1 (ξ1 = x) = 1/3 for x ∈ X and the definition ξ2 := ξ1 . We first note that X 1 1 1 E(ξ1) = xpξ (ξ1 = x) = −1 · + 0 · + 1 · = 0 (6.65) 1 3 3 3 x∈X and 2 3 X 3 3 1 3 1 3 1 E(ξ1ξ2) = E(ξ1ξ1 ) = E(ξ1 ) = x pξ (ξ1 = x) = −1 · + 0 · + 1 · = 0. (6.66) 1 3 3 3 x∈X With the covariance translation theorem, we thus have 3 C(ξ1, ξ2) = E(ξ1ξ2) − E(ξ1)E(ξ2) = E(ξ1 ) − E(ξ1)E(ξ2) = 0 − 0 · E(ξ2) = 0. (6.67)

The covariance of ξ1 and ξ2 is thus zero. However, as shown below, the joint PMF of ξ1 and ξ2 does not factorize, and thus ξ1 and ξ2 are not independent. 2 The definition of ξ2 := ξ1 entails the following conditional PMF pξ2|ξ1 :

pξ2|ξ1 (x2|x1) x1 = −1 x1 = 0 x1 = 1 x2 = 0 0 1 0 x2 = 1 1 0 1

The marginal PMF of pξ and the conditional PMF pξ2|ξ1 in turn entail the following joint PMF pξ1,ξ2 :

pξ1,ξ2 (x1, x2) x1 = −1 x1 = 0 x1 = 1 pξ2 (x2) x2 = 0 0 1/3 0 1/3 x2 = 1 1/3 0 1/3 2/3

pξ1 (x1) 1/3 1/3 1/3

But, for example, 1 1 1 p (x1 = −1, x2 = 0) = 0 6= = · = p (x1 = −1)p (x2 = 0) (6.68) ξ1,ξ2 9 3 3 ξ1 ξ2 and hence ξ1 and ξ2 are not independent.

The correlation of two random variables can be understood as a measure of their linear dependence in the sense of the following theorem.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Sample covariance and sample correlation 78

Theorem 6.4.3 (Correlation and linear-affine transformations). Let ξ1 and ξ2 denote two random variables with V(ξ1) > 0 and V(ξ2) > 0. Then

ξ2 = aξ1 + b ⇔ ρ(ξ1, ξ2) = 1 or ρ(ξ1, ξ2) = −1. (6.69)

Proof. We content with showing the ⇒ direction. We assume that ξ2 = aξ1 + b with V(ξ1) > 0 and V(ξ2) > 0 hold and first show that the above implies that S(ξ2) = ±aS(ξ1) and C(ξ1, ξ2) = aV(ξ1). (6.70) To see this, we first note that as seen above 2 E(ξ2) = aE(ξ2) + b and V(ξ2) = a V(ξ1). (6.71)

Thus a 6= 0 and if a > 0, then S(ξ2) = aS(ξ1), and if a < 0, then S(ξ2) = −aS(ξ1) > 0. Next, with respect to

C(ξ1, ξ2) = E((ξ1 − E(ξ1))(ξ2 − E(ξ2))), (6.72) we note that

ξ2 − E(ξ2) = aξ1 + b − E(ξ2) = aξ1 + b − aE(ξ1) − b = a(ξ1 − E(ξ1)). (6.73) We thus obtain 2 2 C(ξ1, ξ2) = E(a(ξ1 − E(ξ1)) ) = aE((ξ1 − E(ξ1)) ) = aV(ξ1). (6.74) With (6.70), it then follows

C(ξ1, ξ2) aV(ξ1) aV(ξ1) ρ(ξ1, ξ2) = = = ± = ±1. (6.75) S(ξ1)S(ξ2) S(ξ1)(±aS(ξ1)) aV(ξ1)

Finally, the notion of covariance allows for establishing a formula for the variance of linear affine combinations of arbitrary random variables. We have the following theorem.

Theorem 6.4.4 (Variances of sums and differences of random variables). Let ξ1 and ξ2 denote two random variables and let a, b, c ∈ R. Then

2 2 V(aξ1 + bξ2 + c) = a V(ξ1) + b V(ξ2) + 2abC(ξ1, ξ2) (6.76) In particular, V(ξ1 + ξ2) = V(ξ1) + V(ξ2) + 2C(ξ1, ξ2) (6.77) and V(ξ1 − ξ2) = V(ξ1) + V(ξ2) − 2C(ξ1, ξ2). (6.78)



Proof. We first note that E(aξ1 + bξ2 + c) = aE(ξ1) + bE(ξ2) + c. (6.79) We thus have 2 V(aξ1 + bξ2 + c) = E (aξ1 + bξ2 + c − aE(ξ1) − bE(ξ2) − c) 2 = E (a(ξ1 − E(ξ1)) + b(ξ2 − E(ξ2))) 2 2 2 2  = E a (ξ1 − E(ξ1)) + b (ξ2 − E(ξ2)) + 2ab(ξ1 − E(ξ1))(ξ2 − E(ξ2))) (6.80) 2 2 2 2 = a E (ξ1 − E(ξ1)) + b E (ξ2 − E(ξ2)) + 2abE ((ξ1 − E(ξ1))(ξ2 − E(ξ2)))) 2 2 = a V(ξ1) + b V(ξ2) + 2abC(ξ1, ξ2). The special cases then follow directly with a = b = 1 and with a = 1, b = −1, respectively.

6.5 Sample covariance and sample correlation

Like expectations and variances, covariances and correlations should not be confused with their empirical counterparts, the sample covariance and the sample correlation. We have the following definitions.

1 1 n n Definition 6.5.1 (Sample covariance and sample correlation). Let (ξ1 , ξ2 ), ..., (ξ1 , ξ2 ) denote n two- dimensional random vectors. Then ˆ 1 1 n n the sample mean of (ξ1 , ξ2 ), ..., (ξ1 , ξ2 ) is defined as

n n ! 1 X 1 X (ξ , ξ ) := (ξ¯ , ξ¯ ) = ξi , ξi , (6.81) 1 2 1 2 n 1 n 2 i=1 i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Probability density transformations 79

ˆ 1 1 n n the sample covariance of (ξ1 , ξ2 ), ..., (ξ1 , ξ2 ) is defined as

n 1 X C := (ξi − ξ¯ )(ξi − ξ¯ ), (6.82) n n − 1 1 1 2 2 i=1

ˆ 1 1 n n the sample correlation coefficient of (ξ1 , ξ2 ), ..., (ξ1 , ξ2 ) is defined as

Cn Rn := , (6.83) Sξ1 Sξ2

1 n 1 n where Sξ1 and Sξ2 are the sample standard deviations of ξ1 , ..., ξ1 and ξ2 , ..., ξ2 , respectively.

Example 1. Assume the following realizations.

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 (x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1, x2)(x1 , x2 ) (0.8, -0.7) (1.1, 1.6) (-0.8, 1.1) (-0.2, 0.1) (1.1, 0.4) (0.5, 1.5) (1.3, -1.2) (1.8, 0.6) (0.4, 0.2) (1.5, -1.0)

Then the sample mean realization is given by

10 10 ! 1 X 1 X (x , x ) = xi , xi = (0.75, 0.26), (6.84) 1 2 10 1 10 2 i=1 i=1 the sample standard deviation realizations are given by v v u 10 u 10 u1 X u1 X s = t (xi − x¯ )2 = 0.79 and s = t (xi − x¯ )2 = 0.99. (6.85) x1 9 1 1 x2 9 2 2 i=1 i=1 and the sample covariance and sample correlation realizations are given by

n 1 X c = (xi − x¯ )(xi − x¯ ) = −0.26 (6.86) n n − 1 1 1 2 2 i=1 and cn rn = = −0.33, (6.87) sx1 sx2 respectively.

6.6 Probability density transformations

A central theme in the theory of the GLM are transformations of PDFs. The fundamental probabilistic assumption in the classical Frequentist treatment of the GLM is that observation errors are independently and identically Gaussian distributed. The theory of the GLM captures how this Gaussian error assumption results in Gaussian distributed data, which in turn result in Gaussian and χ2 distributed parameter estimates, which in turn result in t and f distributed statistics. In this Section, we introduce basic theorems that allow for the analytical evaluation of PDFs of random variables that result from other random variables by means of the application of a function. We here focus on general scenarios and will see specific application in the context of the GLM in subsequent sections.

Univariate probability density transformations We first consider the PDF of a univariate random variable υ that results from the transformation of a univariate random variable ξ with PDF pξ by means of a function f. The following theorem states how the resulting PDF pυ of υ can be evaluated (DeGroot and Schervish, 2012, Section 3.8). Theorem 6.6.1 (Univariate PDF transformations for bijective functions). Let ξ be a random variable with outcome set X and PDF pξ for which P(]a, b[) = 1, where a and/or b are either finite or infinite. Let υ = f(ξ), where f is differentiable and bijective for ]a, b[. Let f(]a, b[) be the image of ]a, b[ under f.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Probability density transformations 80

Finally, let f −1(y) denote the inverse of f for y ∈ f(]a, b[) and let f 0(x) denote the first derivative of f at x. Then the PDF of the random variable υ with outcomes set Y is given by

( 1 −1  0 p f (y) for y ∈ f(]a, b[) |f (f −1(y))| ξ pυ : Y → R≥0, y 7→ pυ(y) := (6.88) 0 for y ∈ R \ f(]a, b[).

 For a proof of Theorem 6.6.1, see DeGroot and Schervish(2012, Section 3.8). Note that Theorem 6.6.1 implies an analytical procedure for deriving the PDF pυ: based on the definitions of pξ and f, the inverse and the derivative of f have to be evaluated and substituted in eq. (6.88) appropriately. We demonstrate this procedure in the proof of the following theorem, which specializes Theorem 6.6.1 to linear-affine functions. Theorem 6.6.2 (Univariate PDF transformations for linear-affine functions). Let ξ be a random variable with PDF pξ and let υ := f(ξ) with

f(ξ) := aξ + b with a, b ∈ R, a 6= 0. (6.89) Then the PDF of υ is given by

1 y − b p : → , y 7→ p (y) := p (6.90) υ R R≥0 υ |a| ξ a



Proof. We first note that −1 −1 y − b f : R → R, υ 7→ f (y) := (6.91) a because   −1  x − b f f (x) = a + b = x − b + b = ξ for all x ∈ R, (6.92) a and thus −1 f ◦ f = idR. (6.93) We next note that 0 0 d f : R → R, x 7→ f (x) = (ax + b) = a. (6.94) dx Substitution in (6.88) then yields   1 −1  1 y − b pυ : R → R≥0, y 7→ pυ(y) := pξ f (y) = pξ . (6.95) |f 0 (f −1(y)) | |a| a

We next note a generalization of Theorem 6.6.1 to the case of only piecewise bijective functions. Theorem 6.6.3 (Univariate PDF transformations for piecewise bijective functions). Let ξ be a random variable with outcome set X and PDF pξ. Assume further that υ := f(ξ), where f is such that the outcome set of ξ can be partitioned into a finite number of sets X1, ..., Xk with a corresponding number of sets f(X1), ..., f(Xk) in the outcome set Y of Y (which may not be mutually exclusive) such that the transformation f is bijective for all X1, ..., Xk with a corresponding number of sets f(X1), ..., f(Xk) in the −1 0 outcome set Y of υ. Let further fi denote the inverse function of f on Xi and assume that fi exists and is continuous for all i = 1, ..., k. Then the PDF of υ is given by

k X 1 −1 0  pυ : Y → R≥0, y 7→ pυ(y) := 1X (y) 0 −1 pξ fi (y) . (6.96) i |f f (y) | i=1 i i

 Theorem 6.6.3 is especially important in the derivation of the χ2 distribution (cf. Section 7 | Probability distributions).

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Combining random variables 81

Multivariate probability density transformations Finally, we note that Theorem 6.6.1 has a straight-forward generalization to the multivariate scenario of transforming random vectors. It is this generalization, and its application in the case of linear-affine transformations, that is central to the formulation of the GLM (cf. Section 7 | Probability distributions) and to the Frequentist distributions of GLM parameter estimates (cf. Section 9 | Parameter estimate and statistics distributions). Again, we state the general theorem without proof, but exemplify the procedure for evaluating the PDF of the resulting random vector in the proof of its linear-affine special case. Theorem 6.6.4 (Multivariate PDF transformations). Let ξ be an n-dimensional random vector with n m PDF pξ and let υ = f(ξ) be an m-dimensional random vector with differentiable f : R → R . Let f −1 : Rm → Rn denote the inverse of f. Let further   f ∂ m×n J (x) = fi(xi) ∈ R (6.97) ∂xj 1≤i≤m,1≤j≤n denote the Jacobian matrix of f at x ∈ Rn, let |J f (x)| denote its determinant, and assume that |J f (x)|= 6 0 for all x ∈ Rn. Then the PDF of υ is given by

( 1 −1  n m |J f (f −1(y))| px f (y) for y ∈ f(R ) pυ : R → R≥0, y 7→ pυ(y) := (6.98) 0 for y ∈ Rm \ f(Rn)

 An application of Theorem 6.6.4 is given in the proof of the following theorem. Theorem 6.6.5 (Multivariate nonsingular PDF transformations). Let ξ be a random vector with PDF n×n pξ and let υ = Aξ for invertible A ∈ R . Then the PDF of υ is given by 1 p : n → , y 7→ p (y) = p A−1y , (6.99) υ R R≥0 υ |A| ξ where |A| and A−1 denote the determinant and the inverse of A, respectively.



Proof. We first show that −1 n n −1 −1 f : R → R , y 7→ f (y) := A y. (6.100) To this end, we note that −1 −1 n f (f(x)) = A Ax = x for all x ∈ R , (6.101) and thus −1 n f ◦ f = idR . (6.102) We next show that Jf f −1(y) = A. (6.103) To this end, we first note that n X fi(x) = aij xj . (6.104) j=1 Thus   n ! f ∂ X ∂ n×n J (x) = fi(x) = (aij xj ) = (aij )1≤i,j≤n = A ∈ R . (6.105) ∂xj ∂xj 1≤i,j≤n i=1 1≤i,j≤n Substitution in (6.98) then yields 1 1 p f −1(y) = p A−1y . (6.106) |f 0 (f −1(y)) | ξ |A| ξ

6.7 Combining random variables

In this section, we consider the PDFs of random variables that result from various combinations of two continuous random variables.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Combining random variables 82

Linear combinations We have the following theorem

Theorem 6.7.1 (Linear combination of two continuous random variables). Let X1 and ξ2 be two continuous random variables with joint PDF pξ1,ξ2 (x1, x2), and let

υ = a1ξ1 + a2ξ2 + b with a1 6= 0. (6.107)

Then υ has a continuous distribution with PDF Z ∞   y − b − a2x2 1 pυ(y) = pξ1,ξ2 , x2 dx2 (6.108) −∞ a1 a1

Proof. We first note that for any joint PDF pξ of a random vector ξ and any multivariate real-valued function f such that υ := f(ξ), the CDF of υ takes on values Z Pυ(y) = pξ(x) dx, where Ay := {x|f(x) ≤ y}, (6.109) Ay because Z Pυ(y) = P(υ ≤ y) = P(f(ξ) ≤ y) = P(ξ ∈ {x|f(x) ≤ y}) = P(ξ ∈ Ay) = pξ(x) dx (6.110) Ay

We next evaluate the CDF Pυ of υ of the linear combinations theorem in the form Z y Pυ(y) = pυ(s) ds, (6.111) −∞ from which the form of pυ then follows directly. To this end, we define

Ay := {(x1, x2)|a1x1 + a2x2 + b ≤ y} for all y ∈ R. (6.112) Then from the above, we have ZZ

Pυ(y) = pξ1,ξ2 (x1, x2) dx1 dx2. (6.113) Ay

To evaluate this integral, visualized below, we consider −∞ < x2 < ∞ and for each x2 integrate x1 from −∞ to y − a2x2 − b x1 = ⇔ a1x1 + a2x2 + b = y. (6.114) a1 We thus consider the integral

∞ y−a2x2−b ZZ Z Z a1 Pυ(y) = pξ1,ξ2 (x1, x2) dx1 dx2 = pξ1,ξ2 (x1, x2) dx1 dx2. (6.115) Ay −∞ −∞

See Figure 6.1 for a visualization. The inner integral on the right-hand side of the above can then be rewritten by means of the integration by substitution rule as

y−a2x2−b Z a Z y   1 ξ − b − a2x2 1 pξ1,ξ2 (x1, x2) dx1 = pξ1,ξ2 , x2 dξ (6.116) −∞ −∞ a1 a1 (see below for a detailed derivation). Substitution in the above then yields

Z ∞ Z y   ξ − b − a2x2 1 Pυ(y) = pξ1,ξ2 , x2 dξ dx2 −∞ −∞ a1 a1 Z y Z ∞   (6.117) ξ − b − a2x2 1 = pξ1,ξ2 , x2 dx2 dξ −∞ −∞ a1 a1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Combining random variables 83

Figure 6.1. Integral area Ay of interest. For each x2 ∈] − ∞, ∞[, the inner integral is evaluated along the x1 dimension from −∞ to y−a2x2−b a1

But then it follows from basic calculus that d p (y) = P (y) υ dy υ Z y Z ∞   d ξ − b − a2x2 1 = pξ1,ξ2 , x2 dx2 dξ (6.118) dy −∞ −∞ a1 a1 Z ∞   y − b − a2x2 1 = pξ1,ξ2 , x2 dx2 −∞ a1 a1 Finally, we show that

y−a2x2−b Z a Z y   1 ξ − b − a2x2 1 pξ1,ξ2 (x1, x2) dx1 = pξ1,ξ2 , x2 dξ (6.119) −∞ −∞ a1 a1 by means of the integration by substitution rule. To this end, we first recall that the integration by substitution rule state that for univariate real-valued functions g and h it holds that Z h(b) Z b g(x) dx = g(h(x))h0(x) dx. (6.120) h(a) a

For constant x2 ∈ R, we next define

g : R → R, x 7→ g(x) := pξ1,ξ2 (x, x2) (6.121) and 1 h : R → R, x 7→ h(x) := (x − a2x2 − b). (6.122) a1 We note that the derivative of h at x evaluates to 1 h0(x) = . (6.123) a1 Finally, we set b := y and a := −∞. Substitution in (6.120) then yields Z h(b) Z b g(x) dx = g(h(x))h0(x) dx h(a) a Z h(y) Z y 1 ⇔ pξ1,ξ2 (x, x2) dx = pξ1,ξ2 (h(x), x2) dx. h(−∞) −∞ a1

y−a2x2−b (6.124) Z a Z y   1 x − a2x2 − b 1 ⇔ pξ1,ξ2 (x, x2) dx = pξ1,ξ2 , x2 dx −∞ −∞ a1 a1 y−a2x2−b Z a Z y   1 ξ − a2x2 − b 1 ⇔ pξ1,ξ2 (x1, x2) dx1 = pξ1,ξ2 , x2 dξ. −∞ −∞ a1 a1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Combining random variables 84

The following special case is of special interest.

Theorem 6.7.2 (Convolution of random variables). Let ξ1 and ξ2 be two independent continuous random variables with marginal PDFs pξ1 and pξ2 , respectively, and let υ := ξ1 +ξ2. Then a PDF of the distribution of υ is given by the convolution of pξ1 and pξ2 , i.e., Z ∞ Z ∞

pυ(y) = pξ1 (y − x2)pξ2 (x2) dx2 = pξ1 (x1)pξ2 (y − x1) dx1 (6.125) −∞ −∞

Proof. We first note that for independent ξ1, ξ2, pξ1,ξ2 factorizes. Setting a1 = a2 = 1 and b in the Theorem on linear combinations of two continuous random variables then yields ∞   ∞ Z y − 0 − 1x2 Z pυ(y) = pξ1 pξ2 (x2) dx2 = pξ1 (y − x2)pξ2 (x2) dx2. (6.126) −∞ 1 −∞

Finally, by setting ξ1 := ξ2 and ξ2 := ξ1, we obtain Z ∞ Z ∞ pY (y) = pξ2 (y − x1)pξ1 (x1) dx1 = pξ1 (x1)pξ2 (y − x1) dx1. (6.127) −∞ −∞

A direct proof of the convolution formula can be given by considering the transformation (ξ1, ξ2) 7→ (ξ1 + ξ2, ξ1) and marginalization (e.g. Casella and Berger(2012, Theorem 5.2.9).

Ratios We next consider the scenario of a random variable ζ defined by the ratio of two non-negative random variables ξ and υ. We have the following theorem.

Theorem 6.7.3 (Ratio distributions). Let ξ1 and ξ2 be two positive random variables with joint PDF pξ1,ξ2 and let ζ := ξ1/ξ2. Then a PDF of ζ is given by Z ∞

pζ : R>0 → R>0, z 7→ pζ (z) := x2pξ1,ξ2 (zx2, x2) dx2. (6.128) 0

Proof. We first note that for any joint PDF pξ and any multivariate real-valued function f such that ζ := f(ξ) is a random variable, the CDF of ζ takes on the values Z Pζ (z) := pξ(x) dx, whereAz := {x|f(x) ≤ z}, (6.129) Az because Z Pζ (z) = P(Z ≤ z) = P(f(ξ) ≤ z) = P (ξ ∈ {x|f(x) ≤ z}) = P(ξ ∈ Az) = pξ(x) dx (6.130) Az To prove the current theorem, we are thus interested in evaluating ZZ Pζ (z) = pξ1,ξ2 (x1, x2) dx1dx2, (6.131) Az where   x1 Az := (x1, x2) ≤ z for all z ∈ R (6.132) x2 To evaluate this integral, we consider 0 < x2 < ∞ and for each x2 integrate x1 from 0 to zx2, because x1 = z ⇔ x1 = zx2 (6.133) x2 We are thus interested in evaluating Z ∞ Z zx2 Pζ (z) = pξ1,ξ2 (s, x2) ds dx2 (6.134) 0 0 We next rewrite the inner integral using the integration by substitution rule Z g(b) Z b f(s) ds = f(g(s))g0(s) ds (6.135) g(a) a

Specifically, we define f := pξ1,ξ2 (·, x2) for fixed x2 and

g : R → R, s 7→ g(s) := sx2, (6.136) 0 such that g(0) = 0, g(z) = zx2 and g (s) = x2. With a := 0 and b := z, we thus have from (6.135)

Z zx2 Z z pξ1,ξ2 (s, x2) ds = pξ1,ξ2 (sx2, x2)x2 ds (6.137) 0 0 Substitution in (6.134) then yields Z ∞ Z z Z z Z ∞ Pζ (z) = x2pξ1,ξ2 (sx2, x2) ds dx2 = x2pξ1,ξ2 (sx2, x2) dx2 ds (6.138) 0 0 0 0

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 85

With d Z z  f(s) ds = f(z) (6.139) dz 0 (cf. Section 3 | Calculus), we then have d d Z z Z ∞  Z ∞ pζ (z) = Pζ (z) = x2pξ1,ξ2 (sx2, x2) dx2 ds = x2pξ1,ξ2 (zx2, x2) dx2 (6.140) dz dz 0 0 0

6.8 Bibliographic remarks

The material discussed in this section is standard. We followed Wasserman(2004, Sections 3.1 - 3.3) and DeGroot and Schervish(2012, Sections 4.1 - 4.3, 4.6).

6.9 Study questions

1. Write down the definition of the expectation of a random variable and discuss its intuition. 2. What does it mean for the expectation of a random variable to exist? 3. State the linearity and multiplication properties of expectations. 4. Write down the definition of the variance of a random variable and discuss its intuition. 5. Write down the definition of the standard deviation of a random variable and discuss its intuition. 6. Write down the expectation of the square of a random variable in terms of its variance and expectation.

7. For constant a random variable ξ and a constant a, what is V(aξ)?

8. Write down the definition of the covariance and correlation of two random variables ξ1 and ξ2.

9. Express the covariance of two random variables ξ1 and ξ2 in terms of expectations.

10. What is the variance of the sum of two random variables ξ1 and ξ2, if ξ1 and ξ2 are independent and in general?

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 7 | Probability distributions

In this Section, we review the essential probability distributions for the GLM. All distributions of relevance for the theory of the GLM derive from the zero-centred univariate Gaussian distribution that governs the error term of a single data point. Because all these distributions govern the behaviour of continuous random variables, all distributions can be specified in terms of PDFs. We first establish the univariate and multivariate Gaussian distributions, which will allow us to cast the GLM in probabilistic form in Section 7.2. Subsequently, we consider the χ2, T , and F distributions, that derive from nonlinear transformations of Gaussian distributions. Definition 7.0.1 (Univariate Gaussian distribution, standard ). Let ξ be a random variable with outcome set R and PDF   1 1 2 pξ : R → R>0, x 7→ pξ(x) := √ exp − (x − µ) . (7.1) 2πσ2 2σ2 Then ξ is called a Gaussian random variable and said to be distributed according to a Gaussian distribution with parameters µ ∈ R and σ2 > 0, for which we write ξ ∼ N(µ, σ2). We abbreviate the PDF of a Gaussian random variable by

1  1  N(x; µ, σ2) := √ exp − (x − µ)2 . (7.2) 2πσ2 2σ2

A Gaussian random variable with expectation µ = 0 and variance parameter σ2 = 1 is called a standard normal (or z) random variable, the distribution of a standard normal (or z) random variable is called standard normal (or z) distribution, and its PDF is given by

1  1  N(z; 0, 1) = √ exp − z2 . (7.3) 2π 2 • The parameter µ of a univariate Gaussian specifies the location of highest probability density, the parameter σ2 specifies the width of the PDF (Figure 7.1).

Linear-affine transformations An important aspect of Gaussian distributions is that Gaussian distributions reproduce under linear-affine transformations. Intuitively, applying a linear-affine transformation to a Gaussian random variable yields a new Gaussian random variable with expectation and variance parameters that result from the transformation of the original random variable’s parameters. Formally, we have the following theorem. Theorem 7.0.1 (Linear-affine transformations of Gaussian random variables). Let ξ ∼ N(µ, σ2) denote a Gaussian random variable with expectation parameter µ and variance parameter σ2. Let further υ := f(ξ), where f : R → R, x 7→ f(x) := ax + b, a, b ∈ R, a 6= 0. (7.4) Then υ ∼ N aµ + b, a2σ2 . (7.5)



Proof. We first note that the inverse function of f is given by

−1 −1 y − b f : R → R, y 7→ f (y) = (7.6) a The multivariate Gaussian distribution 87 and that f 0(x) = a. With the univariate PDF transform theorem for linear-affine functions (cf. Section 6 | Expectation, covariance, and transformations), we then have for the PDF of υ: 1  y − b  p(y) = N ; µ, σ2 |a| a ! 1 1 1  y − b 2 = √ exp − − µ |a| 2πσ2 2σ2 a !! 1 1 1  y − b 2  y − b  = √ √ exp − − 2 µ + µ2 a2 2πσ2 2σ2 a a   2 2   2  1 1 (y − b) a y − b a 2 = √ exp − − 2 µ + µ (7.7) 2πa2σ2 2σ2 a2 a2 a a2 1  1  = √ exp − (y − b)2 − 2(y − b)aµ + a2µ2 2πa2σ2 2a2σ2 1  1  = √ exp − (y − b − aµ) 2πa2σ2 2a2σ2 1  1  = √ exp − (y − (aµ + b)) 2πa2σ2 2a2σ2 = N(y; aµ + b, a2σ2).

The z-transformation An often encountered transformation of univariate Gaussian distributions is the z-transformation. The z-transformation is a procedure to transform an arbitrary Gaussian random variable to a standard normal random variable. We have the following theorem, which follows directly from Theorem 6.6.1 on the transformations of univariate random variables by bijective functions.

2 x−µ Theorem 7.0.2 (z-transformation). Let ξ ∼ N(µ, σ ) and let υ := f(ξ) with f(x) := σ . Then υ ∼ N(0, 1).

−1 0 1 Proof. We first note that f (y) = σy + µ and f (x) = σ . With the theorem on the transformations of univariate random variables by bijective functions (Theorem 6.6.1), we then have for the PDF of υ

1 2 pυ(y) = 1 N(σy + µ; µ, σ ) | σ | √ 1  1  = σ2 √ exp − (σy + µ − µ)2 2πσ2 2σ2 1  1  (7.8) = √ exp − σ2y2 2π 2σ2 1  1  = √ exp − y2 2π 2 = N(y; 0, 1).

7.1 The multivariate Gaussian distribution

The most important distribution for the theory of the GLM is the multivariate Gaussian distribution. We use the following definition. Definition 7.1.1 (Multivariate Gaussian distribution). Let ξ be an n-dimensional random vector with outcome set Rn and PDF   n − n − 1 1 T −1 p : 7→ , x 7→ p (x) := (2π) 2 |Σ| 2 exp − (x − µ) Σ (x − µ) . (7.9) ξ R R>0 ξ 2 Then ξ is said to be distributed according to a multivariate (or n-dimensional) Gaussian distribution with expectation parameter µ ∈ Rn and positive-definite covariance matrix parameter Σ ∈ Rn×n, for which we write ξ ∼ N(µ, Σ). We abbreviate the PDF of a Gaussian random vector by   − n − 1 1 T −1 N(x; µ, Σ) := (2π) 2 |Σ| 2 exp − (x − µ) Σ (x − µ) . (7.10) 2 •

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The multivariate Gaussian distribution 88

Figure 7.1. Univariate and bivariate Gaussian distributions with varying parameter settings.

In the functional form of the multivariate Gaussian distribution, the parameter µ ∈ Rn specifies the location of highest probability density in Rn. The diagonal elements of Σ specify the width of the PDF with respect to the x1, ..., xn components of x. The i, jth off-diagonal element of Σ specifies the degree of covariation of the xi and xj components of x. A first understanding of multivariate Gaussian distributions can be achieved by considering the bivariate case n = 2 in some detail (cf. Figure 7.1).

Bivariate Gaussian distributions

Definition 7.1.2 (Bivariate Gaussian distribution). A two-dimensional random vector ξ := (ξ1, ξ2) with outcome set R2 is said to have a bivariate Gaussian distribution, if it has the PDF       x1 µ1 σ11 σ12 pξ(x1, x2) = N ; , . (7.11) x2 µ2 σ21 σ22

• We first rewrite the PDF of a two-dimensional Gaussian random vector in a manner that eases its analytical characterization. Theorem 7.1.1 (Bivariate Gaussian PDF). Let ξ denote a two-dimensional random vector with bivariate Gaussian distribution and let σ σ ρ := √ 12 = √ 21 . (7.12) σ11σ22 σ11σ22 Then the PDF of ξ can be written  2      2!!  √ p −1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2 p (x , x ) = 2π σ σ 1 − ρ2 exp − − 2ρ + . ξ 1 2 11 22 2 √ √ √ √ 2(1 − ρ ) σ11 σ11 σ22 σ22

2 Proof. With Definition 7.1.1, the notation of eq. (7.11), the fact that the inverse of a non-singular A = (aij )1≤i,j≤2 ∈ R is given by 1  a −a  A−1 = 22 12 , (7.13) a11a22 − a12a21 −a21 a11 as well as 2 2 σ12σ21 = σ12 = ρ σ11σ22, (7.14)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The multivariate Gaussian distribution 89 we have x  µ  σ σ  N 1 ; 1 , 11 12 x2 µ2 σ21 σ22      −1 − 1 1  σ22 −σ12 x1 − µ1 = (2π) (σ11σ22 − σ12σ21) 2 exp − x1 − µ1 x2 − µ2 2(σ11σ22 − σ12σ21) −σ21 σ11 x2 − µ2 −1 2 − 1 = (2π) (σ11σ22 − σ12) 2  1  x − µ  × exp − (x − µ )σ − (x − µ )σ −(x − µ )σ + (x − µ )σ  1 1 2 1 1 22 2 2 21 1 1 12 2 2 11 x − µ 2(σ11σ22 − σ12) 2 2 −1 2 − 1 = (2π) (σ11σ22 − ρ σ11σ22) 2  1  × exp − (((x − µ )σ − (x − µ )σ )(x − µ ) − ((x − µ )σ + (x − µ )σ )(x − µ )) 2 1 1 22 2 2 21 1 1 1 1 12 2 2 11 2 2 2 (σ11σ22 − ρ σ11σ22) −1 2 − 1 = (2π) (σ11σ22(1 − ρ )) 2  1 (x − µ )2σ − (x − µ )(x − µ )σ − (x − µ )(x − µ )σ + (x − µ )2σ  × exp − 1 1 22 1 1 2 2 21 1 1 2 2 12 2 2 11 2 2(1 − ρ ) σ11σ22 −1 2 − 1 = (2π) (σ11σ22(1 − ρ )) 2 2 2 2  1  (x − µ ) σ (x1 − µ1)(x2 − µ2)σ (x − µ ) σ  × exp − 1 1 22 − 2 12 + 2 2 11 2 2(1 − ρ ) σ11σ22 σ11σ22 σ11σ22   2 √ 2   √ p −1 1 (x1 − µ1) (x1 − µ1)(x2 − µ2)ρ σ11σ22 (x2 − µ2) = 2π σ σ 1 − ρ2 exp − − 2 + 11 22 2 2(1 − ρ ) σ11 σ11σ22 σ22  2  2!!  √ p −1 1 x1 − µ1 (x1 − µ1)(x2 − µ2) x2 − µ2 = 2π σ σ 1 − ρ2 exp − − 2ρ + 11 22 2 √ √ √ 2(1 − ρ ) σ11 σ11σ22 σ22  2      2!!  √ p −1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2 = 2π σ σ 1 − ρ2 exp − − 2ρ + . 11 22 2 √ √ √ √ 2(1 − ρ ) σ11 σ11 σ22 σ22 (7.15)

A bivariate Gaussian distribution can be conceived as the linear combination of two standard normal random variables. Specifically, we have the following theorem.

Theorem 7.1.2 (Construction of bivariate Gaussian distributions). Let ζ1 and ζ2 denote two independent standard normal random variables, let µ1, µ2 ∈ R, let σ11, σ22 > 0, let ρ ∈ [0, 1], and define

√ √ 2 1 ξ1 := σ1ζ1 + µ1 and ξ2 := σ2(ρζ1 + (1 − ρ ) 2 ζ2) + µ2. (7.16)

Then the joint distribution of ξ1 and ξ2 is a bivariate Gaussian with expectation and covariance matrix parameters √ µ   σ ρ σ σ  µ = 1 and Σ = √ 11 11 22 , (7.17) µ2 ρ σ11σ22 σ22 respectively. 

Proof. See DeGroot and Schervish(2012, pp. 338-339).

Theorem 7.1.3 (Correlation of Gaussian random variables). Let ξ1 and ξ2 be two random variables with joint bivariate Gaussian distribution, let σ11 and σ22 denote the diagonal elements of the covariance matrix parameter governing their PDF, and let σ12 = σ21 denote the off-diagonal elements of the covariance matrix parameter governing their PDF. Finally, let σ ρ := √ 12 . (7.18) σ11σ22

Then the correlation of ξ1 and ξ2 is ρ.

Proof. See Casella and Berger(2012, p. 176).

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The multivariate Gaussian distribution 90

Independent Gaussian random variables A central element in the theory of the GLM is the fact that n independent univariate Gaussian distributed random variables can be modelled by an n-dimensional Gaussian distribution with spherical covariance matrix parameter. More specifically, n univariate Gaussian random variables, each distributed with not necessarily identical expectation parameters and identical variance parameters, can equivalently be described by the multivariate distribution of an n-dimensional Gaussian random vector that is distributed with an expectation vector given by the concatenation of the individual univariate expectation parameters and a spherical covariance matrix resulting from the multiplication of the n × n identity matrix with the common variance parameter of the univariate Gaussian random variables. Formally, we have the following theorem.

2 Theorem 7.1.4 (Independent Gaussian distributions). For i = 1, ..., n, let N(xi; µi, σ ) denote the 2 PDF of n independent univariate Gaussian random variables ξ1, ..., ξn with µ1, ..., µn ∈ R and σ > 0. 2  Further, let N x; µ, σ In denote the PDF of an n-variate random vector ξ with expectation parameter µ := (µ1, ..., µn). Then

pξ(x) = pξ1,...,ξn (x1, ..., xn) (7.19) and in particular n 2  Y 2 N x; µ, σ In = N(xi; µi, σ ). (7.20) i=1



2 Proof. We show the identity of the multivariate Gaussian PDF N(x; µ, σ In) with the product of n univariate 2 n Gaussian PDFs N(xi; µi, σ ), where µi denotes the ith entry of µ ∈ R . With definition (7.9), we have

n − 1   2  − 2 2 1 T 2 −1 N x; µ, σ In = (2π) 2 σ In exp − (x − µ) (σ In) (x − µ) 2 n !   Y − 1 2 − n 1 T = 2π 2 (σ ) 2 exp − (x − µ) (x − µ) 2σ2 i=1 n ! n ! 1 Y 2 − 1 X 2 = 2πσ  2 exp − (x − µ ) 2σ2 i i i=1 i=1 (7.21) n n   Y 1 Y 1 2 = √ exp − (xi − µi) 2 2σ2 i=1 2πσ i=1 n   Y 1 1 2 = √ exp − (xi − µi) 2 2σ2 i=1 2πσ 2 = N xi; µi, σ . where the last equality follows with definition (7.1).

From a sampling perspective, Theorem 7.1.4 can be conceived as follows: the sequential sampling of values from n independent univariate Gaussian distributed variables with expectation parameters µi, i = 1, ..., n and common variance parameter σ2 is equivalent to the simultaneous sampling of n univariate marginal Gaussian random variables in the form of an n-dimensional random vector distributed according to a T multivariate Gaussian distribution with expectation parameter vector µ = (µ1, ..., µn) and spherical 2 covariance matrix parameter σ In.

Linear-affine transformations Like univariate Gaussian distributions, multivariate Gaussian distributions reproduce under linear-affine transformations. As in the univariate case, applying a linear transformation to a Gaussian random vector yields a new Gaussian random vector with expectation and covariance parameters that result from the transformation of the original random vector’s parameters. We first note the following result without proof, which applies to invertible linear transformation of Gaussian random vectors by invertible matrices. Theorem 7.1.5 (Nonsingular transformations of Gaussian random vectors). Let ξ ∼ N(µ, Σ) denote an n-dimensional Gaussian random vector and let υ := Aξ with invertible A ∈ Rn×n. Then υ ∼ N Aµ, AΣAT  . (7.22)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The General Linear Model 91

 We also note without proof, that Theorem 7.1.5 can be generalized to the case of arbitrary linear-affine transformations as stated in the following theorem. For a (non-trivial) proof of the theorem in the linear transformation case, see Anderson(2003, Section 2.4). Theorem 7.1.6 (Linear transformations of Gaussian random vectors). Let ξ ∼ N(µ, Σ) denote an n-dimensional Gaussian random vector and let υ := Aξ + b with A ∈ Rm×n and b ∈ Rm. Then υ ∼ N Aµ + b, AΣAT  . (7.23)



7.2 The General Linear Model

The results on multivariate Gaussian distributions discussed in the previous Section allow to reconsider the GLM as introduced in Section 1.3 in a more rigorous fashion: the GLM with spherical covariance matrix is a multivariate Gaussian distribution of an n-dimensional random vector y with expectation n 2 parameter µ := Xβ ∈ R and covariance matrix parameter Σ := σ In.

To see this, first recall that the GLM models n random variables yi, i = 1, ..., n by a linear combination of p (non-random) predictor variables xi1, ..., xip under independent additive Gaussian noise contributions with expectation 0 and variance σ2, i.e.,

2 yi = xi1β1 + xi2β2 + ··· xipβp + εi, εi ∼ N(0, σ ) for i = 1, ..., n and mutually independent εi. (7.24) In vector-matrix notation and with Theorem 7.1.4, eq. (7.24) can be restated as

2 n n×p p y = Xβ + ε, ε ∼ N(0n, σ In) with y ∈ R ,X ∈ R , and β ∈ R . (7.25) The data vector y thus corresponds to a random vector that results from the linear-affine transformation 2 of a Gaussian random vector ε with expectation parameter 0n and covariance matrix parameter σ In. 2 Formally, we thus have that y = f(ε) with ε ∼ N(0n, σ In) and linear-affine transformation n n f : R → R , ε 7→ f(ε) := Inε + Xβ. (7.26) Theorem 7.1.6 then yields 2 T  y ∼ N In0n + Xβ, Inσ InIn (7.27) and thus 2  y ∼ N Xβ, σ In . (7.28) The GLM with mutually independent noise contributions thus models an n-dimensional data set of scalar data points as the realization of an n-dimensional random vector y with expectation parameter Xβ and 2 covariance matrix parameter σ In. Specific instantiations of the GLM, such as T-Tests, simple or multiple linear regression, ANOVAs, or ANCOVAs then correspond to specific assumptions about the design matrix and beta parameters (cf. Sections 11 - 15). Below, we briefly review two GLM designs that will serve as working examples throughout Chapters 7 - 10.

Example 1 (Independent and identically distributed Gaussian samples). Consider the scenario of n independent samples from a univariate Gaussian distribution with expectation parameter µ and variance parameter σ2, 2 yi ∼ N µ, σ for i = 1, ..., n. (7.29) From the above, eq. (7.29) is equivalent to

2 yi = µ + εi, εi ∼ N(0, σ ) for i = 1, ..., n and mutually independent εi. (7.30) Based on the above, eq. (7.30) can then equivalently be expressed as

2 n×1 1 2 y ∼ N(Xβ, σ In) with X := 1n ∈ R , β := µ ∈ R , σ > 0. (7.31) In other words, the scenario of n independent samples from a univariate Gaussian distribution with expectation parameter µ and variance parameter σ2 corresponds to sampling a GLM with a design matrix given by a vector of all 1’s, a single beta parameter that corresponds to the expectation parameter µ of the 2 2 univariate Gaussian distribution, and a spherical covariance matrix parameter σ In, where σ corresponds to the variance parameter of the univariate Gaussian distribution.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The Gamma distribution 92

Example 2 (Simple linear regression) Consider again the simple linear regression scenario as discussed in Section 1.3. Recall that the simple linear regression model conceives each data point as the realization of a random variable constructed by the summation of an offset a, a scaled predictor xi, where the scaling is represented by a slope parameter b and additive zero-centred Gaussian noise with variance σ2,

2 yi = a + bxi + εi, εi ∼ N(0, σ ) for i = 1, ..., n and mutually independent εi. (7.32) From the above, eq. (7.32) is equivalent to

2 yi ∼ N(µi, σ ) with µi := a + bxi and with mutually independent y1, ...yn. (7.33) Based on the above, eq. (7.33) can then equivalently be expressed as   1 x1 a y ∼ N(Xβ, σ2I ) with X := . .  ∈ n×2, β := ∈ 2, σ2 > 0. (7.34) n . .  R b R 1 xn In other words, the simple linear regression scenario with offset parameter a, slope parameter b, and Gaussian noise variance parameter σ2 > 0 corresponds to a GLM with a design matrix with two columns, the first being a vector of all 1’s and the second being the vector of predictor variables, a two-dimensional beta parameter vector comprising the offset and slope parameters, and a spherical covariance matrix 2 2 parameter σ In, where σ corresponds to the variance of the additive Gaussian noise on each data point.

7.3 The Gamma distribution

Definition 7.3.1. Let ξ be a random variable with outcome set R>0 and PDF 1  x  p : → , x 7→ p (x) := xα−1 exp − . (7.35) ξ R>0 R>0 ξ Γ(α)βα β Then ξ is called a Gamma random variable and said to be distributed according to a Gamma distribution with α > 0 and β > 0, for which we write ξ ∼ G(α, β). We abbreviate the PDF of a Gamma random variable by 1  x  G(x; α, β) := xα−1 exp − . (7.36) Γ(α)βα β •

The Gamma distribution family comprises two important special cases: for an integer n, the Gamma distribution with α = n/2 and β = 2 corresponds to the χ2 distribution (see below), while the Gamma distribution with α = 1 and β > 0 is known as the exponential distribution (Figure 7.2).

7.4 The χ2 distribution

The χ2 distribution is the distribution of the sum of n squared and independent z-distributed random variables. More specifically, we have the following definition and theorem (cf. Figure 7.2). Definition 7.4.1 (Chi-squared random variable). A random variable ξ is called chi-squared random variable with n degrees of freedom, if its PDF is given by   1 n 1 2 −1 pξ : R>0 → R>0, x 7→ pξ(x) := n x exp − x . (7.37) n 2 Γ( 2 )2 2 We abbreviate the PDF of a chi-squared random variable by   1 n 1 2 2 −1 χ (x; n) := n x exp − x . (7.38) n 2 Γ( 2 )2 2 • Theorem 7.4.1 (Squared standard normal random variable). Let ξ ∼ N(0, 1) be a standard normal 2 random variable and let ζ := ξ . Then ζ is a chi-squared random variable with one degree of freedom. 

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The t distribution 93

2 G(x; ,; -) @ (x; n) T (t; n) F (f; n1; n2) 1 0.5 0.4 1

, = 1; - = 1 n = 2 n = 2 n1 = 5; n2 = 3 , = 2; - = 1 n = 3 n = 5 n = 10; n = 8 0.8 0.4 0.8 1 2 , = 3; - = 2 n = 4 0.3 n = 10 n1 = 20; n2 = 18

0.6 0.3 0.6 0.2 0.4 0.2 0.4

0.1 0.2 0.1 0.2

0 0 0 0 0 5 10 15 20 0 5 10 15 20 -4 -2 0 2 4 0 1 2 3 4 5 x x t f

Figure 7.2. Gamma, χ2, t, and f distributions with varying parameter settings.

Proof. We first note that with the univariate PDF theorem for piecewise bijective functions(cf. Section 6 | Expectation, covariance, and transformations), the PDF of a random variable ζ = f(ξ) resulting from the transformation of a random variable ξ with PDF pξ by a piecewise differentiable and invertible function is given by k X 1  −1  pζ (z) = 1 0 (z) pξ f (z) . (7.39) Xi 0 −1 i i=1 |fi (fi (z))| We next define 0 X1 :=] − ∞, 0[, X2 :=]0, ∞[, and Xi := R>0, (7.40) as well as 2 fi : Xi → R>0, x 7→ fi(x) := x =: z for i = 1, 2 (7.41) with derivatives 0 0 fi : Xi → R, x 7→ fi (x) = 2x for i = 1, 2 (7.42) and with inverse functions −1 −1 √ −1 −1 √ f1 : R>0 → X1, z 7→ f1 (z) := − z and f2 : R>0 → X2, z 7→ f2 (z) := z. (7.43) From eq. (7.39), we then have 1   1   p : → , z 7→ p (z) = 1 (z) p f −1(z) + 1 (z) p f −1(z) ζ R>0 R>0 ζ R>0 0 −1 ξ 1 R>0 0 −1 ξ 2 |f1(f1 (z))| |f2(f2 (z))|     1 1 1 √ 1 1 1 √ 2 = 1 · √ √ exp − (− z)2 + 1 · √ √ exp − z |2(− z)| 2π 2 |2 z| 2π 2 (7.44) 1 1  1  1 1  1  = √ √ exp − z + √ √ exp − z 2 z 2π 2 2 z 2π 2 1 1  1  = √ √ exp − z . 2π z 2 On the other hand, we have for the PDF of a chi-squared random variable ζ with one degree of freedom

1 1  1  1 1  1  2 −1 pζ (z) = 1 z exp − z = √ √ exp − z . (7.45) 1  2 2 2π z 2 Γ 2 2

Without proof, we state the following theorem. Theorem 7.4.2 (Sum of n independent squared standard normal random variables). For i = 1, ..., n, let Pn 2 ξi ∼ N(0, 1) denote independent standard normal random variables and let ζ := i=1 ξi . Then ζ is a chi-squared random variable with n degrees of freedom. 

7.5 The t distribution

Definition 7.5.1. Let T be a random variable with outcome set R and PDF n+1 ! 2 1 1 n + 1 1 p : → , t 7→ p (t) := √ Γ . (7.46) T R R>0 T πn Γ( n ) 2 t2 2 1 + n

Then T is called a t random variable and is said to be distributed according to a t distribution with n degrees of freedom, for which we write T ∼ t(n). We abbreviate the PDF of a t random variable by

n+1 ! 2 1 1 n + 1 1 T (t; n) := √ Γ . (7.47) πn Γ( n ) 2 t2 2 1 + n

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The t distribution 94

We have the following theorem (Figure 7.2). Theorem 7.5.1 (t distribution). Let Z ∼ N(0, 1) be standard normal random variable, let V ∼ χ2(n) be a chi-squared random variable with n degrees of freedom, and assume that Z and V are independent random variables. Then the random variable Z T := (7.48) pV/n

is a t random variable with n degrees of freedom. 

Proof. We first note that the joint distribution of Z and V has the PDF     1 1 2 1 n −1 1 p (z, v) = √ exp − z v 2 exp − v . (7.49) Z,V n n 2π 2 2 2 Γ( 2 )2 We next consider the transformation ! 2 2 z f : R → R , (z, v) 7→ f(z, v) := , v =: (t, w) (7.50) pv/n and use the multivariate PDF transform theorem to derive the PDF of (t, w). To this end, we first recall that if ξ is an n n n-dimensional random vector with PDF pξ and υ := f(ξ) for differentiable and bijective f : R → R , then the PDF of ζ is given by n 1 −1  pυ : R → R≥0, y 7→ pζ (y) := pξ f (y) . (7.51) |Jf (f −1(y)) | For the current transformation f, we first note that

−1 2 2 −1 p f : R → R , (t, w) 7→ f (t, w) := ( w/nt, w), (7.52) because ! p ! −1 −1 z v/nz 2 f (f(z, v)) = f , v = , v = (z, v) for all (z, v) ∈ R . (7.53) pv/n pv/n We next note that the determinant of the Jacobian of f evaluates to     ∂ z ∂ z −1/2 f √ √  v  |J (z, v)| = ∂z v/n ∂v v/n = , (7.54) ∂ ∂ n ∂z v ∂v v such that 1  w 1/2 = . (7.55) |Jf (f −1(z, v)) | n Substitution in (7.51) then yields  w 1/2 p  pT,W (t, w) = pZ,V w/nt, w , (7.56) n and thus Z ∞ pT (t) = pT,W (t, w) dw 0 Z ∞  w 1/2 p  = pZ,V w/nt, w dw 0 n Z ∞     1/2 1 1 p 2 1 n −1 1  w  = √ exp − ( w/nt) w 2 exp − w dw n n 2π 2 2 2 n 0 Γ( 2 )2 ∞ 1 1 Z  1 w  n  1  2 2 −1 1/2 = √ n 1 exp − t w exp − w w dw 2π n 2 2 0 2 n 2 (7.57) Γ( 2 )2 n ∞ 1 1 Z  1 w 1  n 1 2 2 −1 2 = √ n 1 exp − t − w w w dw 2π n 2 2 0 2 n 2 Γ( 2 )2 n ∞ 1 1 Z  1  w  n+1 2 2 −1 = √ n 1 exp − t + w w dw 2π n 2 2 0 2 n Γ( 2 )2 n ∞ 1 1 Z  1  t2  n+1 2 −1 = √ n 1 exp − 1 + w dw. 2π n 2 2 0 2 n Γ( 2 )2 n We nexte note that the integrand on the left-hand side of eq. (7.56) corresponds to the the kernel of a Gamma PDF with

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The f distribution 95 parameters α = n+1 and β = 2 . Explicitly, 2 t2 1+ n 1  w  Γ(w; α, β) = wα−1 exp − (7.58) Γ(α)βα β   ! n + 1 2 1 n+1 −1 w ⇒ Γ w; , = x 2 exp −  (7.59) t2 n+1  2  2 1 +   2 2 n n+1 2 1+ t Γ( ) n 2 t2 1+ n 1  1  t2  n+1 2 −1 = n+1 exp − 1 + w . (7.60)   2 2 n Γ( n+1 ) 2 2 t2 1+ n We thus have ! 1 1 Z ∞ n + 1 2 pT (t) = √ Γ w; , dw. (7.61) n n 1 t2 2π 2 2 0 2 1 + Γ( 2 )2 n n Finally, we note that the integral term of eq. (7.61) corresponds to the normalization term of the Gamma PDF (cf. Section 2 | Sets, sums, and functions). We thus have

! n+1 1 1  n + 1  2 2 pT (t) = √ Γ n n 1 t2 2π 2 2 2 1 + Γ( 2 )2 n n n+1   ! 2 − 1 − 1 − n 1 n + 1 1 = (2π) 2 n 2 2 2 Γ · 2 Γ( n ) 2 t2 2 1 + n n+1   ! 2 − 1 − 1 − n 1 n + 1 1 n+1 = (πn) 2 2 2 2 2 Γ 2 2 (7.62) Γ( n ) 2 t2 2 1 + n n+1   ! 2 − 1 − n+1 1 n + 1 1 n+1 = (πn) 2 2 2 Γ 2 2 Γ( n ) 2 t2 2 1 + n ! n+1 1 1  n + 1  1 2 = √ Γ πn Γ( n ) 2 t2 2 1 + n which corresponds to the PDF of a t random variable.

7.6 The f distribution

Definition 7.6.1. Let F be a random variable with outcome set R>0 and PDF

m+n  m −1 m n Γ f 2 2 2 2 pF : R → R>0, f 7→ pF (f) := m n m  n  m+n (7.63) Γ 2 Γ 2 m  2 1 + n f

Then F is called a f random variable and is said to be distributed according to an f distribution with n, m degrees of freedom, for which we write F ∼ f(n, m). We abbreviate the PDF of a f random variable by m+n  m −1 m n Γ f 2 2 2 2 F (f; n, m) := m n m  n  m+n (7.64) Γ 2 Γ 2 m  2 1 + n f

We note the following theorem without proof. Theorem 7.6.1 (f distribution). Let V ∼ χ2(n) and W ∼ χ2(m) be two independent chi-squared random variables with n and m degrees of freedom, respectively. Then the random variable

V/n F := (7.65) W/m

is an f random variable with n, m degrees of freedom. 

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 96

7.7 Bibliographic remarks

The material presented in this Section is standard. DeGroot and Schervish(2012) and Casella and Berger (2012) provide comprehensive overviews of the distribution theory for classical frequentist statistics. The basic GLM theory is covered in many books, for example Seber(2015), Christensen(2011), Rao(2002), and Hocking(2003), to name just a few.

7.8 Study questions

1. Write down the PDF of a univariate Gaussian distribution. 2. Write down the PDF of a standard normal distribution. 3. Write down the PDF of a multivariate Gaussian distribution and comment on its components. 4. State the theorem on linear-affine transformations of Gaussian random vectors. 5. State the theorem on independent Gaussian distributions. 6. State the theorem on squared standard normal random variables. 7. State the theorem on t distributions. 8. Write down the GLM in multivariate Gaussian form and comment on its components. 9. Write down the GLM formulation of independent and identically distributed Gaussian samples. 10. Write down the GLM formulation of a simple linear regression model.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 8 | Maximum likelihood estimation

Maximum likelihood (ML) estimation is a general principle to derive point estimators in probabilistic models. ML estimation was popularized by Fisher at the beginning of the 20th century, but already found application in the works of Laplace (1749-1827) and Gauss (1777-1855) (Aldrich, 1997). ML estimation is based on the following intuition: the most likely parameter value of a probabilistic model that generated an observed data set should be that parameter value for which the probability of the data under the model is maximal. In this Section, we first make this intuition more precise and introduce the notions of (log) likelihood functions and ML estimators (Section 8.1). We then exemplify the ML approach by discussing ML parameter estimation for the univariate Gaussian samples (Section 8.2). Finally, we consider ML parameter estimation for the GLM, relate it to ordinary estimation, and introduce the restricted ML estimator for the variance parameter of the GLM (Section 8.3).

8.1 Likelihood functions and maximum likelihood estimators

Likelihood functions The fundamental idea of ML estimation is to select that parameter value as a point estimate of the true, but unknown, parameter value that gave rise to the data which maximizes the probability of the data under the model of interest. To implement this intuition, the notion of the and its maximization is invoked. To introduce the likelihood function, consider a parametric probabilistic model pθ(y) which specifies the probability distribution of a random entity y. Here, y models data and θ denotes the model’s parameter with parameter space Θ. Given a parametric probabilistic model pθ(y), the function

Ly :Θ → R≥0, θ 7→ Ly(θ) := pθ(y) (8.1) is called the likelihood function of the parameter θ for the data y. Note that the specific nature of θ and y is left unspecified, i.e., θ and y may be scalars, vectors, or matrices. Notably, the likelihood function is a function of the parameter θ, while it also depends on y. Because y is a random entity, different data samples from probabilistic model pθ(y) result in different likelihood functions. In this sense, there is a distribution of likelihood functions for each probabilistic model, but once a data realization has been obtained, the likelihood function is a (deterministic) function of the parameter value only. This is in stark contrast with PDFs and PMFs which are functions of the random variable’s outcome values (Section 5 | Probability and random variables). Stated differently, the input argument of a PDF or PMF is the value of a random variable and the output argument of a PDF or PMF is the probability density or mass of this value for a fixed value of the model’s parameter. In contrast, the input argument of a likelihood function is a parameter value and the output of the likelihood function is the probability density or mass of a fixed value of the random variable modelling data for this parameter value under the probability model of interest. If the random variable value and parameter value submitted to a PDF or PMF of a model and their corresponding likelihood functions are identical, so are the outputs of both functions. It is the functional dependencies that distinguish likelihood functions from PDFs and PMFs, but not their functional form.

Maximum likelihood estimators

The ML estimator of a given probabilistic model pθ(y) is that parameter value which maximizes the likelihood function. Formally, this can be expressed as ˆ θML := arg max Ly(θ). (8.2) θ∈Θ

ˆ Eq. (8.2) should be read as follows: θML is defined as that argument of the likelihood function Ly for which Ly(θ) assumes its maximal value over all possible parameter values θ in the parameter space Θ. Note that from a mathematical viewpoint, the above definition is not overly general, because it is tacitly assumed that Ly in fact has an maximizing argument and that this argument is unique. Also note that Likelihood functions and maximum likelihood estimators 98

ˆ ˆ instead of values for θML, one is often interested in functional forms that express θML as a function of the ˆ ˆ data y. Concrete numerical values of θML are referred to as ML estimates, while functional forms of θML are referred to as ML estimators. There are essentially two approaches to ML estimation. The first approach aims to obtain functional forms of ML estimators (sometimes referred to as closed-form solutions) by analytically maximizing the likelihood function with respect to θ. The second approach, often encountered in applied computing, builds on the former and systematically varies θ given an observation of y while monitoring the numeric value of the likelihood function. Once this value appears to be maximal, varying θ stops, and the resulting value is used as an ML estimate. In the following, we consider the first approach, which is of immediate relevance for basic parameter estimation in the GLM, in more detail. ˆ From Section 3 | Calculus, we know that candidate values for the ML estimator θML fulfil the requirement d Ly(θ)| ˆ = 0. (8.3) dθ θ=θML ˆ Eq. (8.3) is known as the likelihood equation and should be read as follows: at the location of θML, the d p derivative of the likelihood function dθ Ly with respect to θ is equal to zero. If θ ∈ R , p > 1, the statement ˆ implies that at the location of θML, the gradient ∇L with respect to θ is equal to the zero vector 0p. Clearly, eq. (8.3) corresponds to the necessary condition for extrema of functions. By evaluating the necessary derivatives of the likelihood function and setting them to zero, one may thus obtain a set of equations which can hopefully be solved for an ML estimator.

The log likelihood function To simplify the analytical approach for finding ML estimators as sketched above, one usually considers the logarithm of the likelihood function, the so-called log likelihood function. The log likelihood function is defined as (cf. eq. (8.1))

`y :Θ → R, θ 7→ `y(θ) := ln Ly(θ) = ln pθ(y). (8.4) Because the logarithm is a monotonically increasing function, the location in parameter space at which the likelihood function assumes its maximal value corresponds to the location in parameter space at which the log likelihood assumes its maximal value. Using either the likelihood function or log likelihood function to find a maximum likelihood estimator is thus equivalent, as both will identify the same maximizing value (if it exists). The use of log likelihood functions instead of likelihood functions in ML estimation is primarily of pragmatic nature: first, probabilistic models often involve PDFs with exponential terms that are dissolved by the log transform. Second, independence assumptions often give rise to factorized probability distributions which are simplified to sums by the log transform. Finally, from a numerical perspective, one often deals with PDF or PMF values that are rather close to zero and that are stretched to a broader range by the log transform. In analogy to (8.3), the log likelihood equation for the maximum likelihood estimator is given by d `y(θ)| ˆ = 0. (8.5) dθ θ=θML Like eq. (8.3), the log likelihood equation can be extended to multivariate θ in terms of the gradient of `, ˆ and like eq. (8.3), it can be solved for θML. We next aim to exemplify the idea of ML estimation in a first example (Section 8.2). To do so, we first discuss two additional assumptions that simplify the application of the ML approach considerably: the assumption of a concave log likelihood function and the assumption of independent data random variables with associated PDFs. Finally, we summarize the ML method in a recipe-like manner.

Concave log likelihood functions If the log likelihood function is concave, then the necessary condition for a maximum of the log likelihood function is also sufficient. Recall that a multivariate real-valued function f : Rn → R is called concave, if for all input arguments a, b ∈ Rn the straight line connecting f(a) and f(b) lies below the function’s graph. Formally,

n f (ta + (1 − t)b) ≥ tf(a) + (1 − t)f(b) for a, b ∈ R and t ∈ [0, 1]. (8.6)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Likelihood functions and maximum likelihood estimators 99

Here, ta+(1−t)b for t ∈ [0, 1] describes a straight line in the domain of the function, while tf(a)+(1−t)f(b) for t ∈ [0, 1] describes a straight line in the range of the function. Leaving mathematical subtleties aside, it is roughly correct that concave functions have a single maximum, or in other words, that a critical point at which the gradient vanishes is guaranteed to be a maximum of the function. Thus, if the log likelihood function is concave, finding a parameter value for which the log likelihood equation holds, is sufficient to identify a maximum at this location. In principle, whenever applying the ML method based on the log likelihood equation, it is thus necessary to show that that the log likelihood function is concave and that the necessary condition for a maximum is hence also sufficient. However, such an approach is beyond the level of rigour herein and we content with stating without proof that the log likelihood functions of interest in the following are concave.

Independent data random variables with probability density functions A second assumption that simplifies the application of the ML method is the assumption of independent data variables with associated PDFs. To this end, we first note that in the case of more than one data point, the data random entity y corresponds to a random vector comprising data random variables T y1, y2, ..., yn, i.e. y := (y1, ..., yn) . If in addition one assumes that these data variables are independent and each variable is governed by a PDF that is parameterized by the same parameter vector, then the joint PDF of y can be written as the product of the individual PDFs pθ(yi), i = 1, ..., n. Formally, we write n Y pθ(y) = pθ(y1, ..., yn) = pθ(yi). (8.7) i=1

Eq. (8.7) may be conceived from two angles: on the one hand, one may think of the random variables yi to be governed by one and the same underlying probability distribution from which samples are obtained with replacement. Alternatively, one may think of each yi to be governed by its individual probability distribution defined in terms of its PDF, all of which are however identical. For our purposes, these two angles are equivalent, while the latter conception seems somewhat closer to the formal developments below.

Crucially, in the case of independent data random variables y1, ..., yn, the log likelihood function is given by n Y `y(θ) = ln pθ(y) = ln pθ(yi). (8.8) i=1 Repeated application of the product property of the logarithm then allows for expressing the log likelihood as (8.8) n X `y(θ) = ln pθ(y) = ln pθ(yi). (8.9) i=1

The evaluation of the logarithm of a product of PDFs pθ(yi) is thus simplified to the summation of logarithms of individual PDFs pθ(yi).

Analytical derivation of maximum likelihood estimators In summary, the developments above suggest the following three step procedure for the analytical derivation of ML estimators in probabilistic models: (1) Formulation of the log likelihood function. This step corresponds to writing down the log probability density of a set of data random variables under the model of interest. Special attention has to be paid to the number of observable variables considered and their independence properties. (2) Evaluation of the log likelihood function’s gradient. Often, probabilistic models of interest have more than one parameter and ML estimators for each parameter are required, i.e., the partial derivatives of the log likelihood function with respect to the parameters have to be evaluated. This step is usually eased by the use of PDFs that involve exponential terms and the assumption of independent data random variables. (3) Solution of the log likelihood equations. Under the assumption of concave log likelihood functions, solving the log likelihood equation yields the location of the maximum of the log likelihood function in parameter space. The parameter values thus obtained then correspond to ML estimators. We next consider an exemplary application of the maximum likelihood method.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 100

8.2 Maximum likelihood estimation for univariate Gaussian distributions

As a first example of the ML method, we consider the case of n independent and identically distributed 2 random variables y1, ..., yn with univariate Gaussian distribution and parameter vector (µ, σ ) ∈ R × R>0. 2 2 Our aim is to derive ML estimators µˆML and σˆML for µ and σ , respectively. In terms of the general 2 principle discussed above, we thus have θ := (µ, σ ) and Θ := R × R>0.

Formulation of the log likelihood function The first step in the application of the ML approach is the formulation of the log likelihood function. For the current example, the distribution of the ith random variable yi, i = 1, ..., n is governed by a 2 univariate Gaussian distribution with PDF N(yi; µ, σ ) and the random variables y1, ..., yn are assumed T n to be independent. The PDF of the joint outcome value y = (y1, ..., yn) ∈ R thus corresponds to the product of the PDFs of each individual yi, i = 1, ..., n. This can be written as

n n Y Y 2 pµ,σ2 (y) = pµ,σ2 (yi) = N(yi; µ, σ ). (8.10) i=1 i=1

Because the PDFs of yi, i = 1, ..., n are of the form   2 1 1 2 N(yi; µ, σ ) = √ exp − (yi − µ) , (8.11) 2πσ2 2σ2 we may rewrite (8.10) as n ! n 2− 2 1 X 2 p 2 = 2πσ exp − (y − µ) (8.12) µ,σ 2σ2 i i=1 as shown below.

Proof. With the laws of exponentiation and the exponentiation property of the exponential function, we have n n   Y 2 Y 1 1 2 N(yi; µ, σ ) = √ exp − (yi − µ) 2 2σ2 i=1 i=1 2πσ n n 1   Y 2 − Y 1 2 = 2πσ  2 exp − (y − µ) 2σ2 i i=1 i=1 n ! (8.13) n 2 − X 1 2 = 2πσ  2 exp − (y − µ) 2σ2 i i=1 n ! n 2 − 1 X 2 = 2πσ  2 exp − (y − µ) . 2σ2 i i=1

Based on eq. (8.12), we can write down the likelihood function as

n ! 2 2 2 − n 1 X 2 L : × → , (µ, σ ) 7→ L (µ, σ ) := (2πσ ) 2 exp − (y − µ) (8.14) y R R>0 R>0 y 2σ2 i i=1 and, as shown below, the corresponding log likelihood function evaluates to

n n n 1 X ` : × → , (µ, σ2) 7→ ` (µ, σ2) := − ln 2π − ln σ2 − (y − µ)2. (8.15) y R R>0 R y 2 2 2σ2 i i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 101

Proof. With the properties of the logarithm, we have

n !! 2 2 − n 1 X 2 ln L (µ, σ ) = ln (2πσ ) 2 exp − (y − µ) y 2σ2 i i=1 n !!  2 − n  1 X 2 = ln (2πσ ) 2 + ln exp − (y − µ) 2σ2 i i=1 (8.16) n n 2 1 X 2 = − ln 2πσ  − (y − µ) 2 2σ2 i i=1 n n n 2 1 X 2 = − ln 2π − ln σ − (y − µ) . 2 2 2σ2 i i=1

Evaluation of the log likelihood function’s gradient The second step in the analytical derivation of ML estimators is the evaluation of the gradient   ∂ 2 2 ∂µ `y(µ, σ ) ∇`y(µ, σ ) =   . (8.17) ∂ 2 ∂σ2 `y(µ, σ ) As shown below, for the partial derivative with respect to µ, we have

n ∂ 1 X ` (µ, σ2) = (y − µ), (8.18) ∂µ y σ2 i i=1 and for the partial derivative with respect to σ2, we have

n ∂ n n X ` (µ, σ2) = − + (y − µ)2. (8.19) ∂σ2 y 2σ2 2σ4 i i=1

Proof. With the summation and chain rules of differential calculus, we have

n ! ∂ 2 ∂ n n 2 1 X 2 ` (µ, σ ) = − ln 2π − ln σ − (y − µ) ∂µ y ∂µ 2 2 2σ2 i i=1 n ! ∂ 1 X 2 = − (y − µ) ∂µ 2σ2 i i=1 n 1 X ∂ 2 = − (y − µ) (8.20) 2σ2 ∂µ i i=1 n 1 X ∂ = − 2(y − µ) (−µ) 2σ2 i ∂µ i=1 n 1 X = (y − µ), σ2 i i=1 and with the form of the derivative of the logarithm, we have

n ! ∂ 2 ∂ n n 2 1 X 2 ` (µ, σ ) = − ln 2π − ln σ − (y − µ) ∂σ2 y ∂σ2 2 2 2σ2 i i=1 n ! n ∂ 2 ∂ 1 X 2 = − ln σ − (y − µ) 2 ∂σ2 ∂σ2 2σ2 i i=1 (8.21) n n 1 1 ∂ 2 −1 X 2 = − − σ  (y − µ) 2 σ2 2 ∂σ2 i i=1 n n 1 X 2 = − + (y − µ) . 2σ2 2σ4 i i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Maximum likelihood estimation for univariate Gaussian distributions 102

Solution of the log likelihood equations

2 With the above, the log likelihood equations corresponding to ∇`y(µ, σ ) = 0 are given by n 1 X (y − µ) = 0 σ2 i i=1 n (8.22) n 1 X − + (y − µ)2 = 0. 2σ2 2σ4 i i=1 Notably, these log likelihood equations exhibit a dependence between the ML estimator for µ and the ML estimator for σ2, because both parameters appear in both equations. To solve the log likelihood equations 2 for µˆML and σˆML, a standard approach is to first solve the log likelihood equation for µˆML and then use 2 the solution to solve the log likelihood equation forσ ˆML. As shown below, this yields n 1 X µˆ = y (8.23) ML n i i=1 and n 1 X σˆ2 = (y − µˆ )2. (8.24) ML n i ML i=1

−2 Pn Proof. The first log likelihood equation implies that σ or i=1(yi − µˆML) is equal to zero. Because by definition 2 −2 Pn σ > 0 and thus σ > 0, the equation can only hold, if i=1(yi − µˆML) equals zero. We thus have n X (yi − µˆML) = 0 i=1 n n X X ⇔ yi − µˆML = 0 i=1 i=1 n (8.25) X ⇔ yi − nµˆML = 0 i=1 n 1 X ⇔ µˆ = y . ML n i i=1 To find the maximum likelihood estimator for σ2, we substitute this result in the second log likelihood equation 2 and solve forσ ˆML: n n 1 X 2 − + (y − µ) = 0 2ˆσ2 2ˆσ4 i ML ML i=1 n 1 X 2 n ⇔ 4 (yi − µ) = 2 2ˆσML 2ˆσML i=1 (8.26) n 2 1 X 2 2nσˆML ⇔ (y − µ) = σˆ2 i 2ˆσ2 ML i=1 ML n 2 1 X 2 ⇔ σˆ = (y − µ) . ML n i i=1

Notably, the ML estimatorµ ˆML corresponds to the sample mean n 1 X y¯ := y . (8.27) n i i=1

2 On the other hand, the ML estimatorσ ˆML does not correspond to the sample variance n 1 X s2 := (y − y¯)2. (8.28) n − 1 i i=1 While the sample variance is a bias-free estimator of σ2, the ML estimator is not.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 103

8.3 ML estimation of GLM parameters

We have previously seen that the GLM can be expressed in PDF form as

2  pβ,σ2 (y) = N y; Xβ, σ In . (8.29)

We now turn to the problem of finding ML estimates βˆ andσ ˆ2 for β and σ2, respectively.

Beta parameter estimation We first note that, assuming a known value σ2 > 0, the likelihood function for the beta parameter is

− n  1  L : p → , β 7→ L (β) := 2πσ2 2 exp − (y − Xβ)T (y − Xβ) . (8.30) y R R>0 y 2σ2

Logarithmic transformation yields the corresponding log likelihood function n n 1 ` : p → , β 7→ ` (β) := − ln 2π − ln σ2 − (y − Xβ)T (y − Xβ). (8.31) y R R y 2 2 2σ2

As shown below, the necessary condition for a maximum of `y is equivalent to

XT Xβ = XT y. (8.32)

Eq. (8.32) is a set of j linear equations known as normal equations. If it is assumed that the matrix XT X is invertible, then the normal equations can readily be solved for the ML beta parameter estimate

−1 βˆ = XT X XT y. (8.33)

It is a basic exercise in linear algebra to prove that the invertibility of XT X is given, if the design matrix X ∈ Rn×p is of full column-rank p. Experimental designs yielding full column-rank design matrices hence allow for the unique identification of the GLM beta parameter maximum likelihood estimate. We will refer to eq. (8.33) as beta parameter estimator.

Proof. The equivalence of the necessary condition for a maximum of `y and eq. (8.32) derives from the following considerations: the necessary condition for a maximum of `y is

∇`y(β) = 0p, (8.34) which implies that ∂ `y(β) = 0 for all j = 1, ..., p. (8.35) ∂βj

Given the functional form of `y, eq. (8.35) is equivalent to

1 ∂ T − 2 (y − Xβ) (y − Xβ) = 0 for all j = 1, ..., p. (8.36) 2σ ∂βj

Because σ2 > 0, eq. (8.36) in turn is equivalent to

1 ∂ (y − Xβ)T (y − Xβ) = 0 for all j = 1, ..., p. (8.37) 2 ∂βj

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 104

We next consider the partial derivative in eq. (8.37) for a selected j ∈ {1, ..., p} in more detail. We have

n 1 ∂ T 1 ∂ X 2 (y − Xβ) (y − Xβ) = (yi − (Xβ)i) 2 ∂βj 2 ∂βj i=1 n X ∂ = (yi − (Xβ)i) (yi − (Xβ)i) ∂βj i=1 n X ∂ = (yi − (Xβ)i) (Xβ)i ∂βj i=1 n X ∂ = (yi − (Xβ)i) (xi1β1 − · · · − xij βj − · · · − xipβp) (8.38) ∂βj i=1 n X = (yi − xi1β1 − · · · − xij βj − · · · − xipβp) xij i=1 n X = (xij yi − xij xi1β1 − · · · − xij xij βj − · · · − xij xipβp) i=1 n n n n X X X X = xij yi − xij xi1β1 − · · · − xij xij βj − · · · − xij xipβp. i=1 i=1 i=1 i=1 From eq. (8.37), we thus have

n n n n X X X X xij xi1β1 + ··· + xij xij βj ··· + xij xipβp = xij yi for all 1 < j < p, (8.39) i=1 i=1 i=1 i=1 and similarly for j = 1 and j = p. Summarizing these p equations in vector format then results in Pn Pn Pn  Pn  i=1 xi1xi1β1 + i=1 xi1xi2β2 + ··· + i=1 xi1xipβp i=1 xi1yi Pn Pn Pn Pn  i=1 xi2xi1β1 + i=1 xi2xi2β2 + ··· + i=1 xi2xipβp   i=1 xi2yi   =   . (8.40)  .   .   .   .  Pn Pn Pn Pn i=1 xipxi1β1 + i=1 xipxi2β2 + ··· + i=1 xipxipβp i=1 xipyi Furthermore, we may rewrite the left-hand side of eq. (8.40) as

Pn Pn Pn  Pn Pn Pn    i=1 xi1xi1β1 + i=1 xi1xi2β2 + ··· + i=1 xi1xipβp i=1 xi1xi1 i=1 xi1xi2 ... i=1 xi1xip β1 Pn Pn Pn Pn Pn Pn  i=1 xi2xi1β1 + i=1 xi2xi2β2 + ··· + i=1 xi2xipβp   i=1 xi2xi1 i=1 xi2xi2 ... i=1 xi2xip β2   =      .   . . .. .   .   .   . . . .   .  Pn Pn Pn Pn Pn Pn i=1 xipxi1β1 + i=1 xipxi2β2 + ··· + i=1 xipxipβp i=1 xipxi1 i=1 xipxi2 ... i=1 xipxip βp

x11 x21 . . . xn1 x11 x12 . . . x1p  β1 x12 x22 . . . xn2 x21 x22 . . . x2p  β2 =        . . .. .   . . .. .   .   . . . .   . . . .   .  x1p x2p . . . xnp xn1 xn2 . . . xnp βp

= XT Xβ. (8.41) Similarly, we may rewrite the right-hand side of eq. (8.40) as

Pn      i=1 xi1yi x11 x21 . . . xn1 y1 Pn  i=1 xi2yi x12 x22 . . . xn2 y2    =     = XT y. (8.42)  .   . . .. .   .   .   . . . .   .  Pn i=1 xipyi x1p x2p . . . xnp yn The normal equations then follow directly from eqs. (8.40), (8.41), and (8.42).

Ordinary least squares beta parameter estimation A popular alternative approach for estimating the parameters of GLMs is the (OLS) method. The idea of OLS estimation is to minimize the squared distance between observed data points

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 ML estimation of GLM parameters 105 and the data predicted by the GLM. In contrast to the ML approach, OLS estimation is not dependent on any specific parametric, nor even any probabilistic assumptions about the data. Below, we discuss the equivalence of ML and OLS estimators for GLM beta parameters. While both approaches result in the same beta parameter estimator, OLS estimation does not lend itself to probabilistic inference, because its ensuing estimates are endowed neither with a Frequentist nor a Bayesian distributional theory. To show the equivalence of ML and OLS beta parameter estimation, we first consider OLS estimation. In OLS estimation, the aim is to minimize the sum of error squares (SES), defined as

n X 2 SES := (yi − (Xβ)i) , (8.43) i=1 where (Xβ)i denotes the ith row of Xβ. (Xβ)i is the GLM prediction of yi and depends on the value of β. The sum of all squared prediction errors yi − (Xβ)i forms the SES. Clearly, due to the quadratic terms, it holds that SES ≥ 0. We next reconsider the likelihood function of the GLM beta parameter for known σ2 > 0 (cf. eq. (8.30)),

n ! − n 1 X 2 L (β) = 2πσ2 2 exp − (y − (Xβ) ) . (8.44) y 2σ2 i i i=1

As readily apparent, the likelihood function Ly comprises the SES in its exponential term. Because the SES itself is non-negative and in the functional form of Ly is endowed with a minus sign, the exponential term of Ly becomes maximal, if the squared deviations between model prediction and data values become minimal. In other words, the GLM likelihood function for the beta parameter is maximized, if the SES is minimized. In effect, irrespective of whether the OLS or ML methods are employed to derive the GLM beta parameter estimator, the beta parameter estimators are identical.

Maximum likelihood variance parameter estimation Finally, to derive the ML estimator for the GLM variance parameter σ2, we proceed as follows: we substitute βˆ in the GLM log likelihood function and then maximize the resulting log likelihood function with respect to σ2. Substitution of βˆ in eq. (8.31) renders the GLM log likelihood function a function of σ2 > 0 only,

n n 1  T   ` : → , σ2 7→ ` (σ2) := − ln 2π − ln σ2 − y − Xβˆ y − Xβˆ , (8.45) y,βˆ R>0 R y,βˆ 2 2 2σ2

2 and, as shown below, the derivative of `y,βˆ with respect to σ evaluates to

T d 2 1 n 1 1     ` ˆ(σ ) = − + y − Xβˆ y − Xβˆ . (8.46) dσ2 y,β 2 σ2 2 (σ2)2

Proof. We have

d d  n n 1  T   ` (σ2) = − ln 2π − ln σ2 − y − Xβˆ y − Xβˆ dσ2 y,βˆ dσ2 2 2 2σ2 T n d 2 1  ˆ  ˆ d 2−1 = − 2 ln σ − y − Xβ y − Xβ 2 σ 2 dσ 2 dσ (8.47) n 1 1  T   −2 = − − y − Xβˆ y − Xβˆ (−1) σ2 2 σ2 2 1 n 1 1  T   = − + y − Xβˆ y − Xβˆ . 2 σ2 2 (σ2)2

d 2 Finally, as shown below, setting dσ2 `y,βˆ to zero and solving forσ ˆ yields

1  T   σˆ2 = y − Xβˆ y − Xβˆ . (8.48) n

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Example (Independent and identically distributed Gaussian samples) 106

Proof. We have d ` (ˆσ2) = 0 dσ2 y,βˆ 1 n 1 1  T   ⇔ − + y − Xβˆ y − Xβˆ = 0 2 σˆ2 2 (ˆσ2)2 1 1  T   1 n ⇔ y − Xβˆ y − Xβˆ = 2 (ˆσ2)2 2 σˆ2 (8.49)  T   σˆ2σˆ2n ⇔ y − Xβˆ y − Xβˆ = σˆ2  T   ⇔ nσˆ2 = y − Xβˆ y − Xβˆ 1  T   ⇔ σˆ2 = y − Xβˆ y − Xβˆ . n

A couple of aspects of eq. (8.48) are noteworthy. First, the term (y − Xβˆ) corresponds to the difference between data y and the GLM data prediction Xβˆ. The variance parameter estimate σˆ2 thus corresponds to a scaled version of the residual sum of error squares (RSS), defined as

n X ˆ 2 RSS = (yi − (Xβ)i) . (8.50) i=1

2 1 ˆ T ˆ Second, the ML estimator of σ is biased: it can be shown that the expected value of n (y − Xβ) (y − Xβ) is smaller than σ2. In other words, the ML variance parameter estimator underestimates the true, but unknown, GLM variance parameter. This can, however, readily be rectified by dividing the RSS not by n, but by n − p. This yields the so-called restricted maximum likelihood (ReML) estimator of the GLM variance parameter, defined as (y − Xβˆ)T (y − Xβˆ) σˆ2 := . (8.51) n − p In the following, we will hence consider the ReML estimator for estimating σ2. For simplicity, we shall refer to (8.51) as variance parameter estimator. An introduction to the origin of the estimation bias of the ML variance parameter estimator and the concept of ReML is beyond the scope of a basic introduction to the GLM.

8.4 Example (Independent and identically distributed Gaussian samples)

As a first application of the beta and variance parameter estimators introduced above, we consider the GLM scenario of n independent and identically distributed Gaussian samples,

2 yi ∼ N(µ, σ ) for i = 1, ..., n, (8.52) which, as discussed in Section 7 | Probability distributions, corresponds to the GLM

2 n×1 2 y ∼ N(Xβ, σ In), where X := 1n ∈ R , β := µ, and σ > 0. (8.53) As shown below, for the model specified in eq. (9.15), the beta and variance parameter estimators evaluate to n n 1 X 1 X 2 βˆ = y =y ¯ andσ ˆ2 = (y − y¯) = s2 (8.54) n i n − 1 i i=1 i=1 respectively. In the GLM scenario of independent and identically distributed Gaussian samples, the beta and variance parameter estimators are thus identical to the sample mean and sample variance of the 2 random sample y1, ..., yn ∼ N(µ, σ ).

Proof. The beta parameter estimator evaluates to   −1   1 y1 n n T −1 T −1 X 1 X βˆ = (X X) X y = 1 ··· 1 . 1 ··· 1  .  = n y = y =y. ¯ (8.55)  .  .  i n i i=1 i=1 1 yn

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 107

The variance parameter estimator is given by

1  T   σˆ2 = y − Xβˆ y − Xβˆ n − 1     T      y1 1 n ! y1 1 n ! 1 1 X 1 X =  .  − . y   .  − . y  n − 1  .  . n i   .  . n i  i=1 i=1 yn 1 yn 1  1 Pn T  1 Pn  y1 − n i=1 yi y1 − n i=1 yi 1 =  .   .  n − 1  .   .  (8.56) 1 Pn 1 Pn yn − n i=1 yi yn − n i=1 yi n n !!2 1 X 1 X = y − y n − 1 i n i i=1 i=1 n 1 X 2 = (y − y¯) n − 1 i i=1 = s2.

8.5 Bibliographic remarks

Treatments of ML and GLM estimation theory can be found in virtually all statistical textbooks. Seber (2015) and Christensen(2011) provide comprehensive introductions for the GLM.

8.6 Study questions

1. Write down the general form of a likelihood function and name its components. 2. Write down the general form of a log likelihood function and name its components 3. Write down the general form of an ML estimator and explain it. 4. Discuss commonalities and differences between OLS and ML beta parameter estimators. 5. Write down the formula of the GLM ML beta parameter estimator and name its components. 6. Write down the formula of the GLM ML variance parameter estimator and name its components. 7. Write down the formula of the GLM ReML variance parameter estimator and name its components. 8. Define the sum of error squares (SES) and the residual sum of squares (RSS) and discuss their commonalities and differences. 9. Write down the GLM of incarnation independent and identical sampling from a univariate Gaussian distribution as well as the ensuing expectation and variance parameter estimators.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 9 | Frequentist distribution theory

9.1 Introduction

2 Let y ∼ N(Xβ, σ In) denote the GLM. The Frequentist distribution theory of the GLM unfolds against the background of the following intuition: it is assumed that there exist true, but unknown, parameter values β and σ2. An observed n-dimensional data set y is conceptualized as a realization of the multivariate 2 Gaussian distribution N(Xβ, σ In) governed by these true, but unknown, parameter values. Hypothetically, 2 however, N(Xβ, σ In) can be sampled many times, generating a set of independent n-dimensional data set realizations, say, y(1), y(2), ... For each of these hypothetical data sets, beta parameter and variance estimators can be evaluated, resulting in a set of independent GLM parameter estimate realizations, say, (β,ˆ σˆ2)(1),(β,ˆ σˆ2)(2), ... Clearly, due to the variable nature of the hypothetical data set realizations, these hypothetical parameter estimate realizations also vary and exhibit probability distributions themselves. On the other hand, the distributions of these random variables are likely to exhibit some structure that depends on the true, but unknown, parameters of the data distribution. For example, if the true, but unknown, variance parameter σ2 of the GLM is large and data realizations vary a lot, it is likely that also beta parameter estimates vary a lot. Similarly, statistics, such as the T - and F -statistics, which, as introduced below, are functions of the GLM parameter estimates, will exhibit probability distributions. In more mundane terms, because y is a random vector, any functions of y, such as βˆ and σˆ2, or T - and F -statistics, are random variables. Due to the fact that y is Gaussian distributed as well as the relatively simple nature of the functional forms of these parameter estimators and statistics, it is possible to derive parametric forms for the distributions governing βˆ, σˆ2, T , and F . In this Section, we state the parametric forms of these distributions and show how they result from the combination of the Gaussian distribution of y and the functional form of the respective parameter estimator or statistic. From a data-analytical perspective, it is important to realize that these distributions are “hypothetical”, in the sense that in a concrete data analysis scenario, only a single n-dimensional data set is observed, and thus only single beta and variance parameter estimates and T - or F -statistics can be evaluated. As will be discussed in subsequent sections, frequentist parameter and model inference techniques, such as T and F tests, evaluate these actual parameter estimate or statistics observations in the light of the distributions introduced in the current Section.

9.2 Beta parameter estimates

We have the following theorem:

Theorem 9.2.1 (Beta parameter estimate distribution). For X ∈ Rn×p, β ∈ Rp, and σ2 > 0, let 2 ˆ T −1 T y ∼ N(Xβ, σ In) denote the GLM and let β = (X X) X y denote the beta parameter estimator. Then

 −1 βˆ ∼ N β, σ2 XT X . (9.1)



Proof. The theorem follows with the linear transformation theorem for Gaussian random vectors (Section 7 | Probability distributions). Specifically, for the current scenario, the linear affine transformation theorem for Gaussian distributions states that

 T  T −1 T T −1 T 2  T −1 T  βˆ ∼ N (X X) X Xβ, (X X) X (σ In) (X X) X . (9.2)

The expectation parameter of this distribution can be simplified to

(XT X)−1XT Xβ = β, (9.3) Beta parameter estimates 109

A B C D 15 3 3 2 x7 = 5; sxx = 110 x7 = 0; sxx = 110 7 2 2 1 x = 0; sxx = 1 10 y -^2 1 -^2 1 -^2 0 5 0 0 -1 0 -1 -1 -2 0 5 10 -1 0 1 2 3 -1 0 1 2 3 -2 0 2

x -^1 -^1 -^1

Figure 9.1. Beta parameter estimates distribution for a simple linear regression model. (A) Ten data samples of a simple linear regression model with independent variable values xi2 = i − 1, i = 1, ...10, and true, but unknown, parameter values β := (1, 1)T and σ2 := 1. (B) Beta parameter estimates corresponding to the data samples in (A). (C) Probability density function of βˆ. (D) Probability density functions of βˆ for different simple linear regression design matrix specifications. and the covariance matrix parameter can be simplified according to

T T −1 T 2  T −1 T  T −1 T 2 T −1 (X X) X (σ In) (X X) X = (X X) X (σ In)X(X X) = σ2(XT X)−1XT X(XT X)−1 (9.4) = σ2(XT X)−1.

Here, the first equality follows from the fact that both XT X and its inverse (XT X)−1 are symmetric matrices.

Theorem 9.2.1 states that the beta parameter estimate βˆ is distributed according to a p-dimensional −1 Gaussian distribution with expectation parameter β and with covariance matrix parameter σ2 XT X . The expectation of the beta parameter estimator thus corresponds to the true, but unknown, beta parameter, while the covariance of the beta parameter estimator is proportional to the true, but unknown, variance parameter. Moreover, the covariance of the beta parameter estimator also depends on the inverse of the design matrix product XT X. In an applied context, this allows for minimizing the beta parameter −1 estimate covariance by adapting the design matrix in such a manner that XT X is minimized.

Example 1 (Independent and identically distributed Gaussian samples.) Consider the GLM scenario of n independent and identically distributed Gaussian samples,

2 n×1 2 y ∼ N(Xβ, σ In) with X := 1n ∈ R , β := µ ∈ R, σ > 0. (9.5) ˆ 1 Pn We have seen in Section 8 | Maximum likelihood estimation, that β = y¯ = n i=1 yi. Theorem 9.2.1 implies that  σ2  y¯ ∼ N µ, . (9.6) n The sample mean of n independent and identically distributed Gaussian random variables with expectation parameter µ and variance parameter σ2 is thus distributed according to a Gaussian distribution with expectation parameter µ and variance parameter σ2/n.

Example 2 (Simple linear regression.) Consider the simple linear regression GLM scenario   1 x1 2 . .  n×2 2 2 y ∼ N(Xβ, σ In) with . .  ∈ R , β ∈ R , σ > 0. (9.7) 1 xn In Section 8 | Maximum likelihood estimation, we have seen that

n σ2  sxx +x ¯2 −x¯ X σ2(XT X)−1 = n , with s := (x − x¯)2. (9.8) s −x¯ 1 xx i xx i=1 The variance of the offset parameter estimate thus depends on both the sum of difference squares and the mean of the independent variable values x1, ..., xn, whereas the slope parameter estimate variance

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Variance parameter estimates 110

Figure 9.2. Variance parameter estimate distribution for a simple linear regression model. (A) 1,000 data samples of a simple linear regression model with independent variable values xi2 = i − 1, i = 1, ...10, and true, but unknown, parameter values β := (1, 1)T and σ2 := 1. (B) of the variance parameter estimates σˆ2 based on the data samples in (A). (C) Histogram-based estimated probability density function of the scaled variance parameter estimates n−p σˆ2 and its analytical σ2 counterpart χ2(n − p).

depends only on the sum of difference squares of the x1, ..., xn. The covariance of the offset and slope parameter estimates is determined by the mean of the x1, ..., xn. We visualize the distribution of beta parameter estimates as well as the effect of adjusting the independent variable values x1, ..., xn in different ways in Figure 9.1.

9.3 Variance parameter estimates

We have the following theorem.

Theorem 9.3.1 (Variance parameter estimate distribution). For X ∈ Rn×p, β ∈ Rp, and σ2 > 0, let 2 2 y ∼ N(Xβ, σ In) denote the GLM and let σˆ denote the frequentist variance parameter estimator. Then n − p σˆ2 ∼ χ2(n − p). (9.9) σ2

 Sketch of proof. A full proof of Theorem 9.3.1 is beyond the mathematical scope of this introduction. We thus content with the sketch of a proof given by Seber and Lee(2003, Section 3.4). The proof proceeds in three steps. First, it is established that the scaled sum or error squares is distributed according to a chi-squared distribution with n degrees of freedom, i.e., 1 (y − Xβ)T (y − Xβ) ∼ χ2(n), (9.10) σ2 2 which follows from the mere fact that εi ∼ N(0, σ ), i = 1, ..., n are i.i.d. random variables and that 1 (βˆ − β)T XT X(βˆ − β) ∼ χ2(p). (9.11) σ2 In a second step, it is then shown that the residual sum of squares can be written as RSS = SES − Q, (9.12) where we defined Q := (βˆ − β)T XT X(βˆ − β) and that SES and Q are independent random variables. Finally, from the properties of chi-squared random variables, it then follows that n − p 1 1 σˆ2 = (y − Xβˆ)T (y − Xβˆ) = RSS ∼ χ2(n − p). (9.13) σ2 σ2 σ2 2

n−p 2 Theorem 9.3.1 states that the scaled variance parameter estimates σ2 σˆ , but not the variance parameter estimates themselves, are distributed according to a chi-squared distribution with n − p degrees of freedom. We visualize this result in Figure 9.2.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The T -statistic 111

9.4 The T -statistic

We start with the definition of the T -statistic.

n×p p 2 2 Definition 9.4.1 (T -statistic). For X ∈ R , β ∈ R , and σ > 0, let y ∼ N(Xβ, σ In) denote the GLM and let βˆ and σˆ2 denote the beta and variance parameter estimators, respectively. In addition, let c ∈ Rp denote a contrast weight vector. Then the T -statistic is defined T ˆ p ˆ 2 ˆ 2 c β Tc : R × R≥0 → R, (β, σˆ ) 7→ Tc(β, σˆ ) := q . (9.14) σˆ2cT (XT X)−1 c •

To obtain an intuition about this definition, we consider a familiar example.

Example 1. (Independent and identically distributed Gaussian samples) Consider the GLM scenario of n independent and identically distributed Gaussian samples,

2 n×1 2 y ∼ N(Xβ, σ In) with X := 1n ∈ R , β = µ ∈ R, σ > 0. (9.15) We have previously seen that βˆ =y ¯ andσ ˆ2 = s2, (9.16) i.e., that for the model of eq. (9.15) the beta and variance parameter estimators correspond to the 2 sample mean and sample variance of the random sample y1, ..., yn ∼ N(µ, σ ). For the evaluation of the T -statistic, we consider the contrast weight vector c := 1. In this case, the T -statistic evaluates to

T ˆ ˆ √ √ ˆ 2 1 β β y¯ y¯ Tc(β, σˆ ) = = = n√ = n . (9.17) q q 2 2 2 T T −1 σˆ σˆ s σˆ · 1 (In In) · 1 n

The T -statistic for the independent and identically distributed Gaussian samples model of eq. (9.15) thus corresponds to the ratio of the sample mean and the sample standard deviation of the random sample 2 y1, ..., yn ∼ N(µ, σ ) multiplied by the square root of the sample size. Informally, we thus have Sample mean T-value = pSample size · . (9.18) Sample standard deviation In this scenario, large positive or negative values of the T -statistic thus have the following interpretation: in comparison to the sample standard deviation, the sample mean is large, either in the positive or negative direction. Conversely, low absolute values of the T -statistic, i.e., T values close to zero, reflect a small effect size in comparison to the data variability. In other words, the T -statistic measures the effect size with respect to the yardstick of data variability.

The distribution of the centred T -statistic For frequentist hypothesis testing, the centred T -statistic is crucial. We define it as follows.

n×p p 2 2 Definition 9.4.2 (Centred T -statistic). For X ∈ R , β ∈ R , and σ > 0, let y ∼ N(Xβ, σ In) denote the GLM and let βˆ and σˆ2 denote the frequentist beta and variance parameter estimators, respectively. In addition, let c ∈ Rp denote a contrast weight vector. Then the centred T -statistic is defined as T ˆ T p ˆ 2 ˆ 2 c β − c β Tβ,c : R × R≥0 → R, (β, σˆ ) 7→ Tβ,c(β, σˆ ) := q . (9.19) σˆ2cT (XT X)−1 c

The centred T -statistic is distributed according to a t distribution with n − p degrees of freedom. We have the following theorem.

Theorem 9.4.1 (Distribution of the centred T -statistic). For X ∈ Rn×p, β ∈ Rp, and σ2 > 0, let 2 ˆ 2 y ∼ N(Xβ, σ In) denote the GLM and let β and σˆ denote the frequentist beta and variance parameter p estimators, respectively. In addition, let c ∈ R denote a contrast weight vector and let Tβ,c denote the centred T -statistic. Then Tβ,c ∼ t(n − p) 

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -statistic 112

Figure 9.3. (A) 1,000 data samples of a simple linear regression model with independent variable values xi2 = i−1, i = 1, ...10, and true, but unknown, parameter values β := (1, 1)T and σ2 := 1. (B) True, but unknown, beta parameter values and beta parameter estimates corresponding to the data samples in (A). (C) Analytical and empirical distribution of the product cT βˆ for c = (0, 1)T . Note that while the distribution of βˆ can be high-dimensional, the distribution of cT βˆ is always one-dimensional. (D) Empirical and analytical distributions of the ensuing scaled variance parameter estimates n−p σˆ2. (E) σ2 Empirical and analytical distributions of the centred T -statistic

Proof. We first rewrite Tβ,c as follows:

cT βˆ − cT β Tβ,c = q σˆ2cT (XT X)−1 c √   n − p cT βˆ − cT β = √ q √ √ q 1 2 2 T T −1 n − p σ2 σ σˆ c (X X) c √ n − p cT βˆ − cT β = q q n−p 2 2 T T −1 (9.20) σ2 σˆ σ c (X X) c

cT βˆ−cT β q −1 σ2cT (XT X) c = r n−p σˆ2 σ2 n−p ζ =: . q υ n−p

We next consider the numerator of the right-hand side of (9.20) and find that ζ ∼ N(0, 1). To see this, we apply the linear transformation theorem for Gaussian random vectors (Theorem 7.2.6). With βˆ ∼ N β, σ2(XT X)−1, transformation matrix cT and offset term cT β, we thus obtain   cT βˆ − cT β ∼ N 0, σ2cT (XT X)−1c . (9.21)

With the z-transformation theorem (Theorem 7.1.6), ζ ∼ N(0, 1) then follows immediately. Furthermore, we have n−p 2 2 already seen above that υ := σ2 σˆ ∼ χ (n − p). Tβ,c thus corresponds to the ratio of a standard normal variable and the square root of a chi-squared random variable divided by its degrees of freedom. With Theorem 7.6.1, it then follows directly that Tβ,c is distributed according to t(n − p).

The distribution of the centred T -statistic is visualized in Figure 9.3. The distribution of the centred T -statistic is commonly put to use in statistical testing scenarios (cf. Section 10 | Statistical testing). p Specifically, given a hypothesized true, but unknown, parameter value β ∈ R and an observed value tβ,c of Tβ,c, the evaluation of P(Tβ,c ≥ tβ,c) can serve as an indicator of the compatibility of the observed data with the hypothesized parameter value.

9.5 The F -statistic

Model partitioning and likelihood ratios The F -statistic can be motivated from the perspective of comparing two GLMs using a likelihood ratio. To this end, we consider a GLM with p design matrix columns, which we will refer to as a full model. The aim is to compare this model to a model comprising the p1 < p first regressors of the full model. The latter model is referred to as a reduced model. Because a reduced model is always part of a full model, a

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -statistic 113 reduced model is also said to be nested in the full model. Formally, let a design matrix X ∈ Rn×p with p > 1 be partitioned according to

 n×p1 n×p2 X = X1 X2 , where X1 ∈ R ,X2 ∈ R and p1 + p2 = p, (9.22) and let a parameter vector β ∈ Rp be partitioned according to   β1 p1 p2 β := , where β1 ∈ R and β2 ∈ R . (9.23) β2

Then a full model is given by 2 y = Xβ + ε, p(ε) = N(ε; 0n, σ In), (9.24) and a reduced model is given by

2 y = X1β1 + ε1, p(ε1) = N(ε1; 0n, σ In). (9.25)

The general idea of using likelihood function ratios for model comparison is to fit two models m1 and m2, such as (9.24) and (9.25), to a single data set using maximum likelihood parameter estimation and then use the ratio of the two maximized likelihood function values to assess which probability is larger. In other words, the aim is to compare the probabilities p (y) and p (y) of both models to account for the βˆ βˆ1 same data y under the optimal parameter settings of each model. If the probability of observing the data y under, say, the optimized model m1 is higher than under the optimized model m2, then one concludes that m1 is a better model for the data. As in the context of maximum likelihood estimation, it is often helpful to work with log likelihood function differences rather than likelihood function ratios (recall that the log function turns ratios into differences). To apply the idea of log likelihood differences in the GLM context, suppose that we estimate the beta parameters of the reduced model (9.25) using the maximum likelihood approach for a known value σ2 > 0. As discussed previously, the maximized log likelihood function is then given by n n 1 max `(β ) = − ln 2π − ln σ2 − eT e , (9.26) p 1 2 1 1 β1∈R 1 2 2 2σ where ˆ e1 := y − X1β1, (9.27) T such that e1 e1 is the residual sum of squares of the reduced model. Next, consider maximizing the log likelihood function of the full model (9.24). In this case, the maximized log likelihood function is given by n n 1 max `(β) = − ln 2π − ln σ2 − eT e, (9.28) p 2 β∈R 2 2 2σ where  ˆ   β1 e := y − X1 X2 ˆ , (9.29) β2 such that eT e is the residual sum of squares of the full model. The difference of the two maximized log likelihood functions (9.28) and (9.26) then yields the following basic model comparison criterion:

∆ := max `(β) − max `(β1) p p β∈R β1∈R 1   n n 2 1 T n n 2 1 T = − ln 2π − ln σ − e e − − ln 2π − ln σ − e e1 (9.30) 2 2 2σ2 2 2 2σ2 1 1 = (eT e − eT e). 2σ2 1 1

T Note that in eq. (9.30), e1 e1 corresponds to the residual sum of squares resulting from fitting the p1 regressors of the reduced model to the data, while eT e corresponds to residual sum of squares resulting from fitting all p regressors of the full model to the data. If the additional p2 regressors present in the full model capture data variance in addition to that captured by the p1 regressors of the reduced model, then T T 2 e e becomes small compared to e1 e1 and, for constant σ , ∆ becomes large.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -statistic 114

Definition of the F -statistic The log likelihood function difference ∆ is the major building block of the F -statistic defined next.

n×p p 2 2 Definition 9.5.1 (F -statistic). For X ∈ R , β ∈ R , and σ > 0, let y ∼ N(Xβ, σ In) denote the GLM and assume that this full model can be partitioned according to    n×p1 n×p2 β1 p1 p2 X = X1 X2 ,X1 ∈ R ,X2 ∈ R , β := , β1 ∈ R , β2 ∈ R , and p1 + p2 = p. (9.31) β2 Further, let ˆ T −1 T ˆ T −1 T β := X X X y and β1 := X1 X1 X1 y, (9.32) and ˆ ˆ e := (y − Xβ) and e1 := (y − X1β1), (9.33) denote the beta parameter estimates and residuals of the full and partitioned models, respectively. Then the F -statistic is defined as

T T e1 e1−e e F : × → , (e , e) 7→ F (e , e) := p2 . (9.34) R≥0 R>0 R≥0 1 1 eT e n−p • To obtain an intuition about the F -statistic definition, we first consider its numerator. Here, it is helpful to consider two scenarios: first, a scenario in which the data is generated by the reduced model, and second, a scenario in which the data is generated by the the full model. If the data is generated by the reduced model, the beta parameter estimates for the additional p2 regressors in the full model will T T approximately equal zero and the residual sum of squares e1 e1 and e e will be of similar size. If the T data is instead generated by the full model, the residual sum of squares e1 e1 resulting from fitting the reduced model will be larger than the residual sum of squares eT e resulting from fitting the full model. The numerator of the F -statistic thus represents the reduction in the residual sum of squares resulting from including the additional p2 regressors measured in terms of the number of additional regressors included (i.e., divided by p2). If this reduction is small, i.e., the additional p2 regressors of the full model do not account for much data variance, then the value of the F -statistic numerator will be small. If this reduction is large, i.e., the full model accounts for considerably more data variance than the reduced model, then the F -statistic numerator will be large. Regardless of the generating model, the denominator of the F -statistic constitutes an estimate of the variance parameter σ2: in the case that the data is in fact generated from the reduced model, the beta parameter estimates for the p2 regressors of the full model will tend to zero and the residual sum of squares is evaluated as for the reduced model. In sum, the F -statistic thus measures the reduction of the residual sum of squares attributable to the inclusion of the p2 additional regressors with respect to the reduced model per regressor and normalized by the estimated data variance.

The distribution of the F -statistic As for the T -statistic, the distribution of the F -statistic for true, but unknown, parameter values can be evaluated analytically.

n×p p 2 Theorem 9.5.1 (Distribution of the F -statistic for β2 = 0p2 ). For X ∈ R , β ∈ R , and σ > 0, 2 let y ∼ N(Xβ, σ In) denote a GLM that is partitioned according to Definition 9.5.1 with full model T T T parameter vector β := (β1 , β2 ) . Then if β2 := 0p2 , F ∼ f(p2, n − p).

Proof. We content with the sketch of a proof. First, recall that the f-distribution is the distribution of the ratio of two independent chi-squared random variables, each of which is divided by its respective degrees of freedom. Second, we note without proof that for β2 = 0p2 , we have T T ! T T !  T   T  e1 e1 − e e 2 e1 e1 − e e e e 2 e e p = χ ; p2 and p = χ ; n − p (9.35) σ2 σ2 σ2 σ2

T T 2 T 2 and that e1 e1 − e e/σ and e e/σ are independent random variables. But then T T T T e1 e1−e e e1 e1−e e 2 F = p2 = σ p2 (9.36) eT e eT e n−p σ2(n−p)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 115

Figure 9.4. (A) 1,000 data samples of a simple linear regression model with independent variable values xi2 = i−1, i = 1, ...10, and true, but unknown, parameter values β := (1, 0)T and σ2 := 1. (B) of the residual sum of squares corresponding to the data samples in (A). (C) Analytical and empirical distribution of the F-statistic.

is the ratio of two independent chi-squared random variables, each of which is divided by its respective degrees of freedom. Thus, F is distributed according to an f-distribution with p2 and n − p degrees of freedom.

Like the distribution of the centred T -statistic, the distribution of the F -statistic for β2 = 0p2 is commonly used in statistical testing scenarios. Specifically, based on an observed value f˜ of the F -statistic, assessment of the probability P(F ≥ f˜) can serve as an indicator of the compatibility of the observed data with the

hypothesized parameter value β2 = 0p2 . We visualize the F -statistic distribution for the case of a simple linear regression GLM in Figure 9.4.

9.6 Bibliographic remarks

Treatments of the estimation and frequentist distribution theory of the GLM can be found in most statistical textbooks. Seber and Lee(2003) provide a comprehensive introduction.

9.7 Study questions

1. Discuss the intuitive background of the Frequentist distribution theory. 2. State the Beta parameter estimate distribution theorem. 3. State the Variance parameter estimate distribution theorem. 4. Write down the definition of the T -statistic. 5. Discuss the intuitive meaning of the T -statistic. 6. Write down the definition of the centred T -statistic. 7. State the distribution of the centred T -statistic. 8. Write down the definition of the F -statistic. 9. Discuss the intuitive meaning of the F -statistic.

10. State the distribution of the centred F -statistic for β2 = 0p2 .

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 10 | Statistical testing

In this Section, we first introduce the abstract notion of a statistical test in Section 10.1. Building on the test scenario, we discuss the single-observation z-test as a first example of a statistical test in Section 10.2.

10.1 Statistical tests

The statistical test scenario

To introduce the notion of a statistical test, we consider a parametric probabilistic model pθ(y) that describes the probability distribution of a random entity y and that is governed by a parameter θ ∈ Θ. The random entity y models data and is assumed to take on values in Rn, n ≥ 1. In test scenarios, the parameter space Θ is partitioned into two disjoint subsets, denoted by Θ0 and Θ1, such that Θ = Θ0 ∪ Θ1 and Θ0 ∩ Θ1 = ∅.A test hypothesis is a statement about the parameter governing pθ(y) in relation to these parameter space subsets. Specifically, the statement

θ ∈ Θ0 ⇔ H = 0 (10.1) is referred to as the null hypothesis and the statement

θ ∈ Θ1 ⇔ H = 1 (10.2) is referred to as the . Note that we are concerned with the Neyman-Pearson hypothesis testing framework (Neyman and Pearson, 1933) and thus assume that null and alternative hypotheses exist in an explicitly defined manner. A number of things are noteworthy. First, a statistical hypothesis is a statement about the parameter of a probabilistic model. In the following, we will use the subscript notations pΘ0 and pΘ1 to indicate that the parameter θ of the probabilistic model pθ is an element of Θ0 or Θ1, respectively. Second, the term null hypothesis is not necessarily the statement that some parameter assumes the value zero, even if this is often the case in practice. Rather, the null hypothesis in a statistical testing problem is the statement about the parameter one is willing to nullify, i.e., reject. Finally, the expressions H = 0 and H = 1 are not conceived as realizations of a random variable and hence hypothesis-conditional probability statements are not meaningful. The statements H = 0 and H = 1 are merely equivalent expressions for θ ∈ Θ0 and θ ∈ Θ1, respectively: H = 0 refers to the true, but unknown, state of the world that the null hypothesis is true and the alternative hypothesis is false (θ ∈ Θ0), and H = 1 refers to the true, but unknown, state of the world that the alternative hypothesis is true and the null hypothesis is false (θ ∈ Θ1). In general, hypotheses can be classified as simple or composite.A simple hypothesis refers to a subset of parameter space which contains a single element, for example Θ0 := {θ0}.A composite hypothesis refers to a subset of parameter space which contains more than one element, for example Θ0 := R≤0. The commonly encountered null hypothesis Θ0 = {0}, also referred to as nil hypothesis, is an example for a simple hypothesis.

Tests Given the test hypotheses scenario introduced above, a test is defined as a mapping from the data outcome space to the set {0, 1}, formally n φ : R → {0, 1}, y 7→ φ(y). (10.3) Here, the test value φ(y) = 0 represents the act of not rejecting null hypothesis, while the test value φ(y) = 1 represents the act of rejecting the null hypothesis. Rejecting the null hypothesis is equivalent to accepting the alternative hypothesis, and accepting the null hypothesis is equivalent to rejecting the alternative hypothesis. Because y is a random entity, the expression φ(y) is also a random entity. All tests φ considered herein involve the composition of a test statistic

n γ : R → R, (10.4) where R models the test statistic’s outcome space, and a subsequent decision rule δ : R → {0, 1}, (10.5) Statistical tests 117 such that the test can be written as

n φ = δ ◦ γ : R → {0, 1}. (10.6) Note that given their dependence on the random entity y, both γ and δ should be understood as random entities. The subset of the test statistic’s outcome space for which the tests assumes the value 1 is referred to as the rejection region of the test. Formally, the rejection region is defined as

R := {γ(y) ∈ R|φ(y) = 1} ⊂ R. (10.7) The random events φ(y) = 1 and γ(y) ∈ R are thus equivalent and associated with the same probability under pθ. In a concrete test scenario, it is hence usually the probability distribution of the test statistic that is of principal concern for assessing the test’s outcome behaviour. Finally, all test decision rules considered herein are based on the test statistic exceeding a critical value u ∈ R. By means of the indicator function, the tests considered here can thus be written as ( n 1, γ(y) ≥ u φ : R → {0, 1}, y 7→ φ(y) := 1{γ≥u} := (10.8) 0, γ(y) < u.

Note that (10.8) describes the situation of one-sided tests. The one-sided one-sample t-test is a familiar example of the general test structure described by expression (10.8): using the sample mean and sample standard deviation, a realization of the random entity y is first transformed into the value of the t-statistic, whose size is then compared to a critical value in order to decide for rejecting the null hypothesis or not.

Test error probabilities When conducting a test as just described, two kinds of errors can occur. First, the null hypothesis can be rejected (φ(y) = 1), when it is in fact true (θ ∈ Θ0). This error is referred to as the Type I error. Second, the null hypothesis may not be rejected (φ(y) = 0), when it is in fact false (θ ∈ Θ1). The latter error is known as the Type II error. The probabilities of Type I and Type II errors under a given probabilistic model are central to the quality of a test: the probability of a Type I error is called the size of the test and is denoted by α ∈ [0, 1]. It is defined as

α := pΘ0 (φ(y) = 1), (10.9) and also routinely referred as the Type I error rate of the test. Its complementary probability,

pΘ0 (φ(y) = 0) = 1 − α, (10.10) is known as the specificity of a test. The probability of a Type II error

pΘ1 (φ(y) = 0) (10.11) lacks a universally accepted denomination. Its complementary probability

β := pΘ1 (φ(y) = 1) (10.12) is referred to as the . In words, the power of a test is thus the probability of accepting the alternative hypothesis (rejecting the null hypothesis), if θ ∈ Θ1, i.e., if the alternative hypothesis is true. Note that basic introductions to test error probabilities often denote the probability of a Type II error by β ∈ [0, 1] and thus define power by 1 − β. For our current purposes, we prefer the definition of eq. (10.12), because it keeps the notation concise and is more coherent with established notations of test quality functions.

Significance levels It is important to distinguish between the size and the significance level of a test: a test is said to be of significance level α0 ∈ [0, 1], if its size α is smaller than or equal to α0, i.e., if α ≤ α0. (10.13) If for a test of significance level α0 it holds that α < α0, the test is referred to as a conservative test. If for a test of significance level α0 it holds that α = α0, the test is referred to as an . Tests with an associated significance level α0 for which α > α0 are sometimes referred to as liberal tests. Note, however, that such tests are, strictly speaking, not of significance level α0.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A single-observation z-test 118

The test quality function The size and the power of a test are summarized in the test’s quality function. For a test φ, the test quality function is defined as

q :Θ → [0, 1], θ 7→ q(θ) := Epθ (y)(φ(y)). (10.14)

In words, the test quality function is a function of the probabilistic model parameter θ and assigns to each value of this parameter a value in the interval [0, 1]. This value is given by the expectation of the test φ under the probabilistic model pθ(y). The definition of the test quality function is motivated by the value it assumes for θ ∈ Θ0 and θ ∈ Θ1: because the random variable φ only takes on values in {0, 1}, the expected value Epθ (y)(φ(y)) is identical to the probability of the event φ(y) = 1 under pθ(y). Thus, for θ ∈ Θ0, the test quality function returns the size of the test (eq. (10.9)) and for θ ∈ Θ1, the test quality function returns the power of the test (eq. (10.12)). For θ ∈ Θ1, the test quality function is also is referred to as the test’s power function and is denoted by

β :Θ1 → [0, 1], θ 7→ β(θ) := pΘ1 (φ(y) = 1). (10.15)

Test construction In both applications and the theoretical development of statistical tests, the probability for a Type I error, i.e., the test size, is usually considered to be more important than the Type II error rate, i.e., the complement of the test’s power. In effect, when designing a test, the test’s size is usually fixed first, for 0 example by deciding for a significance level such as α = 0.05 and its associated critical value uα0 of the test statistic (cf. eq. (10.8)). In a second step, different tests or different probabilistic models are then compared in their ability to minimize the probability of the test’s Type II error, i.e., maximize the test’s power. For example, the celebrated Neyman-Pearson lemma states that for tests of simple hypotheses, the likelihood ratio test achieves the highest power for a given significance level over all conceivable statistical tests (Neyman and Pearson, 1933).

10.2 A single-observation z-test

We next illustrate the theoretical concepts introduced in Section 10.1 and ?? with an example. To this end, we consider a probabilistic model pθ(y) that governs the distribution of a data random variable y taking values in R. For µ ∈ R and σ2 > 0, the model is assumed to be defined in terms of the probability density function 2 pθ(y) := N(y; µ, σ ). (10.16) Intuitively, a single data point is thus assumed to have been sampled from a univariate Gaussian distribution of unknown expectation and known variance. For this model, we assume that the parameter space of interest is of the form Θ := R≥0. A single test scenario is then induced by defining the null and alternative hypotheses µ ∈ Θ0 := {0} and µ ∈ Θ1 := R>0. (10.17) Furthermore, a test of the form (10.8) can be constructed by defining the identity test statistic

Z : R → R, y 7→ Z(y) := y (10.18) and the test φ : R → {0, 1}, y 7→ φ(y) := 1{Z(y)≥u}. (10.19)

In words, the null hypothesis µ ∈ Θ0 is rejected, if the data realization is equal to or exceeds a given critical value u ∈ R, otherwise it is not rejected.

Type I error rate control As discussed above, to afford Type I error rate control and to evaluate the power of a thus controlled test, the distributions of the test statistic under the null and alternative hypotheses are central. The former distribution allows for identifying a critical value such that the size of the test maximally assumes a certain probability. The latter distribution allows for evaluating the probability of rejecting the null hypothesis under the scenario of the alternative hypothesis being true. In the current test scenario, the

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 119

distribution of the test statistic under the null hypothesis θ ∈ Θ0, and hence also the probabilities for the equivalent events Z(y) ∈]u, ∞[ and φ(y) = 1, can be readily inferred: because the test statistic conforms to the identity mapping, its distribution for µ ∈ Θ0 is given by the probability density function

2 pΘ0 (z) = N(z; 0, σ ). (10.20)

Likewise, the test statistic distribution for θ ∈ Θ1 and its associated events Z(y) ∈]u, ∞[ and φ(y) = 1 is given by the probability density function

2 pΘ1 (z) = N(z; µ, σ ) with µ ∈ R>0. (10.21) Given the form (10.19) of the current test, φ(y) can be rendered an exact test of significance level α0 by choosing a critical value uα0 such that

Z uα0 2 0 0 pΘ0 (φ(y) = 1) = PΘ0 (Z(y) ≥ uα ) = 1 − N x; 0, σ dx = α . (10.22) −∞ Note that the required integral corresponds to the CDF of the univariate Gaussian distribution, for which well-known and widely implemented approximations exist.

Power and positive predictive value function

Given a critical value uα0 and the distribution of the test statistics under the alternative hypothesis scenario as specified by (10.21), the probability of the event φ(y) = 1 evaluates to

Z uα0 2 0 pΘ1 (φ(y) = 1) = pΘ1 (Z(y) ≥ uα ) = 1 − N x; µ, σ dx for µ ∈ R>0. (10.23) −∞ The power function of the test thus takes the form

Z uα0 2 β : R>0 → [0, 1], µ 7→ β(µ) := 1 − N x; µ, σ dx. (10.24) −∞ In applied settings, the parameterization of power functions in terms of the effect size measure Cohen’s d is often preferred. For a univariate Gaussian distribution with expectation parameter µ ∈ R and variance parameter σ2 > 0, Cohen’s d is defined as µ d := . (10.25) σ For the power function (10.24), re-parameterization in terms of d results in

Z uα0 2 β : R>0 → [0, 1], d 7→ β(d) := 1 − N x; σd, σ dx. (10.26) −∞

10.3 Bibliographic remarks

The first part of this chapter reviews the classical frequentist theory of hypothesis testing as developed by Pearson(1900) (p-value), Student(1908) (T test statistic distribution), Fisher(1925) (significance testing), and Neyman and Pearson(1933) (hypothesis testing). For a disentanglement of their respective distributions, see e.g. Gigerenzer(2004). In our presentation, we have mainly followed Czado and Schmidt (2011). Similar treatments can be found, for example, in Casella and Berger(2012), Wasserman(2004), and Lehmann and Romano(2005). The development of the positive predictive value function is based on Wacholder et al.(2004).

10.4 Study questions

1. Define the notion of a test hypothesis. 2. Discuss the notions of simple and composite test hypotheses. 3. Define the notion of a statistical test. 4. Define the notions of a test statistic and a test decision rule.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 120

5. Define the notion of a test rejection region. 6. Define Type I and Type II errors. 7. Define the size and the power of a test. 8. Explain the notion a test’s significance level. 9. Define the notions of conservative, exact, and liberal tests. 10. Discuss the standard approach to test construction.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 11 | T-tests and simple linear regression

11.1 Introduction

In this and the subsequent chapters, we consider the following statistical methods from the perspective of the GLM: t-tests, simple linear regression, multiple linear regression, the one- and multi-factorial analysis of variance (ANOVA), F-tests, and the analysis of covariance (ANCOVA). Each method is characterized in terms of its design matrix and beta parameter vector, but the statistical machinery for parameter estimation and inference is identical throughout. To exemplify the different methods, we shall apply them to one common artificial data set shown in Table 11.1. In this data set, the data vector y ∈ Rn represents the volume of the dorsolateral prefrontal cortex (DLPFC) of each of n = 30 participants and the experimental design concerns the question of whether the experimental factors age, measured in years, and alcohol consumption, measured in units of 7.9 gram of pure alcohol, influence the dependent variable DLPFC volume. Note that we are dealing with a between-subject design which justifies the assumption that the error terms εi, i = 1, ..., n are independent.

Participant Age [years] Alcohol [units] DLPFC volume 1 15 3 178.7708 2 16 6 168.4660 3 17 5 169.9513 4 18 7 162.0778 5 19 4 170.1884 6 20 8 156.9287 7 21 1 175.4092 8 22 2 173.3972 9 23 7 154.4907 10 24 5 158.3642 11 25 1 172.1033 12 26 3 162.6648 13 27 2 165.4449 14 28 8 142.2121 15 29 4 154.3557 16 30 6 145.6544 17 31 3 155.5286 18 32 4 150.5144 19 33 7 137.8262 20 34 1 160.1183 21 35 2 155.4419 22 36 8 127.1715 23 37 5 138.0237 24 38 6 133.4589 25 39 4 139.3813 26 40 3 145.1997 27 41 7 123.7259 28 42 5 130.7300 29 43 8 114.1148 30 44 1 151.1943 31 45 6 121.7235 32 46 2 140.9424

Table 11.1. An examplary data set with dependent variable DFLPC volume and independent variables age and alcohol consumption.

In general, GLM designs lie within a spectrum between two extremes: on one side of the spectrum, one may assume that there is no systematic variation in the dependent variable at all. This corresponds to the case that each data point yi, i = 1, ..., n is a realization of a univariate Gaussian random variable, all of which have identical expectation. In this case, all observed data variability over experimental units conforms to “Gaussian noise”. Formally, this scenario can be written as

2 2 yi ∼ N µ, σ ⇔ yi = µ + εi, εi ∼ N(0, σ ) for i = 1, ..., n. (11.1) Introduction 122

In terms of the GLM formulation, this null model corresponds to the case of a design matrix corresponding to a columns of ones and a single beta parameter. The ith entry in the matrix product of these terms, (Xβ)i, represents the experimental unit-independent expectation parameter µ ∈ R of the univariate, formally Gaussian data random variable

1 . n×1 X := . ∈ R and β := µ. (11.2) 1

On the other side of the spectrum, one may assume that there is complete and unsystematic variability over all experimental units, in the sense that each experimental unit’s data corresponds to a sample from a unit-specific univariate Gaussian random variable. In other words, each experimental unit is modelled by an experimental unit-specific expectation parameter µi,

2 2 yi ∼ N(µi, σ ) ⇔ yi = µi + εi, εi ∼ N(0, σ ) for i = 1, ..., n. (11.3)

In its GLM implementation, this scenario corresponds to the square identity design matrix and a beta parameter vector comprising as many parameters as there are data points. This renders the ith entry in the matrix product Xβ the experimental unit-specific expectation parameter µi,

1 0 ··· 0 µ  0 1 ··· 0 1   n×n . n X := . . . . ∈ R and β :=  .  ∈ R . (11.4) . . .. .     µ 0 0 ··· 1 n

Note than in neither case any interesting statements can be made with respect to the columns of the design matrix and the data variable, because (11.1) assumes that all experimental units are “the same”, while (11.3) assumes that all experimental units are “mutually different”. All the designs we will encounter in the following lie somewhere between (11.2) and (11.4) and thus constrain the differences between experimental units in some meaningful way which lends itself to an interpretation in terms of the independent experimental variables.

Continuous and categorical designs In general, design matrices can be classified according to whether their columns are formed by continuously varying real numbers or by so-called indicator or, more colloquially, dummy variables. In the first case, the designs are referred to as continuous or regression designs. Here, the independent variables represent continuous experimental factors and the design matrix columns are commonly referred to as regressors, predictors, or covariates. In the second case, entries of the design matrix are typically 1’s and 0’s, and the designs are referred to as categorical or ANOVA-type designs. In categorical designs, the independent experimental variables are usually referred to as experimental factors and the values that they can take on as levels. The key difference between these continuous and categorical approaches lies in the expectation about changes in the dependent variable as a result of changes in the independent variable. If one treats an independent experimental variable as a continuous variate, one assume a linear effect of this predictor on the dependent variable. If one treats an independent experimental variable as a discrete variate, one does not need to assume that for every unit change in the independent variable one expects a scaled unit change in the value of dependent variable. In other words, one allows for arbitrary changes in the response from one category to another. This approach has the advantage of a simple interpretation and may be viewed as a prediction of qualitative differences. On the other hand, by grouping independent variable values into discrete categories, one discards information contained in the continuous covariation of independent and dependent variables. In the following, we will discuss two forms of continuous GLM designs, simple and multiple linear regression, three forms of categorical GLM designs, t-tests, ANOVA designs, and F-tests, and one mixed form, the ANCOVA design. For each design, we will select the relevant aspects of the example data set in Table 11.1, write down the corresponding GLM in structural and design matrix form, show a typical visualization, and discuss its estimation and interpretation from a frequentist viewpoint.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 One-sample t-test 123

11.2 One-sample t-test

The one-sample t-test is commonly portrayed as a procedure to test whether all data points were generated from univariate Gaussian distributions with identical expectation parameters. From a GLM viewpoint, the one-sample t-test corresponds to a categorical design with a single experimental factor taking on a single level. Each data point is modelled by a univariate Gaussian variable with identical expectations over data points and thus corresponds to the independent and identically distributed Gaussian samples scenario encountered already in Chapter 8 | Maximum likelihood estimation:

2 2 yi ∼ N(µ, σ ) ⇔ yi = µ + εi, εi ∼ N 0, σ for i = 1, ..., n. (11.5)

As seen previously, in design matrix formulation (11.5) corresponds to

1 2 . n×1 2 y ∼ N(Xβ, σ In), where X := . ∈ R , β := µ and σ > 0. (11.6) 1

Notably, the single entry parameter vector β assumes the role of the data expectation µ. If applied to the example data set of Table 11.1, the one-sample t-test allows for evaluating the null hypothesis that all observed data points were generated from a univariate Gaussian and thus have the same expected value. Table 11.2 shows the example data set viewed from the perspective of a one-sample t-test. One-sample t-test data are usually not visualized. However, in line with the visualization of other categorical designs it is most appropriate to visualize them by means of their sample mean and sample standard deviation or of mean as shown in Figure 11.1A.

Participant (i) DLPFC volume (yi) 1 178.7708 2 168.4660 3 169.9513 4 162.0778 5 170.1884 6 156.9287 7 175.4092 8 173.3972 9 154.4907 10 158.3642 11 172.1033 12 162.6648 13 165.4449 14 142.2121 15 154.3557 16 145.6544 17 155.5286 18 150.5144 19 137.8262 20 160.1183 21 155.4419 22 127.1715 23 138.0237 24 133.4589 25 139.3813 26 145.1997 27 123.7259 28 130.7300 29 114.1148 30 151.1943 31 121.7235 32 140.9424

Table 11.2. The example data set considered as a one-sample t-test design.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Independent two-sample t-test 124

A B

200 200 e

e 150 150

m m

u u

l l

o o v

v 100 100

C C

F F

P P L

L 50 50

D D

0 0 All Young Old

Figure 11.1. (A) Visualization of a one-sample t-test design. (B) Visualization of an independent two-sample t-test design. The errorbars depict the pooled standard deviation s12.

Estimation and evaluation of the one-sample t-test design In Chapter 8 | Maximum likelihood estimation, we have seen that the beta and variance parameter estimators for the one-sample t-test model evaluate to

n n 1 X 1 X 2 βˆ = y =y ¯ andσ ˆ2 = (y − y¯) , (11.7) n i n − 1 i i=1 i=1 i.e., the sample mean and the sample variance. Moreover, in Chapter 9 | Frequentist distributions, we have seen that the T -statistic for the one-sample t-test model with contrast vector c := 1 evaluates to √ y¯ T = n . (11.8) c s Under the null hypothesis µ ∈ {0}, the centred T -statistic √ y¯ − µ √ y¯ − 0 √ y¯ T = n = n = n (11.9) β,c s s s is distributed according to a t-distribution with n − 1 degrees of freedom. If the probability of the T -statistic Tc to exceed an observed value tc is low, we may thus consider rejecting the null hypothesis. For the data depicted in Table 11.2, the beta parameter and variance parameter estimates evaluate to

βˆ =y ¯ = 151.1 andσ ˆ2 = s2 = 290.8, (11.10) such that the T -statistic evaluates to √ 151.1 T = 30 · = 50.1. (11.11) c 17.1

The fact that under the null hypothesis P(Tβ,c ≥ 50.1) < 0.001 would hence prompt a rejection of the null hypothesis µ ∈ {0}, if, for example, a test significance level of α0 = 0.05 is desired.

11.3 Independent two-sample t-test

The independent two-sample t-test is commonly portrayed as a procedure to evaluate whether two groups n1 n2 of data points y1 ∈ R and y2 ∈ R were generated from the same underlying univariate Gaussian distribution. From the GLM perspective, the two-sample t-test corresponds to a categorical design with a single experimental factor taking on two levels. The distributions of the n1 random variables modelling data points of the first level take the form

2 2 y1i ∼ N(µ1, σ ) ⇔ y1i = µ1 + εi, εi ∼ N(0, σ ) for i = 1, ..., n1, (11.12) while the distribution of the n2 random variables modelling data points of the second level take the form

2 2 y2i ∼ N(µ2, σ ) ⇔ y2i = µ2 + εi, εi ∼ N(0, σ ) for i = 1, ..., n2. (11.13)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Independent two-sample t-test 125

In design matrix formulation, the two-sample t-test model for n := n1 + n2 can be written as 2 y ∼ N(Xβ, σ In), (11.14) where     y11 1 0  .  . .  .  . .       y1n1  n 1 0 n×2 µ1 2 2 y :=   ∈ R ,X :=   ∈ R , β := ∈ R , and σ > 0. (11.15)  y21  0 1 µ2      .  . .  .  . .

y2n2 0 1 Note that data and non-zero entries in the design matrix have to be arranged in such a manner that they identify the respective data variable group membership. As an example, consider regrouping the example data set of Table 11.1 into two groups of data points corresponding to participants younger than 31 years and participants older or equal to 31 years as shown in Table 11.3. Data observations for independent two-sample t-test designs are usually visualized by portraying their group sample means and associated standard deviations or standard errors of mean as shown in Figure 11.1B.

Group 1 Group 2 Participant Variable Index (ij) DLPFC volume (y1i) Participant Variable Index (ij) DLPFC volume (y2i) 1 11 178.7708 17 21 155.5286 2 12 168.4660 18 22 150.5144 3 13 169.9513 19 23 137.8262 4 14 162.0778 20 24 160.1183 5 15 170.1884 21 25 155.4419 6 16 156.9287 22 26 127.1715 7 17 175.4092 23 27 138.0237 8 18 173.3972 24 28 133.4589 9 19 154.4907 25 29 139.3813 10 110 158.3642 26 210 145.1997 11 111 172.1033 27 211 123.7259 12 112 162.6648 28 212 130.7300 13 113 165.4449 29 213 114.1148 14 114 142.2121 30 214 151.1943 15 115 154.3557 31 215 121.7235 16 116 145.6544 32 216 140.9424

Table 11.3. An independent two-sample t-test data set. The Participant column comprises the original data labels as in Table 11.1, while the Variable Index columns ij comprise the indices of the data points y1i and y2i after relabelling for the independent two-sample t-test design.

Estimation and evaluation of independent two-sample t-test designs As shown below, the independent two-sample t-test model’s beta and variance parameter estimators evaluate to 1 Pn1 ! Pn1 2 Pn2 2 y1i (y1i − y¯1) + (y2i − y¯2) βˆ = n1 i=1 andσ ˆ2 = i=1 i=1 , (11.16) 1 Pn2 y2i (n1 − 1) + (n2 − 1) n2 i=1 respectively. The two entries in the beta parameter estimator of the independent two-sample t-test design thus correspond to the two sample averages n n 1 X1 1 X2 y¯ := y andy ¯ := y . (11.17) 1 n 1i 2 n 2i 1 i=1 2 i=1 The independent two-sample t-test model’s variance parameter estimator is also known as the pooled sample variance and commonly denoted by

Pn1 2 Pn2 2 2 i=1(y1i − y¯1) + i=1(y2i − y¯2) s12 := . (11.18) (n1 − 1) + (n2 − 1)

p 2 The square root of the pooled sample variance s12 := s12 is known as the pooled sample standard deviation.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Independent two-sample t-test 126

Proof. For the beta parameter estimator, we have

βˆ = (XT X)−1XT y   −1   1 0 y11  . .  .   . .  .   . .  .  1 ··· 1 0 ··· 0 1 0 1 ··· 1 0 ··· 0 y      1n1  =       0 ··· 0 1 ··· 1 0 1 0 ··· 0 1 ··· 1  y21        . .  .   . .  .  (11.19) 0 1 y2n2  −1 Pn1  n1 0 i=1 y1i = Pn2 0 n2 i=1 y2i  −1  Pn1  n1 0 i=1 y1i = −1 Pn2 0 n2 i=1 y2i 1 Pn1 ! y1i n1 i=1 = 1 Pn2 . y2i n2 i=1

For the variance estimator, we have with n = n1 + n2 and p = 2 (y − Xβˆ)T (y − Xβˆ) σˆ2 = n − p     T      y11 1 0 y11 1 0  .  . .   .  . .   .  . .   .  . .   .  . .   .  . .  1 y  1 0 y¯  y  1 0 y¯   1n1    1   1n2    1  =   −      −    n1 + n2 − 2  y21  0 1 y¯2   y21  0 1 y¯2             .  . .   .  . .   .  . .   .  . .  y2n 0 1 y2n 0 1 2 2 (11.20)  T   y11 − y¯1 y11 − y¯1  .   .   .   .   .   .  1 y − y¯  y − y¯   1n1 1  1n1 1 =     n1 + n2 − 2  y21 − y¯2   y21 − y¯2       .   .   .   . 

y2n2 − y¯2 y2n2 − y¯2 Pn1 2 Pn2 2 (y1i − y¯1) + (y2i − y¯2) = i=1 i=1 , n1 + n2 − 2 2 which corresponds to the pooled sample variance s12.

Finally, as shown below, the centred T -statistic for evaluating the simple null hypothesis µ1 − µ2 ∈ {0} is given by defining the contrast vector c := (1, −1)T and setting β := (0, 0)T . This yields the familiar formula y¯1 − y¯2 Tβ,c = (11.21) q −1 −1 n1 + n2 s12 for the T -statistic of the independent two-sample t-test.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Simple linear regression 127

Proof. We have

cT βˆ − cT β Tβ,c = q σˆ2cT (XT X)−1 c y¯  1 −1 1 y¯2 = s  −1   2  n1 0 1 s12 1 −1 0 n2 −1

y¯1 − y¯2 = s (11.22)  −1   q  n1 0 1 2 1 −1 −1 s12 0 n2 −1 y¯1 − y¯2 = s  1  n−1 −n−1 s 1 2 −1 12

y¯1 − y¯2 = r .  −1 −1 n1 + n2 s12

As discussed in Chapter 9 | Frequentist distribution theory, this centred T -statistic is distributed according to a t-distribution with n − p = n1 + n2 − 2 degrees of freedom. If the probability of the corresponding T -statistic Tc to exceed an observed value tc is low, the null hypothesis µ1 − µ2 ∈ {0} may thus be rejected. For the data depicted in Table 11.3, the beta parameter and variance parameter estimates evaluate to

y¯  163.2 βˆ = 1 = andσ ˆ2 = 145.8, (11.23) y¯2 139.1 such that the T -statistic evaluates to 163.2 − 139.1 24.1 Tc = √ = ≈ 5.7. (11.24) p(16−1 + 16−1) 145.8 0.3 · 12.1

The fact that under the null hypothesis P(Tβ,c ≥ 5.7) < 0.001 would hence prompt rejection of the null hypothesis, if, for example, a test significance level of α0 = 0.05 is desired.

11.4 Simple linear regression

The central idea of simple linear regression is that for the continuous regressor taken on the value 0, the expectation of the corresponding data variable is given by a ∈ R and that the expectations of the data variables change by a fixed proportion b ∈ R for a unit change in the continuous regressor. The distributions of the n random variables modelling the dependent variable take the form

2 2 yi ∼ N(a + bxi, σ ) ⇔ yi = a + bxi + εi, εi ∼ N(0, σ ) for i = 1, ..., n, (11.25) where xi, i = 1, ..., n denote the values of the continuous regressor variable. In design matrix formulation, the simple linear regression model can be written as   1 x1 a y ∼ N(Xβ, σ2I ), where y ∈ n,X := . .  ∈ n×2, β := ∈ 2, and σ2 > 0. (11.26) n R . .  R b R 1 xn

The entries of β := (a, b)T are commonly known as offset parameter and slope parameter, respectively. As an example for a simple linear regression design, we reconsider the example data of Table 11.1, for which we discount the alcohol intake variable. This results in the data set shown in Table 11.4. Here, the values x1, ..., xn of the continuous regressor variable are listed in the Age[years] column. Simple linear regression designs are commonly visualized by plotting the values of the regressor on the x-axis and the values of the dependent experimental variable on the y-axis as shown in Figure 11.2A. Often, the ˆ n estimated data expectation vector Xβ ∈ R is included and visualized as a function of the x1, ..., xn.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Simple linear regression 128

Participant (i) Age (xi) DLPFC volume (yi) 1 15 178.7708 2 16 168.4660 3 17 169.9513 4 18 162.0778 5 19 170.1884 6 20 156.9287 7 21 175.4092 8 22 173.3972 9 23 154.4907 10 24 158.3642 11 25 172.1033 12 26 162.6648 13 27 165.4449 14 28 142.2121 15 29 154.3557 16 30 145.6544 17 31 155.5286 18 32 150.5144 19 33 137.8262 20 34 160.1183 21 35 155.4419 22 36 127.1715 23 37 138.0237 24 38 133.4589 25 39 139.3813 26 40 145.1997 27 41 123.7259 28 42 130.7300 29 43 114.1148 30 44 151.1943 31 45 121.7235 32 46 140.9424

Table 11.4. The example data set arranged for a simple linear .

Estimation and evaluation of simple linear regression designs As shown below, the beta parameter estimator for the simple linear regression can be written as

ˆ ! ! sxy ! β1 aˆ y¯ − x¯ βˆ = = = sxx , (11.27) ˆ ˆ sxy β2 b sxx where

n n n n 1 X 1 X X X x¯ := x , y¯ := y , s := (x − x¯)2, and s := (x − x¯)(y − y¯). (11.28) n i n i xx i xy i i i=1 i=1 i=1 i=1

The representation of the beta parameter estimates in the form of (11.27) is useful, because it allows to readily appreciate the similarities and differences between simple linear regression and correlation models.

Proof. We first note that n X sxy = xiyi − nx¯y¯ (11.29) i=1 and n X 2 2 sxx = xi − nx¯ , (11.30) i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Simple linear regression 129

A 180 yi

X-^ e

m 160

u

l

o v

C 140

F

P

L D 120

10 20 30 40 50 Age

Figure 11.2. Visualization of a simple linear regression design. because n X sxy := (xi − x¯)(yi − y¯) i=1 n X = (xiyi − xiy¯ − xy¯ i +x ¯y¯) i=1 n n n n X X X X = xiyi − xiy¯ − xy¯ i + x¯y¯ i=1 i=1 i=1 i=1 n n n X X X = xiyi − y¯ xi − x¯ yi + nx¯y¯ (11.31) i=1 i=1 i=1 n X = xiyi − yn¯ x¯ − xn¯ y¯ + nx¯y¯ i=1 n X = xiyi − nx¯y¯ − nx¯y¯ + nx¯y¯ i=1 n X = xiyi − nx¯y,¯ i=1 and n X 2 sxx = (xi − x¯) i=1 n X 2 2 = (xi − 2xix¯ +x ¯ ) i=1 n n n X 2 X X 2 = xi − 2xix¯ + x¯ i=1 i=1 i=1 n n X 2 X 2 = xi − 2¯x xi + nx¯ (11.32) i=1 i=1 n X 2 2 = xi − 2¯xnx¯ + nx¯ i=1 n X 2 2 2 = xi − 2nx¯ + nx¯ i=1 n X 2 2 = xi − nx¯ . i=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Simple linear regression 130

With eqs. (11.29) and (11.30), we then have

βˆ = (XT X)−1XT y −1  1 x1  y1   1 ··· 1   1 ··· 1  =  . .   .   x1 ··· xn . .  x1 ··· xn  .  1 xn yn (11.33)  Pn −1  Pn  n i=1 xi i=1 yi = Pn Pn 2 Pn i=1 xi i=1 xi i=1 xiyi  n nx¯ −1  ny¯  = Pn 2 Pn . nx¯ i=1 xi i=1 xiyi The inverse of XT X is given by 1  sxx +x ¯2 −x¯ n , (11.34) sxx −x¯ 1 because 1  sxx 2    1  nsxx 2 2 sxxnx¯ 2 Pn 2 n +x ¯ −x¯ n nx¯ n + nx¯ − nx¯ n + nx¯ x¯ − x¯ i=1 xi Pn 2 = 2 Pn 2 sxx −x¯ 1 nx¯ i=1 xi sxx −xn¯ + nx¯ −nx¯ + i=1 xi  Pn 2 2  1 sxx sxxx¯ − x¯( i=1 xi − nx¯ ) = Pn 2 2 sxx 0 i=1 xi − nx¯ 1 s s x¯ − xs¯  = xx xx xx (11.35) sxx 0 sxx 1 s 0  = xx sxx 0 sxx 1 0 = . 0 1 Hence,

 1 x¯2 x¯    n + s − s ny¯ βˆ =  xx xx    x¯ 1 Pn − xiyi sxx sxx i=1  2  Pn  1 + x¯ ny¯ − x¯ i=1 xiyi n sxx sxx =  Pn  i=1 xiyi − nx¯y¯ sxx sxx  2 Pn  ny¯ + x¯ ny¯ − x¯ i=1 xiyi n sxx sxx =  Pn  i=1 xiyi−nx¯y¯ s xx (11.36)  Pn  y¯ + xn¯ x¯y¯−x¯ i=1 xiyi sxx =  Pn  i=1 xiyi−nx¯y¯ sxx  Pn  y¯ − i=1 xiyi−nx¯y¯ x¯ sxx =  Pn  i=1 xiyi−nx¯y¯ sxx

 sxy  y¯ − s x¯ =  xx  . sxy sxx

The variance parameter estimator of the simple linear regression model is best represented in its native form (y − Xβˆ)T (y − Xβˆ) σ¯2 = . (11.37) n − 2 Statistical tests in the context of simple linear regression commonly concern the null hypotheses a ∈ {0} and b ∈ {0} by means of the contrast vectors c := (1, 0)T and c := (0, 1)T , respectively. For the data depicted in Table 11.4, the beta parameter and variance parameter estimates evaluate to

aˆ 196.9 βˆ = = andσ ˆ2 = 95.6. (11.38) ˆb −1.5

The T -statistic for assessing the null hypothesis a ∈ {0} evaluates to Tc = 33.01. The fact that under the null hypothesis P(Tβ,c ≥ 33.01) < 0.001 would hence prompt rejection of the null hypothesis, if, for example, a test significance level of α0 = 0.05 is desired. Similarly, the T -statistic for assessing the null hypothesis

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 131

b ∈ {0} evaluates to Tc = −8.02. The fact that under the null hypothesis P(Tβ,c ≥ | − 8.02|) < 0.001 would hence prompt rejection of the null hypothesis, if, for example, a test significance level of α0 = 0.05 is desired.

11.5 Bibliographic remarks

The organization of this chapter owes much to Rodr´ıguez(2007).

11.6 Study questions

1. Discuss the extremes of the spectrum of GLM designs for the analysis of any data set. 2. Discuss commonalities and differences between continuous and categorical GLM designs. 3. Write down the GLM formulation of the one-sample t-test. 4. Write down the one-sample t-test beta estimator and variance parameter estimator. 5. Write down the one-sample t-test T -statistic for the contrast vector c := 1. 6. Write down the GLM formulation of the independent two-sample t-test. 7. Write down the independent two-sample t-test beta estimator and variance parameter estimator. 8. Write down the independent two-sample t-test T -statistic for the contrast vector c := (1, −1)T . 9. Write down the GLM formulation of simple linear regression. 10. Write down the simple linear regression beta estimator.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 12 | Multiple linear regression

Multiple linear regression can be viewed as the most general application of the GLM in the sense that all columns of the design matrix are allowed to take on arbitrary values and there may be arbitrary many of them. As introduced in Chapter 1, for i = 1, ..., n data variables yi, multiple linear regression designs take the general form 2 yi = xi1β1 + xi2β2 + xi3β3 + ... + xipβp + εi, εi ∼ N(0, σ ). (12.1)

The values xij, i = 1, ..., n that constitute the jth design matrix column are variously referred to as regressors, predictors, covariates, or independent variables. The values βj, j = 1, ..., p are variously referred to as beta parameters, regression weights, or (fixed) effects. In this Chapter, we first consider an exemplary multiple linear regression design for the data set introduced in Chapter 11 | T-tests and simple linear regression. We next consider the notions of linearly independent, orthogonal, and uncorrelated regressors and introduce a measure for the statistical efficiency of a multiple linear regression design. Finally, we consider the application of multiple linear regression designs in functional magnetic resonance imaging.

12.1 An exemplary multiple linear regression design

Here, we explicitly state the multiple linear regression design applicable for the data in Table 12.1, which includes the DLPFC volume predictor variables Age and Alcohol.

Participant (i) Age (xi1) Alcohol (xi2) DLPFC Volume (yi) 1 15 3 178.7708 2 16 6 168.4660 3 17 5 169.9513 4 18 7 162.0778 5 19 4 170.1884 6 20 8 156.9287 7 21 1 175.4092 8 22 2 173.3972 9 23 7 154.4907 10 24 5 158.3642 11 25 1 172.1033 12 26 3 162.6648 13 27 2 165.4449 14 28 8 142.2121 15 29 4 154.3557 16 30 6 145.6544 17 31 3 155.5286 18 32 4 150.5144 19 33 7 137.8262 20 34 1 160.1183 21 35 2 155.4419 22 36 8 127.1715 23 37 5 138.0237 24 38 6 133.4589 25 39 4 139.3813 26 40 3 145.1997 27 41 7 123.7259 28 42 5 130.7300 29 43 8 114.1148 30 44 1 151.1943 31 45 6 121.7235 32 46 2 140.9424

Table 12.1. The example data set introduced in Chapter 11 considered here from the perspective of a multiple linear regression design with two predictor variables. An exemplary multiple linear regression design 133

Figure 12.1. (A) Simple linear regression. (B) A multiple linear regression example.

A multiple linear regression design with one offset variable and two predictor variables takes the form

2 2 yi ∼ N a + b1xi1 + b2xi2, σ ⇔ yi = a + b1xi1 + b2xi2 + εi, εi ∼ N 0, σ for i = 1, ..., n, (12.2) where xij, i = 1, ..., n, j = 1, 2 denotes the value of the jth predictor variable for the ith observation. In its design matrix formulation, eq. (12.2) corresponds to   1 x11 x12  a  1 x21 x22 2 n   n×3 3 2 y ∼ N(Xβ, σ In), where y ∈ R ,X := . . .  ∈ R , β := b1 ∈ R , and σ > 0. (12.3) . . .    b2 1 xn1 xn2

Here, the first entry in the beta parameter vector assumes the role of the offset parameter, the second entry assumes the role of the slope with respect to the first predictor variable, and the third entry assumes the role of the slope with respect to the second predictor.

Exemplary estimation and evaluation of a multiple linear regression design The most straightforward way to evaluate the beta and variance parameter estimates in a given multiple linear regression design is by direct implementation of their respective formulas,

(y − Xβˆ)T (y − Xβˆ) βˆ = (XT X)−1XT y andσ ˆ2 = . (12.4) n − p For the data depicted in Table 12.1, the beta and variance parameter estimates evaluate to

 aˆ  215.1 ˆ ˆ 2 β = b1 = −1.5 andσ ˆ = 4.7. (12.5) ˆ b2 −4.0

T -statistics for assessing the null hypotheses a ∈ {0}, b1 ∈ {0}, b2 ∈ {0}, and b1 − b2 ∈ {0} can then T T T be evaluated based on the four contrast vectors c1 := (1, 0, 0) , c2 := (0, 1, 0) , c3 := (0, 0, 1) , and T c4 := (0, 1, −1) , respectively. Based on the the corresponding observed T -statistics of t1 = 140.8, t2 = −36.1, t3 = −24.0, and t4 = 14.61, and the associated centred T -statistics probabilities of P(Tc,β ≥ 140.8) ≤ 0.001, P(Tc,β ≥ | − 36.1|) ≤ 0.001, P(Tc,β ≥ | − 24.0|) ≤ 0.001, and P(Tc,β ≥ 14.61) ≤ 0.001, one is hence prompted to reject the null hypotheses of zero effects and zero difference between the effects of age and alcohol. Multiple linear regression designs are not easily visualized, especially if the number of independent experimental variables is larger than 2. In Figure 12.1 we visualize the multiple linear regression design and the estimated regression plane for the current example. Note however, that these kind of graphs are rarely seen in the literature.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Linearly independent, orthogonal, and uncorrelated regressors 134

12.2 Linearly independent, orthogonal, and uncorrelated regressors

If the columns of a design matrix show a high degree of similarity, estimation of their corresponding beta parameters can become problematic. Intuitively, the beta parameter estimates allocate parts of the data variability onto the design matrix regressors. However, if these regressors are highly similar, the estimation theory of the GLM has no principle answer as to how to best distribute the data variability among the regressors. In general, the GLM is not suited for assessing the differential effects of regressors with almost identical profile. It is thus good practice to minimize the co-occurrence of events of interest that will form different regressors as much as possible while designing an experiment. There exist a great number of ways to optimize experimental designs. In this section, we introduce some of the nomenclature that is used in the discussion of optimal experimental designs. Specifically, we consider linearly independent, orthogonal, and uncorrelated regressors. Although often used interchangeably, the concepts of linearly independent, orthogonal, and uncorrelated regressors are not identical. In principle, the problem of regressor similarity can be solved by designing regressors that are linearly independent. To understand what these terms mean, we give their definitions n×p n and then consider a number of examples. Consider a design matrix X ∈ R and let x1 ∈ R and n x2 ∈ R denote two columns of the design matrix. x1 and x2 are best conceived as vectors lying in an n-dimensional vector space. From this perspective, their correlation is not a meaningful concept, because, in a formal sense, correlation is a concept that applies to random variables. Nevertheless, one can of course compute a correlation coefficient between the entries of x1 and x2 and, because the correlation between the regressors of a design matrix is often mentioned in the literature, we consider it here as well. The notions of linear independence, orthogonality, and uncorrelatedness for x1 and x2 are defined as follows. n Definition 12.2.1 (Regressor linear independence, orthogonality, and uncorrelatedness). Let x1, x2 ∈ R . Then

(1) x1 and x2 are called linearly independent, if and only if there exist no a ∈ R such that ax1 = x2 for x1, x2 6= 0, T (2) x1 and x2 are called orthogonal, if and only if x1 x2 = 0, and T 1 Pn (3) x1 and x2 are called uncorrelated, if and only if (x1 −x¯11n) (x2 −x¯21n) = 0, where x¯1 := n i=1 x1i , 1 Pn x¯2 := n i=1 x2i , and 1n is a vector of all ones. •

If one considers x1 and x2 as arrows in an n-dimensional space, definition (1) means that x1 and x2 do not fall along the same direction, and definition (2) means that x1 and x2 are perpendicular, i.e., the angle between them is 90. Orthogonality is thus a special case of linear independence. With respect to the definition of uncorrelated regressors it is noteworthy that it applies to centred regressors, i.e., the regressors from which their average value has been subtracted. Note also that if x1 and x2 are centred from the outset, orthogonality and uncorrelatedness are identical. Because orthogonality is a special case of linear independence, in this case x1 and x2 are also linearly independent. To illustrate these concepts, we consider three examples.

Example 1 Let 1 2 1 3 x1 :=   and x2 :=   . (12.6) 2 4 3 5 Because there exists no real number a that by scalar multiplication transforms the vector (1, 1)T into the T T vector (2, 3) , the vectors are linearly independent. However, because x1 x2 = 2 + 3 + 8 + 15 = 28 the T vectors are not orthogonal, and because (x1 − x¯11n) (x2 − x¯21n) = 3.5 the vectors are not uncorrelated.

Example 2 Let 0 1 0 0 x1 :=   and x2 :=   . (12.7) 1 1 1 0

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Statistical efficiency of multiple linear regression designs 135

Because there exists no real number a that by scalar multiplication transforms the vector (0, 0, 1)T T T onto the vector (1, 0, 1) , the vectors are linearly independent. Because x1 x2 = 1, the vectors are not orthogonal. They are, however, uncorrelated. We have x¯1 = x¯2 = 1/2 and the centered vectors c T c T cT c x1 = (−1/2, −1/2, 1/2, 1/2) and x1 = (1/2, −1/2, 1/2, −1/2) are orthogonal because x1 x2 = −1/4 + 1/4 + 1/4 − 1/4 = 0.

Example 3 Let  1  5 −5 1 x1 :=   and x2 :=   . (12.8)  3  1 −1 3 Because (1, 1)T cannot be transformed into (−5, 3)T by scalar multiplication, the vectors are linearly independent. Because 5−5+3−3 = 0, the vectors are also orthogonal. However, they are not uncorrelated, T because (x1 − x¯11n) (x2 − x¯21n) = 5. In summary, uncorrelated regressors are not necessarily orthogonal, and orthogonal regressors are not necessarily uncorrelated. Because the concept of linear independence subsumes both orthogonality and uncorrelatedness, it is best to speak of regressor collinearity rather than correlation.

12.3 Statistical efficiency of multiple linear regression designs

The quality of a multiple linear regression design can be measured according to a variety of criteria. In the current section, we focus on a simple statistical criterion that has been proposed as a measure of the statistical efficiency of multiple linear regression in the context of event-related fMRI designs. The proposed criterion relates to the covariance of the beta parameter estimate distribution. Recall that the frequentist distribution of the beta parameter estimates is given by

βˆ ∼ N(β, σ2(XT X)−1). (12.9)

The covariance matrix of beta parameter estimates is thus given by

2 T −1 p×p C(βˆ) = σ (X X) ∈ R . (12.10) Intuitively, the diagonal elements of this covariance matrix encode how much the regressor-specific effect ˆ estimates βj, j = 1, ..., p vary over repeated frequentist sampling. According to eq. (12.10), this variability is a function of the variance parameter σ2 > 0 and the inverse of the design matrix product XT X. For 2 ˆ constant σ > 0, the variance of the effect size estimates βj, j = 1, ..., p is thus a function of the diagonal T −1 p entries (X X)jj , j = 1, ..., p. For a contrast weight vector c ∈ R , this insight motivates the following criterion for the statistical efficiency of multiple linear regression designs:

2 Definition 12.3.1 (A multiple linear regression design efficiency criterion). Let y ∼ N(Xβ, σ In) denote a multiple linear regression GLM and let c ∈ Rp denote a contrast vector. Then

p n×p T T −1 −1 ξ : R × R → R≥0, (c, X) 7→ ξ(c, X) := c (X X) c (12.11) serves as a measure of the statistical efficiency of the design matrix X for contrast c.

• For contrast vectors c comprising a one at the jth entry and zeros for all other entries, ξ(c, X) returns the reciprocal of the diagonal jth element of the design product matrix. As, for constant σ2 > 0, this diagonal ˆ element encodes the variance of the corresponding beta parameter estimate βj, the reciprocal of this value encodes its inverse variance or statistical efficiency. An alternative interpretation of the criterion (12.11) is afforded by considering the T-statistic definition

cT βˆ Tc := . (12.12) pσ2cT (XT X)−1c

Under the assumption of identical true, but unknown, values of β and σ2, the criterion (12.11) thus favours larger values of the T-statistic, because for larger values of ξ(c, X) the denominator in (12.12)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Multiple linear regression in functional neuroimaging 136 becomes smaller. Importantly, the criterion ξ depends on both the design matrix X and the contrast weight vector of interest c ∈ Rp. In other words, according to the criterion (12.11) the same multiple linear regression design can, in principle, be statistically efficient with respect to one contrast of interest and inefficient with respect to another.

12.4 Multiple linear regression in functional neuroimaging

The fundamental aim of applying the GLM in the analysis of fMRI data is to map cognitive processes onto brain areas. To this end, two fortuitous facts are exploited: first, local neural activity results in a local alteration of the ratio of deoxygenated and oxygenated haemoglobin via its metabolic demands. Second, the local displacement of deoxygenated haemoglobin alters the local magnetic susceptibility of brain tissue and can be detected as in increase of the local MR signal by an MRI scanner. The MR signal change induced by local neural activity is referred to as the blood oxygen level-dependent signal (BOLD signal). The idea of task-based fMRI is to induce specific cognitive processes in participants, which are presumably reflected in specific neural and metabolic activity states, which in turn can be detected by means of fMRI. If two different cognitive processes are represented by anatomically different brain structures, the statistical evaluation of the location of task-induced MR signal differences thus allows for mapping cognitive processes onto the anatomy of the brain. In the following, we briefly review the process of fMRI data acquisition and fMRI data preprocessing before formulating the GLM for fMRI data analysis. fMRI data acquisition

MRI scanners allow for taking images, i.e., three-dimensional arrays of numbers, of the brain. Depending on the specific parameters that are used to take these images, different image types result: T1-weighted images have high spatial resolution and reveal fine anatomical detail. T2*-weighted images, on the other hand, are typically less anatomically precise, but are sensitive to BOLD signal changes and take only about two seconds to acquire. fMRI studies based on the BOLD signal hence typically use T2*-weighted images. T2*-weighted images are usually acquired using an MR imaging process known as echo-planar imaging (EPI), hence the images used for fMRI are also often referred to as EPI images. In essence, the GLM-based analysis of these images converts EPI image time-series into maps of statistics that indicate local changes of the BOLD signal. These maps are called statistical parametric maps (SPMs). Here, parametric refers to the fact that the statistics involved are evaluated using parametric assumptions about their underlying distributions. fMRI data is typically organized as follows. A single participant is usually scanned in a single session, which comprises multiple runs of continuous MRI data acquisition. A run usually lasts about 10 to 15 minutes. During a run, the participant carries out a cognitive task, e.g., responding to visually presented stimuli via button presses, while a series of EPI images is acquired. fMRI data of one run thus comprises a temporal sequence of EPI images. The time it takes to acquire a single EPI image corresponds to the sampling interval of fMRI and is called time-to-repetition (TR). For typical two-dimensional imaging acquisition schemes, each image comprises a number of slices. Each slice in turn comprises a spatially arranged set of data containers known as volume elements, or voxels for short. Voxels are the three-dimensional analogues of two-dimensional pixels and make up the entire EPI image. Typical voxel sizes are 2 mm × 2 mm × 2 mm. It is very helpful to simply think of EPI images as three-dimensional arrays of numbers that represent the MR signal values of the image’s voxels. fMRI data preprocessing

Before the application of the GLM, fMRI data typically undergo some amount of data preprocessing to increase their quality. fMRI data preprocessing commonly comprises spatial distortion correction, spatial realignment, slice-time correction, spatial normalization, and spatial smoothing. We briefly review each of these steps in turn. ˆ Due to inhomogeneities of the magnetic field, certain parts of EPI images may be distorted with respect to the object the image is taken of. Correcting these distortions based on knowledge of magnetic field inhomogeneities is referred to as spatial distortion correction. The aim of distortion correction is thus to render the image a more veridical representation of the imaged object.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Multiple linear regression in functional neuroimaging 137

ˆ In order to allocate an observed BOLD signal change to a specific brain region, one has to be sure that the time-series data of a voxel actually refers to the same brain region throughout the course of an experiment. The MR scanner’s voxel grid is overlaid on the participant’s brain in a fixed position. Thus, if the participant moves during a run, voxels will represent different brain regions over time. For this reason, participants are usually spatially fixated as much as possible and encouraged not to move their head during scanning. However, some residual motion, e.g., by the pulsation of the blood in the brain’s vasculature, cannot be avoided and is corrected during the spatial realignment step. Typically, the first image of an EPI image time-series is used as the reference image to which all subsequent images are aligned to. ˆ The slices comprising an EPI image are acquired in temporal succession. Because of this, and because during data analysis EPI images are typically considered as data samples from a single time-point, temporal interpolation can be used to resample each slice with respect to the EPI image acquisition’s onset time. This process is known as slice-time correction. ˆ Spatial normalization refers to the transformation of the participant-specific three-dimensional voxel time-series into a standard group anatomical space. This transformation is performed by translating, rotating, stretching, and squeezing the EPI data in three dimensions. Because brain anatomy exhibits some degree of inter-individual variability, spatial normalization can never achieve full alignment between the brains of different participants, but is a useful and commonly used approach for fMRI group studies. ˆ Spatial smoothing refers to the weighted spatial averaging of voxel data with data from nearby voxels. Intuitively, spatial smoothing may be conceived as removing random signal fluctuations over space and rendering the data spatially more coherent. fMRI data organization

Upon data acquisition and data preprocessing, fMRI data of a single participant is organized in a spatiotemporal format of voxel-specific MR signal time-series. These spatially arranged MR signal time- series may be envisioned as in Table 12.2, where rows represent time-points (EPI images) and columns represent spatial locations (voxels). The basic idea of the mass-univariate viewpoint of GLM-based fMRI data analysis is to consider each voxel’s MR signal time-series in isolation. This is referred to as a mass-univariate approach, because the dependent variable, i.e., the voxel-specific MR signal time-series, is univariate - but there are many voxels.

Voxel 1 Voxel 2 Voxel 3 Voxel 4 ... Voxel m Time-point (image) 1 97.3 90.2 86.1 89.9 ... 85.3 Time-point (image) 2 98.2 91.1 87.0 89.5 ... 86.2 ...... Time-point (image) n 92.3 95.6 82.0 87.4 ... 83.1

Table 12.2. Tabular representation of voxel-specific MR signal time-series of a single participant and a single fMRI experimental run. Rows represent EPI images, i.e., data acquisition times and columns represent voxels, i.e., three- dimensional locations. n ∈ N and m ∈ N denote the total number of images and the total number of voxels, respectively.

We next formulate the mass-univariate GLM for the analysis of fMRI data and explore how it can be used for cognitive process brain mapping. To this end, we consider the time-series data of a single voxel (i.e., a T column of Table 12.2) and denote it by y = (y1, y2, ..., yn) . The time-series data of a single voxel is thus expressed as a column vector y ∈ Rn, the ith component of which corresponds to the ith time-point of the fMRI data acquisition. The GLM design for such a voxel data time-series then takes the form of a multiple linear regression over time. Specifically, the ith data point yi is modelled as a weighted sum of the values of a set of p independent variables and an additive noise term,

yi = xi1β1 + xi2β2 + xi3β3 + ··· + xipβp + εi for i = 1, ..., n. (12.13)

In eq. (12.13), yi ∈ R denotes the value of the MR signal of the voxel for time-point (EPI image) i, xij denotes the value of the jth independent variable for time-point i, βj, j = 1, ..., p denotes the time- independent weighting parameter for the jth independent variable, and εi denotes a time-point specific

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Multiple linear regression in functional neuroimaging 138

Figure 12.2. (A) The haemodynamic response function reflects the idealized MR signal response to a brief neural event. It serves as an important basis for GLM modelling of fMRI data in a temporal convolution framework. (B) Example condition onsets in a hypothetical experiment with two conditions. (C) Exemplary MR signal time-series from two different voxels. (D) Predicted MR signal time-series (design matrix columns) formed by convolving the stimulus onset functions of Panel B with the canonical haemodynamic response function of Panel A. (E) Predicted and observed voxel time-series data based on ˆ ˆ the parameter choices βA and βB . error contribution. Note that it is assumed that there are p independent variables. The independent variables are also routinely referred to as regressors or predictors. The time point-specific noise term εi is assumed to be distributed according to a univariate Gaussian distribution with expectation parameter 0 2 and variance parameter σ . Until further notice, we will assume that the εi are distributed identically and independently for i = 1, ..., n. In probability density function form, eq.(12.13) thus takes the standard GLM form 2 n n×p p 2 y ∼ N(Xβ, σ In), where y ∈ R ,X ∈ R , β ∈ R , and σ > 0. (12.14)

We next explore how the mass-univariate GLM of eq. (12.14) and the intuition of the specificity of a brain region’s response to a cognitive process of interest are related. To this end, we consider a simple fMRI experiment with one experimental factor comprising two levels. An example for such an experiment is the repeated presentation of two different visual stimuli, for example faces and houses, inducing the cognitive processes of visual face and location processing, respectively. For simplicity, we will refer to the two different experimental levels as Condition 1 and Condition 2, respectively. As described above, for each fMRI data acquisition run, fMRI data are acquired continuously for approximately 10 to 15 minutes, while participants are exposed to the experimental manipulation. As also described above, the fundamental idea of GLM-based fMRI is that in response to a cognitive process, neurons in brain regions that are specialized for a respective cognitive process become active, causing a metabolic cascade which leads to a local increase in the level of oxygenated haemoglobin and hence the local MR signal. Crucially,

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Multiple linear regression in functional neuroimaging 139

T Figure 12.3. (A) Graphical representation of the GLM matrix product for βA = (1, 0) . (B) Graphical representation of T the GLM matrix product for βB = (1, 1) . such haemodynamic responses to a variety of cognitive processes, mainly of the visual processing type, have been measured. Based on these measurements, the concept of a haemodynamic response function has been formulated. A haemodynamic response function is a mathematical model that describes the idealized change of the MR signal in response to a neurocognitive event. The particularities of haemodynamic response functions that are employed in GLM-based fMRI data analysis are discussed in a later section. For the moment, it suffices to note that the MR signal time-series in response to a neurocognitive event at time point t = 0 approximately takes the shape of the function shown in Figure 12.2A. For the purpose of the example, we next assume that a participant was presented with stimuli representing Conditions 1 and Condition 2 in random order over the course of an experimental run of 260 seconds at the times shown by the stick functions in Figure 12.2B. Whenever the stick function of a condition takes on the value 1, the respective condition was presented. For example at times t = 0 and t = 16 Condition 1 was presented, at time t = 32 Condition 2 was presented, and so on. We further assume that while these conditions were presented, fMRI data were collected every 1.44 seconds from two voxels, which we refer to as Voxel A and Voxel B. We assume that these voxels are located in different brain regions and their recorded data takes the form of 12.2C. Visually comparing the event time-points of 12.2B to the voxel time-series data in 12.2C suggests that Voxel A shows MR signal increases, whenever Condition 1 is presented and no MR signal increase when Condition 2 is presented. On the other hand, Voxel B shows MR signal increases for both Condition 1 and Condition 2. In the GLM analysis of fMRI data, the voxel-specific responsiveness to a specific condition that is suggested by Figures 12.2B and 12.2C is encoded by the beta parameter values of the independent variable representing the respective condition. To see this, consider the predicted signal time-series for voxels that are only and ideally responsive to either Condition 1 or Condition 2 as shown in Figure 12.2D. These predicted time-series are obtained by “replacing” the stick functions of Figure 12.2B with the assumed haemodynamic response function of 12.2A. Technically, this “replacement” is achieved by the convolution of the stimulus stick functions with the haemodynamic response function and implements the linear time-invariant system perspective of GLM-based fMRI. This will be detailed in a later section. Crucially, in the GLM analysis of fMRI data, the predicted time-series for each condition are identified with the columns of the design matrix in the multiple linear regression design for a single voxel’s observed data time-series. That is, the number that is given by the predicted MR signal for Condition 1 and the number that is given by the predicted MR signal for Condition 2 at time-point i are concatenated into a row vector with two entries and entered in the ith row of the design matrix. Note that for the current example, data were acquired for n = 260/1.44 = 180 data-points, such that X ∈ R180×2. Alternatively, one can imagine transposing the predicted MR signals of Condition 1 into a column vector and entering this as the first column of the design matrix, and likewise transposing the predicted MR signal of Condition 2

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Multiple linear regression in functional neuroimaging 140

Figure 12.4. Typical visualization of a statistical parametric mapping. Note that the SPM itself is overlaid on an anatomical image and voxels with statistics smaller than the treshold value u = 7.22 are masked out, i.e., translucent. Further note that only three slices of the SPM oriented along the principal anatomical axes are shown, and not the the entire suprathreshold SPM values. into a column vector and entering this as the second column of the design matrix. The multiple linear regression design matrix for GLM-based fMRI data analyses are commonly represented as grey-scale images of its entries as in the leftmost panels of in Figures 12.3A and B. Now consider again the MR 180 signal time-series of Voxel A in Figure 12.2C. Transposing this row vector into a column vector yA ∈ R , we see that we can write down the GLM equation for this voxel’s time-series data quite well, if we choose T the true, but unknown, beta parameter vector to be approximately βA := (1, 0) . Intuitively, we can represent the corresponding GLM matrix multiplication graphically as in Figure 12.3A. Likewise consider the MR signal time-series course of Voxel B in Figure 12.2C. Here we see that we can recreate the voxel’s data time-series using the same design matrix as for Voxel A but setting the true, but unknown, T parameter vector to βB := (1, 1) as shown in Figure 12.3B. Note that the observed signal results from the outcome of the design matrix and parameter multiplication plus a noise vector, here denoted by εA ˆ ˆ and εB, respectively. Equivalently, we may evaluate the respective beta parameter estimates βA and βB based on the design matrix X and the data yA and yB, respectively. These parameter estimates result ˆ T ˆ T in βA = (1.0027, −0.0471) and βA = (1.0033, 1.0025) . Overlaying the estimated (or fitted) time-series ˆ ˆ XβA and XβB and the originally observed time-series confirms a good correspondence as shown in Figure 12.2E. In essence, the value of the true, but unknown, beta parameter which belongs to a specific onset regressor or its estimated counterpart tells us something about the voxel’s preference with respect to an ˆ experimental condition or neurocognitive process: for Voxel A, the first entry in βA and βA is large and ˆ the second entry in βA and βA is small. From the discussion above, we see that this means that Voxel A ˆ is responsive to Condition 1, but not to Condition 2. Likewise, for Voxel B, the two entries in βB and βB are very similar and, as discussed above, Voxel B responds equally well to both conditions. Finally, statistical parametric maps are created by evaluating not only beta parameter estimates, but also variance parameter estimates and the ensuing T - or F -statistics (Figure 12.4). For example, for Voxel A, the T -statistic for the null hypothesis, that the true, but unknown beta parameters are identical using a contrast weight vector of the form c = (1, −1)T will yield a value deviating from 0 quite a bit. For ˆ Voxel B, the difference between the entries in βB is around zero, and the T -statistic using this contrast weight vector is hence also close to zero (given that the estimated variance parameters are approximately identical and not themselves close to zero). Concatenating all statistics values for a given contrast weight vector over voxels and arranging them in their anatomical relationships then yields an SPM. Typically, SPMs are visualized by overlaying the respective statistics values on anatomical images, representing the value of a statistic by means of a colormap, and thresholding the SPM at an appropriate value such that statistics smaller than this value are not shown at all.

Optimizing fMRI designs As an example for assessing the statistical efficiency of an fMRI design, we consider measuring the statistical efficiency of a first-level fMRI design with two experimental conditions and varying inter-event intervals (IEIs), i.e., temporal durations between the onset of two successive events. To this end, we assume that for each of the two experimental conditions, 20 event repeats are presented in random order

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 141

Figure 12.5. Measuring the statistical efficiency of first-level fMRI GLM designs. (A) Two conditions fMRI GLM designs with varying inter-event intervals IEIs. The left panel depicts an exemplary design with an IEI of 4 seconds, the right panel an exemplary design with an IEI of 12 seconds. Note that for the short IEI, the individual predicted BOLD responses for experimental events overlap, which is not the case for the long IEI. (B) The left panel depicts the average angle and correlation between the two regressors as a function of IEI over 100 randomly allocated event sequences. The right panel depicts the average design efficiency as measured by ξ for two contrast weight vectors. Notably, the IEI and the parameter contrast of interest interact in their determination of the first-level fMRI GLM design efficiency as measured by ξ(c, X). and the IEI is varied between 2 and 22 seconds. In Figure 12.5A, we visualize two exemplary designs with IEI = 4 seconds and IEI = 12 seconds, respectively. The left panel of Figure 12.5B depicts the angle in degrees and the correlation of the two design matrix regressors as a function of IEI, averaged over 100 randomly allocated event sequences. Note that the geometrical relationship of the regressors and their correlation do not show a one-to-one correspondence. For example, for an IEI of approximately 5 seconds, the regressors are nearly orthogonal, but also strongly negatively correlated. The right panel of Figure 12.5B depicts the design efficiency criterion ξ as a function of IEI for two different contrast T weight vectors. The first contrast weight vector c1 = (1, 1) allows for detecting activation over both T experimental conditions, while the second contrast weight vector c2 := (1, −1) allows for detecting differential activation between both experimental conditions. Notably, short IEIs are more efficient for detecting activation across both conditions, while intermediate IEIs of around 10 seconds are most efficient for detecting differential activations between the conditions. In summary, criteria such as (12.11) can help to optimize first-level fMRI GLM designs. Often, however, there exist additional constraints, such as the cognitive effects to be elicited by the paradigm, and general purpose criteria for optimizing first-level fMRI GLM designs are difficult to establish. Simulating potential first-level fMRI GLM designs and assessing their regressor collinearity is in general an advisable approach.

12.5 Bibliographic remarks

Comprehensive introductions to multiple linear regression are provided for example by Draper and Smith (1998), Hocking(2003), and Seber and Lee(2003). A comprehensive introduction to the physical and biological basis of fMRI is given by Huettel et al.(2014). More advanced and in-depth accounts are provided by Jezzard et al.(2001) and Uludag et al.(2015). Finally, comprehensive introductions to fMRI data preprocessing are provided in Poldrack et al.(2011) and Friston(2007).

12.6 Study questions

1. Write down the GLM formulation of a multiple linear regression with one offset variable and two predictor variables. 2. Define the notions of collinear, orthogonal, and correlated design matrix regressors.

3. Discuss, why for a design matrix X ∈ Rn×p and contrast weight vector c ∈ Rp the function

p n×p T T −1 −1 ξ : R × R → R>0, (c, X) 7→ ξ(c, X) := c (X X) c (12.15)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 142

has some merit as a measure of the efficiency of the experimental design encoded in the design matrix. 4. What does it mean for the GLM to be used for fMRI data analysis in a mass-univariate fashion? 5. Describe the fMRI data organization after fMRI data preprocessing. 6. What is the difference between the haemodynamic response and a haemodynamic response function? 7. Which GLM design category is used for the analysis of fMRI time-series data? 8. What do the beta parameter estimates obtained in a GLM fMRI time-series data of a single voxel reflect? 9. What is a statistical parametric map? 10. How can fMRI designs be optimized?

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 13 | One-way ANOVA

13.1 The GLM perspective

A convenient way to think about the one-way analysis of variance (ANOVA) is to consider it the extension of an independent two-sample t-test to more than two levels of a single experimental factor. Let m ∈ N denote the number of levels of the experimental factor, let ni ∈ N denote the number of observations at Pm level i of the factor, such that the total number of observations is n := i=1 ni, and for j = 1, ..., ni, let yij ∈ R denote the random variable modelling the jth experimental unit on the ith level of the experimental factor. The distributions of the random variables modelling data points in a one-way ANOVA design then take the form

2 2 2 yij ∼ N(µi, σ ) ⇔ yij = µi + εij, εij ∼ N(0, σ ), σ > 0 for i = 1, ..., m for j = 1, ..., ni. (13.1)

Note that the ni random variables on level i of the experimental factor have the same expectation parameter µi. On first approximation, the underlying assumption of the one-way ANOVA model about this level-specific expectation parameter is

µi := µ0 + αi for i = 1, ..., m. (13.2)

In eq. (13.2), µ0 ∈ R models a common offset for all levels of the experimental factor and αi ∈ R models the additional effect of the ith level of the experimental factor. In design matrix formulation, the one-way ANOVA as just specified can be written as

2 y ∼ N(Xβ, σ In), (13.3) where     y11 1 1 0 0  .  . . . .  .  . . . ··· .      y1n1  1 1 0 0  y  1 0 1 0  21       .  . . . . µ0  .  . . . ··· .      α1         y2n2  n 1 0 1 0 n×m+1 α m+1 2 y :=   ∈ R ,X :=   ∈ R , β :=  2  ∈ R , and σ > 0. (13.4)      .       .   .  . . . .    .  . . . ··· .     αm          ym1  1 0 0 1      .  . . . .  .  . . . ··· .

ymnm 1 0 0 1 Notably, this one-way ANOVA GLM formulation is based on a design matrix that comprises m + 1 columns: one column of all 1’s corresponding to the constant offset parameter µ0 and m columns of indicator variables corresponding to the m level-specific effect parameters αi, i = 1, ..., m. These indicator variables take on the value 1 in rows of data variables corresponding to ith level of the experimental factor and take on the value 0 otherwise.

Over-parameterization of the one-way ANOVA formulation (13.1) - (13.4) Unfortunately, the one-way ANOVA GLM formulation of eqs. (13.1) - (13.4) requires a reformulation in order to enable the estimation of its parameters. As it stands, the model is over-parameterized. This problem may be viewed from at least three perspectives. ˆ From a data-analytical perspective, there are more unknowns than knowns: one can obtain averages from m levels of the experimental factor to estimate the m expectations µi, i = 1, ..., m, but the m expectations µi are parameterized with the m + 1 parameters µ0 and α1, ..., αm. The GLM perspective 144

ˆ From the perspective of systems of linear equations, over-parameterization implies that we have more unknown parameters than equations from which to determine them. A simple example is the system of linear equations a a + b + c = 0 1 1 1 0 ⇔ b = (13.5) 2a + b + c = 1 2 1 1   1 c In eq. (13.5), we have two equations and thus two “data points” associated with a specific parameter combination, as well as three “parameters” a, b and c. The problem is that different combinations of values of a, b, and c can solve the system (13.5). We thus cannot uniquely infer the parameter values from the data points. For example, both the combinations a = 1, b = 1, and c = −2, as well as a = 1, b = −1, and c = 0 solve the system of linear equations (13.5). ˆ Finally, from the perspective of linear algebra, the design matrix X in eq. (13.4) is not of full column- rank, because the first column of X is the sum of the last m columns of X. In other words, the columns of X are not linearly independent. It can be shown that the rank-deficiency of X results in the rank-deficiency of XT X. This in turn corresponds to XT X being non-invertible, which implies that the beta estimator, is not defined. To nevertheless obtain a useful one-way ANOVA model, the model represented by eqs. (13.1),(13.2), (13.3), and (13.4) needs to be reformulated. There are several ways in which this can be done. One approach is to set µ0 = 0 or simply omit the constant offset µ0. If this approach is chosen, the αi become the factor level expectations and αi represents the expected response at factor level i. This approach, however, does not generalize well to models with more than one factor. Instead the so-called reference cell method is generally preferred.

Reference cell reformulation

The reference cell method corresponds to constraining one of the level effects αi, i = 1, ..., m to be zero. While in principle any level effect can be chosen as reference cell, conventionally the effect of the first factor level α1 is constrained to be zero. That is, by definition α1 := 0. The level-specific expectation parameters originally formualted in eq. (13.2) thus take on the forms listed in Table 13.1.

Original formulation Reformulation with α1 := 0 µ1 := µ0 + α1 µ1 := µ0 µ2 := µ0 + α2 µ2 := µ0 + α2 . . . . µm := µ0 + αm µm := µ0 + αm

Table 13.1. The reference cell reformulation of a one-way ANOVA model with four levels.

Applying the reference cell method to the one-way ANOVA model thus entails that µ0 becomes the expectation parameter of the first level of the experimental factor, while the αi, i = 2, ..., m parameters become the additional effects of the ith factor level with respect to the first level expectation parameter µ0. Crucially, the α2, ...αm thus model the expected difference between the expectation of the respective experimental factor levels i = 2, ..., m and the expectation of the first level.

In its matrix formulation, the reference cell method with α1 := 0 is equivalent to removing α1 from the beta parameter vector and deleting its corresponding indicator variable column from the design matrix. The thus reformulated model of eqs. (13.3) and (13.4) takes the form

2 y ∼ N(Xβ, σ In), (13.6)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The GLM perspective 145 where     y11 1 0 0  .  . . .  .  . . ··· .      y1n1  1 0 0      y21  1 1 0      .  . . .    .  . . ··· . µ0      α2   y2n2  n 1 1 0 n×m m 2 y :=   ∈ R ,X :=   ∈ R , β :=  .  ∈ R , and σ > 0. (13.7)      .       .   .  . . .  .  . . ··· . αm              ym1  1 0 1      .  . . .  .  . . ··· .

ymnm 1 0 1

Notably, this one-way ANOVA GLM formulation is based on a design matrix that comprises m columns: one column of all 1’s corresponding to the constant offset parameter µ0 and m − 1 columns of indicator variables corresponding to the m − 1 level-specific effect parameters α2, ..., αm. These indicator variables take on the value 1 in rows of data variables corresponding to ith level of the experimental factor and take on the value 0 otherwise. In contrast to the rank-deficient design matrix in eq. (13.4), the last m − 1 columns of the design matrix defined in eq. (13.7) do not add up to its first column.

Estimation and evaluation of one-way ANOVA designs As shown below, the one-way ANOVA model’s beta parameter estimator for GLM design of eqs. (13.6) and (13.6) evaluates to

   1 Pn1    µˆ0 y1j y¯1 n1 j=1    1 Pn2 1 Pn1     αˆ2   y2j − y1j   y¯2 − y¯1  ˆ    n2 j=1 n1 j=1    β =  .  =  .  =  .  , (13.8)  .   .   .        1 Pnm 1 Pn1 αˆm ymj − y1j y¯m − y¯1 nm j=1 n1 j=1 where n Xi y¯i := yij for i = 1, ..., m. (13.9) j=1

In other words, the expectation parameter µ0 of the first level of the experimental factor is estimated by the sample mean of the first level data variables y1j, j = 1, ..., n1. Furthermore, for i = 2, ..., m, the expected differences αi between the ith level of the experimental factor and the expectation of the first level of the experimental factor are estimated by the difference in the sample means of the ith level data variables yij, i = 2, ..., m, j = 1, ..., ni and the sample mean of the first level data variables y1j, j = 1, ..., n1.

Proof. We first note that for the design matrix defined eq. (13.7), we have 1 0 0 . . . . . ··· .   1 0 0   1 1 0   . . .  n n2 n3 ··· nm 1 ··· 1 1 ··· 1 ··· 1 ··· 1 . . . . . ··· . n n 0 ··· 0 0 ··· 0 1 ··· 1 ··· 0 ··· 0    2 2  T   1 1 0  n 0 n ··· 0  X X =  . . . .    =  3 3  . (13.10)  . . . .     . . . . .   . . . .     . . . . .  . . .  . . . . .  0 ··· 0 0 ··· 0 ··· 1 ··· 1 . . . . . ··· . nm 0 0 ··· nm       1 0 1   . . . . . ··· . 1 0 1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The GLM perspective 146

The inverse of XT X is given by   1 − 1 · · · − 1 n1 n1 n1   − 1 n1+n2 ··· 1  T −1  n1 n1n2 n1  (X X) =   . (13.11)  . . .. .   . . . .    − 1 1 ··· n1+nm n1 n1 n1nm For example, for m = 4, we have

 1 1 1 1    n − n − n − n n n2 n3 n4 1 1 1 1  1 n1+n2 1 1  T n2 n2 0 0  T −1 −  X X =   and (X X) =  n1 n1n2 n1 n1  . (13.12) n3 0 n3 0   1 1 n1+n3 1  − n n n n n  n4 0 0 n4  1 1 1 3 1  − 1 1 1 n1+n4 n1 n1 n1 n1n4 We next note that   y11    .   .       y   1n1     y   21   .     .  Pm Pni  1 ··· 1 1 ··· 1 ··· 1 ··· 1  .  i=1 j=1 yij        Pn2  0 ··· 0 1 ··· 1 ··· 0 ··· 0  y   j=1 y2j  T    2n2    X y =  . . . .    =  .  . (13.13)  . . . .     .   . . . .     .     .     .  Pnm 0 ··· 0 0 ··· 0 ··· 1 ··· 1  .  j=1 ymj            ym1     .   .   .    ymnm We thus obtain     1 1 1 Pm Pni − · · · − yij  n1 n1 n1   i=1 j=1       1 n1+n2 1   Pn2  − ···   y2j  T −1 T  n1 n1n2 n1   j=1  βˆ = (X X) X y =     . (13.14)  . . . .   .   . . .. .   .   . . .   .          1 1 n1+nm Pnm − ··· ymj n1 n1 n1nm j=1 For the first entry in βˆ, we have

m ni n2 nm 1 X X 1 X 1 X βˆ1 = yij − y2j − · · · − ymj n n n 1 i=1 j=1 1 j=1 1 j=1    n1 n2 nm n2 nm 1 X X X X X =  y1j + y2j + ··· + ymj  − y2j − · · · − ymj  n 1 j=1 j=1 j=1 j=1 j=1 (13.15)

n1 1 X = y1j n 1 j=1

=y ¯1.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The GLM perspective 147

A B C

200 200 200

e e

e 150 150 150

m m m

u u u

l l l

o o o

v v

v 100 100 100

C C C

F F F

P P P

L L

L 50 50 50

D D D

0 0 0 All Young Old 15 - 22 yrs 23 - 30 yrs 31 - 38 yrs 39 - 46 yrs

Figure 13.1. (A) Visualization of a one-sample t-test design. (B) Visualization of an independent two-sample t-test design. The errorbars depict the pooled standard deviation s12. (C) Visualization of a one-way ANOVA design.

For the second entry in βˆ, we have

m ni n2 n3 nm 1 X X n1 + n2 X 1 X 1 X βˆ2 = − yij + y2j + y3j + ··· + ymj n n n n n 1 i=1 j=1 1 2 j=1 1 j=1 1 j=1    n2 n1 n2 nm n3 nm n1 + n2 X 1 X X X X X = y2j −  y1j + y2j + ··· + ymj  − y3j − · · · − ymj  n n n 1 2 j=1 1 j=1 j=1 j=1 j=1 j=1

n2 n1 n2 n1 + n2 X 1 X 1 X = y2j − y1j − y2j n n n n 1 2 j=1 1 j=1 1 j=1 n2 n2 n1 (13.16) n1 + n2 X n2 X 1 X = y2j − y2j − y1j n n n n n 1 2 j=1 1 2 j=1 1 j=1

n2 n1 n1 X 1 X = y2j − y1j n n n 1 2 j=1 1 j=1

n2 n1 1 X 1 X = y2j − y1j n n 2 j=1 1 j=1

=y ¯2 − y¯1, and analogously for the remaining entries βˆ3, ..., βˆm.

As an example, we consider again the data introduced in Chapter 11. To illustrate the use of a one-way ANOVA model in this scenario, we ignore the alcohol factor and only consider the age factor. Moreover, we define m = 4 levels of the age factor: for level i = 1 we group participants aged 15 - 22 years, for level i = 2 we group participants aged 23 - 30 years, for level i = 3 we group participants aged 31-38 years, and for level i = 4 we group participants aged 39 - 46 years. This regrouping results in the data layout documented in Table 13.2. Here, yij models DLPFC volume of the jth participant at the ith level of the age factor and n1 = n2 = n3 = n4 = 8. One-way ANOVA designs are usually visualized by depicting the group sample means and the associated standard deviations as shown in Figure 13.1.

i = 1 i = 2 i = 3 i = 4

P ij y1j P ij y2j P ij y3j P ij y4j 1 11 178.8 9 21 154.5 17 31 155.5 25 41 139.4 2 12 168.5 10 22 158.4 18 32 150.5 26 42 145.2 3 13 169.9 11 23 172.1 19 33 137.8 27 43 123.7 4 14 162.1 12 24 162.7 20 34 160.1 28 44 130.7 5 15 170.2 13 25 165.4 21 35 155.4 29 45 114.1 6 16 156.9 14 26 142.2 22 36 127.2 30 46 151.2 7 17 175.4 15 27 154.4 23 37 138.0 31 47 121.7 8 18 173.4 16 28 145.7 24 38 133.4 32 48 140.9

Table 13.2. The example data set introduced in Chapter 11 in a one-way ANOVA layout. The participant labels in the P column correspond to the labels in the original data table of Chapter 11, while the ij-indices correspond to the one-way ANOVA-style relabelled variables yij for i = 1, ..., m and j = 1, ..., ni.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The GLM perspective 148

The one-way ANOVA model for m = 4 and ni = 8, i = 1, 2, 3, 4 in reference cell formulation takes the form

2 2 y1j ∼ N(µ1, σ ) ⇔ y1j = µ1 + ε1j, ε1j ∼ N(0, σ ) for j = 1, ..., 8 with µ1 := µ0 2 2 y2j ∼ N(µ2, σ ) ⇔ y2j = µ2 + ε2j, ε2j ∼ N(0, σ ) for j = 1, ..., 8 with µ2 := µ0 + α2 2 2 . (13.17) y3j ∼ N(µ3, σ ) ⇔ y3j = µ3 + ε3j, ε3j ∼ N(0, σ ) for j = 1, ..., 8 with µ3 := µ0 + α3 2 2 y4j ∼ N(µ4, σ ) ⇔ y4j = µ4 + ε4j, ε4j ∼ N(0, σ ) for j = 1, ..., 8 with µ4 := µ0 + α4

In design matrix form, eq. (13.17) can be written as

2 y ∼ N(Xβ, σ I32), (13.18) where y11 1 0 0 0 y12 1 0 0 0     y13 1 0 0 0     y14 1 0 0 0     y15 1 0 0 0     y16 1 0 0 0     y17 1 0 0 0     y18 1 0 0 0     y21 1 1 0 0     y22 1 1 0 0     y23 1 1 0 0     y24 1 1 0 0     y25 1 1 0 0     y26 1 1 0 0       y27 1 1 0 0 µ0     y28 32 1 1 0 0 32×4 α2 2 y =   ∈ R ,X =   ∈ R , β =   and σ > 0. (13.19) y31 1 0 1 0 α3     y32 1 0 1 0 α4     y33 1 0 1 0 y  1 0 1 0  34   y  1 0 1 0  35   y  1 0 1 0  36   y  1 0 1 0  37   y  1 0 1 0  38   y  1 0 0 1  41   y  1 0 0 1  42   y  1 0 0 1  43   y  1 0 0 1  44   y  1 0 0 1  45   y  1 0 0 1  46   y47 1 0 0 1 y48 1 0 0 1

For the data depicted in Table 13.2, the ith level sample means evaluate to

y¯1 = 169.40, y¯2 = 156.91, y¯3 = 144.76, andy ¯4 = 133.38.

In accordance, the beta parameter estimates evaluate to     µˆ0 169.40 αˆ2 −12.49 βˆ =   =   . (13.20) αˆ3 −24.64 αˆ4 −36.02

From eq. (13.43), it follows that the ith level sample means can be reconstructed from the beta parameter estimates as

y¯1 =µ ˆ0 = 169.40

y¯2 =µ ˆ0 +α ˆ2 = 169.40 − 12.49 = 156.91 . (13.21) y¯3 =µ ˆ0 +α ˆ3 = 169.40 − 24.64 = 144.76

y¯4 =µ ˆ0 +α ˆ4 = 169.40 − 36.02 = 133.38

The variance parameter estimate for the data depicted in Table 13.2 evaluates to σˆ2 = 115.43. Based on the respective unit vector contrasts, the null hypotheses µ0 ∈ {0}, α1 ∈ {0}, α2 ∈ {0}, and α3 ∈ {0}

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -test perspective 149

may be evaluated. Exemplarily, we note that α2 ∈ {0} can be evaluated using c = (0, 1, 0, 0), resulting in a T -statistic of Tc = −2.32. The fact that under the null hypothesis P(Tβ,c ≥ | − 2.32|) = 0.01 would hence prompt rejection of the null hypothesis, if a test significance level of α0 = 0.05 is desired. Other potential null hypotheses pertain to differences in the experimental level effects, i.e., α1 − α2 ∈ {0}, α2−α3 ∈ {0}, and α1−α3 ∈ {0}. Evaluation of the corresponding T -statistics based on the contrast vectors T T T c1 := (0, 1, −1, 0) , c2 := (0, 0, 1, −1) , and c1 := (0, 1, 0, −1) yields the T -statistic values Tc1 = 2.26,

Tc2 = 2.12 , and Tc3 = 4.38 with associated probabilities P(Tβ,c ≥ 2.26) = 0.02, P(Tβ,c ≥ 2.12) = 0.02, and P(Tβ,c ≥ 4.38) < 0.001 under the respective null hypotheses. These results would hence prompt the rejection of the null hypothesis α1 − α2 ∈ {0}, the rejection of the null hypothesis α2 − α3 ∈ {0}, and the 0 rejection of the null hypothesis α1 − α3 ∈ {0}, if a test significance level of α = 0.05 is desired.

13.2 The F -test perspective

F -tests are commonly introduced in the context of single-factor ANOVA designs. In this setting, the F -statistic refers to the ratio of a between-group variance, also referred to as a treatment variance, and a within-group variance, also referred to as an error variance. The aim of the current section is to link this classical F -test perspective to the model comparison perspective introduced in Chapter 9 | Frequentist distribution theory. To this end, we first review the one-way ANOVA design and the associated variance partitioning approach of the classical F -tests. We then relate the ensuing variance partitioning scheme to the structural form of the corresponding full and reduced GLM as introduced Chapter 9 | Frequentist distribution theory and demonstrate the equivalence of both perspectives.

Classical variance partitioning one-way ANOVA From a classical perspective, the one-way ANOVA F -test provides a single test procedure that allows to assess the null hypothesis that the “population means of three or more experimental groups are equal”. As in the GLM view of one-way ANOVA, the categorical independent variable of classical one-way ANOVA designs is referred to as factor and the different values that it may assume are referred to as levels. Data in a one-way ANOVA design is then typically organized as shown in Table 13.3 below:

Level 1 Level 2 ··· Level m y11 y21 ··· ym1 y12 y22 ··· ym2 ......

y1n1 y2n2 ··· ymnm

Table 13.3. Data layout of a one-factorial classical ANOVA design.

In Table 13.3, the entry yij ∈ R refers to the data obtained from the jth experimental unit on the ith level of the experimental factor, where j = 1, ..., ni and i = 1, ..., m. ni is the number of experimental units for level i = 1, ..., m and the assumption of a balanced design corresponds to n1 = n2 = ... = nm. m is the number of levels of the experimental factor and has to be larger than 1. The total number of data Pm points/experimental units is n = i=1 ni. The fundamental idea of classical one-way ANOVA is to assess whether the variability of the data is primarily due to the variability of the independent variable, i.e., differences between the levels of the experimental factor or whether it is primarily due to inherent noise in the dependent variables. This is achieved by assessing the relative contributions of treatment-related variance and noise-related variance in a partitioning of the overall data variance. This partitioning takes the intuitive form

Data variance = Treatment-related variance + Noise Variance. (13.22)

If the ratio of treatment-related variance and noise variance, informally given by the F -statistic, Treatment-related variance F-statistic ≈ , (13.23) Noise variance is large, then one may infer that the experimental factor had some effect on the observed data values and the null hypothesis that the experimental factor has no effect on the dependent variable may be rejected.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -test perspective 150

To formalize these notions, we first define a set of average and variance measures that can be computed based on the data variables listed in Table 13.3. Definition 13.2.1 (Grand mean, group means, and sums of error squares). For i = 1, ..., m and j = 1, ..., ni, let yij denote the jth data variable on the ith level of a one-way ANOVA design. Then ˆ the grand mean of the data is defined as

m n 1 X Xi y¯ := y , (13.24) n ij i=1 j=1

ˆ the ith-level mean is defined as n 1 Xi y¯ = y , (13.25) i n ij i j=1

ˆ the total sum of error squares is defined as

m ni X X 2 SESTotal := (yij − y¯) , (13.26) i=1 j=1

ˆ the between-level sum of error squares (or treatment sum of squares) is defined as

m X 2 SESBetween := ni(¯yi − y¯) , (13.27) i=1

ˆ the within-level sum of error squares (or error sum of squares) is defined as

m ni X X 2 SESWithin := (yij − y¯i) . (13.28) i=1 j=1

Note that SESTotal quantifies the total variability of all data points about the grand mean, SESBetween quantifies the variability of the ith level means about the grand mean, and SESWithin quantifies the variability of the ith level data points about the ith level mean for i = 1, ..., m. With these definitions, the intuitive variance partitioning of eq. (13.22) can be formalized as follows.

Theorem 13.2.1 (One-way ANOVA variance partitioning). For i = 1, ..., m and j = 1, ..., ni, let yij denote the jth data variable on the ith level of a one-way ANOVA design and let SESTotal, SESBetween, and SESWithin denote the total sum of error squares, the between-level sum of error squares, and the within-level sum of error squares, respectively, as defined in Definition 13.2.1. Then

SESTotal = SESWithin + SESBetween. (13.29)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -test perspective 151

Proof. With Definition 13.2.1, we have

m ni X X 2 SESTotal = (yij − y¯) i=1 j=1

m ni X X 2 = (yij − y¯i +y ¯i − y¯) i=1 j=1

m ni X X 2 = ((yij − y¯i) + (¯yi − y¯)) i=1 j=1

m ni X X 2 2 = (yij − y¯i) + 2(yij − y¯i)(¯yi − y¯) + (¯yi − y¯) i=1 j=1   m ni ni ni X X 2 X X 2 =  (yij − y¯i) + 2(yij − y¯i)(¯yi − y¯) + (¯yi − y¯)  i=1 j=1 j=1 j=1   m ni ni X X 2 X 2 =  (yij − y¯i) + 2(¯yi − y¯) (yij − y¯i) + ni(¯yi − y¯)  i=1 j=1 j=1     m ni ni ni X X 2 X 1 X 2 (13.30) =  (yij − y¯i) + 2(¯yi − y¯) yij − yij  + ni(¯yi − y¯)  n i=1 j=1 j=1 i j=1      m ni ni ni ni X X 2 X X 1 X 2 =  (yij − y¯i) + 2(¯yi − y¯)  yij −  yij  + ni(¯yi − y¯)  n i=1 j=1 j=1 j=1 i j=1     m ni ni ni X X 2 X ni X 2 =  (yij − y¯i) + 2(¯yi − y¯)  yij − yij  + ni(¯yi − y¯)  n i=1 j=1 j=1 i j=1     m ni ni ni X X 2 X X 2 =  (yij − y¯i) + 2(¯yi − y¯)  yij − yij  + ni(¯yi − y¯)  i=1 j=1 j=1 j=1   m ni X X 2 2 =  (yij − y¯i) + ni(¯yi − y¯)  i=1 j=1

m ni m X X 2 X 2 = (yij − y¯i) + ni(¯yi − y¯) i=1 j=1 i=1

= SESWithin + SESBetween.

To define the F -statistic in the classical perspective of one-way ANOVA, the additional concepts of degrees of freedom and mean squares are required. In the current context, the notion of degrees of freedom refers to the number of independent data points that result from computing a mean over a group of data points. Specifically, for the case of the grand mean and the associated total sum of error squares, there are n values, and, if the grand mean is known, n − 1 choices for different data points. The number of degrees of freedom of SESTotal is thus said to be DFTotal = n − 1. Likewise, for the case of the grand mean and the level mean and their associated between sum of error squares, if the grand mean is known, m − 1 of the group means may freely be chosen. The number of degrees of freedom of SESBetween is thus said to be DFBetween = m − 1. Finally, for the case of the within sum of error squares, m level means and the individual data points are considered. For each level-specific data set, there are ni − 1 degrees of freedom if the group mean is known. Because there are m groups, the total degrees of freedom of the within-level sum of error squares are DFWithin = m(ni − 1) = n − m. In summary, the degrees of freedom of the three sum of error squares of interest are

DFTotal = DFBetween + DFWithin ⇔ n − 1 = m − 1 + n − m. (13.31)

Division of the sum of error squares by their respective degrees of freedom yields estimators for the total, between-level, and within-level variances. In the context of one-way ANOVA, these variance estimators are referred to as total mean squares, between-level mean squares, and within-level mean squares and are

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -test perspective 152 defined as

m ni SESTotal 1 X X MS := = (y − y¯)2, Total DF n − 1 ij Total i=1 j=1 m SESBetween 1 X 2 MSBetween := = n (¯y − y¯) , (13.32) DF m − 1 i i Between i=1 m ni SESWithin 1 X X MS := = (y − y¯ )2, Within DF n − m ij i Within i=1 j=1 respectively. Finally, the F -statistic is introduced as the ratio between the between-level mean squares and the within-level mean squares, MS F := Between . (13.33) MSWithin As will be shown in the next section, the F -statistic defined in eq. (13.33) is identical to an F -statistic as defined in Chapter 9 under the one-way ANOVA model in reference cell formulation. Thus, under the assumption of independent and identically distributed Gaussian errors, the F -statistic defined in eq. (13.33) is distributed according to an f-distribution.

The GLM equivalence To show that the F -statistic defined in eq. (13.33) is identical to an F -statistic as defined in Chapter 9 under the one-way ANOVA model in reference cell formulation, we first make the one-way ANOVA model assumption that the distribution of the vectorized data of Table 13.3 adheres to the one-way ANOVA GLM in reference cell formulation, i.e., for

T n y = (y11, ..., y1n1 , y21, ..., y2n2 , ..., ym1, ..., ymnm ) ∈ R , (13.34) we assume 1 0 0 . . . . . ··· .   1 0 0   1 1 0   . . .   . . ··· . µ0   α2 2 1 1 0 n×m   m 2 y ∼ N(Xβ, σ In) with X :=   ∈ R , β :=  .  ∈ R , and σ > 0. (13.35)    .     .  . . . . . ··· . αm       1 0 1   . . . . . ··· . 1 0 1

In line with the discussion in Chapter 9, we next consider the partitioning of the GLM specified in eq. (13.35) according to    n×m β1 m X = X1 X2 ∈ R and β := ∈ R , (13.36) β2

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 The F -test perspective 153 where 1 0 0 . . . . . ··· .     1 0 0     1 1 0     . . . . . ··· .       α2 1 n×1 1 0 n×(m−1) . m−1 X1 :=   ∈ R , β1 := µ0 ∈ R,X2 :=   ∈ R , and β2 :=  .  ∈ R ,      .      α . . . m . . ··· .             1 0 1     . . . . . ··· . 1 0 1 (13.37) such that with the definitions of eqs. (13.34) to (13.37) the reduced and full models are given by

2 2 y ∼ N(X1β1, σ In) and y ∼ N(Xβ, σ In), (13.38) respectively. We then have the following result:

Theorem 13.2.2 (Classical and GLM F -statistic equivalence). For i = 1, ..., m and j = 1, ..., ni, let 2 yij denote the jth data variable on the ith level of a one-way ANOVA GLM design y ∼ N(Xβ, σ ) in reference cell formulation (cf. eqs. (13.34) and (13.35)). Let this model correspond to the full model and let the reduced model be given in terms of the first column of X and the first entry of β of the full model (cf. eqs. (13.36) and (13.37)). Let further

ˆ ˆ e := (y − Xβ) and e1 := (y − X1β1) (13.39) denote the residuals of the full and reduced models, respectively. Then, with the definitions of the mean squares in the classical variance partitioning one-way ANOVA perspective (cf. (13.32)), it holds that

eT e −eT e MS 1 1 F = Between = m−1 . (13.40) MS eT e Within n−m That is, the F -statistic defined in the context of classical variance partitioning (cf. (13.33)) is identical to the F -statistic as introduced in the GLM context (cf. Chapter 9) for p2 := m − 1 and p := m. Thus, under the assumption of independent and identically distributed Gaussian error terms, the F -statistic defined in eq. (13.33) is distributed according to an f-distribution with m − 1 and n − m degrees of freedom. ◦

2 Proof. We first note that the beta parameter estimator of the reduced model y ∼ N(X1β1, σ In) evaluates to

m ni T −1 T T −1 T 1 X X βˆ1 = (X X1) X y = (1 1n) 1 y = yij =y. ¯ (13.41) 1 1 n n n i=1 j=1 The beta parameter estimator of the reduced model thus corresponds to the grand mean defined in (13.24). Further, the residual sum of squares of the reduced model is given by

m ni T ˆ T ˆ T X X 2 e1 e1 = (y − X1β1) (y − X1β1) = (y − 1ny¯) (y − 1ny¯) = (yij − y¯) = SESTotal. (13.42) i=1 j=1 The residual sum of squares of the reduced model thus corresponds to the total sum of error squares as defined in (13.26). 2 We next note from (13.43), that the beta parameter estimator of the full model y ∼ N(Xβ, σ In) is given by

   1 Pn1    µˆ0 y1j y¯1 n1 j=1    1 Pn2 1 Pn1     αˆ2   y2j − y1j   y¯2 − y¯1  ˆ    n2 j=1 n1 j=1    β =  .  =   =  .  . (13.43)  .   .   .   .   .   .        1 Pnm 1 Pn1 αˆm ymj − y1j y¯m − y¯1 nm j=1 n1 j=1

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 154

The beta parameter estimator of the full model thus corresponds to the vector comprising the first level mean y¯1 and, for i = 2, ..., m, the differences of the ith level mean and the first level mean y¯i − y¯1 as defined in (13.25). The residual sum of squares of the full model is hence given by eT e = (y − Xβˆ)T (y − Xβˆ) T  y11  1 0 0   y11  1 0 0   .  . . .   .  . . .   .  . . ··· .   .  . . ··· .             y  1 0 0   y  1 0 0   1n1      1n1      y  1 1 0   y  1 1 0   21      21      .  . . .   .  . . .   .  . . .  y¯   .  . . .  y¯   .  . . ··· . 1   .  . . ··· . 1      y¯2 − y¯1      y¯2 − y¯1   y2n  1 1 0    y2n  1 1 0   =  2  −    .   2  −    .       .       .       .       .   .  . . .   .  . . .   .  . . . y¯m − y¯1   .  . . . y¯m − y¯1   .  . . ··· .   .  . . ··· .                                 ym1  1 0 1   ym1  1 0 1    . . .   .  . . .   .  . . .   .  . . .   .  . . ··· .   .  . . ··· . 

ymnm 1 0 1 ymnm 1 0 1 T  y11 − y¯1   y11 − y¯1   .   .   .   .       y − y¯   y − y¯   1n1 1   1n1 1   y − y¯ − y¯ +y ¯   y − y¯ − y¯ +y ¯   21 1 2 1   21 1 2 1   .   .   .   .   .   .       y2n − y¯1 − y¯2 +y ¯1   y2n − y¯1 − y¯2 +y ¯1  =  2   2           .   .  (13.44)  .   .   .   .               ym1 − y¯1 − y¯m +y ¯1   ym1 − y¯1 − y¯m +y ¯1       .   .   .   . 

ymnm − y¯1 − y¯m +y ¯1 ymnm − y¯1 − y¯m +y ¯1 T  y11 − y¯1   y11 − y¯1   .   .   .   .       y − y¯   y − y¯   1n1 1   1n1 1   y − y¯   y − y¯   21 2   21 2   .   .   .   .   .   .       y2n − y¯2   y2n − y¯2  =  2   2           .   .   .   .   .   .               ym1 − y¯m   ym1 − y¯m     .   .   .   .   . 

ymnm − y¯m ymnm − y¯m m ni X X 2 = (yij − y¯i) i=1 j=1

= SESWithin. The residual sum of squares of the full model thus corresponds to the within-level sum of error squares as defined in (13.28). From Theorem 13.2.1, it then follows immediately that T T SESBetween = SESTotal − SESWithin = e1 e1 − e e. (13.45) But then it follows that T T SESBetween e1 e1−e e MSBetween DFBetween m−1 F = = = T . (13.46) MSWithin SESWithin e e n−m DFWithin

13.3 Bibliographic remarks

Treatments of one-way ANOVA designs can be found in most introductory statistical textbooks, e.g. DeGroot and Schervish(2012, Chapter 11.6), Casella and Berger(2012, Chapters 11.1 - 11.2), and Georgii

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 155

(2009, Chapter 23.5). Comprehensive accounts of one-way ANOVA designs from the perspective of the GLM are provided by Seber and Lee(2003), Hocking(2003), and Rutherford(2011) amongst many others.

13.4 Study questions

1. Provide a verbose account of the reference cell method reformulation for one-way ANOVA designs. 2. Write down the one-way ANOVA GLM in its reference cell, i.e., non-over-parameterized formulation. 3. Write down the beta parameter estimator of a one-way ANOVA GLM in its reference cell formulation. 4. Define the grand mean and the ith level means of a one-way ANOVA. 5. Define the total, between-level, and within-level sum of error squares of a one-way ANOVA. 6. Define the total, between-level, and within-level degrees of freedom of a one-way ANOVA. 7. Define the total, between-level, and within-level mean squares of a one-way ANOVA. 8. Define the F -statistic in terms of mean squares and discuss its intuition. 9. Write down the reduced and full model one-way ANOVA GLMs, such that the total sum of error squares corresponds to the residual sum of squares of the reduced model and the within-level sum of error squares corresponds to the residual sum of squares of the full model. 10. Write down the F -statistic in its mean squares and its residual sum of squares model form and explain their equivalence in verbose form.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 14 | Two-way analysis of variance

In multifactorial designs two or more independent experimental factors are manipulated and all possible combinations of their levels are assessed. Multifactorial designs are usually referred to simply as factorial designs. For example, a typical 2 × 2-factorial design used in functional neuroimaging may involve a stimulus manipulation (e.g., low and highly degraded visual stimuli) and a cognitive manipulation (e.g., attended and unattended visual stimuli). Any form of two-dimensional n × m or higher-dimensional n × m × p × q × ... factorial design is conceivable. Due to experimental constraints and the aim to measure each factorial combination from the same number of experimental trials, 2 × 2-factorial designs are probably the most prevalent designs in functional neuroimaging. Factorial designs allow for measuring (1) the main effects of each factor, i.e., the differential variability in the dependent experimental variable induced by levels of the respective factor and averaged over the other factors, (2) interactions between factors. In intuitive terms, an interaction in a 2 × 2-factorial design refers to a difference in a difference. Before considering GLM formulations of 2 × 2-factorial designs, we first illuminate the concept of a 2 × 2-factorial designs using the example data set of Chapter 11. To this end, we define the experimental factors age and alcohol consumption and allow each of these factors to take on only two levels: at most 31 years of age (factor age, level 1) and older than 31 years (factor age, level 2), and at most 5 units (factor alcohol consumption, level 1) and more than 5 units (factor alcohol consumption, level 2). Each combination of a specific level of one factor with a specific level of the other factor is referred to as a cell of the design. 2 × 2-factorial designs are commonly depicted using a square lattice as shown in Figure 14.1. According to their position in the square lattice, the factors may also be referred to as row and column factors, respectively. Average data from the different cells of a 2 × 2-factorial design are commonly depicted as bar graphs (Figure 14.2).

Figure 14.1. Conceptual visualization of an exemplary 2 x 2 factorial design.

For this 2 × 2-ANOVA setting, the following questions may be investigated: 1. Does DLPFC volume change with the age of the participant, irrespective of (i.e., averaged over) whether the participant consumes a lot of or little alcohol? The answer to this question is referred to as the main effect of age. 2. Does DLPFC volume change with the alcohol consumption of the participant, irrespective of (i.e., averaged over) whether the participant is young or old? The answer to this question is referred to as the main effect of alcohol. 3. Does the difference in DLPFC volume observed for the different levels of the age factor change with the different levels of the alcohol factor? Or vice versa, does the difference in DLPFC volume observed for the different levels of the alcohol factor change between old and young age? This difference in the differences is referred to as the interaction between the age and alcohol factors. In the following, we first discuss the GLM formulation of a two-way ANOVA design that applies, if only the first two questions are of interest. We then extend this formulation to the case that all three questions An additive two-way ANOVA design 157

190

180

170

160

e

m

u l

o 150

v C

F 140

P L

D 130

120

110

100 Low Alcohol High Alcohol Low Alcohol High Alcohol Young Age Old Age

Figure 14.2. Visualization of a two-way ANOVA design.

Alcohol Low Alcohol High

P ijk DLPFC volume y11k P ijk DLPFC volume y12k 1 111 178.7708 2 121 168.4660 5 112 170.1884 3 122 169.9513 7 113 175.4092 4 123 162.0778 Age Young 8 114 173.3972 6 124 156.9287 11 115 172.1033 9 125 154.4907 12 116 162.6648 10 126 158.3642 13 117 165.4449 14 127 142.2121 15 118 154.3557 16 128 145.6544

P ijk DLPFC volume y21k P ijk DLPFC volume y22k 17 211 155.5286 19 221 137.8262 18 212 150.5144 22 222 127.1715 20 213 160.1183 23 223 138.0237 Age Old 21 214 155.4419 24 224 133.4589 25 215 139.3813 27 225 123.7259 26 216 145.1997 28 226 130.7300 30 217 151.1943 29 227 114.1148 32 218 140.9424 31 228 121.7235

Table 14.1. The example data set of Chapter 11 in a 2 × 2 ANOVA layout with row factor Age taking on the levels Young and Old and the column factor Alcohol taking on the levels Low and High. The column P denotes the original participant label, while the column ijk denotes the reformulated index. are of interest. To formulate the two-way ANOVA GLM, it is helpful to adapt a notation that is in accordance with the 2 × 2-factorial design. For the data variables, we use

yijk ∈ R, where i = 1, ..., r, j = 1, ..., c and k = 1, ..., nij (14.1) to denote the kth data point in the cell corresponding to the combination of the ith level of the row factor with the jth level of the column factor. Each cell comprises nij ∈ N data points. For the special case of a 2 × 2 ANOVA design, we have r = c = 2. Table 14.1 depicts the exemplary data set of Chapter 11 in a 2 × 2 ANOVA layout.

14.1 An additive two-way ANOVA design

We first consider a 2 × 2 ANOVA design without interaction, i.e., a purely additive setting. To this end, we conceive the data point yijk as realization of a univariate Gaussian random variable distributed according to 2 2 yijk ∼ N(µij, σ ) ⇔ yijk = µij + εijk, εijk ∼ N(0, σ ) for k = 1, ..., nij, (14.2) where µij := µ0 + αi + βj. (14.3)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 An additive two-way ANOVA design 158

In this formulation, µ0 represents a constant offset common to all cells, αi, i = 1, ..., r represents the effect of the ith level of the row factor and βj, j = 1, ..., c represents the effect of the jth level of the column factor. The design matrix X ∈ Rn×(1+r+c) implementing the model specified in eqs. (14.2) and (14.3) comprises a column of 1’s (representing the constant offset) and two sets of indicator variables (representing the r ∈ N levels of the row factor and the c ∈ N levels of the column factor). The corresponding beta parameter vector then encodes the effect of each level of each factor. We thus have 2 y ∼ N(Xβ, σ In), (14.4) where  y111  1 1 0 1 0  .  . . . . .  .  . . . . .     y11n  1 1 0 1 0  11     y121  1 1 0 0 1      .  . . . . . µ   .  . . . . . 0     α y  1 1 0 0 1  1 y :=  12n12  ∈ n,X =   ∈ n×5, β := α  ∈ 5, and σ2 > 0. (14.5)  y  R 1 0 1 1 0 R  2 R  211    β   .  . . . . .  1   .  . . . . .  .  . . . . . β2     y21n21  1 0 1 1 0      y221  1 0 1 0 1      .  . . . . .  .  . . . . .

y22n22 1 0 1 0 1

As for the one-way ANOVA, the model thus defined is overparameterized. Effectively, we have five parameters to estimate and four equations for the respective group expectations. Viewed differently, in eq. (14.3) we could add a constant either to each of the αi’s or to each of the βj’s and subtract it from µ0 without altering any of the expected responses µij. We thus require two constraints to obtain an identifiable model. Commonly, these correspond to setting

α1 := β1 := 0 (14.6) and thus identifying the combination of the first level of the row factor and the first level of the column factor as the reference cell. The meaning of the remaining parameters is then as provided in Table 14.2 and Table 14.3. The entries in these tables document the formulation of the expected dependent variable responses for each combination of levels of row and column factors in terms of the initial formulation of the additive 2 × 2 ANOVA and in terms of the reference cell method reformulation of the additive 2 × 2 ANOVA, respectively. In the reference cell formulation of 14.3, µ0 represents the expected response in the reference cell, α2 represents the effect of the row factor compared to its first level for any fixed level of the column factor, and β2 represents the effect of level 2 of the column factor compared to its first level for any fixed value of the row factor. As for the case of the one-way ANOVA, the parameters α2 and β2 thus encode the differences in expected values between the design cells. Finally, note that the model is additive in the sense that the effect of each factor is the same at all levels of the other factor. To see this point, consider moving from the first to the second row. The expected response increases by α2 if one moves down the first column, but also if one moves down the second column. Likewise, the expected response increases by β2 if one moves from the first to the second column, for both the first and the second row of the table.

1 2 1 µ0 + α1 + β1 µ0 + α1 + β2 2 µ0 + α2 + β1 µ0 + α2 + β2

Table 14.2. Formulation of an overparameterized two-way additive ANOVA model.

Equivalently, the design matrix defined in eq. (14.12) is not of full column rank, because the row factor columns (columns 2 and 3) as well as the column factor indicator variables (columns 4 and 5) add up to the constant offset indicator (column 1). The two required constraints correspond to omitting the variables corresponding to the first row factor level and to the first column factor level. This results in the following reformulation of the GLM for the 2 × 2 ANOVA layout: 2 y ∼ N(Xβ, σ In), (14.7)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A two-way ANOVA design with interaction 159

1 2 1 µ0 µ0 + β2 2 µ0 + α2 µ0 + α2 + β2

Table 14.3. Reference cell reformulation of the two-way additive ANOVA model. where  y111  1 0 0  .  . . .  .  . . .     y11n  1 0 0  11     y121  1 0 1      .  . . .  .  . . .     µ  y  1 0 1 0 y :=  12n12  ∈ n,X =   ∈ n×3, β := α ∈ 3, and σ2 > 0. (14.8)  y  R 1 1 0 R  2 R  211    β  .  . . . 2  .  . . .  .  . . .     y21n21  1 1 0      y221  1 1 1      .  . . .  .  . . .

y22n22 1 1 1

Based on the formulation of the two-way ANOVA design in (14.8), one may use the contrast vectors T T cα = (0, 1, 0) and cβ = (0, 0, 1) to statistically assess the main effects of Age and Alcohol for the data of Table 14.1 using t tests . The corresponding T -statistics evaluate to Tβ,cα = −7.88 and Tβ,cβ = −5.43, respectively. Because under the null hypotheses α2 ∈ {0} and β2 ∈ {0} the associated probabilities are given by P(Tβ,α2 ≥ | − 7.88|) < 0.001 and P(Tβ,β2 ≥ | − 5.43|) < 0.001, the main effects would be declared significant, if a test significance level of α0 = 0.05 was desired in a two-sided test scenario.

14.2 A two-way ANOVA design with interaction

In order to allow for the modelling of interaction effects in 2 × 2-factorial designs, the GLM of the previous section is modified as follows:

2 2 yijk ∼ N(µij, σ ) ⇔ yijk = µij + εijk, εijk ∼ N(0, σ ) for k = 1, ..., nij, (14.9) where we now define

µij := µ0 + αi + βj + (αβ)ij. (14.10)

In this formulation, the first three terms are familiar: µ0 is a constant and αi and βj are the main effect of levels i = 1, ..., r of the row factor and j = 1, ..., c of the column factor. The new term (αβ)ij is an interaction effect. It represents the effect of the combination of levels i and j of the row and column factors. The notation (αβ) should be understood as a single symbol, not as a product. One could have have chosen γij to denote this interaction effect, but the notation (αβ)ij is more suggestive and reminds us that the term corresponds to an effect due to the combination of the levels i and j of each factor. Table 14.4 displays the two-way ANOVA with interaction parameters.

1 2 1 µ0 + α1 + β1 + (αβ)11 µ0 + α1 + β2 + (αβ)12 2 µ0 + α2 + β1 + (αβ)21 µ0 + α2 + β2 + (αβ)22

Table 14.4. Formulation of an overparameterized two-way ANOVA GLM model with interaction.

In design matrix form, eqs. (14.9) and (14.10) correspond to

2 y ∼ N(Xβ, σ In), (14.11)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A two-way ANOVA design with interaction 160 where

 y111  1 1 0 1 0 1 0 0 0  .  ......   .  ......      y11n  1 1 0 1 0 1 0 0 0  11     µ0   y121  1 1 0 0 1 0 1 0 0      α1   .  ......     .  ......   α2   .  ......         β1  y12n12  n 1 1 0 0 1 0 1 0 0 n×9 9 2 y :=   ∈ R ,X =   ∈ R , β :=  β2  ∈ R , σ > 0. (14.12)  y211  1 0 1 1 0 0 0 1 0       (αβ)11  .  ......     .  ......  (αβ)12  .  ......        (αβ)21 y21n21  1 0 1 1 0 0 0 1 0     (αβ)22  y221  1 0 1 0 1 0 0 0 1      .  ......   .  ...... 

y22n22 1 0 1 0 1 0 0 0 1

Notably, addition of the second and third columns, fourth and fifth columns, and sixth to ninth columns of the design matrix all result in the first column, creating a multiple rank-deficient design matrix. We thus re-express (14.9) and (14.10) in terms of an extended reference cell method, which sets all parameters involving the first row or the first column in the two-way layout to zero, i.e.,

α1 := β1 := (αβ)i1 := (αβ)1j := 0 for i = 1, ..., r, j = 1, ..., c. (14.13)

The meaning of the remaining parameters can then be read off Table 14.5. Again, µ0 represents the expected response in the reference cell. The main effects now assume a more specific meaning: α2 is the expected difference between the expected response of the reference cell and the effect of level 2 of the row factor, when the column factor is at level 1. β2 is the expected difference due to level 2 of the column factor, compared to level 1, when the row factor is at level 1. The interaction term (αβ)22 is the additional effect of level 2 of the row factor, compared to level 1, when the column factor is at level 2 rather than 1. This term can also be interpreted as the additional effect of level 2 of the column factor, compared to level 1, when the row factor is at level 2 rather than 1. The key feature of this model is that the effect of one factor now depends on the level of the other factor. For example, the effect of level 2 of the row factor, compared to level 1, is α2 in the first column and α2 + (αβ)22 in the second column.

1 2 1 µ0 µ0 + β2 2 µ0 + α2 µ0 + α2 + β2 + (αβ)22

Table 14.5. Reference cell method reformulation of the two-way ANOVA GLM model with interaction.

The reformulated design matrix representation of the 2 × 2-ANOVA GLM with interaction is of size

n × (1 + (r − 1) + (c − 1) + (r − 1)(c − 1)) . (14.14)

Specifically, it comprises a column of ones representing the constant offset µ0, a set of (r − 1) indicator variables representing the row effects, a set of (c − 1) indicator variables representing the column effects, and a set of r · c indicator variables representing the interactions. The easiest way to compute the values of the interaction indicator variable is as products of the row and column indicator variable values. In other words, if ri takes the value 1 for observations in row i and 0 otherwise, and cj takes the value 1 for observations in column j and otherwise 0, then the product ricj takes the value 1 for observations that are in row i and column j, and is 0 for all others. We thus have

2 y ∼ N(Xβ, σ In), (14.15)

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 161 where

 y111  1 0 0 0  .  . . . .  .  . . . .     y11n  1 0 0 0  11     y121  1 0 1 0      .  . . . .  .  . . . .    .  . . . . µ0 y  1 0 1 0 α  12n12  n   n×4  2  4 2 y :=   ∈ R ,X =   ∈ R , β :=   ∈ R , and σ > 0. (14.16)  y211  1 1 0 0  β2   .  . . . .  .  . . . . (αβ)22  .  . . . .     y21n21  1 1 0 0      y221  1 1 1 1      .  . . . .  .  . . . .

y22n22 1 1 1 1

This design is identifiable and parameter estimation and inference can proceed in the standard fashion. Based on the formulation of the two-way ANOVA design in (14.16), one may use the contrast vectors T T T cα = (0, 1, 0, 0) , cβ = (0, 0, 1, 0) , and cαβ = (0, 0, 0, 1) to statistically assess the main effects and their interaction for the data of Table 14.1 using t tests . The corresponding T -statistics evaluate to Tβ,cα = −4.58,Tβ,cβ = −2.80, and Tβ,cαβ = −1.63, respectively. Because under the null hypotheses

α2 ∈ {0}, β2 ∈ {0}, and (αβ)22 ∈ {0} the associated probabilities are given by P(Tβ,α2 ≥ |−4.58|) < 0.001,

P(Tβ,β2 ≥ | − 2.80|) < 0.005, and P(Tβ,(αβ22) ≥ | − 1.626|) = 0.06, the main effects would be declared significant, whereas their interaction would not, if a test significance level of α0 = 0.05 was desired in a two-sided test scenario.

14.3 Bibliographic remarks

Treatments of two-way ANOVA designs can be found in most introductory statistical textbooks, e.g., DeGroot and Schervish(2012, Chapters 11.7 and 11.8). Comprehensive accounts of two-way ANOVA designs from the perspective of the GLM are provided by Seber and Lee(2003), Hocking(2003), and Rutherford(2011) amongst many others.

14.4 Study questions

1. Explain the terms factorial design and 2 × 2 factorial design 2. Explain the terms main effects and interaction in factorial designs. 3. Write down the structural form of a purely additive two-way ANOVA design upon reference cell formulation. 4. Write down the GLM form of a purely additive 2 × 2 ANOVA design upon reference cell formulation. 5. Write down the structural form of a two-way ANOVA design with interaction upon reference cell formulation. 6. Write down the GLM form of a 2 × 2 ANOVA design with interaction upon reference cell formulation.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 162

References

Abbott, S. (2015). Understanding Analysis. Undergraduate Texts in Mathematics. Springer New York, New York, NY. Aldrich, J. (1997). R.A. Fisher and the making of maximum likelihood 1912-1922. Statistical Science, 12(3):162–176. Anderson, T. W. (2003). An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, N.J, 3rd ed edition. Barber, D. (2012). Bayesian Reasoning and Machine Learning. Cambridge University Press. Billingsley, P. (1995). Probability and Measure. Wiley Series in Probability and . Wiley, New York, 3rd ed edition. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York. Cantor, G. (1892). ¨uber eine Eigenschaft des Inbegriffes aller reellen algebraischen Zahlen. Jahresbericht der Deutschen Mathematiker-Vereinigung, 1. Cantor, G. (1895). Beitr¨agezur Begr¨undung der transfiniten Mengenlehre. Mathematische Annalen, 46(4):481–512. Casella, G. and Berger, R. (2012). . Duxbury. Christensen, R. (2011). Plane Answers to Complex Questions. Springer Texts in Statistics. Springer New York, New York, NY. Czado, C. and Schmidt, T. (2011). Mathematische Statistik. Statistik und ihre Anwendungen. Springer, Berlin. DeGroot, M. H. and Schervish, M. J. (2012). Probability and Statistics. Addison-Wesley, Boston, 4th ed edition. Draper, N. and Smith, H. (1998). Applied Regression Analysis. Wiley-Interscience. Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge University Press, first edition. Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd. Fristedt, B. E., Gray, L. F., and Birkh¨auserPublishing Ltd (1998). A Modern Approach to Probability Theory. Friston, K. J., editor (2007). Statistical Parametric Mapping: The Analysis of Funtional Brain Images. Elsevier/Academic Press, Amsterdam ; Boston, 1st ed edition. Georgii, H.-O. (2009). Stochastik: Einf¨uhrungin die Wahrscheinlichkeitstheorie und Statistik. De-Gruyter- Lehrbuch. de Gruyter, Berlin, 4., ¨uberarb. und erw. aufl edition. Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5):587–606. Hays, W. (1994). Statistics. Harcourt Brace College Publishers, fifth edition. Hocking, R. (2003). Methods and Applications of Linear Models - Regression and the Analysis of Variance. Wiley. Horn, R. A. and Johnson, C. R. (2012). Matrix Analysis. Cambridge University Press, Cambridge ; New York, 2nd ed edition. Huettel, S. A., Song, A. W., and McCarthy, G. (2014). Functional Magnetic Resonance Imaging. Sinauer Associates, Sunderland, Mass, 3rd ed edition. Jezzard, P., Matthews, P. M., and Smith, S. M., editors (2001). Functional MRI: An Introduction to Methods. Oxford University Press, Oxford ; New York. Kolmogorov, A. N. (1956). Foundations of the Theory of Probability. Chelsea Pub Co.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 163

Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. Springer Texts in Statistics. Springer, New York, 3rd ed edition. Leithold, L. (1976). The Calculus, with Analytic Geometry. Harper & Row, New York, 3d ed edition. Magnus, J. R. and Neudecker, H. (1989). Matrix Differential Calculus with Applications in Statistics and . Journal of the American Statistical Association, 84(408):1103. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. Adaptive Computation and Machine Learning Series. MIT Press, Cambridge, MA. Neyman, J. and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. Lond. A, 231(694-706):289–337. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302):157–175. Poldrack, R. A., Nichols, T., and Mumford, J. (2011). Handbook of Functional MRI Data Analysis. Cambridge University Press, Cambridge. Press, W. (2007). Numerical Recipes the Art of Scientific Computing. Cambridge University Press. Rao, C. R. (2002). Linear Statistical Inference and Its Applications. Wiley Series in Probability and Statistics. Wiley, New York, 2. ed., paperback ed edition. Rodr´ıguez,G. (2007). Lecture Notes on Generalized Linear Models. Rosenthal, J. S. (2006). A First Look at Rigorous Probability Theory. World Scientific, Singapore ; Hackensack, N.J, 2nd ed edition. Rutherford, A. (2011). ANOVA and ANCOVA A GLM Approach. Wiley. Searle, S. (1982). Matrix Algebra Useful for Statistics. Wiley-Interscience. Seber, G. (2015). The Linear Model and Hypothesis. Springer Series in Statistics. Springer International Publishing, Cham. Seber, G. A. F. and Lee, A. J. (2003). Linear Regression Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, N.J, 2nd ed edition. Shao, J. (2003). Mathematical Statistics. Springer Texts in Statistics. Springer, New York, 2nd ed edition. Spivak, M. (2008). Calculus. Publish or Perish, Inc, fourth edition. Strang, G. (2009). Introduction to Linear Algebra. Student (1908). The Probable Error of a Mean. Biometrika, 6(1):1–25. Uludag, K., Ugurbil, K., and Berliner, L., editors (2015). fMRI: From Nuclear Spins to Brain Functions. Number 30 in Biological Magnetic Resonance. Springer, New York, NY. Wacholder, S., Chanock, S., Garcia-Closas, M., El ghormli, L., and Rothman, N. (2004). Assessing the Probability That a Positive Report is False: An Approach for Molecular Studies. JNCI Journal of the National Cancer Institute, 96(6):434–442. Wasserman, L. (2004). All of Statistics.

The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0