The General Linear Model 20/21
Univ.-Prof. Dr. Dirk Ostwald Contents
1 Introduction 3 1.1 Probabilistic modelling...... 3 1.2 Experimental design...... 6 1.3 A verbose introduction to the general linear model...... 7 1.4 Bibliographic remarks...... 11 1.5 Study questions...... 12
2 Sets, sums, and functions 13 2.1 Sets...... 13 2.2 Sums, products, and exponentiation...... 17 2.3 Functions...... 19 2.4 Bibliographic remarks...... 25 2.5 Study questions...... 25
3 Calculus 26 3.1 Derivatives of univariate real-valued functions...... 26 3.2 Analytical optimization of univariate real-valued functions...... 29 3.3 Derivatives of multivariate real-valued functions...... 32 3.4 Derivatives of multivariate vector-valued functions...... 36 3.5 Basic integrals...... 37 3.6 Bibliographic remarks...... 43 3.7 Study questions...... 43
4 Matrices 44 4.1 Matrix definition...... 44 4.2 Matrix operations...... 44 4.3 Determinants...... 52 4.4 Symmetry and positive-definiteness...... 52 4.5 Bibliographic remarks...... 53 4.6 Study Questions...... 53
5 Probability spaces and random variables 55 5.1 Probability spaces...... 55 5.2 Elementary probabilities...... 56 5.3 Random variables and distributions...... 58 5.4 Random vectors and multivariate probability distributions...... 62 5.5 Bibliographic remarks...... 68 5.6 Study questions...... 68
6 Expectation, covariance, and transformations 69 6.1 Expectation...... 69 6.2 Variance...... 71 6.3 Sample mean, sample variance, and sample standard deviation...... 74 6.4 Covariance and correlation of random variables...... 75 6.5 Sample covariance and sample correlation...... 78 6.6 Probability density transformations...... 79 6.7 Combining random variables...... 81 6.8 Bibliographic remarks...... 85 6.9 Study questions...... 85
7 Probability distributions 86 7.1 The multivariate Gaussian distribution...... 87 Contents 2
7.2 The General Linear Model...... 91 7.3 The Gamma distribution...... 92 7.4 The χ2 distribution...... 92 7.5 The t distribution...... 93 7.6 The f distribution...... 95 7.7 Bibliographic remarks...... 96 7.8 Study questions...... 96
8 Maximum likelihood estimation 97 8.1 Likelihood functions and maximum likelihood estimators...... 97 8.2 Maximum likelihood estimation for univariate Gaussian distributions...... 100 8.3 ML estimation of GLM parameters...... 103 8.4 Example (Independent and identically distributed Gaussian samples)...... 106 8.5 Bibliographic remarks...... 107 8.6 Study questions...... 107
9 Frequentist distribution theory 108 9.1 Introduction...... 108 9.2 Beta parameter estimates...... 108 9.3 Variance parameter estimates...... 110 9.4 The T -statistic...... 111 9.5 The F -statistic...... 112 9.6 Bibliographic remarks...... 115 9.7 Study questions...... 115
10 Statistical testing 116 10.1 Statistical tests...... 116 10.2 A single-observation z-test...... 118 10.3 Bibliographic remarks...... 119 10.4 Study questions...... 119
11 T-tests and simple linear regression 121 11.1 Introduction...... 121 11.2 One-sample t-test...... 123 11.3 Independent two-sample t-test...... 124 11.4 Simple linear regression...... 127 11.5 Bibliographic remarks...... 131 11.6 Study questions...... 131
12 Multiple linear regression 132 12.1 An exemplary multiple linear regression design...... 132 12.2 Linearly independent, orthogonal, and uncorrelated regressors...... 134 12.3 Statistical efficiency of multiple linear regression designs...... 135 12.4 Multiple linear regression in functional neuroimaging...... 136 12.5 Bibliographic remarks...... 141 12.6 Study questions...... 141
13 One-way ANOVA 143 13.1 The GLM perspective...... 143 13.2 The F -test perspective...... 149 13.3 Bibliographic remarks...... 154 13.4 Study questions...... 155
14 Two-way analysis of variance 156 14.1 An additive two-way ANOVA design...... 157 14.2 A two-way ANOVA design with interaction...... 159 14.3 Bibliographic remarks...... 161 14.4 Study questions...... 161
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 1 | Introduction
The general linear model (GLM) is a unifying perspective on many data analytical techniques in statistics, machine learning, and artificial intelligence. For example, many statistical methods, such as T-tests, F-tests, simple linear regression, multiple linear regression, the analysis of variance, and the analysis of covariance are special cases of the GLM. Furthermore, the mathematical machinery of the GLM forms the basis for many more advanced data analytical techniques ranging from mixed linear models to neural networks to Bayesian hierarchical models. In cognitive neuroimaging, the GLM is popular as a standard technique in the analysis of fMRI data. The aim of this introductory Section is to preview the scope contemporary data analytical approaches, which is most sensibly summarized by the term probabilistic modelling (Section 1.1). After touching upon some basic aspects of experimental design (Section 1.2), we then provide a verbose introduction to the GLM and its mathematical form (Section 1.3). The mathematical language that is needed to discuss the GLM (e.g., matrix calculus and multivariate Gaussian distributions) will be expanded upon in subsequent Sections. It is introduced here primarily to motivate the engagement with these mathematically more basic concepts in subsequent Sections.
1.1 Probabilistic modelling
Science is the dyad of formulating quantitative theories about natural phenomena and validating these theories in light of quantitative data. Because quantitative data is finite, theories will ever only be quantified up to a certain level of uncertainty. Probabilistic modelling provides the glue between formalized scientific theories and empirical data and offers a mechanistic framework for quantifying the remaining uncertainty about a theory’s validation. Probabilistic modelling has many synonyms, such as statistics, Bayesian inference, data assimilation, advanced machine learning, or simply data analysis. Cognitive neuroscience aims for a scientific approach to understanding brain function. When designing any experiment in cognitive neuroscience, it is thus essential to have at least a vague idea about the data analytical procedures that are going to be used on the collected data, irrespective of whether the data is behavioural or derives from neuroimaging techniques such as functional magnetic resonance imaging (fMRI) or magneto- or electroencephalography (M/EEG). In the current Section, we provide a brief overview about common data analytical strategies employed in cognitive neuroimaging or, more generally, in probabilistic quantitative data analysis. To this end, it is first helpful to appreciate that any form of data analysis embodies data reduction and that any sensible form of data reduction is based on a model of the data generating process.
Data analysis is data reduction. Any cognitive neuroscience experiment generates a wealth of quanti- tative data (numbers). For example, when conducting a typical behavioural experiment, one presents stimuli of different experimental conditions multiple times to participants and records, for example, the correctness of the response and the associated reaction time on each experimental trial. For reaction times only and with a hundred trials per one of four experimental condition, this amounts to four hundred numbers per participant. Usually, one does not only acquire data from a single participant and thus deals with four hundred times the number of participants data points. If one concomitantly acquires neurophysiological data, for example fMRI data across many voxels or EEG data from multiple electrodes, the number of data points grows into the hundred thousands or even millions very quickly. Nevertheless, one would like to understand and visualize in which way the experimental manipulation has affected the recorded data. Any data analysis method must hence project large sets of numbers onto smaller sets of numbers that allow for the experimental effects to be more readily appreciated by humans. These smaller sets of numbers are commonly referred to as statistics. While many data analysis techniques appear to be very different at least on the surface, a reduction of the data dimensionality is a common characteristic of all forms of data analysis (Figure 1.1).
Data analysis is model-based. A second characteristic of any data analysis method is that it embodies assumptions about how the data were generated and which data aspects are important. The key step of any data analysis method is to evaluate how well a given set of quantitative assumptions, i.e., a model, can explain a set of observed data. When studying any data analysis approach, it is helpful to identify the following three components of the scientific method: model formulation, model estimation, and model Probabilistic modelling 4
Raw Data Reduced Data
Figure 1.1. Data analysis is data reduction. Raw data usually takes the form of large data matrices, here represented by an 100 × 100 array of different colours encoding real number values. Usually, the raw data are not reported in scientific reports, but rather a smaller set of numbers, such as T- or p-values in frequentist statistics. This smaller set of numbers is represented by the 2 × 2 array of different colours on the right. The process of transforming a large data set into a smaller data set that can be more readily appreciated by humans is called data analysis (glm 1.m). evaluation (Figure 1.2). Model formulation refers to the mathematical formalization of informal ideas about the generation of empirical data. Typically, models aim to mechanistically and quantitatively capture data generating processes and comprise both deterministic and probabilistic aspects. Some components of a model may take predefined values and are referred to as fixed parameters, while other components of a model can be informed by the data and are referred to as free parameters. Model estimation is the adaptation of model parameters in light of observed data. Often the adaptation of free model parameters in light of observed data is a non-trivial task and requires sophisticated mathematical and statistical techniques. Finally, model evaluation refers to the evaluation of adapted parameter values in some meaningful sense and drawing conclusions about experimental hypotheses. Note that upon model evaluation, the scientific method proceeds by going back to the model formulation step. At least two aims may be addressed during model reformulation: either to conceive a model formulation that may capture observed data in a more meaningful way or to relax the assumptions of the model to derive a more general theory.
Model classes It is sometimes helpful to classify a particular model. While ultimately every model and its associated estimation and evaluation scheme is unique, some rough categorization can help to obtain an overview about the plethora of data analysis approaches in functional neuroimaging. Below we discuss a non-exhaustive list of dichotomies.
Static vs. dynamic models. In most simple terms, static models describe the current state of a phenomenon, while dynamic models describe how the phenomenon of interest currently changes. Usually, static models have no inherent representation of time, while dynamic models typically treat time as an explicit model variable. Static models often have a relatively simple algebraic form, whereas dynamic models are usually formulated with the help of differential equations. While not originally conceived as models of time-series data, many static models are also applied to time series data, often inducing the need for sophisticated model modifications. Dynamic models can further be classified into deterministic dynamic and stochastic dynamic models. Deterministic dynamic models describe the change of the state of a phenomenon without additive stochastic error, commonly using systems of ordinary or partial differential equations. Stochastic dynamic models additionally assume probabilistic influences on the change of the phenomenon of interest and are formulated using stochastic differential equations.
Univariate vs. multivariate models. Another way to classify models is according to the dimensionality of the measurement data they describe. If this dimension is one, i.e., for each measurement a single number is observed and modelled, the model is referred to as univariate. On the other hand, if each measurement constitutes two or more numbers which are modelled, the model is referred to as multivariate.
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Probabilistic modelling 5
Science Model formulation
Reality Data Model estimation
Model evaluation
Figure 1.2. Data analysis is model-based. The figure depicts the relationship between the scientific method (big box) and reality. Data forms part of the scientific method, because it is registered in data recording instruments that aim to capture specific aspects of reality. The scientific method is based on the formulation of models (also known as theories or hypotheses), the estimation of these models based on data (also known as parameter estimation or model fitting), and the evaluation of the models in light of the data upon their estimation. Typically, multiple models are compared with respect to each other. Upon evaluation of a model, the model may be refined or a new model may be formulated. Note that this is a highly idealistic description of the scientific process, which omits all sociological factors involved in actual academic practice (glm 1.m).
Encoding vs. decoding models. Another popular model classification scheme uses the notions of encoding vs. decoding models. According to this scheme, encoding models rest on an explicit formulation of experimental circumstances that generate measurements, while decoding approaches decode the experimental circumstances from the observed measurement. However, the distinction between encoding and decoding models is meaningless because every “decoding model” is also based on a generative model of the measurement data - most typically a very simple one with little explanatory appeal. As will become evident in subsequent Sections, the GLM is a static, univariate model that can be used both in an encoding and a decoding manner. Due to its relative simplicity, the GLM forms an ideal starting point for studying modern data analysis.
Model estimation and evaluation techniques Probabilistic models comprise both deterministic aspects and stochastic aspects. The stochastic aspects commonly model that part of the data variability that is not explained by the deterministic aspects. The frameworks of Frequentist and Bayesian statistics differ in the way that stochastic aspects are interpreted.
Frequentist statistics. In Frequentist statistics, probabilities are interpreted as large sample limits of random phenomena. Most of classical Frequentist statistics as encountered in undergraduate statistics combines variants of the GLM with null hypothesis significance testing (NHST). NHST is based on the following logic: one assumes that if there is no experimental effect, a statistic of interest has a certain probability distribution. This is referred to as the null distribution. Upon observing data, one can compute the probability of obtaining the observed or more extreme data under the null distribution. If this probability (known as the p-value) is small, one concludes that the data does not support the null hypothesis and declares the experimental effect to be “statistically significant”.
Bayesian statistics. In Bayesian statistics, probabilities are interpreted as measures of subjective uncertainty. Here, in the absence of any experimental data, one quantifies one’s uncertainty about model parameters using so-called prior probability distributions. Using Bayes’ theorem, one then computes the posterior distribution of model parameters given the data, resulting in an updated belief. At the same time, one often aims to quantify the probability of the data under the model assumptions employed. It should be noted that the dichotomy between Frequentist statistics and Bayesian statistics is not a strict one, and that mixed forms, such as parametric empirical Bayes or the study of Frequentist quality criteria
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Experimental design 6 of Bayesian point estimators, exist. In this introduction to the GLM, we will focus on the Frequentist statistics, which remains the dominant statistical paradigm in the empirical sciences.
1.2 Experimental design
In this Section, we briefly review a few terms from the theory of experimental design that will be needed for introducing the GLM.
Experiment and experimental design. An experiment is the controlled test of a scientific hypothesis or theory. Experiments manipulate some aspect of the world and then measure the outcome of that manipulation. In functional neuroimaging experiments, researchers often manipulate some aspects of a stimulus (for example presenting a picture of a face or a house, or manipulating whether a word is easy or difficult to remember) and measure the participant’s behaviour and brain activity using fMRI or EEG. Here, experimental design refers to the organization of an experiment to allow for the effective investigation of the research hypothesis. All well-designed experiments share several characteristics: they test specific hypotheses, rule out alternative explanations for the data, and minimize costs involved in the experiment.
Independent and dependent experimental variables. An experimental variable can be defined as a manipulated or measured quantity that varies within an experiment. Two classes of experimental variables are central: independent and dependent variables. Independent experimental variables are aspects of the experimental design that are intentionally manipulated by the experimenter and that are hypothesized to cause changes in the dependent variables. Independent variables in functional neuroimaging experiments include, for example, different forms of sensory stimulation, different cognitive contexts, or different motor tasks. The different values of an independent variable are often referred to as conditions or levels. Usually, independent variables are explicitly controlled. From a modelling perspective, they are thus usually represented by constants rather than by random variables. Dependent experimental variables are quantities that are measured by the experimenter in order to evaluate the effect of the independent variables. Examples for dependent variables in functional neuroimaging experiments are the response accuracy and reaction time in behavioural tasks, the BOLD signal at a given voxel in an fMRI experiment, or the frequency composition of a recording channel in an EEG experiment. From a data analytical perspective, dependent experimental variables are usually modelled by random variables.
Categorical and continuous experimental variables. In principle, both independent and dependent variables can either be categorical or continuous. A categorical experimental variable is an experimental variable that can take on one of several discrete values, for example, encoding sensory stimulation (1) vs. no sensory stimulation (0). Categorical experimental variables are commonly referred to as factors taking on different levels. Mathematically, categorical variables are usually represented by elements of the natural numbers or signed integers. A continuous experimental variable is an experimental variable that can take on any value within a specified range. Examples for continuous variables are different contrast levels of a visual stimulus as well as most observed signals in functional neuroimaging, such as the BOLD signal or electrical potentials in EEG. Mathematically, continuous experimental variables are usually represented by real numbers.
Between- and within-participant designs. Experimental designs can be classified according to whether the levels of an independent variable are applied to the same group of participants or to different groups of participants. In a between-participant design, different participant groups are associated with different values of an independent experimental variable. A more common design type in basic functional neuroimaging research is the within-participant designs, in which each participant is exposed to all levels of the independent experimental variables. These designs are also commonly referred to as repeated-measures designs.
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 7
1.3 A verbose introduction to the general linear model
The GLM can be neatly summarized in the expression y = Xβ + ε, (1.1) which we will refer to as the GLM equation. In the GLM equation, y represents data, X denotes a design matrix, β denotes a parameter vector, and ε denotes an error vector. The aim of the current Section is to gain an initial understanding of eq. (1.1). To this end, we will first consider the structural aspects of eq. (1.1) comprising the data y, the design matrix X, and the parameter vector β from the perspective of independent and dependent experimental variables. In a second step, we then consider the stochastic aspect represented by the error vector ε. We exemplify both the structural and stochastic aspects by means of the simple linear regression model. Throughout, it is important to note that the data y is modelled by the GLM as a stochastic entity of which a single realization is available in a practical context.
Structural aspects To obtain an initial understanding of the GLM equation (1.1), we consider an independent experimental variable, denoted by x for the moment, and a dependent experimental variable, denoted by y for the moment. As reviewed above, the independent experimental variable x is under the control of the experimenter, while the dependent experimental variable y models measurements of a phenomenon of interest. y is not under the direct control of the researcher, but it is assumed that it is in some way related to x. For reasons of simplicity, flexibility, and because every functional relationship is locally linear, researchers chose to model a lot of these relationships by means of affine-linear functions, or linear models for short. In verbose terms, a noise-free linear model states that “an observed value of the dependent variable y is equal to a weighted sum of values associated with one or more independent variables x”.
To render the last statement more precise, we introduce some additional notation: let yi denote one observation of the dependent variable y, where i = 1, ..., n, such that there are n observations in total. Likewise, let xij, i = 1, ..., n, j = 1, ..., p denote the values of a number of independent experimental variables that are supposed to be associated with the observation yi. Here, p is the number of independent experimental variables. The statement that the value yi equals the weighted sum of the values of the independent variables xij associated with this observation can then be written as
yi = xi1β1 + xi2β2 + xi3β3 + ... + xipβp. (1.2)
In eq. (12.1), the βj parameters are multiplicative coefficients that quantify the contribution of the independent experimental variable xij to the value of the dependent experimental variable yi. Each βj parameter may thus be conceived as the size of the effect that the independent experimental variable xij has on the value of the dependent experimental variable yi. All variables in eq. (12.1) should be thought of as scalar numbers. As a concrete example of eq. (12.1), we consider the 7th dependent variable y7 of a set of dependent variables associated with p = 4 independent experimental variables and, correspondingly, four βj parameters:
y7 = x71β1 + x72β2 + x73β3 + x74β4. (1.3) A numerical example of eq. (1.3) is 10 = 16 · 0.25 + 1 · 2 + 3 · 0.5 + 2.5 · 1. (1.4)
Here, the value of the dependent experimental variable is y7 = 10, the values of the independent experimental variables are x71 = 16, x72 = 1, x73 = 3, x74 = 2.5, and the βj values are β1 = 0.25, β2 = 2, β3 = 0.5, β4 = 1. For understanding the GLM, it is important to be clear about which variables are known at which point of a research project: the values of the independent experimental variables xij, i = 1, ..., n, j = 1, ..., p are specified by the researcher and are hence known as soon as the researcher has decided on the design of a given experiment. The dependent experimental variable values yi, i = 1, ..., n are known as soon as the researcher has collected data in response to the independent experimental variable values xi1, ...xip. However, how much each of the independent experimental variables xi1, ..., xip contributes to the sum on the right-hand side of (12.1) and thus to the observed data on the left-hand side of (12.1), is unknown at this point. In other words, the weighting coefficients β1, ..., , βp are not known to the researcher in advance and have to be estimated. As discussed inSection 1.1, the process of identifying these parameter values is referred to as model estimation and will be discussed in detail in subsequent Sections.
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 8
Structural aspects of simple linear regression Expressions such as eq. (12.1) are often referred to as multiple linear regression models and are usually introduced as generalizations of simple linear regression models to scenarios of more than one independent experimental variable. In the current Section, we consider the structural aspects of simple linear regression models to introduce the GLM matrix notation. In undergraduate statistics, simple linear regression models are often written as y = a + bx, (1.5) where y is referred to as the dependent experimental variable, a is referred to as the offset, b is referred to as the slope, and x is referred as the independent experimental variable. Crucially, eq. (1.5) encodes the idea that if we know the values of x, b, and a, we can compute the value of y. Let us hence assume that we would like to compute the value of y for five different values of x, namely,
x12 = 0.2, x22 = 1.4, x32 = 2.3, x42 = 0.7, and x52 = 0.5. (1.6)
Notably, the values of x and y are allowed to vary, whereas the values of a and b are fixed. Let us hence assume that a = 0.8 and that b = 1.3. We may thus write the five values of y corresponding to the five values of x as
y1 = a + bx12 = 1 · 0.8 + 1.3 · 0.2
y2 = a + bx22 = 1 · 0.8 + 1.3 · 1.4
y3 = a + bx32 = 1 · 0.8 + 1.3 · 2.3 (1.7)
y4 = a + bx42 = 1 · 0.8 + 1.3 · 0.7
y5 = a + bx52 = 1 · 0.8 + 1.3 · 0.5.
Using matrix notation as formally introduced in Section 4 | Matrix algebra, we can equivalently express (1.7) as y1 x11 x12 1 0.2 1 · 0.8 + 1.3 · 0.2 y2 x21 x22 1 1.4 1 · 0.8 + 1.3 · 1.4 a 0.8 y3 = x31 x32 = 1 2.3 = 1 · 0.8 + 1.3 · 2.3 . (1.8) b 1.3 y4 x41 x42 1 0.7 1 · 0.8 + 1.3 · 0.7 y5 x51 x52 1 0.5 1 · 0.8 + 1.3 · 0.5
Note that in (1.8) we have introduced another variable xi1, i = 1, ..., n which takes on the value 1 for all values of yi for i = 1, ..., n and serves the purpose of including the offset 0.8 on the right-hand side. Independent variables that take on only the values 0,1, or −1 are sometimes referred to as dummy variables. What is the benefit of rewriting eq. (1.7) in the form of eq. (1.8)? Conceptually nothing has changed, but notationally, we can now express the relatively large expression (1.7) much more compactly. To do so, we define y1 x11 x12 y2 x21 x22 a y := y3 ,X := x31 x32 , and β := . (1.9) b y4 x41 x42 y5 x51 x52 Moreover, the definition of β in (1.9) can be simplified and aligned to the notation used in the previous Section by setting β1 := a and β2 := b, i.e., by defining β β := 1 . (1.10) β2 Take note of the dimensions of y, X, and β: y is a 5 × 1 vector, X is a 5 × 2 matrix, and β is a 2 × 1 vector. In matrix form, we can thus write (1.7) very compactly as
5 5×2 2 y = Xβ, where y ∈ R ,X ∈ R , and β ∈ R . (1.11) Matrix notation thus allows for neatly summarizing sets of linear equations as made explicit in (1.7). Moreover, as will become clear in subsequent Sections, matrix algebra also allows for writing other aspects of the GLM, such as parameter estimation and the evaluation of statistics, in very compact forms that can readily be implemented in computer code.
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 9
To conclude this Section, consider again the GLM equation (1.1). In comparison to eq. (1.11), it is apparent that so far we did not consider the error term ε thus far. In fact, the right-and side of eq. (1.11) merely describes the structural or deterministic aspect of the GLM. In the following Section, we shall thus consider the error term, which reflects the probabilistic aspect of the GLM and provides an essential contribution to the data y and its conception as a random variable.
Probabilistic aspects Before delving into the meaning of the error term ε in the GLM equation (1.1), we shall summarize in more general form, what we have learned so far. To this end, we first note that a fundamental aspect of the GLM equation (1.1) is that it generalizes many experimental design cases, such as the simple linear regression design discussed above. In all generality and using matrix notation as introduced in Section 4 | Matrix algebra, the structural elements of the GLM equation (1.1) take the forms y ∈ Rn, X ∈ Rn×p, and β ∈ Rp. Note that the design matrix always has as many rows as there are data values (n) and as many columns as their are parameter values (p). Explicitly, we may thus write the GLM equation (1.1) as y1 x11 x12 ··· x1p β1 y2 x21 x22 ··· x2p β2 = + ε. (1.12) . . . .. . . . . . . . . yn xn1 xn2 ··· xnp βp
We now consider ε in (1.12) in more detail. We first note that because X ∈ Rn×p and y ∈ Rn, ε must also be an n-dimensional real vector, i.e., ε ∈ Rn, and we thus have y1 x11 x12 ··· x1p β1 ε1 y2 x21 x22 ··· x2p β2 ε2 = + . (1.13) . . . .. . . . . . . . . . . yn xn1 xn2 ··· xnp βp εn We next consider the ith row of (1.13), which reads
yi = xi1β1 + xi2β2 + ... + xipβp + εi. (1.14)
The right-hand side of (1.14) now corresponds to the full GLM assumption about the ith data value yi and comprises two categorically different entities. The first part xi1β1 + xi2β2 + ... + xipβp is the structural, deterministic part already discussed above. The value εi, on the other hand, is conceived as the realization of a random variable. This means that the values εi for i = 1, ..., n are governed by random variables and their associated probability distributions. We might know some parameters of these probability distributions, but the exact values of the εi’s do not follow deterministically from this knowledge. Eq. (1.14) thus implies that the value yi is given by the sum of a deterministic and a probabilistic term. Next, consider obtaining sample values εi and adding them to the deterministic value
µi := xi1β1 + xi2β2 + ... + xipβp, (1.15) such that yi = µi + εi. (1.16)
In eq. (1.16), µi is a deterministic value and εi is a random variable realization. We now make the central assumption that the values εi are drawn from independent univariate Gaussian distributions with specified expectation parameter 0 and variance parameter σ2 > 0, which will formally be introduced in Section 5 | 2 Probability theory and Section 6 | Probability distributions. For small values of σ , the sampled values εi will be close to zero, but on occasion they may be a little bit positive or a little bit negative. Consider drawing the sample values ε1 = 0.200, ε2 = −0.001, ε3 = 0.050 for µ1 = µ2 = µ3 = 1. If we evaluate (1.16) for these values, we obtain
y1 = µ1 + ε1 = 1 + 0.200 = 1.200
y2 = µ2 + ε2 = 1 − 0.001 = 0.099 (1.17)
y3 = µ3 + ε3 = 1 + 0.050 = 1.050.
The most important thing to realize about (1.17) is that despite the fact that each yi has the same deterministic aspect µi = 1 for i = 1, 2, 3, the values yi, i = 1, 2, 3, still vary, because realizations of random
The General Linear Model 20/21 | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 A verbose introduction to the general linear model 10
variables are added to the µi for i = 1, 2, 3. Crucially, this renders the yi themselves realizations of random variables. We can also infer how the random variables they result from are distributed: because the random variables governing the εi’s have an expectation of zero, the expectation of the random variables governing the yi’s will correspond to the deterministic aspects µi. The variance of the random variables governing the yi’s, on the other hand, corresponds to the variance of the random variables governing the εi’s. There are two ways to express this more formally. We can either state that
yi = µi + εi, (1.18) where εi is a realization of a random variable distributed according to a univariate Gaussian distribution with expectation parameter 0 and variance parameter σ2, such that in distribution form (Section 7 | Probability distributions) we may write 2 εi ∼ N(0, σ ). (1.19)
Equivalently, we may state that yi is a realization of a random variable distributed according to a univariate 2 Gaussian distribution with expectation parameter µi and variance parameter σ , such that in distribution form we may write 2 yi ∼ N(µi, σ ). (1.20) Formally, (1.20) follows directly from application of the linear-affine transformation theorem for Gaussian distributions to εi under addition of µi (as introduced in Section 7 | Probability distributions). Next, recall that µi = xi1β1 + xi2β2 + ... + xipβp, (1.21) which may be re-expressed using matrix multiplication as
µi = xiβ, (1.22)
1×p where we defined xi ∈ R as the row vector