Quick viewing(Text Mode)

5 | Probability Spaces and Random Variables

5 | Probability Spaces and Random Variables

5 | spaces and random variables

In this Chapter, we review some essentials of as required for the theory of the GLM. We focus on the particularities and inner logic of the probability theory model rather than its practical application and primarily aim to establish important concepts and notation that will be used in subsequent sections. In Section 5.1, we first introduce the basic notion of a probability as a model for experiments that involve some degree of . We then discuss some elementary aspects of probability in Section 5.2 which mainly serve to ground the subsequently discussed theory of random variables and random vectors. The fundamental mathematical construct to model univariate data endowed with uncertainty is the concept of a . We focus on different ways of specifying probability distributions of random variables, notably probability mass and density functions for discrete and continuous random variables, respectively, in Section 5.3. The concise mathematical representation of more than one data point requires the concept of a random vector. In Section 5.4, we first discuss the extension of random variable concepts to the multivariate case of random vectors and then focus on three concepts that arise only in the multivariate scenario and are of immense importance for statistical data analysis: marginal distributions, conditional distributions, and independent random variables.

5.1 Probability spaces

Probability spaces Probability spaces are very general and abstract models of random experiments. We use the following definition.

Definition 5.1.1 (). A probability space is a triple (Ω, A, P), where ˆ Ω is a set of elementary outcomes ω, ˆ A is a σ-algebra, i.e., A is a set with the following properties ◦ Ω ∈ A, ◦A is closed under the formation of complements, i.e. if A ∈ A, then also Ac = Ω for all A ∈ A, ∞ ◦A is closed under countable unions, i.e., if A1,A2,A3, ... ∈ A, then ∪i=1Ai ∈ A. ˆ P is a probability , i.e., P is a mapping P : A → [0, 1] with the following properties: ◦ P is normalized, i.e., P (∅) = 0 and P (Ω) = 1, and

◦ P is σ-additive, i.e., if A1,A2, ... is a pairwise disjoint sequence in A (i.e., Ai ∈ A for i = 1, 2, ... ∞ P∞ and Ai ∩ Aj = ∅ for i 6= j), then P(∪i=1Ai) = i=1 P(Ai). •

Example A basic example is a probability space that models the throw of a die. In this case the elementary outcomes ω ∈ Ω model the six faces of the die, i.e., one may define Ω := {1, 2, 3, 4, 5, 6}. If the die is thrown, it will roll, and once it comes to rest, its upper surface will show one of the elementary outcomes. The typical σ-algebra used in the case of discrete and finite outcomes sets (such as the current Ω) is the P(Ω) of Ω. It is a basic exercise in probability theory to show that the power set indeed fulfils the properties of a σ-algebra as defined above. Because P(Ω) contains all of Ω, it also contains the elementary sets {1}, {2}, ..., {6}, which thus get allocated a probability P({ω}) ∈ [0, 1], ω ∈ Ω by the P. of sets containing a single elementary outcome are also often written simply as P(ω) (:= P({ω})). The typical value ascribed to P(ω), ω ∈ Ω, if used to model a fair die, is P(ω) = 1/6. The σ-algebra P(Ω) contains many more sets than the sets of elementary outcomes. The purpose of these additional elements is to model all sorts of events to which an observer of the random experiment may want to ascribe probabilities. For example, the observer may ask “What is the probability that the upper Elementary probabilities 2 surface shows a number larger than three?”. This event corresponds to the set {4, 5, 6}, which, because the σ-algebra P(Ω) contains all possible subsets of Ω, is contained in P(Ω). Likewise, the observer may ask “What is the probability that the upper surface shows an even number?”, which corresponds to the {2, 4, 6} of Ω. The probability measure P is defined in such a manner that the answers to the following questions are predetermined: “What is the probability that the upper surface shows nothing?” and “What is the probability that the upper surface shows any number in Ω?”. The element of P(Ω) that corresponds to the first question is the , and by definition of P, P(∅) = 0. This models the idea that one of the elementary outcomes, i.e., one surface with pips, will show up on every instance of the random experiment. If this is not the case, for example because the pips have worn off at one of the surfaces, the probability space model as sketched thus far is not a good model of the die experiment. The element of P(Ω) that corresponds to the second question is Ω itself. Here, the definition of the probability measure assigns P(Ω) = 1, i.e., the probability that something unspecific will happen, is one. Again, if the die falls off the table and cannot be recovered, the probability space model and the experiment are not in good alignment. Finally, the definition of the probability space as provided above allows one to evaluate probabilities for certain events based on the probabilities of other events by of the σ-additivity of P. Assume for example that the probability space models the throw of a fair die, such that P({ω}) = 1/6 by definition. Based on this assumption, the σ-additivity property allows to evaluate the probabilities of many other events. Consider for example an observer who is interested in the probability of the event that the surface of the die shows a number smaller or equal to three. Because the elementary events {1}, {2}, {3} are pairwise disjoint, and because the event of interest can be written as the countable {1, 2, 3} = {1} ∪ {2} ∪ {3} of these events, one may evaluate the probability of the event of interest by 3 P3 P(∪i=1{i}) = i=1 P(i) = 1/6 + 1/6 + 1/6 = 1/2. The die example is concerned with the case that a probability space is used to model a random experiment with a finite number of elementary outcomes. In the modelling of scientific experiments, the elementary outcomes are often modelled by the set of real numbers or real-valued vectors. Much of the theoretical development of modern probability theory in the early twentieth century was concerned with the question of how ideas from basic probability with finite elementary outcome spaces can be generalized to the continuous outcome space case of real numbers and vectors. In fact, it is perhaps the most important contribution of the probability space model as defined above and originally developed by Kolmogorov (1956) to be applicable in both the discrete-finite and the continuous-infinite elementary outcome set scenarios. The study of probability spaces for Ω := R or Ω := Rn, n > 1 is a central topic in probability theory which we by and large omit here. We do however note that the σ-algebras employed when Ω := Rn, n ≥ 1 are the so-called Borel σ-algebras, commonly denoted by B for n = 1 and Bn for n > 1. The mathematical construction of these σ-algebras is beyond our scope, but for the theory of the GLM, it is not unhelpful to think of Borel σ-algebras as power sets of R or Rn, n > 1. This is factually wrong as it can be shown that there are in fact more subsets of R or Rn, n > 1 than there are elements in the corresponding Borel σ-algebras. Nevertheless, many events of interest, such as the probability for the elementary outcome of a random experiment with outcome space R to fall into a real interval [a, b], are in B.

5.2 Elementary probabilities

We next discuss a few elementary aspects of probabilities defined on probability spaces. Throughout, let (Ω, A, P) denote a probability space, such that P : A → [0, 1] is a probability measure.

Interpretation

We first note that the probability P(A) of an event A is associated with at least two interpretations. From a Frequentist perspective, the probability of an event corresponds to the idealized long run frequency of observing the event A. From a Bayesian perspective, the probability of an event corresponds to the degree of belief that the event is true. Notably, both interpretations are subjective in the sense that the Frequentist perspective envisions an idealized long run frequency which can never be realized in practice, while the Bayesian belief interpretation is explicitly subjective and specific to a given observer. However, irrespective of the specific interpretation of the probability of an event, the logical rules for probabilistic inference, also known as probability calculus, are identical under both interpretations.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Elementary probabilities 3

Basic properties We next note the following basic properties of probabilities, which follow directly from probability space definition.

Theorem 5.2.1 (Properties of probabilities). Let (Ω, A, P) denote a probability space. Then the following properties holds.

(1) If A ⊂ B, then P(A) ≤ P(B). (2) P(Ac) = 1 − P(A). (3) If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B). (4) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).

 Exemplary, we prove property (4) of Theorem 5.2.1 below.

Proof. With the fact that any union of two sets A, B ⊂ Ω can be written as the union of the A ∩ Bc, A ∩ B, c and A ∩ B (cf. Section 2 | Sets, sums, and functions) and with the additivity of P for disjoint events, we have: c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) c c (5.1) = P ((A ∩ B ) ∪ (A ∩ B)) + P ((A ∩ B) ∪ (A ∩ B)) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B).

Independence An important feature of many probabilistic models is the independence of events. Intuitively, independence models the absence of deterministic and stochastic influences between events. Notably, independence can either be assumed and thus build into a probabilistic model by design or independence can follow from the design of the model. Regardless of the origin of the independence of events, we use the following definitions.

Definition 5.2.1 (Independent events). Let (Ω, A, P) denote a probability space. Two events A ∈ A and B ∈ A are independent, if P(A ∩ B) = P(A)P(B). (5.2)

A set of events {Ai|i ∈ I} ⊂ A with index set I is independent, if for every finite subset J ⊂ I Y P (∩j∈J Aj) = P (Aj) . (5.3) j∈J • Notably, disjoint events with positive probability, such as observing an even or odd number of pips in the die experiment, are not independent: if P(A) > 0 and P(B) > 0, then P(A)P(B) > 0, but P(A∩B) = P(∅) = 0, and thus P(A ∩ B) 6= P(A)P(B).

Conditional probability The basis for many forms of probabilistic inference is the of an event given that another event occurs. We use the following definition.

Definition 5.2.2 (Conditional probability). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(B) > 0. Then the conditional probability of A given B is defined as

P(A ∩ B) P(A|B) = . (5.4) P(B) •

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 4

Without proof, we note that for any fixed B ∈ A, P(·|B) is a probability measure, i.e., P(·|B) ≥ 0, ∞ P∞ P(Ω|B) = 1, and for disjoint A1,A2, ... ∈ A, P (∪i=1Ai|B) = i=1 P(Ai|B). Note that the rules of probability apply to the events on the left of the vertical bar. Intuitively, P(A|B) is the fraction of times the event A occurs among those times in which the event B occurs. This fraction is defined proportionally already by P(A ∩ B), the idealized relative frequency or the belief that the events A and B occur together. Division of P(A ∩ B) by P(B) yields a normalized measure. Furthermore, in most probabilistic models P(A|B) 6= P(B|A). For example, the probability of exhibiting respiratory symptoms after contracting corona virus does not necessarily equal the probability of contracting corona virus when exhibiting respiratory symptoms. Finally, a mathematical extension of conditional probability to the case of P(B) = 0 is possible, but technically beyond our scope. Rearranging the definition of conditional probability allows for expressing the probability of two events to occur jointly by the product of the conditional probability of one event given the other and the probability of the conditioning event. This fact is routinely used in the construction of probabilistic models. Formally, we have the following theorem, which follows directly from the definition of conditional probability.

Theorem 5.2.2 (Joint and conditional probabilities). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0. Then

P(A ∩ B) = P(A|B)P(B) = P(B|A)P(B). (5.5)

 For independent events, knowledge of the occurrence of one of the events does not affect the probability of the other event to occur:

Theorem 5.2.3 (Conditional probability for independent events). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0 denote two independent events. Then

P(A|B) = P(A) and P(B|A) = P(B). (5.6)



Proof. With the definitions of conditional probability and independent events, we have

P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (5.7) P(B) P(B) and analogously for P(B|A).

5.3 Random variables and distributions

The fundamental construct for the mathematical representation of numerical data endowed with uncertainty are random variables. From a mathematical perspective, random variables are neither random nor variables. Instead, random variables are functions that map elements of a probability outcome space Ω into another outcome space Γ. Γ is either a X , in which case the functions are referred to as discrete random variables, or Γ is the real line R, in which case the functions are referred to as continuous random variables. If Γ is a multidimensional space, the respective functions are referred to as random vectors. In the current section, we are concerned with some fundamental aspects of random variables. In Section 5.4, we consider their multivariate generalization as random vectors.

Measurable functions and random variables First, not all functions that map elements of a probability outcome space Ω onto elements of another outcome space Γ are random variables. A fundamental feature of random variables is that they are measurable. In the mathematical literature, the terms and random variable are hence used interchangeably. To make the concept of a measurable function precise, let (Ω, A, P) denote a probability space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.8) denote a function. Assume further that there exists a σ-algebra S on Γ. The tuple of a set Γ and a σ-algebra S is referred to as measurable space (for every probability space (Ω, A, P), (Ω, A) thus forms

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 5 a measurable space). Finally, for every set S ∈ S let ξ−1(S) denote the preimage of S under ξ. The preimage of S ∈ S under ξ is the set of all ω ∈ Ω that are mapped onto elements of S by ξ, i.e.,

ξ−1(S) := {ω ∈ Ω|ξ(ω) ∈ S}. (5.9)

Now, if the preimages of all S ∈ S are elements of the σ-algebra A on Ω, then ξ is called a measurable function. Formally, we have the following definition.

Definition 5.3.1 (Measurable function). Let (Ω, A, P) be a probability space, let (Γ, S) denote a measur- able space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.10) be a function. If ξ−1(S) ∈ A for all S ∈ S, (5.11) then ξ is called a measurable function. • A measurable function ξ :Ω → Γ is called a random variable:

Definition 5.3.2 (Random variable). Let (Ω, A, P) denote a probability space and let ξ :Ω → Γ denote a function. If ξ is a measurable function, then ξ is called a random variable. •

Probability distributions The condition of measurability of the function ξ has a fundamental consequence for the sets in S: because the probability measure P allocates a probability P(A) to all sets in A, and because, by definition of the measurability of ξ, all preimages ξ−1(S) of all sets S ∈ S are sets in A, the construction of a random variable allows for allocating a probability to all sets S ∈ S - namely the probability of the preimage ξ−1(S) ∈ A under P. This entails the induction of a probability measure on the measurable space (Γ, S). This induced probability measure is called the of the random variable ξ and is denoted by Pξ. We use the following definition. Definition 5.3.3 (Probability distribution). Let (Ω, A, P) denote probability space, let (Γ, S) denote a measurable space, and let ξ :Ω → Γ, ω → ξ(ω) (5.12) denote a random variable. Then the probability measure Pξ defined by

−1  Pξ : S → [0, 1],S 7→ Pξ(S) := P ξ (S) (5.13) is called the probability distribution of the random variable ξ. •

Intuitively, the notion of randomness in the values ξ(ω) of ξ is captured by this construction as follows: in a first step, an element ω ∈ Ω is selected according to the probability P({ω}) that is allocated to ω by the probability measure P on (Ω, A). In a second step, this ω is mapped onto an element ξ(ω) in Γ, which is also referred to as a realization of the random variable ξ. Across realizations, the values of ξ exhibit a probability distribution that depends both on the properties of P and ξ and is denoted by Pξ. Figure 5.1 visualizes the situation.

Clearly, if Γ = Ω, S = A and ξ := id, then P and Pξ are identical. Importantly, the union of the measurable space (Γ, S) and the probability measure Pξ forms the probability space (Γ, S, Pξ). In most probabilistic models, it is the latter probability space that takes center stage. Most commonly, the random variable outcome set is given by the real line Γ := R and the σ-algebra corresponds to the Borel σ-algebra S := B. Moreover, the probability measure Pξ is usually directly defined by means of a probability density function (see below). Notably, given the probability space (R, B, Pξ), an underlying probability space (Ω, A, P) can always be constructed post-hoc by setting ξ := id.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 6

Ω 풳 휉−1(푆)

(Ω, 풜, ℙ) 휉: Ω → Γ

푆 풳 Γ

(풳, 풮, ℙ휉)

−1 ℙ휉 푆 ≔ ℙ 휉 푆 = ℙ {휔 ∈ Ω|휉 휔 ∈ 푆}

Figure 5.1. Random variables and probability distributions. For a detailed discussion, please refer to the main text.

Notation In the following, we discuss a number of notational conventions with regards to probability distributions. We first note that random variables of the form ξ :Ω → Γ are often written as

ξ : (Ω, A) → (Γ, S) or ξ : (Ω, A, P) → (Γ, S). (5.14) Both notations are not inherently meaningful, as the random variable ξ only maps elements of Ω onto elements of Γ. Presumably, the notations of (5.14) evolved to stress the fact that the concept of a random variable entails the theoretical overhead of probability distributions that relate to S, A and P as described above. Second, the following notational conventions for events in A are commonly employed:

{ξ ∈ S} := {ω ∈ Ω|ξ(ω) ∈ S} {ξ = x} := {ω ∈ Ω|ξ(ω) = x} {ξ < x} := {ω ∈ Ω|ξ(ω) < x} (5.15) {ξ ≤ x} := {ω ∈ Ω|ξ(ω) ≤ x} {ξ > x} := {ω ∈ Ω|ξ(ω) > x} {ξ ≥ x} := {ω ∈ Ω|ξ(ω) ≥ x} for S ∈ S and x ∈ Γ and

{x1 < ξ < x2} := {ω ∈ Ω|x1 < ξ(ω) < x2}

{x1 ≤ ξ < x2} := {ω ∈ Ω|x1 ≤ ξ(ω) < x2} (5.16) {x1 < ξ ≤ x2} := {ω ∈ Ω|x1 < ξ(ω) ≤ x2}

{x1 ≤ ξ ≤ x2} := {ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2} for x1, x2 ∈ Γ, x1 ≤ x2 and similarly for larger than relationships. These conventions entail the following conventions for expressing the probabilistic behaviour of random variables, here demonstrated for three events listed above:

Pξ(ξ ∈ S) := P({ξ ∈ S}) = P({ω ∈ Ω|ξ(ω) ∈ S}) (5.17) Pξ(ξ = x) := P({ξ = x}) = P({ω ∈ Ω|ξ(ω) = x}) (5.18) Pξ(ξ ≤ x) := P({ξ ≤ x}) = P({ω ∈ Ω|ξ(ω) ≤ x}) (5.19) Pξ(x1 ≤ ξ ≤ x2) := P({x1 ≤ ξ ≤ x2}) = P({ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2}). (5.20)

Because of the redundancy in the reference to ξ in symbols of the form Pξ(ξ ≥ s), the subscript is often omitted, i.e., the expression is written as P(ξ ≥ s). Note that this notation entails the danger of confusing

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 7 the underlying probability measure P of the probability space (Ω, A, P) with the induced probability measure Pξ on (Γ, S). However, as remarked above, (Ω, A, P) is negligible in most applied cases and hence this danger is usually negligible. We next consider the direct specification of probability distributions by means of cumulative distribution functions, probability mass functions, and probability density functions.

Cumulative distribution functions

One way to specify the probability distribution Pξ of a random variable is to define its cumulative distribution function. We denote the cumulative distribution function of a random variable ξ by Pξ and use the following definition. Definition 5.3.4 (Cumulative distribution function). Let ξ be a real-valued random variable. Then a cumulative distribution function of ξ is a function defined as

Pξ :Γ → [0, 1], x 7→ Pξ(x) := Pξ(ξ ≤ x). (5.21) •

Intuitively, Pξ(x) represents the probability that the random variable ξ takes on a value equal or smaller than x. It thus follows that 1 − Pξ(x) represents the probability that the random variable ξ takes on a value larger than x. Importantly, by specifying the functional form of a cumulative distribution function Pξ, the probability of all events {ξ ≤ x} for x ∈ Γ is defined. An alternative and much more common approach to define the probability distributions of random variables is by means of probability mass and probability density functions.

Probability mass functions Probability mass functions are used to define the distributions of discrete random variables ξ :Ω → X with discrete and finite (or least countable) outcome set X . We use the following definitions.

Definition 5.3.5 (Discrete random variable, probability mass function). Let (Ω, A, P) denote a probability space. A random variable ξ :Ω → X is called discrete, if its outcome space X contains only a finite number of or countably many elements xi, i = 1, 2, .... The probability mass function (PMF) of a discrete random variable ξ is denoted by pξ and is defined as

pξ : X → [0, 1], xi 7→ pξ(xi) := Pξ(ξ = xi). (5.22)

• Note that by definition, PMFs are non-negative and normalized, i.e., X pξ(xi) ≥ 0 for all xi ∈ X and pξ(xi) = 1, (5.23)

xi∈X respectively. Both properties follow directly from the definition of a probability distribution as a probability measure.

The cumulative distribution function of a discrete random variable ξ with PMF pξ evaluates to X Pξ : X → [0, 1], xi 7→ Pξ(ξ) := Pξ(x ≤ ξ) = pξ(xi) (5.24) xi≤ξ and is also referred to as cumulative mass function (CMF).

Probability density functions

Probability density functions are used to define the distributions of continuous random variables ξ :Ω → R. We use the following definitions.

Definition 5.3.6 (Continuous random variable, probability density function). Let (Ω, A, P) denote a probability space. A random variable ξ :Ω → R is called a continuous random variable or a real-valued random variable. The probability density function (PDF) of a continuous random variable is defined as a function pξ : R → R≥0, x 7→ pξ(x) (5.25) with the properties

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 8

R ∞ (1) −∞ pξ(x) dx = 1, and R x2 (2) ξ(x1 ≤ x ≤ x2) = pξ(x) dx for all x1, x2 ∈ with x1 ≤ x2. P x1 R • Property (2) of 5.3.6 is central to the understanding of PDFs: the probability of a continuous random variable ξ to take on values in an interval [x1, x2] ⊂ R is obtained by integrating its associated PDF on the interval [x1, x2]. Notably, the probability for a continuous random variable ξ to take on any specific value x ∈ R is zero, because by property (2) in 5.3.6, we have Z x Pξ(ξ = x) = Pξ(x ≤ ξ ≤ x) = pξ(s) ds = 0. (5.26) x Also note that the motivation of the term probability density relates closely to the physical relations between mass, density, and volume,

Mass = Density × Volume. (5.27)

Physical density is a measure of the physical mass of a material per unit volume. To obtain the physical mass of an object of a given material with arbitrary volume, the physical density of the material has to be multiplied with the volume of the object. In analogy and with the intuition of definite integrals (cf. Section 3 | Calculus), to obtain the probability mass that is associated with a given interval of the real numbers, the size of the interval has to be multiplied with the associated values of the probability density. The cumulative distribution function of a continuous real-valued random variable ξ with PDF ξ evaluates to Z x Pξ : R → [0, 1], x 7→ Pξ(x) = pξ(s) ds (5.28) −∞ and is also referred to as cumulative density function (CDF). With the intuition of indefinite integrals (cf. Section 3 | Calculus), we thus see that PDFs can regarded as derivatives of CDFs - or vice versa, CDFs can be regarded as anti-derivatives of PDFs, in symbols d p (x) = P (x). (5.29) ξ dx ξ Finally, with the properties of basic integrals, we have the following possibility to evaluate the probability that a continuous random variable takes on values in an interval [x1, x2] by means of its CDF (and likewise for semi-open and open intervals):

Pξ(x1 ≤ ξ ≤ x2) = Pξ(x2) − Pξ(x1). (5.30)

5.4 Random vectors and multivariate probability distributions

Random vectors Random vectors are the multivariate extension of random variables. We use the following definition.

Definition 5.4.1 (Random vector). Let (Ω, A, P) denote a probability space and let (Γn, Sn) denote the n-dimensional measurable space. Then a function

ξ :Ω → Γn, ω 7→ ξ(ω) (5.31) is called an n-dimensional random vector, if it is a measurable function, i.e., if

ξ−1(S) ∈ A for all S ∈ Sn. (5.32)

T Without proof, we note that a multivariate function ξ = (ξ1, ..., ξn) is a measurable function, if its component functions ξ1, ..., ξn are measurable functions. This implies that the component functions of a random vector are random variables. n-dimensional random vectors may thus be conceived as the concatenation of n random variables, while random variables are one-dimensional random vectors.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 9

Multivariate probability distributions Multivariate probability distributions are the probability distributions of random vectors. In complete analogy to the random variable scenario, we use the following definition.

Definition 5.4.2 (Multivariate probability distribution). Let (Ω, A, P) denote a probability space, let (Γn, Sn) denote the n-dimensional measurable space, and let

ξ :Ω → Γn, ω → ξ(ω) (5.33) denote a random vector. Then the probability measure Pξ defined by n −1  Pξ : S → [0, 1],S 7→ Pξ(S) := P ξ (S) = P ({ω ∈ Ω|ξ(ω) ∈ S}) (5.34) is called the multivariate probability distribution of the random vector ξ. • For simplicity, the multivariate nature of the probability distribution of a random vector is often left implicit, such that one simply speaks of the probability distribution of a random vector.

Notation The notational conventions for events discussed in Section 5.3 extend to the multivariate case. For example, for S ∈ Sn and x ∈ Γn, we have

Pξ(ξ ∈ S) := P ({ξ ∈ S}) = P ({ω ∈ Ω|ξ(ω) ∈ S}) Pξ(ξ = x) := P ({ξ = x}) = P ({ω ∈ Ω|ξ(ω) = x}) (5.35) Pξ(ξ ≤ x) := P ({ξ ≤ x}) = P ({ω ∈ Ω|ξ(ω) ≤ x}) Pξ(x1 ≤ ξ ≤ x2) := P ({x1 ≤ ξ ≤ x2}) = P ({ω ∈ Ω|x1 ≤ ξ(ω) ≤ x2}) . Note that relational operators such as ≤ are understood to hold component-wise for multivariate entities, n e.g., x ≤ y for x, y ∈ Γ is understood as xi ≤ yi for all i = 1, ..., n.

Multivariate cumulative distribution functions

One way to specify the probability distribution Pξ of a random vector is to define its multivariate cumulative distribution function. In analogy to the random variable scenario, we use the following definition. Definition 5.4.3 (Multivariate cumulative distribution function). Let ξ be a random vector. Then a multivariate cumulative distribution function of ξ is a function

n Pξ :Γ → [0, 1], x 7→ Pξ(x) := Pξ(ξ ≤ x). (5.36) •

More commonly employed alternatives for specifying the probability distributions of random vectors are multivariate probability mass and density functions. The intuitions for probability mass and density functions established for random variables extend to random vectors.

Multivariate probability mass functions Multivariate probability mass functions are used to define the distributions of discrete random vectors. We use the following definitions.

Definition 5.4.4 (Discrete random vector, multivariate probability mass function). Let (Ω, A, P) denote a probability space. A random vector ξ :Ω → X is called discrete, if its outcome space X contains only finite number of or countably many elements xi, i = 1, 2, .... The multivariate probability mass function of a discrete random vector ξ is denoted by pξ and is defined as

pξ : X → [0, 1], xi 7→ pξ(xi) := Pξ(ξ = xi). (5.37) •

Like their univariate counterparts, multivariate PMFs are non-negative and normalized.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 10

Example. To exemplify the concept of multivariate PMF, we consider a discrete two-dimensional random vector ξ = (ξ1, ξ2) taking values in X = X1 × X2 with X1 := {1, 2, 3} and X2 := {1, 2, 3, 4}. An exemplary two-dimensional PMF of the form

pξ : {1, 2, 3} × {1, 2, 3, 4} → [0, 1], (ξ1, ξ2) 7→ pξ(x1, x2) (5.38) is specified according in Table 5.1. Note that P3 P4 p (x , x ) = 1. x1=1 x2=1 ξ 1 2

pξ(x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 x1 = 1 0.1 0.0 0.2 0.1 x1 = 2 0.1 0.2 0.0 0.0 x1 = 3 0.0 0.1 0.1 0.1

Table 5.1. An exemplary bivariate PMF.

Multivariate probability density functions Multivariate probability density functions are used to define the distributions of continuous random vectors. We use the following definitions.

Definition 5.4.5 (Continuous random vector, multivariate probability density function). Let (Ω, A, P) denote a probability space. A random vector ξ :Ω → Rn is called a continuous random vector. The multivariate probability density function of a continuous random vector is defined as a function

n pξ : R → R≥0, x 7→ pξ(x), (5.39) such that R (1) n pξ(x) dx = 1, and R R x21 R x2n n (2) Pξ(x1 ≤ ξ ≤ x2) = ··· pξ(s1, ..., sn) ds1 ··· dsn for all x1, x2 ∈ R with x1 ≤ x2. x11 x1n • Like in the random variable scenario, we have

Z x1 Z xn Pξ(ξ = x) = Pξ(x ≤ ξ ≤ x) = ··· pξ(s1, ..., sn) ds1 ··· dsn = 0. (5.40) x1 xn As for the probability distributions of random vectors, we often omit the qualifying adjective multivariate when discussing the PMFs and PDFs of random vectors.

Marginal distributions Marginal distributions are the probability distributions of the components of random vectors. In the following, we first define marginal distributions and discuss how univariate marginal distributions can be evaluated based on multivariate PMFs and PDFs. We then discuss an example for the marginal distributions of a two-dimensional discrete random vector. Examples for marginal distributions of multivariate continuous vectors are discussed in the context of Gaussian distributions in Section 7 | Probability distributions. Definition 5.4.6 (Marginal random variables and vectors, marginal probability distributions). Let n (Ω, A, P) denote a probability space, let ξ :Ω → Γ denote a random vector, let Pξ denote the probability (i) n n (i) distribution of ξ, and let Γ denote the outcome space of the ith component of ξ such that Γ = ×i=1Γ . Then the probability distribution defined by

 (1) (i−1) (i+1) (n) (i) Pξi : S → [0, 1],S 7→ Pξi Γ × · · · × Γ × S × Γ × · · · × Γ for S ⊆ Γ (5.41) is called the ith univariate of ξ. •

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 11

Without proof, we note that marginal distributions can be evaluated from multivariate PMFs and PDFs by means of summation and integration, respectively. Theorem 5.4.1 (Marginal probability mass functions, marginal probability density functions). Let ξ denote a discrete random vector with probability mass function pξ. Then the probability mass function of the ith component ξi of ξ evaluates to X X X X pξi : R → [0, 1], xi 7→ pξi (xi) := ··· ··· pξ(x). (5.42) x1 xi−1 xi+1 xn

Similarly, let ξ denote a continuous random vector with probability density function pξ. Then the probability density function of the ith component ξi of ξ evaluates to Z Z Z Z

pξi : R → [0, 1], xi 7→ pξi (xi) := ··· ··· pξ(x) dx1 ··· dxi−1 dxi+1 ··· dxn. (5.43) x1 xi−1 xi+1 xn



Example To exemplify the concept of a marginal PMF, we reconsider the discrete two-dimensional random vector ξ = (ξ1, ξ2) taking values in X = X1 × X2 with X1 := {1, 2, 3} and X2 := {1, 2, 3, 4} and

PMF specified in Table 5.1. Based on Theorem 5.4.1, the marginal PMFs pξ1 and pξ2 of ξ evaluate as specified in Table 5.2 below. Note that P3 p (x ) = 1 and P4 p (x ) = 1. x1=1 ξ1 1 x2=1 ξ2 2

pξ(x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 pξ1 (x1) x1 = 1 0.1 0.0 0.2 0.1 0.4 x2 = 2 0.1 0.2 0.0 0.0 0.3 x2 = 3 0.0 0.1 0.1 0.1 0.3

pξ2 (x2) 0.2 0.3 0.3 0.2

Table 5.2. Exemplary marginal PMFs.

Conditional distributions

Recall that for a probability space (Ω, A, P) and two events A, B ∈ A with P(B) > 0, the conditional probability of event A given event B is defined as

P(A ∩ B) P(A|B) = . (5.44) P(B)

Analogously, for the distribution of two random variables ξ1 and ξ2, the conditional probability distribution of ξ1 given ξ2 is defined in terms of events A = {ξ1 ∈ X1} and B = {ξ2 ∈ X2}. To introduce conditional distributions, we first consider the case of two-dimensional (bivariate) discrete and continuous random vectors.

T Definition 5.4.7 (Conditional PMF, discrete conditional distribution). Let ξ = (ξ1, ξ2) denote a discrete random vector with PMF pξ = pξ1,ξ2 and marginal PMFs pξ1 and pξ2 . Then the conditional PMF of ξ1 given ξ2 = x2 is defined as

pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → [0, 1], x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0 (5.45) pξ2 (x2) and the conditional PMF of ξ2 given ξ1 = x1 is defined as

pξ1,ξ2 (x1, x2) pξ2|ξ1 : R → [0, 1], x2 7→ pξ2|ξ1 (x2|x1) = for pξ1 (x1) > 0. (5.46) pξ1 (x1)

The discrete distributions with PMFs pξ1|ξ2 (·|ξ2 = x2) and pξ2|ξ1(·|ξ1 = x1) are called the conditional distributions of ξ1 given ξ2 = x2 and ξ2 given ξ1 = x1, respectively. •

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 12

In complete analogy to the conditional probabilities of events, we have

pξ1,ξ2 (x1, x2) P ({ξ1 = x1} ∩ {ξ2 = x2}) pξ1|ξ2 (x1|x2) = = (5.47) pξ2 (x2) P(ξ2 = x2) and likewise for pξ2|ξ1 . Like conditional probabilities, conditional PMFs behave like proper probability measures in their first argument.

Example. Consider the earlier example of the two-dimensional PMF pξ1,ξ2 and its marginal PMFs pξ1 and pξ2 documented in Table 5.1 and Table 5.2. For this example, the conditional PMFs of ξ2 given

ξ1 = 1, ξ1 = 2, and ξ1 = 3 are evaluated in Table 5.3 below. Note the qualitative similarity of pξ1,ξ2 (x1, x2) and pξ2|ξ1 (x2|x1).

pξ2|ξ1 (x2|x1) x2 = 1 x2 = 2 x2 = 3 x2 = 4 p (x |x = 1) 0.1 = 1 0.0 = 0 0.2 = 1 0.1 = 1 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.4 4 0.4 0.4 2 0.4 4 x2=1 ξ2|ξ1 2 1 p (x |x = 2) 0.1 = 1 0.2 = 2 0.0 = 0 0.0 = 0 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.3 3 0.3 3 0.3 0.3 x2=1 ξ2|ξ1 2 1 p (x |x = 3) 0.0 = 0 0.1 = 1 0.1 = 1 0.1 = 1 P4 p (x |x ) = 1 ξ2|ξ1 2 1 0.3 0.3 3 0.3 3 0.3 3 x2=1 ξ2|ξ1 2 1

Table 5.3. Exemplary conditional PMF.

Similarly, we have the following definition for conditional distributions of continuous random variables.

T Definition 5.4.8 (Conditional PDF, continuous conditional distribution). Let ξ = (ξ1, ξ2) denote a continuous random vector with PDF pξ = pξ1,ξ2 and marginal PDFs pξ1 and pξ2 . Then the conditional PDF of ξ1 given ξ2 is defined as

pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → [0, 1], x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0, (5.48) pξ2 (x2) and the conditional PDF of ξ2 given ξ1 = x1 is defined as

pξ1,ξ2 (x1, x2) pξ2|ξ1 : R → [0, 1], x2 7→ pξ2|ξ1 (x2|x1) = for pξ1 (x1) > 0. (5.49) pξ1 (x1)

The continuous distributions with PDFs pξ1|ξ2 (·|ξ2 = x2) and pξ2|ξ1(·|ξ1 = x1) are called the conditional distributions of ξ1 given ξ2 and ξ2 given ξ1, respectively. • Finally, the two-dimensional scenario discussed thus far can be generalized to the multivariate scenario in terms of the following definition, which overs both the discrete and continuous settings.

Definition 5.4.9 (Multivariate conditional PMF and PDF). Let ξ = (ξ1, ξ2) denote an n-dimensional random vector, where ξ1 and ξ2 denote k and n − k-dimensional random vectors, respectively. Let pξ1,ξ2 denote the PMF or PDF of ξ and let pξ2 denote the (n − k)-dimensional marginal PMF or PDF of ξ2. Then, for ξ2 = x2, the conditional k-dimensional PMF or PDF of ξ1 given ξ2 is defined as

k pξ1,ξ2 (x1, x2) pξ1|ξ2 : R → R≥0, x1 7→ pξ1|ξ2 (x1|x2) = for pξ2 (x2) > 0. (5.50) pξ2 (x2) •

Independence

In analogy to the definition of independent events (cf. 5.2.1), two random variables ξ1 and ξ2 are called independent, if {ξ1 ∈ S1} and {ξ2 ∈ S2} are independent events for all S1 and S2. We use the following definition.

(1) (2) Definition 5.4.10 (Independent random variables). Two random variables ξ1 :Ω → Γ and ξ2 :Ω → Γ (1) (2) are independent, if for every S1 ⊆ Γ and S2 ⊆ Γ it holds that

P(ξ1 ∈ S1, ξ2 ∈ S2) = P(ξ1 ∈ S1)P(ξ2 ∈ S2). (5.51) •

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random vectors and multivariate probability distributions 13

As in the elementary probability scenario, independence of random variables implies that

P({ξ1 ∈ S1}|{ξ2 ∈ S2}) = P({ξ1 ∈ S1}) (5.52) or, intuitively, that knowledge of the fact that ξ2 ∈ S2 does not affect the probability of the event ξ1 ∈ S1. Without proof, we note the following theorem that transfers the definition of independent random variables to their respective PMF or PDF.

Theorem 5.4.2 (Independence and PMF/PDF factorization). Let ξ1 :Ω → X1 and ξ2 :Ω → X2 denote discrete random variables with PMF pξ1,ξ2 and marginal PMFs pξ1 and pξ2 , respectively. Then ξ1 and ξ2 are independent, if and only if

pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) for all (x1, x2) ∈ X1 × X2. (5.53)

Similarly, let ξ1 and ξ2 denote continuous random variables with PDF pξ1,ξ2 and marginal PDFs pξ1 and pξ2 , respectively. Then ξ1 and ξ2 are independent, if and only if

2 pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) for all (x1, x2) ∈ R . (5.54)

2

Notably, the PMF or PDF property

pξ1,ξ2 (x1, x2) = pξ1 (x1)pξ2 (x2) (5.55) is referred to as factorization of the PMF or PDF. The independence of two random variables is thus equivalent to the factorization of their bivariate PMF or PDF.

Example Consider the earlier example of a bivariate PMF and its associated marginal PMFs (cf. Table 5.2). Because

pξ1,ξ2 (1, 1) = 0.1 6= 0.08 = pξ1 (1)pξ2 (1), (5.56) the random variables ξ1 and ξ2 are not independent. For the marginal distributions specified in Table 5.2, the bivariate PMF for independent ξ1 and ξ2 is documented in Table 5.4 below.

pξ1,ξ2 (x1, x2) x2 = 1 x2 = 2 x2 = 3 x2 = 4 pξ1 (x1) x1 = 1 0.08 0.12 0.12 0.08 0.40 x1 = 2 0.06 0.09 0.09 0.06 0.30 x1 = 3 0.06 0.09 0.09 0.06 0.30

pξ2 (x2) 0.20 0.30 0.30 0.20

Table 5.4. A factorized PMF

The bivariate case of two independent random variables is generalized to the case of n independent random variables in the following definition.

Definition 5.4.11 (n independent random variables). n random variables ξ1, ..., ξn are independent, if (1) (n) for every S1 ⊆ Γ , ..., Sn ⊆ Γ ,

n Y P(ξ1 ∈ S1, ..., ξn ∈ Sn) = P(ξi ∈ Si). (5.57) i=1

If the random variables have a multivariate PMF or PDF pξ1,...,ξn (x1, ..., xn) with marginal PMFs or

PDFs pξi , i = 1, ..., n, then independence holds if

n Y pξ1,...,ξn (x1, ..., xn) = pξi (xi). (5.58) i=1 • The special case of n independent random variables with identical marginal distributions serves as a fundamental assumption in many statistical settings. We use the following definition.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographic remarks 14

Definition 5.4.12 (Independent and identically distributed random variables). n random variables ξ1, ..., ξn are called independent and identically distributed (iid), if and only if

(1) ξ1, ..., ξn are independent random variables, and

(2) each ξi has the same marginal distribution for i = 1, ..., n. •

In Section 7 | Probability distributions, we consider the case of n iid Gaussian random variables and how their joint distribution can be represented by a multivariate Gaussian distribution.

5.5 Bibliographic remarks

The presented material is standard and can be found in any introductory textbook on probability and . DeGroot and Schervish(2012) and Wasserman(2004) are the main sources for the presentation provided here. Excellent introductions to modern probability theory include Billingsley(1995), Fristedt et al.(1998), Rosenthal(2006), and, from a statistical perspective, Shao(2003).

5.6 Study questions

1. Write down the definition of a probability space. 2. Write down the definition of the independence of two events A and B. 3. Write down the definition of a random variable. 4. Write down the definition of the cumulative distribution function of a random variable. 5. Write down the definitions of a PMF and a PDF. 6. Write down the definition of a random vector. 7. Write down the definition of the cumulative distribution function of a random vector. 8. Write down the definition of a multivariate PMF and a multivariate PDF.

9. Write down the definition of the independence of n random variables ξi, i = 1, ..., n.

10. What does it for n random variables ξ1, ..., ξn to be iid?

References

Billingsley, P. (1995). Probability and Measure. Wiley Series in Probability and . Wiley, New York, 3rd ed edition. DeGroot, M. H. and Schervish, M. J. (2012). Probability and Statistics. Addison-Wesley, Boston, 4th ed edition. Fristedt, B. E., Gray, L. F., and Birkh¨auserPublishing Ltd (1998). A Modern Approach to Probability Theory. Kolmogorov, A. N. (1956). Foundations of the Theory of Probability. Chelsea Pub Co. Rosenthal, J. S. (2006). A First Look at Rigorous Probability Theory. World Scientific, Singapore ; Hackensack, N.J, 2nd ed edition. Shao, J. (2003). Mathematical Statistics. Springer Texts in Statistics. Springer, New York, 2nd ed edition. Wasserman, L. (2004). All of Statistics.

The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0