5 | Probability Spaces and Random Variables
5 | Probability spaces and random variables
In this Chapter, we review some essentials of probability theory as required for the theory of the GLM. We focus on the particularities and inner logic of the probability theory model rather than its practical application and primarily aim to establish important concepts and notation that will be used in subsequent sections. In Section 5.1, we first introduce the basic notion of a probability space as a model for experiments that involve some degree of randomness. We then discuss some elementary aspects of probability in Section 5.2 which mainly serve to ground the subsequently discussed theory of random variables and random vectors. The fundamental mathematical construct to model univariate data endowed with uncertainty is the concept of a random variable. We focus on different ways of specifying probability distributions of random variables, notably probability mass and density functions for discrete and continuous random variables, respectively, in Section 5.3. The concise mathematical representation of more than one data point requires the concept of a random vector. In Section 5.4, we first discuss the extension of random variable concepts to the multivariate case of random vectors and then focus on three concepts that arise only in the multivariate scenario and are of immense importance for statistical data analysis: marginal distributions, conditional distributions, and independent random variables.
5.1 Probability spaces
Probability spaces Probability spaces are very general and abstract models of random experiments. We use the following definition.
Definition 5.1.1 (Probability space). A probability space is a triple (Ω, A, P), where Ω is a set of elementary outcomes ω, A is a σ-algebra, i.e., A is a set with the following properties ◦ Ω ∈ A, ◦A is closed under the formation of complements, i.e. if A ∈ A, then also Ac = Ω for all A ∈ A, ∞ ◦A is closed under countable unions, i.e., if A1,A2,A3, ... ∈ A, then ∪i=1Ai ∈ A. P is a probability measure, i.e., P is a mapping P : A → [0, 1] with the following properties: ◦ P is normalized, i.e., P (∅) = 0 and P (Ω) = 1, and
◦ P is σ-additive, i.e., if A1,A2, ... is a pairwise disjoint sequence in A (i.e., Ai ∈ A for i = 1, 2, ... ∞ P∞ and Ai ∩ Aj = ∅ for i 6= j), then P(∪i=1Ai) = i=1 P(Ai). •
Example A basic example is a probability space that models the throw of a die. In this case the elementary outcomes ω ∈ Ω model the six faces of the die, i.e., one may define Ω := {1, 2, 3, 4, 5, 6}. If the die is thrown, it will roll, and once it comes to rest, its upper surface will show one of the elementary outcomes. The typical σ-algebra used in the case of discrete and finite outcomes sets (such as the current Ω) is the power set P(Ω) of Ω. It is a basic exercise in probability theory to show that the power set indeed fulfils the properties of a σ-algebra as defined above. Because P(Ω) contains all subsets of Ω, it also contains the elementary outcome sets {1}, {2}, ..., {6}, which thus get allocated a probability P({ω}) ∈ [0, 1], ω ∈ Ω by the probability measure P. Probabilities of sets containing a single elementary outcome are also often written simply as P(ω) (:= P({ω})). The typical value ascribed to P(ω), ω ∈ Ω, if used to model a fair die, is P(ω) = 1/6. The σ-algebra P(Ω) contains many more sets than the sets of elementary outcomes. The purpose of these additional elements is to model all sorts of events to which an observer of the random experiment may want to ascribe probabilities. For example, the observer may ask “What is the probability that the upper Elementary probabilities 2 surface shows a number larger than three?”. This event corresponds to the set {4, 5, 6}, which, because the σ-algebra P(Ω) contains all possible subsets of Ω, is contained in P(Ω). Likewise, the observer may ask “What is the probability that the upper surface shows an even number?”, which corresponds to the subset {2, 4, 6} of Ω. The probability measure P is defined in such a manner that the answers to the following questions are predetermined: “What is the probability that the upper surface shows nothing?” and “What is the probability that the upper surface shows any number in Ω?”. The element of P(Ω) that corresponds to the first question is the empty set, and by definition of P, P(∅) = 0. This models the idea that one of the elementary outcomes, i.e., one surface with pips, will show up on every instance of the random experiment. If this is not the case, for example because the pips have worn off at one of the surfaces, the probability space model as sketched thus far is not a good model of the die experiment. The element of P(Ω) that corresponds to the second question is Ω itself. Here, the definition of the probability measure assigns P(Ω) = 1, i.e., the probability that something unspecific will happen, is one. Again, if the die falls off the table and cannot be recovered, the probability space model and the experiment are not in good alignment. Finally, the definition of the probability space as provided above allows one to evaluate probabilities for certain events based on the probabilities of other events by means of the σ-additivity of P. Assume for example that the probability space models the throw of a fair die, such that P({ω}) = 1/6 by definition. Based on this assumption, the σ-additivity property allows to evaluate the probabilities of many other events. Consider for example an observer who is interested in the probability of the event that the surface of the die shows a number smaller or equal to three. Because the elementary events {1}, {2}, {3} are pairwise disjoint, and because the event of interest can be written as the countable union {1, 2, 3} = {1} ∪ {2} ∪ {3} of these events, one may evaluate the probability of the event of interest by 3 P3 P(∪i=1{i}) = i=1 P(i) = 1/6 + 1/6 + 1/6 = 1/2. The die example is concerned with the case that a probability space is used to model a random experiment with a finite number of elementary outcomes. In the modelling of scientific experiments, the elementary outcomes are often modelled by the set of real numbers or real-valued vectors. Much of the theoretical development of modern probability theory in the early twentieth century was concerned with the question of how ideas from basic probability with finite elementary outcome spaces can be generalized to the continuous outcome space case of real numbers and vectors. In fact, it is perhaps the most important contribution of the probability space model as defined above and originally developed by Kolmogorov (1956) to be applicable in both the discrete-finite and the continuous-infinite elementary outcome set scenarios. The study of probability spaces for Ω := R or Ω := Rn, n > 1 is a central topic in probability theory which we by and large omit here. We do however note that the σ-algebras employed when Ω := Rn, n ≥ 1 are the so-called Borel σ-algebras, commonly denoted by B for n = 1 and Bn for n > 1. The mathematical construction of these σ-algebras is beyond our scope, but for the theory of the GLM, it is not unhelpful to think of Borel σ-algebras as power sets of R or Rn, n > 1. This is factually wrong as it can be shown that there are in fact more subsets of R or Rn, n > 1 than there are elements in the corresponding Borel σ-algebras. Nevertheless, many events of interest, such as the probability for the elementary outcome of a random experiment with outcome space R to fall into a real interval [a, b], are in B.
5.2 Elementary probabilities
We next discuss a few elementary aspects of probabilities defined on probability spaces. Throughout, let (Ω, A, P) denote a probability space, such that P : A → [0, 1] is a probability measure.
Interpretation
We first note that the probability P(A) of an event A is associated with at least two interpretations. From a Frequentist perspective, the probability of an event corresponds to the idealized long run frequency of observing the event A. From a Bayesian perspective, the probability of an event corresponds to the degree of belief that the event is true. Notably, both interpretations are subjective in the sense that the Frequentist perspective envisions an idealized long run frequency which can never be realized in practice, while the Bayesian belief interpretation is explicitly subjective and specific to a given observer. However, irrespective of the specific interpretation of the probability of an event, the logical rules for probabilistic inference, also known as probability calculus, are identical under both interpretations.
The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Elementary probabilities 3
Basic properties We next note the following basic properties of probabilities, which follow directly from probability space definition.
Theorem 5.2.1 (Properties of probabilities). Let (Ω, A, P) denote a probability space. Then the following properties holds.
(1) If A ⊂ B, then P(A) ≤ P(B). (2) P(Ac) = 1 − P(A). (3) If A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B). (4) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
Exemplary, we prove property (4) of Theorem 5.2.1 below.
Proof. With the fact that any union of two sets A, B ⊂ Ω can be written as the union of the disjoint sets A ∩ Bc, A ∩ B, c and A ∩ B (cf. Section 2 | Sets, sums, and functions) and with the additivity of P for disjoint events, we have: c c P(A ∪ B) = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) c c = P(A ∩ B ) + P(A ∩ B) + P(A ∩ B) + P(A ∩ B) − P(A ∩ B) c c (5.1) = P ((A ∩ B ) ∪ (A ∩ B)) + P ((A ∩ B) ∪ (A ∩ B)) − P(A ∩ B) = P(A) + P(B) − P(A ∩ B).
Independence An important feature of many probabilistic models is the independence of events. Intuitively, independence models the absence of deterministic and stochastic influences between events. Notably, independence can either be assumed and thus build into a probabilistic model by design or independence can follow from the design of the model. Regardless of the origin of the independence of events, we use the following definitions.
Definition 5.2.1 (Independent events). Let (Ω, A, P) denote a probability space. Two events A ∈ A and B ∈ A are independent, if P(A ∩ B) = P(A)P(B). (5.2)
A set of events {Ai|i ∈ I} ⊂ A with index set I is independent, if for every finite subset J ⊂ I Y P (∩j∈J Aj) = P (Aj) . (5.3) j∈J • Notably, disjoint events with positive probability, such as observing an even or odd number of pips in the die experiment, are not independent: if P(A) > 0 and P(B) > 0, then P(A)P(B) > 0, but P(A∩B) = P(∅) = 0, and thus P(A ∩ B) 6= P(A)P(B).
Conditional probability The basis for many forms of probabilistic inference is the conditional probability of an event given that another event occurs. We use the following definition.
Definition 5.2.2 (Conditional probability). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(B) > 0. Then the conditional probability of A given B is defined as
P(A ∩ B) P(A|B) = . (5.4) P(B) •
The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 4
Without proof, we note that for any fixed B ∈ A, P(·|B) is a probability measure, i.e., P(·|B) ≥ 0, ∞ P∞ P(Ω|B) = 1, and for disjoint A1,A2, ... ∈ A, P (∪i=1Ai|B) = i=1 P(Ai|B). Note that the rules of probability apply to the events on the left of the vertical bar. Intuitively, P(A|B) is the fraction of times the event A occurs among those times in which the event B occurs. This fraction is defined proportionally already by P(A ∩ B), the idealized relative frequency or the belief that the events A and B occur together. Division of P(A ∩ B) by P(B) yields a normalized measure. Furthermore, in most probabilistic models P(A|B) 6= P(B|A). For example, the probability of exhibiting respiratory symptoms after contracting corona virus does not necessarily equal the probability of contracting corona virus when exhibiting respiratory symptoms. Finally, a mathematical extension of conditional probability to the case of P(B) = 0 is possible, but technically beyond our scope. Rearranging the definition of conditional probability allows for expressing the probability of two events to occur jointly by the product of the conditional probability of one event given the other and the probability of the conditioning event. This fact is routinely used in the construction of probabilistic models. Formally, we have the following theorem, which follows directly from the definition of conditional probability.
Theorem 5.2.2 (Joint and conditional probabilities). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0. Then
P(A ∩ B) = P(A|B)P(B) = P(B|A)P(B). (5.5)
For independent events, knowledge of the occurrence of one of the events does not affect the probability of the other event to occur:
Theorem 5.2.3 (Conditional probability for independent events). Let (Ω, A, P) denote a probability space and let A, B ∈ A with P(A), P(B) > 0 denote two independent events. Then
P(A|B) = P(A) and P(B|A) = P(B). (5.6)
Proof. With the definitions of conditional probability and independent events, we have
P(A ∩ B) P(A)P(B) P(A|B) = = = P(A). (5.7) P(B) P(B) and analogously for P(B|A).
5.3 Random variables and distributions
The fundamental construct for the mathematical representation of numerical data endowed with uncertainty are random variables. From a mathematical perspective, random variables are neither random nor variables. Instead, random variables are functions that map elements of a probability outcome space Ω into another outcome space Γ. Γ is either a countable set X , in which case the functions are referred to as discrete random variables, or Γ is the real line R, in which case the functions are referred to as continuous random variables. If Γ is a multidimensional space, the respective functions are referred to as random vectors. In the current section, we are concerned with some fundamental aspects of random variables. In Section 5.4, we consider their multivariate generalization as random vectors.
Measurable functions and random variables First, not all functions that map elements of a probability outcome space Ω onto elements of another outcome space Γ are random variables. A fundamental feature of random variables is that they are measurable. In the mathematical literature, the terms measurable function and random variable are hence used interchangeably. To make the concept of a measurable function precise, let (Ω, A, P) denote a probability space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.8) denote a function. Assume further that there exists a σ-algebra S on Γ. The tuple of a set Γ and a σ-algebra S is referred to as measurable space (for every probability space (Ω, A, P), (Ω, A) thus forms
The General Linear Model | © 2020 Dirk Ostwald CC BY-NC-SA 4.0 Random variables and distributions 5 a measurable space). Finally, for every set S ∈ S let ξ−1(S) denote the preimage of S under ξ. The preimage of S ∈ S under ξ is the set of all ω ∈ Ω that are mapped onto elements of S by ξ, i.e.,
ξ−1(S) := {ω ∈ Ω|ξ(ω) ∈ S}. (5.9)
Now, if the preimages of all S ∈ S are elements of the σ-algebra A on Ω, then ξ is called a measurable function. Formally, we have the following definition.
Definition 5.3.1 (Measurable function). Let (Ω, A, P) be a probability space, let (Γ, S) denote a measur- able space, and let ξ :Ω → Γ, ω 7→ ξ(ω) (5.10) be a function. If ξ−1(S) ∈ A for all S ∈ S, (5.11) then ξ is called a measurable function. • A measurable function ξ :Ω → Γ is called a random variable:
Definition 5.3.2 (Random variable). Let (Ω, A, P) denote a probability space and let ξ :Ω → Γ denote a function. If ξ is a measurable function, then ξ is called a random variable. •
Probability distributions The condition of measurability of the function ξ has a fundamental consequence for the sets in S: because the probability measure P allocates a probability P(A) to all sets in A, and because, by definition of the measurability of ξ, all preimages ξ−1(S) of all sets S ∈ S are sets in A, the construction of a random variable allows for allocating a probability to all sets S ∈ S - namely the probability of the preimage ξ−1(S) ∈ A under P. This entails the induction of a probability measure on the measurable space (Γ, S). This induced probability measure is called the probability distribution of the random variable ξ and is denoted by Pξ. We use the following definition. Definition 5.3.3 (Probability distribution). Let (Ω, A, P) denote probability space, let (Γ, S) denote a measurable space, and let ξ :Ω → Γ, ω → ξ(ω) (5.12) denote a random variable. Then the probability measure Pξ defined by