<<

Lecture notes for STAT 547C: Topics in Probability (draft)

Ben Bloem-Reddy

November 26, 2019

Contents

1 Motivation 5

2 Measurable spaces7 2.1 Basic set notation and operations...... 7 2.2 σ-algebras...... 9 2.3 Examples of σ-algebras...... 10 2.4 Measurable spaces...... 13 2.4.1 Products of measurable spaces...... 14

3 Measurable functions 15 3.1 Measurability of functions...... 15 3.2 Compositions of functions...... 17 3.3 Numerical functions...... 17 3.4 Measurability of limits of sequences of functions...... 18 3.5 Simple functions and approximating measurable functions...... 19 3.6 Standard and common measurable spaces...... 21

4 Measures, probability measures, and probability spaces 23 4.1 Measures and spaces...... 23 4.1.1 Probability measures and probability spaces...... 23 4.2 Examples...... 23 4.3 Some properties of measures...... 24 4.4 Negligible/null sets and completeness...... 26 4.5 Almost everywhere/surely...... 26 4.6 Image measures...... 26

5 Probability spaces 28 5.1 Probability spaces as models of reality...... 28 5.2 Statistical models...... 29

6 Random variables 31 6.1 Distribution of a random variable...... 31

1 6.2 Examples of random variables...... 33 6.3 Joint distributions and independence...... 36 6.4 Information and determinability...... 36

7 Integration 37 7.1 Definition and desiderata...... 37 7.2 Examples...... 39 7.3 Basic properties of integration...... 39 7.4 Monotone Convergence Theorem...... 40 7.5 Further properties of integration...... 42 7.6 Characterization of the integral...... 43 7.7 More on interchanging limits and integration...... 43 7.8 Integration and image measures...... 44

8 Expectation 46 8.1 Integrating with respect to a probability measure...... 46 8.2 Changes of measure: indefinite integrals and Radon–Nikodym derivatives...... 47 8.3 Moments and inequalities...... 50 8.4 Transforms and generating functions...... 51

9 Independence 53 9.1 Independence of random variables...... 54 9.2 Sums of independent random variables...... 55 9.3 Tail fields and Kolmogorov’s 0-1 law...... 55

10 Conditioning and disintegration 57 10.1 Conditional expectations...... 57 10.2 Conditional probabilities and distributions...... 61 10.3 Interjection: kernels and product spaces...... 62 10.4 Conditional probabilities and distributions, continued...... 67 10.5 Conditional independence...... 70 10.6 Statistical sufficiency...... 72 10.7 Construction of probability spaces...... 73

11 Convergence 74 11.1 Convergence of real sequences...... 74 11.2 Almost sure convergence...... 75 11.3 Convergence in probability...... 77 11.4 Convergence in Lp ...... 79 11.5 Overview of modes of convergence...... 83 11.6 Convergence in distribution...... 83 11.7 Central Limit Theorem...... 87 11.8 Laws of large numbers...... 89

12 Martingales 92 12.1 Basic definitions...... 92 12.2 Martingale Convergence Theorem...... 95

2 12.3 Martingale inequalities...... 98

A Additional results 101 A.1 An example of the use of Radon–Nikodym derivatives...... 101

3 List of Exercises 1 Exercise (Limits of monotone decreasing sequences of sets)...... 9 2 Exercise (Countable unions and limits of monotone increasing sequences of sets).. 10 3 Exercise (Countable intersections and limits of monotone decreasing sequences of sets) 10 4 Exercise (Intersection of σ-algebras)...... 10 5 Exercise (Trace spaces, C¸inlar Ex. I.1.15)...... 13 6 Exercise (Proof of Lemma 3.1)...... 16 7 Exercise (Proof of Proposition 3.2)...... 16 8 Exercise (σ-algebra generated by a function)...... 16 9 Exercise (Measurable functions of measurable functions are measurable)...... 17 10 Exercise (Compositions under which the class of simple functions is closed)..... 19 11 Exercise (Plotting the dyadic functions)...... 20 12 Exercise (Restrictions and traces, C¸inlar Ex. I.3.11)...... 25 13 Exercise (Extensions, C¸inlar Ex. I.3.12)...... 25 14 Exercise (Image measure)...... 27 15 Exercise...... 32 16 Exercise (Random variable from undergraduate probability)...... 35 17 Exercise (Random variable from your research interests)...... 36 18 Exercise (Insensitivity of integration)...... 42 19 Exercise (Proof of Theorem 7.9)...... 44 20 Exercise (Proof of Markov’s inequality)...... 50 21 Exercise (Chebyshev’s inequality)...... 50 22 Exercise (Generalized Markov’s inequality)...... 50 23 Exercise (Independence of functions of independent random variables.)...... 55 24 Exercise...... 56 25 Exercise...... 58 26 Exercise...... 58 27 Exercise (Independence in conditional expectation)...... 61 28 Exercise (Measurable sections)...... 64 29 Exercise (Equivalence of conditional independence relationships)...... 71 30 Exercise (Chain rule for conditional independence)...... 72 31 Exercise (Almost sure version of the continuous mapping theorem)...... 75 32 Exercise (Applications of Borel–Cantelli Lemma)...... 76 33 Exercise (Convergence in probability and arithmetic)...... 79 34 Exercise (Implications of Lp convergence)...... 80 35 Exercise (Convex increasing functions of submartingales)...... 94

4 1 Motivation

Placeholder for these notes.

5 A few words on mathematics in these notes and generally

Mathematics in these notes. I aim to be mathematically rigorous in these notes, but I will also include some (attempts at) intuitive explanations of challenging technical concepts. Appealing to intuition sometimes requires relaxing the rigor. It should be clear when I am making an intuition- based argment—if it is not, please ask.

Notation. I will mostly follow the notation of C¸inlar [C¸in11], who lists some frequently used notation on page ix. If I deviate from that notation, I will say so.

Lemma, Proposition, Theorem, etc. As is typically the case, only the most important results will be called theorems. Propositions are typically intermediate results that might are of interesting on their own, but whose importance—at least within this material—does not reach the lofty heights of the exalted theorems. Lemmas are intermediate or supporting results that often could be included within the body of the proof they support. Different authors have different preferences, but well- structured lemmas can enhance the conceptual clarity and readability of long or complex proofs.

Writing mathematics. Something to keep in mind as you do research is that the clarity of your writing can greatly affect how your work is received—even the very best work will be hindered by poor (or even mediocre) writing. Writing mathematics can be especially challenging. I encourage you to practice in your assignment write-ups and especially in your final project. I will post some resources that I have found helpful to my writing.

Disclaimer. The primary purpose of these notes is to stay organized during lecture. They are a work in progress. They do not replace the readings. Furthermore, they are not a model of good mathematical writing. If something is unclear, please ask. If you find a typo or an error, please let me know.

6 2 Measurable spaces

Reading: C¸inlar, I.1. Supplemental:

Overview. Recall from last time that defining a on an uncountable space can lead to fundamental issues. Ideally, we want the elements of the space to encode all possible outcomes of an experiment, and we want to define a function for assigning probability to those outcomes in a self-consistent way (i.e., the probability of all possible outcomes is 1, and is equal to the sum of the probabilities of each possible outcome.) We ran into problems when trying to work in an uncountable space—in essence, there were too many points (uncountably many) that we required the probability to “measure” in a self-consistent way. This raises the question of what we might measure instead (and get something useful). This section is devoted primarily to introducing and defining the relevant terms, and exploring some of their properties. For the sake of time, we’re skipping over some material that is quite important to measure theory, though we won’t need it later on. In particular, we won’t cover the sections “p-systems and d- systems” and “Monotone class theorem” C¸inlar [C¸in11]. I encourage you to read them on your own.

2.1 Basic set notation and operations We’ll start with some notations and definitions. Following C¸inlar’s notation, let E denote a set. (For the purposes of this and the next few sections, capital letters will typically denote sets. Once we start working with random variables, capital letters will generally indicate random variables.)

Subsets. Let A be a subset of E, denoted A ⊂ E. A = E if and only if A ⊂ E and A ⊃ E. The empty set is denoted by ∅. Note that some authors use A ⊆ E to denote that A is a subset of E, and A ⊂ E to denote that it is a proper subset, that is, A 6= E. I will follow C¸inlar and use ⊂ to mean subset. Often, the distinction won’t matter; if it does, then I will use A ( E to denote proper subset.

Set operations. The basic set operations on subsets A, B of E are • union, denoted A ∪ B; • intersection, denoted A ∩ B; • complement of B in A, denoted A \ B. E will typically be a generic “master” set, for which we use the notation Bc to denote E \ B.

7 Logic and set operations. Set operations encode the logical relationships or, and, and not, respec- tively. We can think of constructing a “probabilistic query”1 as combining sets of outcomes via logical operations (e.g., “It’s raining today.” and “It rained yesterday.”). Sets and set operations are the natural mathematical objects.

Collections of (sub)sets. We will we be working extensively with collections of subsets. For an arbitrary index I, we denote a collection by C = {Ai : i ∈ I}. We write \ [ Ai , Ai (2.1) i∈I i∈I

for the union and intersection, respectively, of all sets Ai in the collection.

A collection C is disjointed if Ai ∩ Aj = ∅ for all Ai,Aj ∈ C, i 6= j. A partition of a set E is a countable disjointed collection of sets whose union is E.

Closure. A collection C is said to be closed under intersections if A ∩ B ∈ C whenever A, B ∈ C. C is closed under countable intersections if the intersection of every countable collection of sets in C is also in C. Closure under complements, unions, and countable unions are defined analogously.

Sequences of sets. Let (An) = A1,A2,... be a sequence of sets in C. The limit superior and limit inferior are defined as \ [ lim sup An = Ak (2.2) n n≥1 k≥n [ \ lim inf An = Ak . (2.3) n n≥1 k≥n

If the two are equal, then we call it the limit of (An).

A sequence of sets is monotone increasing if A1 ⊂ A2 ⊂ · · · , and likewise for monotone decreasing sequences of sets. In both cases, the limit exists. Call it A. Then we write An % A for increasing sequences and An & A for decreasing sequences .

1Thanks to Trevor Campbell for the analogy.

8 Example 2.1. Limits of monotone sequences of sets. Show that An % ∪mAm for any monotone increasing sequence (An).

Evaluating (2.2) in both cases yields the answer. In particular, consider lim supn An term-by-term: • n = 1: ∪k≥nAk = ∪m≥1Am. • n = 2: ∪k≥nAk = ∪m≥2Am. • And so on. Now if we take the intersection of these terms, we get, for example,

(∪m≥1Am) ∩ (∪m≥2Am) = ∪m≥2Am .

But A1 ⊂ A2, so ∪m≥2Am = ∪m≥1Am, and therefore lim supn An = ∪nAn.

Similarly, lim infn An = ∪nAn because ∪n≥1 ∩k≥n Ak = ∪n≥1An. (The intersection of an increasing sequence of sets is just the smallest member of the sequence.) Thus, lim supn An = lim infn An = limn An = ∪nAn. Because (An) is a monotone increasing sequence, An % ∪mAm.

Exercise 1 (Limits of monotone decreasing sequences of sets): £ Show that A & ∩ A for any monotone decreasing sequence. ¢ n m m ¡

2.2 σ-algebras

Warning! σ-algebras are notoriously non-intuitive the first (and second, and third, and ...) time you encounter them. I think it’s useful to keep in mind that ultimately we want to have a properly defined notion of probability that takes as input a query (as described above) and outputs values in [0, 1]. We also want that probability to respect some logical relationships (e.g., exclusive outcomes cannot simultaneously). A σ-algebra is the mathematical formalism of what we can possibly query. (Most authors write “σ-algebra”, but you may see “sigma-alebra”. You may see both within the same text (as in C¸inlar) including Greek letters in LATEX section headings causes problems for the hyperref package.2

Definition. A non-empty collection E of E is called an algebra if it is closed under finite unions and complements. E is a σ-algebra if it is closed under countable unions and complements. In math: 1. A ∈ E ⇒ E \ A ∈ E S 2. A1,A2,... ∈ E ⇒ n≥1 An ∈ E Observe that a σ-algebra is also closed under countable intersections:

∩n≥1An = E \ (∪n≥1(E \ An)) . (2.4)

This is an element of E because: 1. each entry in the union is in E (closed under complements); 2. the union of those entries is in E (closed under countable unions);

2Aside: most of the time this can be dealt with by using the texorpdfstring command.

9 3. the complement of the union is in E (closed under complements). This type of recursive closure suggests an algorithm to create a σ-algebra from any collection of sets.3 Conversely (and somewhat less intuitively), for a given σ-algebra, we might be able to find a relatively small sub-collection that can be used to produce any of the other sets in the σ-algebra. We’ll return to this idea shortly.

Why closure under complements and unions? Back to the probabilistic query analogy: if we’re interested in the probability of some set A, we should also be able to query, for example, “not A” (Ac), “A or B”(A ∪ B), “A or B and not D” ((A ∪ B) ∩ Dc). Closure under complements and unions ensures this.

Why countable unions? Countable unions ensure that the σ-algebra also contains the limits of all monotone sequences of its sets. In fact, this is how Doob [Doo94] defines σ-algebra.

Every monotone sequence of sets has a limit, denoted A∞. We denote this by An % A∞ for increasing sequences and An & A∞.

Exercise 2 (Countable unions and limits of monotone increasing sequences of sets): £ Show that a collection of sets E that is closed under finite unions is also closed under countable ¢ ¡ unions if and only if it contains the limits of all monotone increasing sequences A1 ⊂ A2 ⊂ · · · of its sets.

Exercise 3 (Countable intersections and limits of monotone decreasing sequences of sets): £ Show that a collection of sets E is closed under finite intersections is also closed under countable ¢ ¡ intersections if and only if it contains the limits of all monotone decreasing sequences A1 ⊃ A2 ⊃ ··· of its sets.

Exercise 4 (Intersection of σ-algebras): £ ¢ a) Let E1 and E2 be σ-algebras on E.¡ Show that E1 ∩ E2 is a valid σ-algebra on E.

b) Let (En)n≥1 be a (countable) collection of σ-algebras on E. Show that their intersection is a valid σ-algebra on E.

2.3 Examples of σ-algebras

Trivial σ-algebra. Every σ-algebra on E contains at least ∅ and E. E = {∅,E} is called the trivial σ-algebra.

Discrete σ-algebra. The largest σ-algebra is the collection of all subsets of E, denoted 2E and called the discrete σ-algebra. We’ll see later that if E is countable then this is the natural σ-algebra with which to define a probability space. If E is uncountable, 2E won’t work (it’s too big).

3This is what Rohlin had in mind.

10 Generated σ-algebra. Fix an arbitrary collection C of subsets of E, and consider all the σ-algebras that contain C. At the very least, 2E contains C. Take the intersection of all of those σ-algebras and call the result the σ-algebra generated by C, denoted σC or σ(C). Observe (convince yourself) that σC is the smallest σ-algebra that contains C. Observe (convince yourself) that every member of σC can be obtained by at most a countable number of set operations (complement, union, intersection) applied to at most a countable number of sets from C. σ-algebras generated by random variables are used extensively in probability. The following proposition collects some basic relationships between collections and their generated σ-algebras. Proposition 2.1. Let C and D be two collections of subsets of E. Then: (a) If C ⊂ D then σC ⊂ σD. (b) If C ⊂ σD then σC ⊂ σD. (c) If C ⊂ σD and D ⊂ σC, then σC = σD. (d) If C ⊂ D ⊂ σC, then σC = σD.

Proof. (a) Since C ⊂ D, C ⊂ σD. By definition, σC is the smallest σ-algebra that contains C, so it must be that σC ⊂ σD. (b) The argument is the same as in part (a). (c) Applying part (b) to C ⊂ σD implies that σC ⊂ σD. Applying part (b) to D ⊂ σC implies that σD ⊂ σC. Together these imply that σC = σD. (d) Part (a) applied to C ⊂ D implies that σC ⊂ σD. Part (b) applied to D ⊂ σC implies that σD ⊂ σC. Together these imply that σC = σD.

We will see an example of how to use these relationships in Example 2.2.

A bit of topology. To properly define the next example, we need the concept of a topological space. My reference for this section is Aliprantis and Border [AB06, Ch. 2].4

A topology τS on a set S is a collection of subsets of S satisfying:

1. ∅,S ∈ τS.

2. τS is closed under finite intersections.

3. τS is closed under arbitrary (finite, countable, uncountable) unions.

4Available as a PDF through the UBC Library at SpringerLink. (If you’re not on the UBC network, you might need to go through the Library’s website and sign into CWL.)

11 A non-empty set S equipped with a topology τS is called a topological space, denoted (S, τS). Members of τS are called the open sets of S. (Convince yourself that your notion of an open set, likely tied to a mental model of an interval of R, is consistent with the definition of the topology τS.) The complement of an open set is a closed set. Most (if not all) spaces we encounter will be topological spaces. In fact, most (if not all) spaces we encounter will be metric spaces. A metric on a space S is a function d : S × S → R that is non-negative, symmetric, and satisfies: 1. d(x, x) = 0 for all x ∈ S. 2. d(x, y) = 0 implies x = y for all x, y ∈ S. 3. d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ S. (This is the triangle inequality.) A metric space is a pair (S, d), where d is a metric on S.

Given a metric d, define the open -ball around x, B(x) = {y : d(x, y) < }. A set A ⊂ S is open in the metric topology generated by d if for each point x ∈ A there is an  > 0 such that B(x) ⊂ A. The metric d(x, y) = |x − y| defines a topology on R, and every open interval (a, b) is an open set in this topology. Every open set in R can be expressed as the (countable) union of disjoint open intervals [see, e.g., SS05, Thm. 1.3]. This is all the topology we need (and maybe a bit more). The main ideas are: • A topological space is defined in terms of its open sets. • The metric of a metric space can be used to define a topology based on open balls. • For our purposes, the open intervals of R are a good mental model of a topology generated by a metric.

Borel σ-algebra. If (E, τE) is a topological space then the σ-algebra generated by the collection of all open subsets of E (i.e., τE) is called the Borel σ-algebra, often denoted B(E) or BE. Its elements are called the Borel sets. In most applications (and everything we encounter in this class), we’re working in topological spaces (and usually metric spaces). The Borel σ-algebra is ubiquitous. At a high level, one of the main reasons for this ubiquity is that, in the words of Aliprantis and Border [AB06, p. 21], “topology is the abstract study of convergence and approximation”. Though we won’t study it in this level of detail, convergence and approximation are crucial to measure theory—and to the probability theory built on top of it. The open sets are the basic building blocks of a topological space, so it is natural to construct our system of probabilistic queries (i.e., the σ-algebra) from them.

12 Example 2.2.. Show that B(R) is generated by the collection of all open intervals.

Let CI denote the collection of all open intervals of R, and σCI the σ-algebra generated by it. Likewise, let O and σO = B(R) be the collection of all open sets and its generated σ-algebra (i.e., the Borel σ-algebra).

Each open interval is an open set, so CI ⊂ O, which by Proposition 2.1 (part (a)) implies that σCI ⊂ σO.

Conversely, every open set is the union of at most a countable number of open intervals, i.e., A ∈ σCI for all open sets A ∈ O. Therefore, O ⊂ σCI . Proposition 2.1 (part (b)) implies σO ⊂ σCI .

Hence, σCI = σO = B(R). This is not unique! B(R) is also generated by any of the following collections (see C¸inlar, Exercise I.1.13): • The collection of all intervals of the form (∞, x]. • The collection of all intervals of the form (x, y]. • The collection of all intervals of the form [x, y]. • The collection of all intervals of the form (x, ∞). In each case (and the one above), x and y can be limited to the rational numbers Q.

Example 2.3.. Every interval of R is a Borel set. Based on the previous example, the open sets generate B(R), so clearly they are Borel sets.

The semi-closed intervals are, too: (x, y] = ∩n≥1(x, y + 1/n). The singleton sets are Borel sets. Building on the fact that the open and semi-closed intervals are Borel sets, {x} = [x, y) \ (x, y). Alternatively, {x} = ∩n≥1(x − 1/n, x + 1/n). The closed intervals: [x, y] = {x} ∪ (x, y].

Observe that there are many other ways we could prove that every interval of R is a Borel set.

2.4 Measurable spaces A measurable space is a pair (E, E), where E is a set and E is a σ-algebra on E. The elements of E are called measurable sets of E. When E is a topological space and E = B(E), then the measurable sets are called the Borel sets of E.

Exercise 5 (Trace spaces, C¸inlar Ex. I.1.15): £ ¢ Let (E, E) be a measurable space. Fix D ⊂ E¡ and let D = E ∩ D = {A ∩ D : A ∈ E} .

Show that D is a σ-algebra on D. It is called the trace of E on D, and (D, D) is called the trace of (E, E) on D.

13 2.4.1 Products of measurable spaces Let (E, E) and (F, F) be two measurable spaces. For A ⊂ E and B ⊂ F , the product5 of A and B is

A × B = {(x, y): x ∈ A, y ∈ B} . (2.5)

If A ∈ E and B ∈ F, then A × B is called a measurable rectangle. The product σ-algebra on E × F is the collection of all measurable rectangles, denoted E ⊗ F. The measurable space (E × F, E ⊗ F) is the product of (E, E) and (F, F), also denoted (E, E) × (F, F). Product spaces play a key role in probability theory for statistical models, particularly in condi- tioning and conditional independence.

5A × B is also called the Cartesian product of A and B.

14 3 Measurable functions

Reading: C¸inlar, I.2. Supplemental:

Overview. With Section2 in hand, most courses in probability theory move on to defining probability measures and/or random variables. We will put off probability for another two sections and remain in the realm of deterministic measure theory. When we get to probability theory, we’ll see that for the most part it can be interpreted as a special case of measure theory. For example, random variables are measurable functions that take on special meaning in probability theory. However, the language of probability theory is still rooted in pre-measure-theoretic ideas, and somehow “feels” different. C¸inlar [C¸in11, p. 49] says, “... our attitude and emotional response toward one [measure theory] is entirely different from those toward the other [probability theory]. On a measure space everything is deterministic and certain, on a probability space we face randomness and uncertainty.” I tend to agree with this assessment. With this in mind, we’ll continue on with measure theory— without the emotional baggage of randomness and uncertainty.

3.1 Measurability of functions Let E and F be sets. A function or mapping6 f from E into F is a rule that assigns an element f(x) ∈ F to each x ∈ E. This is typically written as f : E → F . For more specificity, is 2 also sometimes written as f : x 7→ x + 5 as the function mapping R to R+. You may also see f : E → F, x 7→ x2 + 5. For any mapping f : E → F and subset B ⊂ F , the inverse image of B under f is

f −1B = {x ∈ E : f(x) ∈ B} . (3.1)

Picture on board. The inverse image is essentially a set operation that “plays well” with other set operations (i.e., commutes, distributes). Lemma 3.1. Let f be a mapping from E into F . Then,

f −1∅ = ∅ , f −1F = E , f −1(B \ C) = (f −1B) \ (f −1C) (3.2) −1 [ [ −1 −1 \ \ −1 f Bi = f Bi , f Bi = f Bi , i i i i

for all subsets B and C of F and arbitrary collections {Bi : i ∈ I} of subsets of F .

Exercise 6 (Proof of Lemma 3.1): £ ¢ 6Be aware that different sub-fields of mathematics¡ may or may not treat these as synonymous (C¸inlar does). For example, this Math Stack Exchange thread. I will treat them as synonymous.

15 Prove Lemma 3.1.

Measurable functions. Let (E, E) and (F, F) be measurable spaces. A mapping f : E → F is measurable relative to E and F if f −1B ∈ E for all B ∈ F. For short, we also say that f is E/F-measurable. Often, F will be fixed or obvious (e.g., F is a topological space and F = B(F )), in which case we say that f is measurable with respect to E, or E-measurable. As a further simplification, when both E and F are fixed, we may just say that f is a measurable function. (I will try to avoid this unless there is no chance for confusion, but much of the literature is written this way.) The following proposition is very useful: if we can establish measurability with respect to a gener- ating collection F0 such that F = σF0, then we have also established F-measurability.

Proposition 3.2. Let F0 ⊂ F such that σF0 = F. Then a mapping f : E → F is measurable −1 relative to E and F if and only if f B ∈ E for every B ∈ F0.

Exercise 7 (Proof of Proposition 3.2): £ Prove Proposition 3.2. ¢ ¡ Hint: Use Lemma 3.1.

σ-algebra generated by a function. Let E be a set and (F, F) a measurable space. For f : E → F , define

f −1F = {f −1B : B ∈ F} . (3.3)

This is a σ-algebra on E, called the σ-algebra generated by f. f −1F is the smallest σ-algebra on E such that f is measurable relative to it and F.

Exercise 8 (σ-algebra generated by a function): £ −1 ¢ Show that f F as in Eq. (3.3) is a σ-algebra on¡ E.

Another way to define measurability. If (E, E) is a measurable space, then f is measurable relative to E and F if and only if f −1F ⊂ E. This is another way of defining measurability.

Why bother? If you’re confused about why we’re bothering with all of this, it might be helpful to think of measurability this way: if we’re given a set of values B ⊂ F and a mapping f : E → F , does E contain all of the possible values x ∈ E that could have been mapped to B via f? This typically (though not always) comes down to whether E is large (or “fine”) enough. For example, if f : E → R is a non-constant mapping and E is the trivial σ-algebra {∅,E}, then f is non-measurable (relative to E and B(R).) Cases that arise in practice are not typically so extreme, but it is common to ask, for two functions f : E → F and g : E → G, whether g is measurable relative to f −1F

16 (and vice versa). In terms of probability and statistics, this is basically equivalent to whether one random variable is measurable relative to another.

3.2 Compositions of functions Let (E, E), (F, F), and (G, G) be measurable spaces. Let f : E → F and g : F → G. The composition of f and g is the mapping g ◦ f : E → G defined by

(g ◦ f)(x) = g(f(x)) , x ∈ E. (3.4)

In statistics, machine learning, and related fields,7 we often compose functions and typically do so without any second thoughts about measurability. That’s because measurable functions of measurable functions are measurable. Proposition 3.3. If f is E/F-measurable and g is F/G-measurable, then g ◦ f is E/G-measurable.

Exercise 9 (Measurable functions of measurable functions are measurable): £ Prove Proposition 3.3. ¢ ¡

3.3 Numerical functions Some notation for variations on the :

• R = (−∞, +∞) • R¯ = [−∞, +∞]

• R+ = [0, +∞) ¯ • R+ = [0, ∞] For a measurable space (E, E), a numerical function on E is a mapping from E into R¯, or some ¯ subset thereof. It is positive if all of its values are in R+. (Note that some authors use non-negative for this.)

A numerical function is E-measurable if it is E/B(R)-measurable. If E is topological and E = B(E), then E-measurable functions are called Borel functions.

Recall from Example 2.2 that various different collections of intervals generate B(R). Recall also from Proposition 3.2 that we can establish measurability by checking measurability on a generating set. Then there is a version of the following proposition for each type of interval collection generating B(R). Proposition 3.4. A mapping f : E → R¯ is E-measurable if and only if, for every r ∈ R, f −1[−∞, r] ∈ E.

Positive and negative parts. For a and b in R, we write a ∨ b to denote max{a, b} and a ∧ b to denote min{a, b}. When applied to numerical functions, the maximum is taken pointwise: f ∨g is a

7Artificial neural networks, i.e., deep learning, are iterated compositions of functions.

17 function whose value at x is f(x) ∨ g(x). For a measurable space (E, E) and a function f : E → R¯, the positive part and negative part of f are f + = f ∨ 0 , and f − = −(f ∧ 0) .

It should be (intuitively) clear that f is E-measurable if and only if both f + and f − are. This fact is important enough that it is stated as a proposition in C¸inlar [C¸in11, Prop. I.2.9] because we can obtain many results for arbitrary f from the corresponding results for positive functions.

3.4 Measurability of limits of sequences of functions Many of the proof techniques we will use in upcoming sections rely on analyzing sequences of numerical functions (fn)n≥1, or (fn) for short, and their limits:

inf fn , sup fn , lim inf fn , lim sup fn . (3.5) n n n n

Each of these are functions defined on E pointwise: fix x ∈ E and construct the sequence of numbers (f1(x), f2(x),... ). Sequences of real numbers and taking the inf, sup, lim inf, or lim sup should be familiar from calculus and/or analysis [see Abb15, Ch. 6 to refresh your memory]. Repeating this operation for each x ∈ E yields the result. For example, f = infn fn defines a function whose value at x is f(x) = infn(fn(x)), i.e., the infimum of (f1(x), f2(x),... ).

In general, lim inf is dominated by lim sup. If they are equal, lim infn fn = lim supn fn = f, then the sequence (fn) has the pointwise limit f, denoted limn fn = f or, more commonly, fn → f.

Monotone sequences of functions. If (fn) is increasing, that is, f1 ≤ f2 ≤ ... (for all x ∈ E), then limn fn = supn fn. The shorthand for (fn) is increasing and has limit f is fn % f. Likewise, fn & f means that (fn) is decreasing and has limit f. Importantly, the class of measurable (numerical) functions is closed under limits.

Theorem 3.5. Let (fn) be a sequence of E-measurable numerical functions. Then each of the four functions in (3.5) is E-measurable. Moreover, if it exists, limn fn is E-measurable.

Proof. We start with f = supn fn. For every x ∈ E and r ∈ R, observe that f(x) ≤ r if and only if fn(x) ≤ r for all n. Thus, for each r ∈ R,

−1 \ \ −1 f [−∞, r] = {x : f(x) ≤ r} = {x : fn(x) ≤ r} = fn [−∞, r] . (3.6) n n

Each term in the intersection in the right-most expression is in E because fn is E-measurable. Furthermore, E is closed under countable intersections, so the entire right-most expression is in E. [−∞, r] generate B(R) so by Proposition 3.4, f = supn fn is E-measurable.

Measurability of infn fn follows from the identity infn fn = − supn(−fn). The composition of two measurable functions is again measurable, so

lim inf fn = sup inf fn , and lim sup fn = inf sup fn n m n≥m n m n≥m are both E-measurable.

18 As we will see in the next two sections, closure under limits gives us a powerful tool for proving certain fundamental properties of measurable functions.

3.5 Simple functions and approximating measurable functions Let A ⊂ E. The indicator function is ( 1 if x ∈ A 1A(x) = . (3.7) 0 if x∈ / A

1A is E-measurable if and only if A ∈ E.

Simple functions. A function f on E is said to be simple if it has the form

n X f = ai1Ai , (3.8) i=1

n n for some n ∈ N, some real numbers (ai)i=1 and sets (Ai)i=1 ∈ E. For any simple function f, there m exits some m ∈ N, distinct real numbers (bi)i=1, and a measurable partition {B1,...,Bm} of E Pm such that f = i=1 bi1Bi . (Convince yourself that this is true.) This is called the canonical form of the simple function f.

Measurability. Proposition 3.4 applied to the canonical form implies that every simple function is E-measurable. Conversely, if a function f is E-measurable and takes on finitely many values in R, then f is simple.

Exercise 10 (Compositions under which the class of simple functions is closed): £ For f, g both simple functions, show that the following are also simple: ¢ ¡ 1. f + g 2. f − g 3. fg 4. f/g (with the caveat that g is nowhere equal to zero) 5. f ∨ g 6. f ∧ g

j Dyadic functions. Recall that dyadic intervals are bounded intervals of R whose endpoints are 2n j+1 and 2n , where j, n ∈ Z. Observe that any fixed finite interval (a, b) contains increasingly more dyadic intervals as n increases. In particular, [0, n], n ∈ N, contains n2n dyadic intervals of equal size. Define the dyadic function as the simple function dn : R → R as

n2n X k − 1 dn(r) = 1[ k−1 , k )(r) + n1[n,∞](r) . (3.9) 2n 2n 2n k=1

19 Draw on board for n = 1, 2.

We can think of the functions (dn) as a sequence of finer and finer (better and better) discrete approximations of f(x) = x.

Exercise 11 (Plotting the dyadic functions): £ ¢ Use your scientific computing language/environment¡ of choice to plot the dyadic functions for n = {1, 2, 3, 10}.

Measurability. Recall from Example 2.2 that the semi-closed and closed sets are Borel sets; because dn is a simple function, it is measurable for each n. We can prove the following with pictures. ¯ Lemma 3.6. For each n ∈ N, each dn is an increasing right-continuous simple function on R+, ¯ and dn(r) increases to r (that is, dn(r) % r) for each r ∈ R+ as n → ∞.

Approximating measurable functions. The dyadic function can be used to approximate any function f to arbitrary precision: define fn = dn ◦ f, and observe that since dn(r) increases to r as n → ∞, dn ◦ f % f. If f is E-measurable, then so is dn ◦ f for each n ∈ N, by Proposition 3.3. This is essentially all that is needed to prove the following important result, which provides a basic tool for proving certain properties of measurable functions by reducing the problem to proving the property on simple functions and taking the limit. Theorem 3.7. A positive function on E is E-measurable if and only if it is the limit of an increasing sequence of positive simple functions.

Proof technique. The proof of this theorem uses a technique that can be/is used to prove measurability of a positive function f (or a class of functions) in many situations: 1. Prove that all simple functions are measurable with respect to the relevant σ-algebras.

2. Approximate f by simple functions fn % f and take the limit n → ∞. 3. Argue that the limit is measurable. This technique can be extended to arbitrary functions by separating into positive and negative parts: f = f + − f −. This technique will be useful on problem 4 of the first assignment.

Proof. If a positive function f is the limit of an increasing sequence of positive simple functions, then by Theorem 3.5, it is also measurable.

Conversely, let f : E → R+ be E-measurable. We need to show that there is an sequence (fn) of positive simple functions increasing to f. Let dn be as in (3.9) and put fn = dn ◦f. For each n, fn is E-measurable, since it is the composition of two measurable functions (Proposition 3.3). Moreover, fn is positive and takes only finitely many values: it is positive and simple. Since dn(r) % r for each r ∈ R+, we have that fn(x) = dn(f(x)) % f(x) for each x ∈ E.

20 3.6 Standard and common measurable spaces

Isomorphisms. Let (E, E) and (F, F) be measurable spaces. Let f : E → F be a bijection and denote by fˆ the functional inverse: fˆ(y) = x if and only if f(x) = y. If f is E/F-measurable and fˆ is F/E-measurable, then f is an isomorphism of (E, E) and (F, F). (E, E) and (F, F) are said to be isomorphic if there exists an isomorphism between them.

Standard spaces. A measurable space (E, E) is standard if it is isomorphic to (F, B(F )) for some Borel subset F of R. (Various other terms are used by other authors for this property. “Borel space” and “standard Borel space” are two of the most common.) One of the biggest advantages of working with standard Borel spaces is that it’s typically enough to show that some property holds on [0, 1] (or R, or any subset thereof—whatever is easiest for what you’re trying to prove).

What are some standard measurable spaces? Most of the spaces we encounter in statistics and machine learning are standard Borel spaces. Some examples:

• R, Rd, R∞ together with their Borel σ-algebras. • If E is a complete separable metric space (c.s.m.s.),8 then (E, B(E)) is standard. • Completeness and separability of a set E depend on the metric. If E is a topological space such that there is a metric on E which defines a topology of E and which makes E c.s.m.s., then E is a Polish space9 and (E, B(E)) is standard. • Another for which we won’t go into detail: for a separable E (of which a separable Hilbert space is a special case), (E, B(E)) is standard.

The three standard measurable spaces. A deep result from measure theory says that every standard measurable space (i.e., standard Borel space) is isomorphic to one of:10 • {1, 2, . . . , n} and its discrete σ-algebra; • {1, 2,... } and its discrete σ-algebra; • [0, 1] and its Borel σ-algebra.

Notation C¸inlar [C¸in11] uses E to denote both the σ-algebra and the class of functions that are measurable relative to it. To avoid confusion, I will use Efn to denote the latter. We haven’t encountered them yet, but we will also need the following notation:

8Complete: Every Cauchy sequence (the elements get arbitrarily close to each other) in E has a limit in E (i.e., parts of the space don’t go missing). Separable: E has a countable dense subset (think of Q in R—we can approximate anything in E arbitrarily well using only countable approximations). Metric space: A set with a metric on the set (i.e., a notion of distance), (E, d). 9From Wikipedia: “Polish spaces are so named because they were first extensively studied by Polish topologists and logicians—Sierpi´nski,Kuratowski, Tarski and others”. 10This result is due to Kuratowski, Sur une g´en´eralisation de la notion d’hom´eomorphie, Fund. Math. 22 (1934), 206-220. I haven’t been able to find a good textbook reference.

21 • M denotes an arbitrary collection of numerical functions;

• M+ ⊂ M denotes the positive functions in M;

• Mb ⊂ M is the collection of bounded functions in M. fn For example, E+ is the class of positive E-measurable numerical functions.

22 4 Measures, probability measures, and probability spaces

Reading: C¸inlar, I.3. Supplemental:

Overview. Sections2 and3 followed C¸inlar [C¸in11] pretty closely. This section will continue to do so, though we will introduce a few ideas from section II.1.

4.1 Measures and measure spaces ¯ Let (E, E) be a measurable space. A measure on (E, E) is a mapping µ : E → R+ the following properties: a) Zero on the empty set: µ(∅) = 0.  P b) Countable additivity: µ ∪n An = n µ(An) for every disjointed sequence (An) in E. ¯ The number µ(A) ∈ R+ is called the measure or mass of A. It is also written as µA. A measure space is a triplet (E, E, µ) where (E, E) is a measurable space and µ is a measure on it.

4.1.1 Probability measures and probability spaces

A probability measure P on a measurable space (Ω, H) is a measure such that P(Ω) = 1. A probability space is a triple (Ω, H, P), where (Ω, H) is a measurable space and P is a probability measure on it. Nothing more, nothing less. The fact that the total mass of Ω is one lets us say a few more things, but mathematically there is not a major difference. However, an entirely different vocabulary—some might say an entirely different conceptual framework—is used to describe probability spaces. Recall what C¸inlar says: “our attitude and emotional response toward one is entirely different from those toward the other”. The space Ω is called the sample space and its elements ω are called outcomes. The σ-algebra H is called the grand history (maybe a bit less common these days) and its elements are called events. Ω is often described as containing all possible states of the world/universe, each corresponding to an element ω, and the random variables we work with are functions whose value changes depending on the state of the world. I never found this analogy helpful, but some people do.

4.2 Examples The following examples are measures that we will encounter frequently. For each of the following, let (E, E) be a measurable space on which we will define the measure. Example 4.1. Dirac measure. Let x be a fixed point of E. For each A ∈ E, define ( 1 if x ∈ A δx(A) = . (4.1) 0 if x∈ / A

δx is a measure on (E, E) called the Dirac measure.

23 Example 4.2. Counting, discrete, and empirical measures. Let D be a fixed subset of E. For each A ∈ E, let νD(A) be the number of points in A ∩ D (this might be infinite). νD is a measure on (E, E) called a counting measure. Often, D is taken to be countable, in which case X νD(A) = δx(A) ,A ∈ E . (4.2) x∈D

For countable D, let m(x) be a positive number for each x ∈ D. Define the discrete measure X ν¯D(A) = m(x)δx(A) ,A ∈ E . (4.3) x∈D

It’s helpful to visualize a discrete measure as a mass m(x) attached to a particular point x (some- times called an atom of the measure);ν ¯D(A) is the sum of the mass of the atoms attached to points in A. If (E, E) is a discrete measurable space, then every measure on it has this form. 1 The normalized version of the counting measure, with m(x) = |D| for each x ∈ D is the empirical measure 1 X νˆ (A) = δ (A) ,A ∈ E , (4.4) D |D| x x∈D which gets its name from its use with, e.g., a sequence of observations D = {x1, x2, . . . , xn}.

Example 4.3. Lebesgue measure. A measure λ on (R, B(R)) is called the Lebesgue measure on R if λ(A) is the length of A for every interval A. We can see that this definition makes it impossible to display λ(A) for every Borel set A, but we can use it to integrate (which, as C¸inlar says, is the main thing measures are for). For R2, the Lebesgue measure measures the “area”; on R3, the “volume”. The notation λ for Lebesgue measure is fairly standard, though C¸inlar uses Leb. I will use λ.

4.3 Some properties of measures Most of the key properties of measures and probability measures are summarized in the following proposition. Proposition 4.1. Measures and probability measures have the following properties:

24 Property Measure space (E, E, µ) Probability space (Ω, H, P) Norming: µ(∅) = 0 P(∅) = 0 and P(Ω) = 1

Countable & finite additivity: P P (Hn) disjointed ⇒ µ(∪nHn) = n µ(Hn) P(∪nHn) = n P(Hn)

Monotonicity: H ⊂ K ⇒ µ(H) ≤ µ(K) P(H) ≤ P(K)

Sequential continuity Hn % H ⇒ µ(Hn) % µ(H) P(Hn) % P(H) Hn & H ⇒ P(Hn) & P(H) P P Boole’s inequality µ(∪nHn) ≤ n µ(Hn) P(∪nHn) ≤ n P(Hn) (a.k.a. union bound)

Proof. See C¸inlar [C¸in11, Prop. I.3.6 and remarks on p. 50].

Exercise 12 (Restrictions and traces, C¸inlar Ex. I.3.11): £ ¢ Let (E, E) be a measurable space and µ a measure on it.¡ Fix D ∈ E. a) Define ν(A) = µ(A ∩ D), A ∈ E. Show that ν is a measure on (E, E). It is called the trace of µ on D. b) Let D be the trace of E on D (recall Exercise5). Define ν(A) = µ(A) for A ∈ D. Show that ν is a measure on (D, D). It is called the restriction of µ to D.

Exercise 13 (Extensions, C¸inlar Ex. I.3.12): £ ¢ Let (E, E) be a measurable space, let D ∈ E¡, and let (D, D) be the trace of (E, E) on D. Let µ be a measure on (D, D) and define ν by

ν(A) = µ(A ∩ D) ,A ∈ E . (4.5)

Show that ν is a measure on (E, E). This device allows us to regard a “measure on D” as a “measure on E”.

Arithmetic and uniqueness. The set of measures is closed under (positive) linear operations: • If c > 0 and µ is a measure on (E, E), then so is cµ. • If µ and ν are measures on (E, E), then so is µ + ν. P • If µ1, µ2,... are measures on (E, E), then so is n µn. You can prove these by checking the definition of a measure.

25 Less straightforward is the following uniqueness of measures result. We won’t prove it, but it’s a useful result. See C¸inlar [C¸in11, Prop. 3.7] if you’re interested in the proof. Proposition 4.2. Let (E, E) be a measurable space. Let µ and ν be measures on it with µ(E) = ν(E) < ∞. Let C be a collection of sets that is closed under finite intersections, and such that σC = E. If µ and ν agree on all elements of C, then µ and ν are identical. You may see the term p-system or π-system to describe a collection of sets that is closed under finite intersections. We will not concern ourselves with them, but it’s good to know the term.

Proposition 4.2 has the important special case when applied to R. Corollary 4.3. Let µ and ν be probability measures on (R¯, B(R¯)). Then µ = ν if and only if µ[−∞, r] = ν[−∞, r] for all r ∈ R¯.

Atoms, purely atomic measures, diffuse measures. Suppose that the singleton sets {x} are measurable (i.e., belong to E) for all x ∈ E. This is true for all standard (Borel) measurable spaces. Let µ be a measure on (E, E). A point x is called an atom of µ if µ({x}) > 0. If µ has no atoms, it is said to be diffuse; it is purely atomic if the set D of atoms is countable and µ(E \ D) = 0. The Lebesgue measures are diffuse, and discrete measures are purely atomic. Note that a measure may have a diffuse part and an atomic part: under quite general conditions, µ = λ + ν, where λ is a diffuse measure and ν is purely atomic [see C¸in11, Prop. 3.9].

4.4 Negligible/null sets and completeness Let (E, E, µ) be a measure space. A measurable set B is said to be negligible if µ(B) = 0. On a probability space (Ω, H, P), a measurable set B with P(B) = 0 is called a null set. A measure space is said to be complete if every negligible set is measurable. Likewise for probability spaces and null sets. If a measure space (E, E, µ) is not complete, it can be enlarged/extended to its completion (E, E¯, µ¯), for which E¯ contains all negligible sets and µ is extended onto E¯. (We’ll take this as fact because it’s really just a technical result that we won’t revisit; see C¸inlar [C¸in11, Prop. I.3.10].)

4.5 Almost everywhere/surely If a claim holds for all x ∈ E except for a a negligible set, then we say that it holds for almost every x, or almost everywhere. If we need to indicate the measure µ that this holds with respect to, we say µ-almost everywhere. These are often abbreviated as a.e. and µ-a.e. The same definitions are true for probability measures/spaces, except we say almost surely (or P-almost surely or almost surely P), abbreviated a.s. or P-a.s.

4.6 Image measures Let (E, E) and (F, F) be measurable spaces. Let ν be a measure on (E, E) and f : E → F a ¯ E/F-measurable function. Define the mapping µ : F → R+ as µ(B) = (ν ◦ f −1)(B) = ν(f −1B) ,B ∈ F . (4.6)

26 This is well-defined because of the measurability of f, and is called the image, or image measure, of ν under f. Other notation includes, variously, as f ◦ ν, h(ν), ν ◦ h, and νh. Another common name for the image measure is the pushforward measure.

Exercise 14 (Image measure): £ −1 ¢ Show that µ = ν ◦ f is a measure¡ on (F, F).

27 5 Probability spaces

Reading: C¸inlar [C¸in11], II.1, II.4. Supplemental: Gut [Gut05], Section 2 of Chapter 2, i.e., 2.2. Skip subsection 2.2.2. From here on, we will be firmly in the realm of probability, with all of the randomness and uncer- tainty that comes with it. Keep in mind that measure theory is always in the background, even if we don’t appeal directly to measure theoretic results (though we often will). It’s worth devoting a bit of time to translate from measure theory to probability theory; to make a mental model of how the abstract mathematics of the earlier sections become real when we apply probability to solve problems in statistics and other fields.

5.1 Probability spaces as models of reality The inherent uncertainy in the world make probability the natural mathematical language with which to describe it. In particular, we can use probability to construct models of (parts of) reality. A traditional description would say that probability is concerned with random experiments: a phenomenon whose outcome is not predictable with certainty. The basic requirement for using probability to analyze uncertain phenomena is that we can describe (mathematically) the set of all possible outcomes of the phenomena, Ω. Thus far, we’ve built up the mathematics to properly define a probability model as a probability space, (Ω, H, P). Many courses and books on probability theory begin by defining a probability spaces/models axiomatically: • Ω is a set called the sample space. The elements ω ∈ Ω are called outcomes. • A special collection of subsets, a σ-algebra, of Ω, denoted H.11 The sets in H are called events.12

• A probability measure, P, satisfying: – P(Ω) = 1 (this implies that P(∅) = 0). P – For any countable disjoint collection (An) ∈ H, P(∪nAn) = n P(An) (countable addi- tivity). The axiomatic definition is due to Kolmogorov, who laid the foundations of measure theoretic probability in the 1930s. In particular, he showed that the measure theoretic construction based on events rather than individual outcomes solved the foundational issue of defining probability models with uncountable sample spaces. For good reason, the axioms above are often called Kolmogorov’s axioms of probability. A canonical example of a probability model is the following coin-tossing experiment: I flip two identical coins; what is the probability of one heads and one tails? We construct a probability model as follows: 11C¸inlar [C¸in11] says that H is sometimes called the grand history; I haven’t come across the term in any other texts, so I assume it’s an older/outdated term. 12Other common notation for the σ-algebra on Ω is A or F.

28 • Ω1 = {(H,H), (H,T ), (T,H), (T,T )}.

Ω1 • H is all possible subsets of Ω1, 2 . For example, the event “at least one tails” is Etails = {(H,T ), (T,H), (T,T )}.

• P is up to us, a modeling choice. In this case, many would specify P as uniform over the elements of Ω1.

Using this probability model, it’s straightforward to calculate P(Etails) = P{(H,T ), (T,H), (T,T )} = 3/4.

Note that because Ω1 is discrete, we don’t really need any of the measure theory developed so far. Let’s change that. Suppose that I had (ignoring any issues of finite precision, i.e., all numerical values are represented with infinite precision): • a spinning wheel13 that, upon stopping, reports the fraction p ∈ (0, 1) of 360◦ that it rotated (modulo 360◦); • a machine that could make biased coins such that the probability of coming up heads is equal to a user-specified p ∈ (0, 1). My experiment could take on a number of forms. Here’s one: spin the wheel; input the rotation fraction p into the coin machine; flip the resulting coin two times. Now Ω2 = (0, 1) × Ω1 and H = B((0, 1)) ⊗ 2Ω1 . How might we construct a probability measure on the following H-measurable events?

• {0.1} ∩ Etails

• [0.1, 0.5] ∩ Etails

• [0.1, 0.1 + δ] ∩ Etails, for any 0 < δ < 0.9

• Etails Because [0, 1] is uncountable, we need our measure theoretic machinery to define a probability measure on H. As a practical matter, it is cumbersome to specify a probability on every event in the product σ-algebra, B((0, 1)) ⊗ 2Ω1 (or even on generating subsets). Although you probably know how to construct a valid probability measure via conditional probability, we haven’t yet developed those tools in this class. We’ll return to these ideas soon.

5.2 Statistical models Probability models are the basic building blocks of statistical models. A statistical model is a family of probability measure on a common measurable space (Ω, H),

P = {Pθ : H → [0, 1], θ ∈ Θ} . (5.1)

Here, Θ is an index set for the model. If Θ is finite-dimensional (e.g., Θ = Rd) then P is called parametric; otherwise, it is called non-parametric. The term semi-parametric typically refers to a model whose parameter space is Θp × Θn, where Θp has finite dimension and Θn does not.

13Something like this.

29 Example 5.1. Normal family. Let N θ,σ2 , θ ∈ R, denote the normal distribution with mean θ and variance σ2. A statistical model might be

P = {N θ,1 : θ ∈ R} . Clearly, this model is parametric.

Example 5.2. Mixture of normals. Let (pk) = p1, p2,... be a sequence satisfying: pk ≥ 0, P k ∈ N; k≥1 pk = 1. Such a sequence of length K + 1 lies in a K-dimensional simplex (or K-simplex), denoted SK . The sequence might be infinite, in which case it lies in S∞. Define the mixture of normals distribution as X p ◦ N (A) = pk N 2 (A) ,A ∈ B( ) . µk,σk R k A statistical model using this might be

P = {(p ◦ N )θ : θ ∈ R × R+ × SK } . If K is finite, this model is parametric; if not then the model is non-parametric. In practice, our finite data sets will use only a finite number of parameters, even in a non-parametric model. However, a larger data set can use a larger number of parameters in a non-parametric model; another way to describe non-parametric models is to say that the parameter space is unbounded.

Statistical inference. Frequentist inference boils down to choosing a particular probability measure from P that is optimal in some sense. In the parametric setting, choosing a probability measure is equivalent to choosing a parameter; hence the central importance of parameter estimation in frequentist inference. Bayesian inference amounts to putting a prior probability measure on P and using conditional probability to determine a posterior probability measure on P after observing data. We will return to this in the context of conditioning.

30 6 Random variables

Reading: C¸inlar [C¸in11], II.1 Supplemental:

For the rest of these notes, (Ω, H, P) will always denote the probability space with which we are working. Let (E, E) be a measurable space. A H/E-measurable mapping X :Ω → E is a random variable taking values in (E, E). Recall that H/E-measurability means that

X−1A := {ω ∈ Ω: X(ω) ∈ A} ∈ H A ∈ E . (6.1)

A more common way to denote X−1A is {X(ω) ∈ A}, read as “the event X in A”. This type of notation is called event notation. It is customary to denote random variables by capital letters, and fixed values they might assume as lowercase letters, e.g., {X = x}. When E is understood from context, we often say that X is E-valued.

The simplest random variables are indicator functions on sets in H: 1H for H ∈ H.A simple random variables takes only finitely many values (typically in R); a discrete random variable takes on at most countably many different values. In undergraduate probability courses, typically continuous random variables are also defined here. We will be more careful and properly define what this means soon.

Random variables, random elements. Some authors define X :Ω → E to be a random element of the arbitrary (E, E). When that measurable space is (R, B(R)), the term “random variable” is applied. C¸inlar [C¸in11] doesn’t make such a distinction, and neither will I.

6.1 Distribution of a random variable Let X be a random variable that takes values in (E, E). The distribution, or law, of X is the image of P under X,

−1 −1 µ(A) = P ◦ X (A) = P(X A) = P{X ∈ A} ,A ∈ E . (6.2)

Induced probability spaces. The probability space (Ω, H, P) is often referred to as the background probability space because it gets relegated to the background: the assumption of its existence is stated at the beginning, and it is never heard from again. As such, the random variable X(ω) is written not as a function, but simply as X. In practice, we typically work directly on the space on which X takes values. For example, we don’t work with P, defined on Ω, and then determine the image measure P ◦ X−1. Rather, we define the distribution µX and proceed from there, forgetting (Ω, H, P). In this case, we call (E, E, µX ) the induced probability space. The following fact, stated as a proposition for posterity, follows from (E, E) being a measurable space and µX being a probability measure.

31 Proposition 6.1. The induced space (E, E, µX ) is a probability space. For complicated models, we might construct hierarchies or sequences of random variables. This works under the assumption that all random quantities are measurable relative to H and the σ- algebra on whatever space in which they take values. A common interpretation in such models is that Ω is “the state of the world”, and the random quantities included in the model are functions of different functions of that state. I like to think of the background probability space (Ω, H, P) as a reservoir of randomness that we draw on to do all of our stochastic/probabilistic operations.

Stochastic equality. Let X and Y be two random variables taking values in a measurable space (E, E). X and Y are said to be equal almost surely or almost everywhere if

P{ω ∈ Ω: X(ω) = Y (ω)} = P{X = Y } = 1 . (6.3)

We denote this by X a.s.= Y . Observe that this means that X and Y may differ on null sets. X and Y are said to be equal in distribution if

P{X ∈ A} = P{Y ∈ A} ,A ∈ E , (6.4)

denoted X =d Y . Note that in general, this is a much weaker form of equality: almost-sure equality implies distributional equality.

Distributions on R. In light of the result Corollary 4.3 on the uniqueness of probability measures on R¯, in order to specify the distribution of a random variable X taking values in [−∞, ∞], it is enough to specify µ([−∞, r]) for all r ∈ R, i.e., the distribution function

F (x) = µ[−∞, x] = P{X ≤ x} , x ∈ R . (6.5) This is also called the cumulative distribution function, or cdf. We won’t dwell on them further in lecture because you’ve likely studied them in previous courses on statistics/probability. You will be asked to prove some of their properties on Assignment 2.

Exercise 15: £ ¢ Show the following:¡ two R-valued random variables X and Y are equal in distribution if and only if their distribution functions FX and FY are equal.

Densities on R. When the distribution function of X has the form Z x F (x) = f(x)λ(dx) , (6.6) −∞

we say that X has density function f, typically denoted as fX . A distribution function F that has a density is said to be absolutely continuous with respect to the Lebesgue measure (or dF simply absolutely continuous); f = dλ is the Radon–Nikodym derivative of F with respect to the Lebesgue measure. We will study these ideas in more detail in the context of integration and expectation. Note that the Lebesgue measure is so ubiquitous that λ(dx) is typically written simply as dx.

32 6.2 Examples of random variables

Example 6.1. Poisson distribution. Let X be a random variable taking values in N = {0, 1, 2,... }, with the discrete σ-algebra 2N. X has Poisson distribution with mean c > 0 is e−ccn {X = n} = . (6.7) P n!

The corresponding distribution µ on N is X e−ccn µ(A) = ,A ⊂ . (6.8) n! N n∈A

Example 6.2. Gamma distribution. Let X be a random variable taking values in (R+, B(R)). It has the gamma distribution with shape index a and scale parameter b if its distribution has the form b−axa−1e−bx µ(dx) = λ(dx) , x ∈ . (6.9) Γ(a) R+

The normalizing constant Γ(a) is the gamma function Z ∞ Γ(a) = dx xa−1e−x . (6.10) 0 When a = 1, the gamma distribution is the exponential distribution; when b = 1/2 and a = n/2 for some integer n, it is the χ2-distribution with n degrees of freedom.

33 Example 6.3. Random graph. Many interesting random objects do not have a distribution function that we can write in closed form or compute easily (or at all). Rather, their distribution is induced by a constructive or generative model.A graph G is a set V of vertices and the set E ⊂ V × V of edges between them. For example, vertices might represent users of a social network and edges represent friendship connections. A common representation of a simple graph enumerates the vertices (V = {1, 2, . . . , n} := [n]) and records a value in {0, 1} for each possible a edge in [n] × [b]. A random simple graph on n vertices is therefore a random variable Gn taking values in [n] × {0, 1}[n]×[n]. This is a finite (discrete) space, so the discrete σ-algebra can be used without problems. In all but the simplest models, specifying a probability distribution on a (possibly large) composite structure like Gn is highly non-trivial. In the case of random graphs, there are good reasons not to specify a model this way (we won’t go into them). However, consider the following simple generative model: 1. Start with G2 as two vertices connected by an edge. 2. At each n = 3,... , connect a new vertex via an edge to a random vertex in V(Gn−1). 3. The probability that the new vertex n attaches to any existing vertex v is proportional to the number of edges that attach to v in Gn−1. This is the simplest version of the widely studied preferential attachment model; it generates random trees. Its simplicity makes it amenable to probabilistic study. Observe that for any [n]×[n] set A ⊂ [n] × {0, 1} , we could in theory compute P{Gn ∈ A}. However, as n gets even [n]×[n] moderately large, the size of [n] × {0, 1} blows up and computing the distribution of Gn becomes unwieldy/impossible. Specifying the model as above, however, allows us to write the distribution in terms of conditional probabilities. We will revisit this example in that context. Note that this is not (by far) the only way we might specify a distribution on random graphs. There is a vast literature on the topic, ranging from the purely probabilistic (for example, preferential attachment has been generalized in many ways for probabilistic study) to more statistics-oriented models like the so-called exponential random graph model, which specifies sufficient statistics (e.g., number of triangles, degree sequence, and so on) and uses them in an exponential family distribution. (There has been an increasing exchange of ideas from these two ends of the spectrum in recent years.)

aThis is a simple graph, the simplest version of such an object; multigraphs, weighted graphs, hypergraphs, multiplex graphs, and others extend the basic idea.

34 Example 6.4. Function of a random variable. Let X be a random variable taking values in a measurable space (E, E), and (F, F) another measurable space such that the mapping f : E → F is E/F-measurable. Then the composition Y = f ◦ X,

Y (ω) = f ◦ X(ω) = f(X(ω)) , ω ∈ Ω , is a random variable taking values in (F, F). This follows from the measurability of the composition of measurable functions (Proposition 3.3). If µ is the distribution of X, then the distribution ν of Y is the image measure

−1 −1 ν(A) = P{Y ∈ A} = P{X ∈ f A} = µ(f A) ,A ∈ F .

For example, let D1,n be the degree of the first vertex in a preferential attachment tree with n edges, Gn. D1,n takes values in N, and its distribution is much easier to analyze than the distribution of the entire random tree—much of the probabilistic analysis of preferential attachment models focuses on the properties of the entire sequence of degrees of the vertices, (D1,n,...,Dn+1,n).

Example 6.5. Random variable defined by a computer program. Many of the (pseudo- )random variables we know and love can be generated by transforming uniform (pseudo-)random variables. For example, let U ∼ Unif[0, 1] (read “U sampled from the uniform distribution on [0, 1]”), and let X ∼ Exp(b) (read “X sampled from the exponential distribution with rate parameter b”). Then we know that X =d − ln(U)/b. This is an example of a distributional identity. On a computer, generating a (pseudo-)random variable with exponential distribution might call a program with the following basic structure: function exponential(b) u ← −ln(rand())/b return u end function The function rand() is a sort of computerized reservoir of randomness the represents the background probability space (Ω, H, P): anything stochastic or probabilistic that we might do on a computer relies on calls to the pseudo-random number generator. Moreover, ln calls the natural logarithm function, which is a built-in function in most mathematics libraries, and is evaluated via numerical methods like the Newton–Raphson algorithm. This is a simple example of a computer program that can be viewed as a random variable. The field of probabilistic programming takes a broader view: any computer program can be viewed as a random variable whose distribution is induced by the program. Statistical inference (typically Bayesian) can be performed over the distribution of execution traces by conditioning on observed data. These programs (and research into them) is at the forefront of machine learning. Some examples are BUGS and Stan (from the statistics world), and TensorFlow Probability (formerly known as Edward), Pyro, Church, and Anglican (from the machine learning world).

Exercise 16 (Random variable from undergraduate probability): £ ¢ Give the full definition (i.e., set of values, σ-algebra) of a random¡ variable you studied in under-

35 graduate probability. State at least one distributional identity involving that random variable.

Exercise 17 (Random variable from your research interests): £ Give the full definition of a random variable that is not encountered in introductory textbooks ¢ ¡ on probability, ideally one you encounter in research. If you can, state at least one distributional identity involving that random variable.

6.3 Joint distributions and independence Let X and Y be random variables taking values in measurable spaces (E, E) and (F, F), respectively. The pair Z = (X,Y ): ω 7→ Z(ω) = (X(ω),Y (ω)) is measurable relative to H and the product σ- algebra E ⊗F. That is, Z is a random variable taking values in the product space (E×F, E ⊗F). The distribution of Z is the probability measure π on the product space, called the joint distribution of X and Y . In order to specify π it is sufficient to specify

π(A × B) = P{X ∈ A, Y ∈ B} = P({X ∈ A} ∩ {Y ∈ B}) ,A ∈ E,B ∈ F . (6.11) The marginal distributions of X and Y are

µ(A) = P{X ∈ A} = π(A × F ) ,A ∈ E and ν(B) = P{Y ∈ B} = π(E × B) ,B ∈ F . (6.12) These terms are extended in the obvious way to arbitrary finite collections of random variables.

Independence. X and Y are said to be independent if their joint distribution is just the product of their marginal distributions:

π(A × B) = P{X ∈ A, Y ∈ B} = P{X ∈ A}P{Y ∈ B} = µ(A)ν(B) ,A ∈ E,B ∈ F . (6.13) Independence is one of the fundamental (and most useful) ideas used to construct statistical models. As a general rule, the more independence a model has, the easier it is to perform inference, compu- tation, closed-form analysis. The trade-off for too much independence is that the model might not be able to capture properties in real data. Traditionally, good models strike a balance—but doing so is a bit of an art.

6.4 Information and determinability Theorem 6.2. Let X be a random variable taking values in some measurable space (E, E).A mapping V :Ω → R¯ belongs to σX (is σX/B(R¯)-measurable) if and only if V = f ◦ X for some deterministic E/B(R¯)-measurable function f. Question 4 of Assignment 1 was a slightly less general version of this. One interpretation is that σX is the collection of all R-valued random variables that are measurable functions of X (and X only). One can think of σX as a body of information; measurability with respect to it is In this sense, X determines all σX-measurable random variables. A random variable Y that is not σX-measurable is not determined by X, i.e., there is no function f such that Y = f(X)—there is leftover randomness. When we study conditional expectations, we will see that conditioning on σX yields a “best estimate” of Y given the information in σX.

36 7 Integration

Reading: C¸inlar [C¸in11], I.4 Supplemental:

7.1 Definition and desiderata

Let (E, E, µ) be a measure space. Recall that Efn denotes the collection of all E/B(R)-measurable fn functions, and E+ is the sub-collection of positive measurable functions. Our aim in this section is to define the integral of a function with respect to a measure. Of course, we would like whatever we define to satisfy certain properties; the bulk of the section is devoted to proving that what we define actually has the properties we want. To that end, denote the integral of a function f with respect to a measure µ by (these are just different ways of writing the same thing) Z Z µf = µ(f) = µ(dx)f(x) = f dµ . (7.1) E E The notation µf suggests a kind of mulitplication, and indeed integration behaves much like mul- fn tiplication via the properties (which we will prove) for all a, b ∈ R+ and f, g, fn ∈ E : i) Positivity: µf ≥ 0 if f ≥ 0. ii) Linearity: µ(af + bg) = aµf + bµg.

iii) Monotone convergence: If fn % f then µfn % µf. The integral is defined in parts (recursively), in order to handle different types of functions: a) Simple and positive functions: Let f be simple and positive, with canonical form f = Pn i=1 ai1Ai . Then the integral is defined as n X µf = aiµ(Ai) . (7.2) i=1

fn b) Positive functions: Let f ∈ E+ , and put fn = dn ◦ f with dn the dyadic function defined in (3.9) (with properties given in Lemma 3.6). Then each fn is simple and positive, and fn % f. The integral µfn is defined in part a) above; the sequence of numbers µfn is increasing so the limit exists (we will establish this below). We define

µf = lim µfn . (7.3) n

fn + − fn c) Arbitrary measurable functions: Let f ∈ E . Then f , f ∈ E+ , and their integrals are defined by part b) above. Noting that f = f + − f −, we define µf = µf + − µf − , (7.4) provided at least one of the terms on the RHS is finite. Otherwise, if both terms are infinite, we define µf is undefined.

37 We will show that this definition is the “right one”, in the sense that it satisfies properties i)-iii) above, and that those properties characterize the integral—there is no other way to define it if we want those properties.

Some observations and immediate implications for positive simple functions. Let f, g be simple and positive. Then the following are true: • The formula (7.2) for µf still holds even if f is not in canonical form. (This is true due to the finite addtivity of µ.) • Linearity of the integral follows from the finite additivity of µ. • The integral is monotonic: if f ≤ g then µf ≤ µg. This follows from the linearity property applied to the functions f and g − f:

µf ≤ µf + µ(g − f) = µ(f + g − f) = µg .

• By monotonicity, a sequence of increasing functions f1 ≤ f2 ≤ ... has µf1 ≤ µf2 ≤ ... , so limn µfn exists (it may be +∞).

What’s the intuition? Recall from your calculus class that the Riemann integral of f : R → R (the one we know and love) is calculated by partitioning E and drawing a rectangle on each element of the partition with height equal to the value of f at the leftmost value of the element. Another set of rectangles is generated to have height equal to the value of f at the rightmost value of each element of the partition. As more and more elements are squeezed into the partition, the common limit of the two sets of rectangles is the Riemann integral. We’ll see an example below where this might not work with some functions for which we’d like to be able to define an integral. The integral defined above, generally known as the Lebesgue integral (not to be confused with Lebesgue measure), takes the “inverse” approach: composing dn with f essentially partitions the “y-axis” (i.e., the range of f):

n2n X k − 1 (dn ◦ f)(x) = 1[ k−1 , k )(f(x)) + n1[n,∞](f(x)) . (7.5) 2n 2n 2n k=1 Applied to the sum, the general integral then uses the definition of integral for simple functions to make a rectangle with height 2−n(k − 1) and base

 k − 1 k   k − 1 k  µ f −1 , = µ {x ∈ E : ≤ f(x) < } . 2n 2n 2n 2n

The crucial technical step is proving that the limit of this construction behaves as we would like.

38 7.2 Examples

Example 7.1. Integrating with discrete measures. Fix x0 ∈ E and consider the Dirac measure fn δx . The integral as defined above yiels δx f = f(x0) for every f ∈ E . This extends to discrete 0 P 0 measures µ = x∈D m(x)δx for some countable set D and masses m(x), X µf = m(x)f(x) , x∈D

fn for every f ∈ F+ . Similar results hold for purely atomic measures, and for measures and functions defined on discrete spaces. Example 7.2. Integrating with the Lebesgue measure. Suppose that E is a Borel subset of Rd, d ≥ 1, and that E = B(E). Suppose that µ is the restriction of the Lebesgue measure on Rd to (E, E). For f ∈ Efn, the Lebesgue integral of f on E is denoted Z Z µf = λ(dx)f(x) = dx f(x) . E E If the Riemann integral of f exists, then so does the Lebesgue integral (and they are equal). However, the Lebesgue integral exists for a larger class of functions than the Riemann integral. Consider the function on [0, 1], ( 1 x ∈ f(x) = Q . 0 x∈ / Q

The set of discontinuities of this function is the entire interval [0, 1]; recall that Lebesgue’s crite- rion for the existence of the Riemann integral for a function is that the set of discontinuities has (Lebesgue) measure zero [see Abb15, Ch. 7], and therefore the function is not Riemann-integrable. However, by our definition above, and in particular a) and the monotone convergence property (which we will prove below), the Lebesgue integral is well-defined and is equal to 0.

7.3 Basic properties of integration

Integrability. A function f ∈ Efn is said to be integrable if µf exists and is a real number. Thus, f is integrable if and only if µf + < ∞ and µf − < ∞, or equivalently, µ|f| = µf + + µf − < ∞.

fn fn Integrating over a set. Let f ∈ E and A ∈ E. Then f1A ∈ E and the integral of f over A is defined to be the integral of f1A, denoted Z Z µ(f1A) = µ(dx)f(x) = f dµ . A A We would like the integrals over two disjoint sets to sum to their integral over the union of those sets; the following lemma establishes that this is the case. fn Lemma 7.1. Let f ∈ E+ . Let A and B be disjoint sets in E with C = A ∪ B. Then

µ(f1A) + µ(f1B) = µ(f1C ) .

39 Proof. If f is simple, the lemma follows from the linearity property established for simple functions:

µ(f1A) + µ(f1B) = µ(f1A + f1B) = µ(f(1A + 1B)) = µ(f1A∪B) = µ(f1C ) .

fn For arbitrary f ∈ E+ , putting fn = dn ◦ f, we get µ(fn1A) + µ(fn1B) = µ(fn1C ) for all n since the fn are simple. Observing that fn1A = dn ◦ (f1A) (convince yourself of this), and likewise for B and C, we get

µ(dn ◦ (f1A)) + µ(dn ◦ (f1B)) = µ(dn ◦ (f1C )) .

Taking the limit n → ∞ and checking the definition (part b) yields the result.

Positivity and monotonicity. fn fn Proposition 7.2. If f ∈ E+ then µf ≥ 0. If f, g ∈ E+ and f ≤ g, then µf ≤ µg.

fn Proof. Positivity of the integral for f ∈ E+ follows from the definition of the integral. For mono- tonicity, let fn = dn ◦ f and gn = dn ◦ g; since dn is and increasing function, f ≤ g implies that fn ≤ gn for all n. These are both simple functions and we established monotonicity for simple functions as an immediate consequence of the definition of the integral. Hence, letting n → ∞ and checking the definition (part b)), we see that µf ≤ µg.

7.4 Monotone Convergence Theorem As C¸inlar notes, this is the main theorem of integration. It says that the mapping f 7→ µf is continuous under increasing limits, and we may interchange integral with limit. It is useful in its own right, as it is often easier to evaluate µfn and then take the limit, rather than the other way around. The proof is a bit involved, but it’s illuminating and worth the time.

Theorem 7.3 (Monotone Convergence (MCT)). Let (fn) be a increasing sequence of positive E/B(R+)-measurable functions. Then

µ(lim fn) = lim µfn . n n

Proof. First, let’s make sure that limits on both sides of the claimed equality are well-defined. Let fn f = limn fn, which is well-defined because (fn) is increasing. Clearly, f ∈ E+ so µf is well-defined. Since (fn) is increasing, Proposition 7.2 on monotonicity implies that the integrals (µfn) form an increasing sequence of numbers. Hence, limn µfn exists. We want to show that it is equal to µf. We will do so by proving the following two claims:

1. µf ≥ limn µfn.

2. µf ≤ limn µfn.

40 Proof of Claim 1 : Because f ≥ fn for each n, monotonicity (Proposition 7.2) yields µf ≥ µfn for each n. It follows that µf ≥ limn µfn. Proof of Claim 2 : We will do this in three parts, using first an indicator function, then a simple function, then the limit of dyadic functions.

Indicator function. Fix b ∈ R+ and a set B ∈ E. Suppose that f(x) > b for every x ∈ B. Since the sets {fn > b} = {x ∈ E : fn(x) > b} are increasing to {f > b}, the sets Bn = B ∩ {fn > b} are increasing to B. Therefore, by the sequential continuity of µ (Proposition 4.1),

lim µ(Bn) = µ(B) . (7.6) n

Now consider the function fn1B. We have that

fn1B ≥ fn1Bn ≥ b1Bn , which (again) by monotonicity yields that

µ(fn1B) ≥ µ(b1Bn ) = bµ(Bn).

We established the limit of µ(Bn) in (7.6), so

lim µ(fn1B) ≥ bµ(B) . (7.7) n

This is still true if f(x) ≥ b for all x ∈ B. To see this note that if b = 0 then by the positivity of the integral this is trivially true. For b > 0, choose a sequence of positive numbers (bm) such that bm % b. Then f(x) > bm for all x ∈ B, and (7.7) is true with b replaced by bm. Letting m → ∞, we get (7.7) again. Simple function. Let g be a positive simple function such that f ≥ g, with canonical representation Pm g = i=1 bi1Bi . Therefore, f(x) ≥ bi for every x ∈ Bi, and (7.7) yields

lim µ(fn1B ) ≥ biµ(Bi) , i = 1, . . . , m . (7.8) n i Pm Furthermore, fn ≥ fn i=1 1Bi and therefore

m X lim µfn ≥ lim µ(fn 1B ) (monotonicity) n n i i=1 m X = lim µ(fn1B ) (finite additivity) n i i=1 m X = lim µ(fn1B ) (interchange limit with finite sum) n i i=1 m X ≥ biµ(Bi) (by (7.8)) i=1 = µg (definition of integral for simple functions) .

41 This holds for every positive simple function g such that f ≥ g.

Limit of dyadic functions. Recall that by definition, µf = limk µ(dk ◦ f). For each k, dk ◦ f is a positive simple function and f ≥ dk ◦ f. Hence, setting g = dk ◦ f, we have

lim µfn ≥ µ(dk ◦ f) n

for all k. Letting k → ∞, we obtain limn µfn ≥ limk µ(dk ◦ f) = µf.

7.5 Further properties of integration

Linearity. fn Proposition 7.4 (Linearity of integration). For f, g ∈ E+ and a, b ∈ R+, µ(af + bg) = aµf + bµg . (7.9)

The same is true of integrable f, g ∈ Efn and arbitrary a, b ∈ R.

Proof. This proof is a nice example of how useful/powerful the MCT can be. fn Suppose that f, g ∈ E+ and a, b ≥ 0. We established linearity for simple f, g immediately after the definition of the integral. For general positive f, g, choose sequences of positive increasing functions fn % f and gn % g. Then

µ(afn + bgn) = aµfn + bµgn .

Applying the Monotone Convergence Theorem to both sides,

lim µ(afn + bgn) = lim aµfn + lim bµgn n n n

µ(a lim fn + b lim gn) = aµ(lim fn) + bµ(lim gn) n n n n µ(af + bg) = aµf + bµg .

Insensitivity. Proposition 7.5. The integral has the following insensitivity properties: fn i) If A ∈ E is negligible (i.e., µ(A) = 0), then µ(f1A) = 0 for every f ∈ E . fn ii) If f, g ∈ E+ and f = g almost everywhere, then µf = µg. fn iii) If f ∈ E+ and µf = 0, then f = 0 almost everywhere.

Exercise 18 (Insensitivity of integration): £ Prove Proposition 7.5. ¢ ¡

42 7.6 Characterization of the integral We will not prove the following important result, which essentially says that the properties of positivity, linearity, and monotone convergence characterize the integral. Theorem 7.6 (Characterization of the integral). Let (E, E) be a measurable space. Let L be a fn ¯ mapping from E+ into R+. Then there exists a unique measure µ on (E, E) such that L(f) = µf fn for every f ∈ E+ if and only if: a) f = 0 ⇒ L(f) = 0. fn b) f, g ∈ E+ and a, b ∈ R+ ⇒ L(af + bg) = aL(f) + bL(g). fn c) (fn) ⊂ E+ and fn % f ⇒ L(fn) % L(f).

7.7 More on interchanging limits and integration Two limiting operations do not necessarily commute—they might not be interchangeable. Inte- gration, differentiation, infinite sums are different types of limiting operations. It is common to encounter something like Z d Z lim fn(x, θ)dx or f(x, θ)dx . n dθ R R

Often, performing one limit operation before the other is much easier, e.g., differentiating before integrating. We will establish conditions for interchanging differentiation and integration in the context of expectations. The MCT allows us to interchange limits with integration for increasing sequences of functions; the Dominated Convergence Theorem relaxes the requirement that (fn) is increasing, at the expense of bounding the sequence with another integrable function.

Dominated functions and integration. A function f is dominated be a function g if |f| ≤ g. (Note that g ≥ 0 necessarily.) A sequence (fn) is dominated by g if |fn| ≤ g for every n.

Theorem 7.7 (Dominated Convergence Theorem). Let (fn) be a sequence of E-measurable func- tions. Suppose that (fn) is dominated by some integrable function g. If limn fn exists, then it is integrable and

µ(lim fn) = lim µfn . n n

See C¸inlar [C¸in11, pp. 25-26] for the proof, along with two necessary intermediate results on inter- changing lim inf/lim sup and integration.

Note that for the DCT, (fn) does not need to be monotone; it only needs to have a limit and be dominated by an integrable function.

If (fn) is bounded by some constant b ∈ R+ and µ is finite, then we can take g = b in the DCT.

43 Theorem 7.8 (Bounded Convergence Theorem). Let (fn) be a sequence of E-measurable functions, bounded by b ∈ R+. Suppose that µ is finite. If limn fn exists, then it is a bounded integrable function and

µ(lim fn) = lim µfn . n n

7.8 Integration and image measures Recall (from Section 4.6) that for a measurable function h : F → E and a measure ν on (F, F), the image of ν under h, ν ◦ h−1, is a measure on (E, E). The following theorem says that we can integrate either with the image measure on (E, E), or with ν on (F, F). fn −1 Theorem 7.9. For every f ∈ E+ , (ν ◦ h )f = ν(f ◦ h).

fn Proof. Sketch of proof: Define L : E+ → R+ by setting L(f) = ν(f ◦ h), and check that L meets the conditions of Theorem 7.6. Thus, L(f) = µf for some unique measure µ on (E, E). Then, show that µ and ν ◦ h−1 agree for all B ∈ E.

Exercise 19 (Proof of Theorem 7.9): £ Complete the proof of Theorem 7.9. ¢ ¡

Written explicitly, this theorem says that Z Z ν(dx)f(h(x)) = µ(dy)f(y) . (7.10) F E This might look familiar: it is just a general version of the change-of-variables formula from calculus. When expectations and random variables are involved, it also goes by the (somewhat annoying) name “Law of the unconscious statistician”. For F = E = Rd, typically µ and ν expressed in terms of the Lebesgue measure on Rd and the Jacobian of the transformation h.

44 Example 7.3. Beta-gamma magic. Let X and Y be independent random gamma variables (as in Example 6.2), with parameters (a, 1) and (b, 1), respectively. We will show the following: a) Z = X + Y has a gamma distribution with parameters (a + b, 1), denoted γa+b,1. b) U = X/(X + Y ) has a beta distribution with parameters (a, b), denoted βa,b. That is,

Γ(a + b) (U ∈ du) = β (du) = du ua−1(1 − u)b−1 , 0 < u < 1 . (7.11) P a,b Γ(a)Γ(b)

c) U and Z are independent: their joint distribution π(dz, du) is the product measure γa+b,1(dz) × βa,b(du).

Let f be any positive Borel function on R+ × [0, 1] and consider

πf = E[f(X + Y, X/(X + Y ))] Z ∞ xa−1e−x Z ∞ yb−1e−y x = dx dy f(x + y, ) 0 Γ(a) 0 Γ(b) x + y Z ∞ Z 1 za+b−1e−z = dz ua−1(1 − u)b−1f(z, u) 0 0 Γ(a)Γ(b) Z ∞ Z 1 = γa+b,1(dz) βa,b(du)f(z, i) , 0 0 where we have made the substitution x = zu and y = (1 − u)z; the Jacobian is equal to z.

45 8 Expectation

Reading: C¸inlar [C¸in11], I.4, II.2 Supplemental: Random variables are measurable functions; expectations of random variables are integrals of mea- surable functions. Hence, we’ve already developed most of the necessary theory. We’ll focus on more specialized, “probabilistic” results in this section.

8.1 Integrating with respect to a probability measure

As always, we have a background probability space (Ω, H, P). We denote the expectation or expected value of a random variable X as Z Z EX = E[X] = P(dω)X(ω) = XdP = PX. (8.1) Ω Ω Other notation used in various fields is meant to clarify what random variable is in question, and what distribution is being used, i.e., EX [f(X,Y )] or EX∼P [f(X)]. E is treated as an operator corresponding to P.

Properties. All of the conventions and notations of integration transfer over to expectation. X is integrable if EX exists and is finite. The integral of X over the event H ∈ H is EX1H . Everything that we proved for integrals in Section7 holds for expectation (e.g., positivity, mono- tonicity, linearity, monotone convergence, etc.). The following result is, in C¸inlar’s words, the “work horse” of probabilistic computations. Theorem 8.1. Let X be a random variable taking values in a measurable space (E, E). If µ is the distribution of X, then

Ef ◦ X = µf (8.2)

fn fn for every f ∈ E+ . Conversely, if (8.2) holds for some measure µ and all f ∈ E+ , then µ is the distribution of X.

Proof. The first statement is true because of what we know about integration with respect to image −1 fn measures (Theorem 7.9): µ = P ◦ X , so µf = P(f ◦ X) = Ef ◦ X at least for f ∈ E+ . Conversely, fn if (8.2) holds for all f ∈ E+ , taking f = 1A we see that

µ(A) = µ1A = E1A ◦ X = P{X ∈ A} , which is the distribution of X.

d fn A common way to show that X = Y is to show that E[f(X)] = E[f(Y )] for every f ∈ E+ .

46 8.2 Changes of measure: indefinite integrals and Radon–Nikodym deriva- tives Composing a measure with a function constructs a measure (the image); alternatively we might multiply a measure and a function. In particular, let (E, E, µ) be a measure space and p : E → R+ be a measurable function. Define Z ν(A) = µ(p1A) = µ(dx)p(x) ,A ∈ E . (8.3) A ν is a measure on (E, E). (Can you prove this?) It is called the indefinite integral of p with respect to µ. Like the image measure, we can choose to integrate with µ or with ν. fn Proposition 8.2. For every f ∈ E+ , νf = µ(pf). The proof has the same structure as that of Theorem 7.9. In more explicit notation, (8.3) is Z Z fn ν(dx)f(x) = µ(dx)p(x)f(x) , f ∈ E+ . (8.4) E E Informally, we write

ν(dx) = µ(dx)p(x) , x ∈ E. (8.5)

This aligns with ideas we may have taken for granted in introductory probability. Heuristically, µ(dx) is the amount of mass µ assigns to an “infinitesimal neighborhood” dx of a point x, and similarly for ν(dx). The p(x) tells us how to go between the two; it is the mass density at x of ν with respect to µ. The function p is called the density function of ν relative to µ, with the following notation: dν dν ν(dx) p = , and p(x) = (x) = . (8.6) dµ dµ µ(dx)

When dealing with random variables X and X0, we might see (X ∈ dx) p = P , x ∈ E. (8.7) P(X0 ∈ dx)

Radon–Nikodym derivative. Above, we used some function p to construct a new measure via (8.3). What if we have two measures? Is the some p that relates the two? The Radon–Nikodym theorem establishes conditions for the existence (and almost sure uniqueness) of such a p. Although it may seem like something of a technical curiosity, it is extremely powerful; we will use it to build conditional probability distributions. We won’t prove it now, though we may prove it later in the course. At a minimum, µ and ν need to agree on non-null sets. In particular, a measure µ on (E, E) is absolutely continuous with respect to µ (also on (E, E)) if, for every set A ∈ E,

µ(A) = 0 ⇒ ν(A) = 0 .

47 This is denoted ν  µ. In the indefinite integral construction (8.3), clearly ν  µ. That the converse is true (for “nice” ν) is the content of the theorem. For the purposes of stating the theorem, we need the following definition. A measure µ on (E, E) is σ-finite if there exists a measurable partition (En) of E such that µ(En) < ∞ for all n. Clearly, every finite measure is σ-finite. In this course, we will only work with probability measures, which are finite by definition. Theorem 8.3 (Radon–Nikodym). Suppose that µ is a σ-finite measure and that ν  µ. Then there exists a positive E-measurable function p : E → R+ such that Z Z fn ν(dx)f(x) = µ(dx)p(x)f(x) , f ∈ E+ . (8.8) E E Moreover, p is unique up to equivalence on non-null sets: if (8.8) holds for some other function fn pˆ ∈ E+ , then pˆ(x) = p(x) for µ-almost every x ∈ E. We won’t prove this now, but we may prove it later in the course. The Radon–Nikodym theorem is the primary technical tool for defining conditional expectation (coming soon). More immediately, we recognize its role in defining some familiar objects.

Example 8.1. Discrete c.d.f. and p.m.f. Let −∞ < a1 < a2 < . . . < ∞ be a sequence of real P numbers and let (pn) be a sequence of positive numbers such that n pn = 1. Then ( Pm p a ≤ x < a F (x) = n=1 n m m+1 , 0 −∞ < x < a1 is a cumulative distribution function consisting purely of jumps. F corresponds to some discrete probability measure µ on (R, B(R)), X µ(A) = pn ,A ∈ B(R) . n:an∈A

Let ν be the counting measure on 2R. Then µ  ν and

Z X µ(A) = f(x)ν(dx) = f(ai) ,A ⊂ R , A ai∈A where f(ai) = pi. f is the probability mass function of µ or F with respect to ν.

48 Example 8.2. c.d.f. and p.d.f. Let F be a distribution function, and assume that F is differen- tiable (in the same sense as in calculus). Then its derivative f satisfies Z x F (x) = f(s)λ(ds) , x ∈ R . −∞

Let µ be the corresponding probability measure on (R, B(R)). Then Z µ(A) = f(ds)λ(ds) ,A ∈ B(R) . A f is the probability density function of µ or F with respect to the Lebesgue measure λ. The Radon–Nikodym derivative obeys a number of useful identities. Proposition 8.4. Let µ be a σ-finite measure on (E, E). Let all other measures appearing below also be on (E, E). The following hold fn i) If ν  µ and f ∈ E+ , then Z Z dν fdν = f dµ . E E dµ

ii) If ν1  µ and ν2  µ, then ν1 + ν2  µ and

d(ν + ν ) dν dν 1 2 = 1 + 1 , µ-a.e. dµ dµ dµ

iii) If η is a measure, ν is a σ-finite measure, and η  ν  µ, then

dη dη dν = , µ-a.e. dµ dν dµ

In particular, if ν  µ and µ  ν (in which case µ and ν are equivalent), then

dν dµ−1 = , µ- or ν-a.e. dµ dν

iv) Let µ1, ν1 be σ-finite measures on (E, E) and µ2, ν2 σ-finite measures (F, F), such that ν1  µ1 and ν2  µ2. Then ν1 × ν2  µ1 × µ2 and

d(ν1 × ν2) dν1 dν2 (x1, x2) = (x1) (x2) , µ1 × µ2-a.e. d(µ1 × µ2) dµ1 dµ2

See Appendix A.1 for an applicattion.

49 8.3 Moments and inequalities The expectation is used to define a number of special classes of functions that describe different properties of random variables.

Moments. Certain expected values have special names. Let X be a random variable taking values in R¯, with distribution µ. Then E[Xn] is called the nth moment of X. For n = 1, E[X] is the mean. Assuming the mean is finite, e.g., E[X] = a < ∞, the nth moment of X − a is called the nth centered moment of X. A familiar example is the variance of X, Var[X] = E[(X − a)2].

Markov’s inequality. One of the most useful (basic) results from probability theory is Markov’s inequality. It is a standard tool in probability, and number of other named inequalities are derived from it. Let X be a random variable in R+. Then for every c > 0, 1 {X > c} ≤ [X] . (8.9) P c E

Exercise 20 (Proof of Markov’s inequality): £ ¢ Prove Markov’s inequality (8.9). ¡

Hint: Use the fact that X ≥ c1X>c.

Exercise 21 (Chebyshev’s inequality): £ 2 ¢ Assume that E[X] is finite. Apply Markov’s¡ inequality to (X − E[X]) to show that 1 {|X − [X]| > c} ≤ Var[X] . (8.10) P E c2

Exercise 22 (Generalized Markov’s inequality): £ Let X be a random variable in , and f : → be increasing. Show that for every c ∈ , ¢ R R ¡R+ R 1 {X > c} ≤ [f(X)] . (8.11) P f(c)E

Jensen’s inequality. Another standard tool relies on special properties of convex functions. A function ϕ : R → R is convex if ϕ = supn ϕn for some sequence of functions with the form ϕn(x) = an +bnx. Theorem 8.5 (Jensen’s inequality). Let X be a random variable in R, and ϕ : R → R be a convex function. Then    ϕ E[X] ≤ E ϕ(X) . (8.12)

50 The measure-theoretic version of this is (with µ the distribution of X)  Z  Z ϕ xµ(dx) ≤ ϕ(x)µ(dx) . R R Example 8.3. Positivity of Kullback–Leibler divergence. Let ν  µ be two probability measures on (E, E). The Kullback–Leibler divergence, or KL divergence, between ν and µ is

Z  dν  KL(ν||µ) = ln dν . (8.13) E dµ The KL divergence plays an important role in Bayesian statistics and machine learning, particularly in variational inference methods. Assume also that µ  ν. Then, because − ln(x) is convex,

Z dµ KL(ν||µ) = − ln dν E dν  Z dµ  ≥ ln dν E dν  Z  = ln dµ = 0 . E Hence, we have shown that when ν and µ are mutually absolutely continuous, KL(ν||µ) ≥ 0. (This is also true without the mutual absolute continuity assumption, but we can’t use this technique.)

8.4 Transforms and generating functions

Laplace and Fourier transforms. Let X be a random variable in R+ with distribution µ. Then for −rX r ∈ R+, the random variable e takes values in [0, 1], and Z −rX −rx µˆr(X) = E[e ] = e µ(dx) (8.14) R+

is a number in [0, 1]. The function r 7→ µˆr from R+ into [0, 1] is called the Laplace transform of µ (also of X). It can be shown that the Laplace transform determines the distribution: if µ and ν ¯ are distributions on R+, andµ ˆr =ν ˆr for all r ∈ R+, then µ = ν.

This is also known as the moment generating function: If E[X] < ∞ andµ ˆr is differentiable with respect to r on (0, +∞), then (we won’t prove this)

n d n n lim µˆr = (−1) E[X ] . (8.15) r↓0 drn

The Fourier transform, or characteristic function, plays a similar role for distributions on the entire real line, R. For a random variable X taking values in R with distribution µ, eirX = cos rX + i sin rX is a complex-valued random variable, and the notion of expectation extends

51 naturally: Z Z Z ∗ irX irx µˆr = E[e ] = e µ(dx) = µ(dx) cos rx + i µ(dx) sin rx . (8.16) R R R The Fourier transform determines the distribution.

Probability generating function. If X takes values in {0, 1,... } with distribution µ, then the probability generating function of µ is

∞ X X n E[z ] = z P{X = n} , z ∈ [0, 1] . (8.17) n=0 It determines the distribution of X: in the power series expansion of the distribution, the coefficient of zn is P{X = n} for each n.

52 9 Independence

Reading: C¸inlar [C¸in11], II.5 Supplemental: Recall that two random variables are independent if their joint distribution factors into the product of their marginals. There is a more general notion of independence, in terms of expectations and σ-algebras. A σ-algebra G is a sub-σ-algebra of H if G ⊂ H. C¸inlar advocates for thinking of the sub-σ-algebra G both as a collection of events, and as the collection of all numerical random variables that are G-measurable. This is justified, particularly when G = σX (recall the definition of a σ-algebra generated by a function, defined in Section 3.1,(3.3)), because of the following theorem—a version of which you proved in Assignment 1.

A word on notation. Recall that for a measurable space (E, E), Efn denotes the set of all numerical functions that are E/B(R)-measurable. When working with random variables (and σ-algebras generated by random variables), this notation can get a bit tedious. I will adopt the following short-hand: if V : E → R is a random variable in R, we say that V belongs to σX, denoted V ∈ σX, if it is σX/B(R)-measurable. Recall from Theorem 6.2 that a mapping V :Ω → R¯ belongs to σX if and only if V = f ◦ X,

for some deterministic function f ∈ Efn.

General definition. For any (countable or uncountable) set I indexing a set of σ-algebras (Fi)i∈I , denote by

_  [  FI = Fi = σ Fi (9.1) i∈I i∈I

the σ-algebra generated by the union of the σ-algebras. (Recall that the union itself is in general not a σ-algebra.)

Let F1,..., Fn be a sequence of sub-σ-algebras of H. Then {F1,..., Fn} is called an independency if

E[V1 ··· Vn] = E[V1] ... E[Vn] , (9.2)

for all positive random variables V1 ∈ F1,...,Vn ∈ Fn.

For an arbitrary (possibly infinite) index set T , and a sub-σ-algebra Ft of H for each t ∈ T , the collection {Ft : t ∈ T } is an independency of every finite subset is an independency. In general, elements—such as random variables—of an independency are said to be independent or mutually independent. This reveals what we really mean when we say the random variables X and Y are independent: σX and σY are independent.

53 The following test for independence of σ-algebras echoes the test for measurability of functions in Proposition 3.2: we only need to show independence on a generating subset. Recall that a p-system is a collection of sets that is closed under intersection.

Proposition 9.1. Let F1,..., Fn be sub-σ-algebras of H, n ≥ 2. For each i ≤ n, let Ci be a p-system that generates Fi. Then F1,..., Fn is are independent if and only if

P(H1 ∩ · · · ∩ Hn) = P(H1) ··· P(Hn) , for all Hi in C¯i = Ci ∪ {Ω}, i = 1, . . . , n. See C¸inlar [C¸in11, Prop. II.5.2] for a proof. We will need the following result, stating that independence survives grouping, to prove Kol- mogorov’s 0-1 law. In particular, let {Ft : t ∈ T } be an independency. For a countable partition

{T1,T2,... } of T , the subcollections FTi = {Ft : t ∈ Ti}, i ∈ N+, form a partition of the original independency. Proposition 9.2. Every partition of an independency is an independency.

9.1 Independence of random variables

For each t in some index set T , let Xt be a random variable in a measurable space (Et, Et). As defined above, the random variables Xt are independent if {σXt : t ∈ T } is an independency. This is used to adapt the previous result Proposition 9.1 to random variables.

Proposition 9.3. The random variables X1,...,Xn are independent if and only if

E[f1(X1) ··· fn(Xn)] = E[f1(X1)] ··· E[fn(Xn)] , (9.3)

fn fn for all f1 ∈ E1,+, . . . , fn ∈ En,+.

Proof. We need to show that (9.2) holds for all positive V1 ∈ σX1,...,Vn ∈ σXn if and only if (9.3) fn fn holds for all positive f1 ∈ E1,+, . . . , fn ∈ En,+. This is immediate from Theorem 6.2: Vi ∈ σXi if fn and only if Vi = fi(Xi) for some fi ∈ Ei .

We can translate this into a statement about the distributions of the random variables. In particular, let π be the joint distribution of X1,...,Xn, and µ1, . . . , µn their marginals. Then rewriting (9.3) yields Z Z Z Z π(dx1, . . . , dxn)f1(x1) ··· fn(xn) = µ1(dx1)f1(x1) ··· µn(dxn)fn(xn) . E1×···×En E1 E2 En (9.4)

Clearly, that this equality holds for all positive measurable f1, . . . , fn is equivalent to saying that π = µ1 × · · · × µn.

Proposition 9.4. The random variables X1,...,Xn are independent if and only if their joint distribution is the product of their marginal distributions. Finally, we establish that functions of independent random variables are also independent.

54 Proposition 9.5. Measurable functions of independent random variables are independent.

Exercise 23 (Independence of functions of independent random variables.): £ Prove Proposition 9.5. ¢ ¡

Hint: Let Yi = fi(Xi), i = 1, . . . , n. Then σYi ⊂ σXi.

9.2 Sums of independent random variables Sums of sequences of random variables are of constant interest in probability and statistics. We focus briefly on the distribution of the sum of two random variables. Let X and Y be R-valued random variables with distributions µ and ν, respectively. Then, the distribution of (X,Y ) is the product measure µ × ν, and the distribution of X + Y , µ ∗ ν, is given by Z Z (µ ∗ ν)f = E[f(X + Y )] = µ(dx) ν(dy)f(x + y) . (9.5) R R The distribution µ ∗ ν is called the of µ and ν. Of course, because X + Y = Y + X, µ ∗ ν = ν ∗ µ. In many cases, there are easier ways to identify the distribution of X + Y than to compute (9.5). The following example gives one. Example 9.1. Transforms of sums of independent random variables. A common way to establish distributional identities is via Laplace and Fourier transforms. This is particularly applicable for sums of independent random variables. For example, let X and Y be independent gamma-distributed random variables with parameters (a, b) and (c, b), respectively. Then

 b a  b c [e−rX ] = and [e−rY ] = . E b + r E b + r

What is the distribution of X + Y ? We have (by independence)

 b a b c  b a+c [e−r(X+Y )] = [e−rX ] [e−rY ] = = = [e−rZ ] , E E E b + r b + r b + r E where Z is gamma-distributed with parameters (a + c, b). Because the Laplace transform charac- terizes the distribution for positive random variables, we know that Z =d X + Y .

9.3 Tail fields and Kolmogorov’s 0-1 law

Let (Gn) be a sequence of sub-σ-algebras of H. For the purposes of this section, we think of Gn as the information revealed by the nth trial of an experiment. Then the information about the future after n is Tn = ∨m>nGm. The σ-algebra consisting of events whose occurrences are unaffected by anything in finite time is the tail σ-algebra, T = ∩nTn.

55 Example 9.2. Tail σ-algebra. Let X1,X2,... be random variables in R, and Sn = X1 +···+Xn. a) The event {ω : limn Sn(ω) exists} belongs to Tn for every n, so it belongs to T . 1 b) Likewise, {lim supn n Sn > b} is unaffected by the first n variables, and hence it belongs to T . c) Conversely, {lim supn Sn > b} is not in T —we could change X1 and change the event. d) Let B be a Borel subset of R. Let {Xn ∈ B i.o.}, read “Xn is in B infinitely often”, be the P set of ω for which n 1B ◦ Xn(ω) = +∞. This event belongs to T . e) The event {Sn ∈ B i.o.} is not in T .

When the sequence (Gn) is an independency, something special happens: each of the events in T has probability 0 or 1.

Theorem 9.6 (Kolmogorov’s 0-1 law). Let G1, G2,... be independent. Then P(H) is either 0 or 1 for every event H in the tail σ-algebra T .

Proof. By Proposition 9.2 on partition independencies, {G1,..., Gn, Tn} is an independency for each n, which implies that so is {G1,..., Gn, T} for each n, since T ⊂ Tn. Thus, by definition {T , G1, G2,... } is an independency, and so is {T , T0} by Proposition 9.2 (again).

Pick any two events H1 ∈ T and H2 ∈ T0. Then P(H1 ∩ H2) = P(H1)P(H2). Since T ⊂ T0, for any H1 ∈ T , we have H1 ∈ T0. Thus, for H1 ∈ T ,

P(H1) = P(H1)P(H1) , which means that P(H1) equals either 0 or 1.

1 As a corollary every random variable in the tail-σ-algebra (for example, lim supn n Sn) is almost surely constant. ¯ Corollary 9.7. Let G1, G2,... be independent, and let V be a random variable in R, such that a.s. V ∈ T . Then there is a constant c ∈ R¯ such that V = c.

Exercise 24: £ ¢ Prove Corollary¡ 9.7.

56 10 Conditioning and disintegration

Reading: C¸inlar [C¸in11], Chapter IV Supplemental: Conditioning is one of the most important concepts in modern probability, especially for applications in statistics and machine learning. Martingales and Markov processes cannot even be defined with- out conditioning, and the starting point of Bayesian inference is conditioning on data. Hierarchical models, latent variable models, stochastic processes—none of these work without conditioning.

10.1 Conditional expectations

For two R-valued random variables X and Y , a heuristic notion of the conditional expectation E[X | Y = y] is an estimate of X given the information contained in the event {Y = y}, for some fixed value y. This is a pretty good heuristic, and one to keep in mind as we develop a more technical view of conditional expectation.

Some intuition. Let (Ω, H, P) be a probability space, F a sub-σ-algebra of H, and X a R+-valued random variable. C¸inlar reminds us to regard F both as a collection of events and as the collection of all F-measurable random variables. Let H be an event in H and assume that P(H) > 0. Fix some ω ∈ Ω, and suppose that all we know is that ω ∈ H. Given this information, our best estimate of X(ω) should be the “average” over H,

1 Z 1 E[X | H] = P(dω)X(ω) = E[X1H ] . P(H) H P(H)

The quantity E[X | H] is a number called the conditional expectation of X given the event H. Note that we’re conditioning on an event, not a random variable. In general, we will condition on events or collections of events (i.e., σ-algebras). Conditioning on a random variable Y is really short-hand for conditioning on σY .

Now suppose that the sub-σ-algebra F is generated by a measurable partition (Hn) of Ω. Now for fixed ω, consider our estimate of X(ω) given the information F: we can tell which of the events H1,H2,... includes ω, and therefore our estimate ¯ X X(ω) = E[X | F] = E[X | Hn]1Hn (ω) . (10.1) n ¯ If we do this for each ω ∈ Ω, we have defined a random variable X :Ω → R+. This is the conditional expectation of X given F.

Observe that because X¯ = E[X | F] is a random variable, already important technical considera- tions are indicated. For example: Is it unique? Under what conditions is it integrable? In order to generalize to arbitrary F, two properties from the example above are key. Firstly, X¯ is F-measurable; it is determined by the information in F. Second, E[VX] = E[V X¯] for every V that belongs to F+. (These are worth checking.) These two properties will be used to define conditional expectation.

57 Exercise 25: £ ¢ Show that X¯¡ in Eq. (10.1) belongs to F.

Exercise 26: £ ¯ ¯ ¢ Show that for¡ X in Eq. (10.1), E[VX] = E[V X] for every V that belongs to F+.

Hint: First, fix n and let V = 1H and show the desired equality. Then, extend that to arbitrary n P V ∈ F+ using the MCT, observing that for this particular F, all such V = n an1Hn .

Definition. Let F be a sub-σ-algebra of H. The conditional expectation of X given F, denoted E[X | F], is defined as follows: ¯ i) For X in H+, E[X | F] is any random variable X that satisfies

a) measurability: X¯ belongs to F+; ¯ 14 b) projection: E[VX] = E[V X] for every V that belongs to F+. Then we write E[X | F] = X¯ and call X¯ a version of E[X | F]. ii) For arbitrary X ∈ H, if E[X] exists, then we define

+ − E[X | F] = E[X | F] − E[X | F] .

Otherwise, if E[X+] = E[X−] = +∞, then E[X | F] is left undefined.

Remark. Observe that for X ∈ H+, the projection property is equivalent to the condition that ¯ E[1H X] = E[1H X] ,H ∈ F . To see this, we can use the MCT on both sides of the equality to extend from indicators to simple functions, and then to arbitrary positive V ∈ F.

Uniqueness and versions. If Y and Z are random variables belonging to F+, and if E[1H Y ] = E[1H Z] for every H ∈ F, then Y a.s.= Z. That is, Y and Z are equivalent up to null sets. To see this, let Hq,r be of the form {Y < q < r < Z}, for some rational numbers q < r. Then

qP(Hq,r) = E[q1Hq,r ] > E[Y 1Hq,r ] = E[Z1Hq,r ] > E[r1Hq,r ] = rP(Hq,r) ,

but this can only be true if P(Hq,r) = 0 for all q < r ∈ Q+, and likewise for the reverse ordering of Z < q < r < Y . Therefore, Y a.s.= Z. We use this fact to establish uniqueness of conditional expectation.

Proposition 10.1. Let X be a random variable belonging to H+. Then the conditional expectation defined above is unique up to almost-sure equivalence.

14“Projection” is intentional: an alternative definition of conditional expectation is via projection in Hilbert spaces. See C¸inlar [C¸in11].

58 ¯ ¯ 0 Proof. Consider two versions of E[X | F] for X ≥ 0, X and X . Then both versions belong to F+ ¯ ¯ 0 ¯ a.s. ¯ 0 and E[VX] = E[V X] = E[V X ] for all V ∈ F+. Therefore, X = X . ¯ ¯ 0 ¯ 0 a.s. ¯ ¯ 0 Conversely, if E[X | F] = X and if X ∈ F+ and X = X, then X satisfies the measurability and projection properties in the definition and hence is a version of E[X | F].

The uniqueness extends to arbitrary X for which E[X] exists via an integrability argument (see C¸inlar, IV.1.6f). We see that “the conditional expectation” actually refers to an equivalence class of random variables (versions), but for probabilistic purposes, their almost-sure equivalence is strong enough to justify the definite article. In light of this, some authors may say E[X | F] = X¯ almost surely.

Integrability. With V = 1, if X ∈ H+, then E[X] = E[E[X | F]]. Therefore, if X is integrable then so is E[X | F]. For general X belonging to H, the previous argument applies separately to X+ and X−. Hence, if X is integrable then so is E[X | F] = X¯, and the projection property can be expressed as E[V (X − X¯)] = 0 for every bounded V ∈ F.

Existence. The following proof uses the Radon–Nikodym derivative to show the existence of conditional expectations. A different proof relying on projections in Hilbert spaces can be found in C¸inlar (and elsewhere).

Theorem 10.2. Let X ∈ H+. Let F be a sub-σ-algebra of H. Then E[X | F] exists and is unique up to equivalence.

Proof. For each event H in F, define Z P (H) = P(H) and Q(H) = P(dω)X(ω) . (10.2) H That is, P is the restriction of P to F, and Q is obtained by averaging X over the sets in F. On the measurable space (Ω, F), P is a probability measure and Q is a measure that is absolutely continuous with respect to P . Hence, by Theorem 8.3 (Radon–Nikodym), there exists X¯ belonging to F+ such that for every H ∈ F, Z E[X1H ] = P(dω)X(ω)1H (ω) (definition) Ω Z = Q(dω)1H (ω) (Equation (10.2)) Ω Z = P (dω)X¯(ω)1H (ω) (Radon–Nikodym theorem) Ω Z ¯ ¯ = P(dω)X(ω)1H (ω) = E[X1H ] (Equation (10.2)) . Ω Recall (from the remark above) that this is equivalent to E[XV ] = E[XV¯ ] for all V belonging to ¯ F+. This shows that X is a version of E[X | F]. Uniqueness was established in Proposition 10.1 above.

59 Properties similar to expectation. For the most part, conditional expectations behave much like expectations, with the caveat that one should remember that E[X | F] is a random variable, so statements must be modified with probabilistic qualifiers like “almost surely”. See C¸inlar [C¸in11, Remark IV.1.9] for a detailed explanation. ¯ Proposition 10.3. Let F be a sub-σ-algebra of H, X,Y, (Xn)n≥1, (Yn)n≥1 be R-valued random variables, and a, b, c be R-valued constants. Furthermore, assume that all of the conditional expec- tations in the following claims exist. Then the following properties hold almost surely:

• Monotonicity: X ≤ Y ⇒ E[X | F] ≤ E[Y | F]. • Linearity: E[aX + bY + c | F] = aE[X | F] + bE[Y | F] + c.

• Monotone Convergence Theorem: Xn ≥ 0,Xn % X ⇒ E[Xn | F] % E[X | F].

• Dominated Convergence Theorem: Xn → X, |Xn| ≤ Y with Y integrable ⇒ E[Xn | F] → E[X | F]. • Jensen’s inequality: f convex ⇒ E[f(X) | F] ≥ f(E[X | F]). The proof is left as an exercise.

Special properties. In addition to its expectation-like properties, the conditional expectation has two special properties that capture how it behaves with more or less information. The first special property is conditional determinism: if a random variable W is determined by F, i.e., it is F-measurable, then it should be treated as a deterministic number in the presence of the information contained in F. The second special property is repeated conditioning. Let F ⊂ G be two sub-σ-algebras of H. At a high level, F contains less information than G. C¸inlar says to “think of F as the information a fool has, and G as that a genius has: the genius cannot improve on the fool’s estimate, but the fool has no difficulty worsening the genius’s. In repeated conditioning, the fool wins all the time.” Theorem 10.4. Let F and G be sub-σ-algebras of H. Let W and X be random variables such that E[X] and E[WX] exist. Then the following hold: a) Conditional determinism: W ∈ F ⇒ E[WX | F] = W E[X | F]. b) Repeated conditioning: F ⊂ G ⇒ E[E[X | G] | F] = E[E[X | F] | G] = E[X | F].

Proof. We give the proofs for positive W and X; the general case follows from the usual arguments. ¯ a) Suppose X ∈ H+ and W ∈ F+. Then X = E[X | F] ∈ F+ and

E[V · (WX)] = E[(VW ) · X] = E[(VW ) · X¯] = E[V · (W X¯)] ,

for every V ∈ F+, by the projection property and the fact that VW ∈ F+. Hence, W X¯ = W E[X | F] is a version of E[WX | F].

b) Since the random variable E[X | F] belongs to F+ by definition, and F ⊂ G, we have that E[X | F] belongs to G+. Therefore, by conditional determinism, E[E[X | F] | G] = E[X | F], which is the second equality above.

60 Finally, we will show that E[Y | F] is a version of E[X | F]. To that end, observe that X¯ = E[X | F] ¯ belongs to F+. Now let V ∈ F+. By definition, E[VX] = E[V X]. Now, because V ∈ F+, it also belongs to G+. Let Y = E[X | G]. By definition, E[VY ] = E[VX], which shows that E[VY ] = E[V X¯], and thus E[E[X | G] | F] = E[X | F].

Conditional expectations given random variables. Let Y be a random variable taking values in some measurable space (E, E). Recall from Theorem 6.2 that the σ-algebra generated by Y , σY , contains all R¯-valued random variables of the form f ◦ Y , for some measurable f. For X belonging to H, the conditional expectation given Y is defined to be E[X | σY ]. Similarly, for a collection of random variables {Yt : t ∈ T }, the conditional expectation of X given the collection is E[X | {Yt : t ∈ T }]. Observe that these are really the same definition because we can always define Y = (Yt)t∈T . The following result is an immediate consequence of these definitions and Theorem 6.2. It says that E[X | σY ] = f(Y ) for some measurable f.

Theorem 10.5. Let X belong to H+, and let Y be a random variable taking values in some measurable space (E, E). Then, every version of E[X | σY ] has the form f ◦Y for some E-measurable ¯ function f : E → R+. Conversely, f ◦ Y is a version of E[X | σY ] if and only if

E[f ◦ Y h ◦ Y ] = E[X · h ◦ Y ] , for every h belonging to E+. (10.3)

Independence. If X and Y are independent, then conditioning on σY tells us nothing about X. Proposition 10.6. Suppose that X and Y are independent random variables taking values in (D, D) and (E, E), respectively. If f belongs to D+ and g to E+, then E[f ◦ X g ◦ Y | σY ] = g ◦ Y E[f ◦ X] .

Exercise 27 (Independence in conditional expectation): £ Prove Proposition 10.6. ¢ ¡

10.2 Conditional probabilities and distributions Conditional expectations are (almost) all we need to define conditional probabilities. For a sub-σ- algebra F of H, for each event H ∈ H,

P[H | F] = E[1H | F] , (10.4) is called the conditional probability of H given F. In more elementary settings (i.e., undergrad- uate probability), conditional probability is defined in terms of events. For events G and H, the conditional probability of H given G is defined to be any number P(H | G) ∈ [0, 1] satisfying P(G ∩ H) = P(G)P(H | G) . (10.5) When P(G) > 0, this is unique. However, a more general version will define conditional probability in terms of σ-algebras, allowing us to handle events with zero measure, such as conditioning on the event that a R-valued random variable takes a particular value. For this, we need some (interesting!) technical tools.

61 10.3 Interjection: kernels and product spaces We need a bit of mathematical background to study conditional probability. In particular, we need some tools for moving between measurable spaces. This material comes from C¸inlar [C¸in11, Ch. I.6].

Transition kernels. Let (E, E) and (F, F) be two measurable spaces, and let K be a mapping from ¯ E × F into R+. Then K is called a transition kernel from (E, E) into (F, F) if a) the mapping x 7→ K(x, B) is E-measurable for every set B ∈ F; and b) the mapping B 7→ K(x, B) is a measure on (F, F) for every x ∈ E. If the mapping B 7→ K(x, B) is a probability measure on (F, F) for every x ∈ E, then K is a probability transition kernel. If you’ve ever studied Markov chains on discrete spaces, you’ve seen a particular (common) version of the latter. When E = F = {1, . . . , m} is equipped with its discrete σ-algebras, the probability transition kernel is specified by the K(x, {y}). This is often denoted by the m-by-m matrix of positive numbers P , with X X K(x, B) = K(x, {y}) = Px,y ,B ⊂ {1, . . . , m} . y∈B y∈B

This special case (of discrete E,F ) informs the choice of notation Kf and µK below. For general E and F , it can be awkward/difficult to define K with the second argument taking sets B ∈ F. A common way to overcome this difficulty is to obtain a transition kernel by integrating a function k : E × F → R+ against a finite measure ν on (F, F): Z K(x, B) = ν(dy)k(x, y) , x ∈ E,B ∈ F . (10.6) B

Obtaining measures and functions from transition kernels. One of the most useful properties of transition kernels is that they can be used to obtain functions and measures from other functions and measures. Theorem 10.7. Let K be a transition kernel from (E, E) into (F, F). Then: i) Z Kf(x) = K(x, dy)f(y) , x ∈ E, (10.7) F

defines a function Kf that belongs to E+ for every function f belonging to F+. ii) Z µK(B) = µ(dx)K(x, B) ,B ∈ F , (10.8) E defines a measure on (F, F) for each measure µ on (E, E).

62 iii) Z Z (µK)f = µ(Kf) = µ(dx) K(x, dy)f(y) (10.9) E F

for every measure µ on (E, E) and function f belonging to F+.

Proof sketch. The proof of i) relies on the fact that K(x, B) is a measure on (F, F), following the usual simple function/limit of simple functions proof technique. ii) and iii) are proven at the same time, by using the characterization theorem for the integral, ¯ Theorem 7.6. Define L : F+ → R+ by setting L(f) = µ(Kf). Clearly, if f = 0 then L(f) = 0. Linearity follows from the linearity of integration with respect to the measure B 7→ K(x, B) for each x and by the linearity of integration with respect to µ. Similarly, the MCT applied to the measures B 7→ K(x, B) and to µ characterize L(f) as the integral νf, for some unique ν. Taking f = 1B shows that µK = ν. See C¸inlar [C¸in11, Theorem I.6.3] for the full proof.

Products of kernels. We can construct a transition kernel from the product of two kernels. Specifically, let K be a transition kernel from (E, E) into (F, F), and L from (F, F) into (G, G). Then their product transition kernel KL from (E, E) into (G, G) is defined by Z KL(x, B) = K(x, dy)L(y, B) , x ∈ E,B ∈ G . F

An equivalent way of defining this (or any transition kernel) is by defining (KL)f for every f ∈ E+. (We won’t show this; see C¸inlar [C¸in11, Remark I.6.4].)

Markov kernels. A transition kernel from (E, E) into (E, E) is a transition kernel on (E, E). Such a kernel K is a Markov kernel if K(x, E) = 1 for every x, and a sub-Markov kernel if K(x, E) ≤ 1 for every x. Transition kernels can be recursively multiplied to obtain a sequence of kernels, K0 = I,K1 = K,K2 = KK,K3 = KK2 ,...,

where I is the identity kernel on (E, E) defined by I(x, A) = 1A(x), x ∈ E,A ∈ E. If K is Markov, so is Kn for every integer n ≥ 0.

Kernels finite and bounded. As with measures, there are various notions of “well-behavedness” relating to boundedness. For example, when viewed as a measure, a kernel K from (E, E) into (F, F) is said to be finite if K(x, F ) < ∞ for each x, and σ-finite if the measure B 7→ K(x, B) is σ-finite for each x. When viewed as a function, K is said to be bounded if x 7→ K(x, F ) is bounded. There are further definitions, but for the purposes of these lecture notes, we will assume (mostly for convenience) that K is finite as a measure and bounded as a function. See C¸inlar [C¸in11, p. 40] for the details.

Functions on product spaces. Note: I skipped this section in lecture. The two most important results are Theorems 10.10 and 10.11 below. Their importance will become clear when we apply

63 them in the context of conditioning. In order to prove them, we need two intermediate results, the first of which has to do with the measurability of sections. Specifically, let f be function defined on a product space, f : E × F → G. Then for a fixed x ∈ E, the mapping fx : F → G, y 7→ f(x, y) is the section of f at x; and likewise for fy : E → G. As usual, we are particularly interested in G ⊂ R and G = B(G), in which case we use our usual notation f ∈ E ⊗ F to say that f is E ⊗ F/B(R)-measurable. Lemma 10.8 (Measurable sections). Let f ∈ E ⊗ F. Then the sections are measurable. That is, x 7→ f(x, y) belongs to E for each y ∈ F , and y 7→ f(x, y) belongs to F for each x ∈ E.

Exercise 28 (Measurable sections): £ ¢ Prove Lemma 10.8. (See C¸inlar [C¸in11¡ , Exercise I.2.2] for a hint.)

Note that the converse is not necessarily true: it is possible for each of the sections to be measurable, but f is not E ⊗ F-measurable. Stronger conditions on the sections are needed; for example, left- or right-continuity of x 7→ f(x, y) for each y ∈ F . [See C¸in11, Exercise I.6.28]. For functions defined on a product space, the following result generalizes the operation K : F fn → Efn, f 7→ Kf of Theorem 10.7. Proposition 10.9. Let K be a finite kernel from (E, E) into (F, F). Then, for every positive function f that belongs to E ⊗ F, Z T f(x) = K(x, dy)f(x, y) , x ∈ E, F

fn fn defines a function T f that belongs to E+. Moreover, the transformation T :(E ⊗ F)+ → E+ is linear and continuous under increasing limits. That is, fn a) T (af + bg) = aT f + bT g for positive f and g in (E ⊗ F)+ and a, b ∈ R+; and fn b) T fn % T f for every positive sequence (fn) ⊂ (E ⊗ F)+ with fn % f. Some remarks. This theorem holds even if we relax the finiteness condition on K, but the proof becomes more involved. See C¸inlar [C¸in11, Proposition 6.9]. fn For the purposes of stating the proof, I reverted back to the (E ⊗ F)+ notation, to avoid confusion about what T operates on. I’ll switch back to (E ⊗ F)+ now. The proof highlights the properties of K as a measure for each x ∈ E.

Proof. Let f be a positive function belonging to (E ⊗ F)+. By Lemma 10.8, for each x ∈ E the section fx : y 7→ f(x, y) belongs to F+. Therefore, T f(x) is the integral of fx with respect to the measure Kx : B 7→ K(x, B). Thus, T f(x) is a well-defined number for each x ∈ E, and the linearity property a) follows from the linearity of integration with respect to Kx for each x; the continuity under increasing limits property b) follows from the MCT for the measures Kx. We assumed that K is finite, i.e., K(x, F ) < ∞ for each x ∈ E, so clearly x 7→ K(x, F ) is bounded. Boundedness of K implies that T f is well-defined and bounded for each bounded f ∈ E ⊗ F. E-measurability follows from the usual indicator→simple function→positive function→measurable

64 function chain of arguments. We will just show that functions of the form T 1A×B for A ∈ E,B ∈ F are E-measurable. Specifically, Z T 1A×B(x) = K(x, dy)1A×B(x, y) = 1A(x)K(x, B) , F is a product of two E-measurable functions (A ∈ E and x 7→ K(x, B) is E-measurable by definition), so T 1A×B(x) is E-measurable. Now, because of properties a) and b), similar identities can be shown for simple functions and increasing limits of simple functions (i.e., positive measurable functions).

Measures on product spaces. Note: I resumed here in lecture. For our purposes, this is the key construction, and something we’ve been working towards for awhile. Specifically, we know from experience that specifying joint probability distributions with any non-trivial structure over more than a few random variables is very hard. In practice, probability models are often constructed one random quantity at a time, through a chain or hierarchy of conditional distributions. This mode of model-specification dominates Bayesian modeling. The following result shows that this very general method is valid. Theorem 10.10 (Marginal-conditional-joint construction). Let µ be a finite measure on (E, E), and K a finite transition kernel from (E, E) into (F, F). Then π is the unique (finite) measure on the product space (E × F, E ⊗ F) satisfying, Z π(A × B) = µ(dx)K(x, B) ,A ∈ E,B ∈ F . (10.10) A Moreover, Z Z πf = µ(dx) K(x, dy)f(x, y) , f ∈ (E ⊗ F)+ . (10.11) E F

Proof. Clearly, π defined by (10.10) is a measure on the product space that inherits the finiteness of µ and K. To show that it is unique, assume thatπ ˆ is another measure satisfying (10.10). Then

π(A × B) =π ˆ(A × B) ,A ∈ E,B ∈ F .

Thus, π andπ ˆ agree on the set of measurable rectangles A × B, which is closed under unions and which generates E ⊗ F. Thus, by Proposition 4.2, π =π ˆ. To prove (10.11), note that by Proposition 10.9 it can be written as Z πf = µ(dx)T f(x) = µ(T f) , F

with T f belonging to E+. Now Theorem 7.6 can be used, with L(f) = µ(T f), to show that the equality holds.

Note that this theorem holds under more general conditions on µ and K, but in statistics we are typically dealing with probability measures and probability transition kernels, so this is sufficient.

65 Product measures. In the special case that K has the special form K(x, B) = ν(B) for all x ∈ E, for some (finite) measure ν on (F, F), then π is the product measure µ × ν. In this case, the following result (Fubini’s theorem) says that we can integrate in whichever order is convenient. Theorem 10.11 (Fubini). Let µ and ν be finite measures on (E, E) and (F, F), respectively. There exists a finite measure π on (E × F, E ⊗ F) such that, for each f ∈ (E ⊗ F)+, Z Z Z Z πf = µ(dx) ν(dy)f(x, y) = ν(dy) ν(dx)f(x, y) . E F F E

Proof. Let h : E × F → F × E be the transposition mapping (x, y) 7→ (y, x). This is clearly E ⊗ F/F ⊗ E-measurable. Now, for sets A ∈ E,B ∈ F,

π ◦ h−1(B × A) = π(A × B) = µ(A)ν(B) =π ˆ(B × A) ,

which implies thatπ ˆ = π ◦ h−1 via Proposition 4.2. Let fˆ(y, x) = f(x, y). Then

πˆfˆ = (π ◦ h−1)fˆ = π(fˆ◦ h) = πf

since fˆ◦ h(x, y) = fˆ(y, x) = f(x, y).

Kernels and randomization. We know from our study of distribution functions that for a random variable X with distribution function F (and quantile function Q), we can generate a realization of X with a uniform random variable U: X =d Q(U). We end this detour with a result that extends that construction to more general spaces in a way that can be used with conditional distributions. Proposition 10.12. Let K be a probability transition kernel from a measurable space (E, E) into a standard measurable space (F, F). Then there exists a family of F -valued random variables (Yx) with distribution K(x, • ) for every x ∈ E. More precisely, there exists a measurable function f : E × [0, 1] → F such that if U is a Unif[0, 1] random variable, then we may set Yx = f(x, U) for every x ∈ E.

Proof. First, assume that F = [0, 1] and F = B([0, 1]). Define

f(x, u) = inf{r ∈ [0, 1] ∩ Q : K(x, [0, r]) > u} , x ∈ E, u ∈ [0, 1] , and note that f is (E ⊗B([0, 1]))-measurable because the set {(x, u): K(x, [0, r]) < u} is measurable for each r (and by Theorem 3.5, so is their infimum). If U ∼ Unif[0, 1], then setting Yx = f(x, Y ), we have

P{Yx ≤ r} = P{f(x, U) ≤ r} = P{U ≤ K(x, [0, r])} = K(x, [0, r]) , r ∈ [0, 1] ∩ Q .

The intervals [0, r], r ∈ Q, generate B([0, 1]) so f(x, U) has distribution K(x, • ), by Corollary 4.3. For the general case of a standard measurable space (F, F), recall (see Section 3.6) that by definition there is an isomorphism g from F into some Borel subset G ⊂ [0, 1]. Now, we can construct a tran- sition probability kernel from (E, E) into (G, G) by composing K with g: Kˆ (x, B) = K(x, g−1B). Because g is an isomorphism, it is straightforward to show that Kˆ is a transition probability kernel from (E, E) into (G, G). Therefore, the previous part of the proof applies to guarantee the existence

66 of the random variables (Yˆx) taking values in G, with distribution Kˆ (x, • ). Let h : G → F be the functional inverse of g and define Yx = h(Yˆx). Observe that ˆ −1 ˆ −1 −1 −1 P{Yx ∈ B} = P{Yx ∈ h B} = K(x, h B) = K(x, (g ◦ h )B) = K(x, B) ,

which shows that Yx has distribution K(x, • ) for each x ∈ E.

10.4 Conditional probabilities and distributions, continued Recall that before our detour into kernels and product spaces, we defined the conditional probability as

P[H | F](ω) = E[1H | F](ω) ,H ∈ H, ω ∈ Ω , (10.12) where F is a sub-σ-algebra of H, and I am making explicit the dependence on ω. Now let Q(H) be a version of P[H | F] for each H ∈ H, assuming that Q(∅) = 0 and Q(Ω) = 1. So-defined, Q(H) is a random variable belonging to F. We denote its value at ω ∈ Ω by Qω(H).

The mapping Q :(ω, H) 7→ Qω(H) looks like a transition probability kernel from (Ω, F) into (Ω, H): the mapping ω 7→ Qω(H) is F-measurable for each H ∈ H (by definition of conditional expectation); and by the monotone convergence property of Proposition 10.3, X Qω(∪nHn) = Qω(Hn) , (Hn) disjointed in H , (10.13) n for almost every ω. The almost is a limitation: we would need to pin down the null sets (and therefore the almost sure event Ωh) for which (10.13) holds, but this generally depends on the sequence h = (Hn). We might run into some serious technical difficulties in specifying the set of ω in Ω for which H 7→ Qω(H) is a probability measure. This set is Ω0 = ∩hΩh, where the intersection is taken over all disjointed sequences h ⊂ H. As C¸inlar writes, “Ω0 is generally a miserable object.”

Despite these challenges, it is often possible to pick versions of Q(H) such that H 7→ Qω(H) is a probability measure for all ω ∈ Ω (i.e., Ω0 = Ω).

Regular versions. Let Q(H) be a version of P[H | F] for every H ∈ H. Then Q :(ω, H) 7→ Qω(H) is said to be a regular version of the conditional probability P[ • | F] if Q is a transition probability kernel from (Ω, F) into (Ω, H). We also call Q a regular conditional probability. In the literature, these versions are studied and used almost exclusively. The main reason for their ubiquity is the following.

Proposition 10.13. Suppose that P[ • | F] has a regular version Q. Then Z 0 0 QX : ω 7→ QωX = Qω(dω )X(ω ) (10.14) Ω is a version of E[X | F] for every random variable X whose expectation exists.

Proof. It is sufficient to prove this for positive X belonging to H. By Theorem 10.7 applied to the transition kernel Q and the function X, we have that QX belongs to F+. We just need to check the projection property. That is, we need to show that for V belonging to F+, E[VX] = E[V QX] .

67 Fix V . For X = 1H , this follows from the definition of Q(H) as a version of E[H | F]. This extends to simple and then arbitrary positive random variables by the linearity and monotone convergence properties of the operators X 7→ QX (Proposition 10.3) and Z 7→ E[Z] (Section7).

Conditional distributions. Let Y be a random variable taking values in a measurable space (E, E), and let F be a sub-σ-algebra of H. Then the conditional distribution of Y given F is any transition probability kernel L :(ω, B) 7→ Lω(B) from (Ω, F) into (E, E) such that

P[Y ∈ B | F](ω) = Lω(B) , ω ∈ Ω,B ∈ E . (10.15)

If P[ • | F] has a regular version Q, then

Lω(B) = Qω{Y ∈ B} , ω ∈ Ω,B ∈ E , (10.16)

defines a version L of the conditional distribution of Y given F. In general the problem is to find a regular version of P[ • | F] restricted to σY . The following standard existence result does so under fairly general conditions, but requires some regularity conditions on (E, E). Theorem 10.14. Let Y be a random variable taking values in (E, E). If (E, E) is a standard measurable space, then there exists a version of the conditional distribution of Y given F (10.15). Moreover, P[ • | F] has a regular version. Some remarks. The standard measurable space is key: we need to be able to map to R and take advantage of the properties of distribution functions. Specifically, the proof is constructive and looks like a much more involved version of that of Proposition 10.12. There are a number of technical details related to measurability that require careful attention, but the basic idea is very similar: construct the distribution function by mapping (E, E) into ([0, 1]), B([0, 1])), and then construct the kernel from the distribution function. See C¸inlar [C¸in11, Theorem IV.2.10].

The second claim, that P[ • | F] has a regular version, follows from the first: define Y (ω) = ω for all ω ∈ Ω. Then (E, E) = (Ω, H) and the conditional distribution of Y given F is the regular version of P[ • | F] given by (10.15).

Conditioning on random variables. In most applications, we are interested in conditioning on a random variable. As with conditional expectations, this means conditioning on the σ-algebra generated by the random variable. The following result shows how to do so via a transition probability kernel K, giving proper meaning to the conditional distribution of Y given X = x as,

P[ • | X = x] = K(x, • ) . (10.17)

Theorem 10.15. Suppose X is a random variable taking values in a measurable space (D, D) and Y is a random variable taking values in a standard measurable space (E, E). Then there is a transition probability kernel K from (D, D) into (E, E) such that

Lω(B) = K(X(ω),B) ,B ∈ E ,

is a version of the conditional distribution of Y given F = σX.

68 Proof. Applying Theorem 10.14 with F = σX implies the existence of a regular version   Lω(B) = Qω{Y ∈ B} = P[Y ∈ B | F](ω) = E 1{Y ∈B} F (ω) , ω ∈ Ω,B ∈ H , where the second equality is just the definition re-stated as a reminder. Furthermore, it makes clear that Lω(B) belongs to F+ (by definition of conditional expectation). On the other hand, we know from Theorem 6.2 that a mapping V :Ω → R¯ belongs to σX if and only if V = f ◦ X for some ¯ D-measurable f : D → R. Therefore, we can define K(X(ω),B) = Lω(B) for each B ∈ E and check that K so-defined is a transition probability kernel. By regularity of Qω, B 7→ K(X(ω),B) is almost surely a probability measure on (E, E), and by the previous arguments ω 7→ K(X(ω),B) belongs to σX.

Disintegration. In the previous section, we constructed a joint measure π from a measure µ and a transition kernel K, with (10.11) Z Z πf = µ(dx) K(x, dy)f(x, y) , f ∈ (D ⊗ E)+ , (10.18) D E and π satisfying Z π(A × B) = µ(dx)K(x, B) ,A ∈ D,B ∈ E . (10.19) A A natural question is whether a measure π on (D×E, D⊗E) has a disintegration into components, in which case we would write (informally),

π(dx, dy) = µ(dx)K(x, dy) , x ∈ D, y ∈ E. (10.20)

The probabilistic interpretation is as follows: if π is the joint distribution of X and Y , then µ(dx) is “the probability that X is in the small set dx centered at x” and K(x, dy) is “the conditional probability that Y is in the small set dy, given that X is equal to x”. The disintegration of π is the converse of the construction of π in Theorem 10.10. The following result is the exact converse of that theorem, except that we also require here that (E, E) be standard.

Theorem 10.16 (Disintegration). Let π be a probability measure on the product space (D ×E, D ⊗ E), and suppose that (E, E) is standard. Then there exist a probability measure µ on (D, D) and a transition probability kernel K from (D, D) into (E, E) such that (10.20) holds.

The theorem has some implications. One is that for every f belonging to (D ⊗ E)+, Z E[f(X,Y ) | σX] = K(X, dy)f(X, y) , (10.21) E which further implies that Z Z E[f(X,Y )] = µ(dx) K(x, dy)f(x, y) . (10.22) D E

69 Proof of Theorem 10.16. This can be cast as a special case of Theorem 10.14: Let W = D × E, W = D ⊗ E, P = π. On the probability space (W, W, π), define the random variable X(w) = x and Y (w) = y, for w = (x, y) ∈ W . Let µ be the distribution of X: µ(A) = π(A × E), A ∈ D. Since Y takes values in a standard measurable space (E, E), there is a regular version L of the conditional distribution of Y given F = σX. Observe that F consists of measurable rectangles of the form A×E, A ∈ D, and therefore a random ¯ variable V belongs to F+ if and only if V (w) = V (x, y) = v(x), for some mapping v : D → R+ belonging to D+.

Now, by the same argument we used in the proof of Theorem 10.16, this implies that Lw(B) = K(X(w),B), where K is a transition probability kernel from (D, D) into (E, E). Therefore, using the projection property of Lw(B), and writing Eπ for expectation with respect to π, Z π(A × B) = Eπ[1A ◦ X 1B ◦ Y ] = Eπ[1A ◦ XK(X,B)] = µ(dx)1A(x)K(x, B) . D

This establishes (10.20), and also (10.18) for f = 1A×B; it is extended to measurable f by the usual arguments.

10.5 Conditional independence This is an important generalization of independence (Section9), which is recovered in the special case of conditioning on a trivial σ-algebra.

Let F, F1,..., Fn be sub-σ-algebras of H. Then F1,..., Fn are said to be conditionally indepen- dent given F if, for all positive random variables V1,...,Vn belonging to F1,..., Fn, respectively,

E[V1 ··· Vn | F] = E[V1 | F] ··· E[Vn | F] . (10.23)

Recall that both sides of this equality are random variables, so there is an implicit “P-almost surely” here. Comparing the definition (10.23) to that of independence (9.2), it is apparent that the only difference is the substitution of E[ • | F] for E[ • ]. All of the results about independence (i.e., Propositions 9.1 to 9.4) have conditionally independent counterparts. If F is trivial—that is, F = {∅, Ω} and we condition on what amounts to nothing—then conditional independence given F is the same as independence.

Our heuristic in Section9 was that F1 is independent from F2 if information from F1 is useless for estimating random variables belonging to F2. An analogous heuristic can be used for conditional independence: given the information in F, the further information in F1 is useless for estimating quantities belonging to F2.

For convenience, let F1⊥⊥F F2 denote that F1 and F2 are conditionally independent given F. Fur- thermore, conditioning on multiple σ-algebras means that we condition on the σ-algebra generated by the union, e.g.,

E[X | F, G] = E[X | σ(F ∪ G)] = E[X | F ∨ G] . I will use the expression on the left-hand side for convenience.

70 Proposition 10.17. The following are equivalent:

a) F1⊥⊥F F2.

b) E[V2 | F, F1] = E[V2 | F] for every positive V2 ∈ F2.

c) E[V2 | F, F1] ∈ F for every positive V2 ∈ F2.

Proof. Assume V,V2,V2 are positive and belong to F, F1, F2, respectively. First, consider a), which is equivalent to (by conditional independence and then the conditional determinism property),

E[V1V2 | F] = (E[V1 | F])(E[V2 | F]) = E[(V1E[V2 | F]) | F] . Therefore,

E[VV1V2] = E[E[VV1V2 | F]] = E[V E[V1V2 | F]] = E[V E[(V1E[V2 | F]) | F]] = E[VV1E[V2 | F]] . On the other hand,

E[VV1V2] = E[E[VV1V2 | F, F1]] = E[VV1E[V2 | F, F1]] . Therefore,

E[VV1E[V2 | F, F1]] = E[VV1E[V2 | F]] .

Now, random variables of the form VV1 generate F, F1 (i.e., all random variables in F, F1 have that form for some V ∈ F and V1 ∈ F1), so this shows a) ⇐⇒ b). By the measurability property of conditional expectations, b) ⇒ c). Conversely, if c) holds, then (by the conditional determinism implied by c) and then by repeated conditioning with F ⊂ (F ∨ F1)),

E[V2 | F, F1] = E[E[V2 | F, F1] | F] = E[V2 | F] , so c) ⇒ b).

I have found the following results useful in my research. The first follows from the previous propo- sition, via the correspondence between simple functions and measurable functions.

Proposition 10.18. For any σ-algebras F, F1, F2, we have F1⊥⊥F F2 if and only if,

P[H | F, F1] = P[H | F] for each event H ∈ F2 .

Corollary 10.19. For any σ-algebras F, F1, F2, we have F1⊥⊥F F2 if and only if (F, F1)⊥⊥F F2.

Exercise 29 (Equivalence of conditional independence relationships): £ Prove Corollary 10.19. ¢ ¡

Proposition 10.20 (Chain rule).

F⊥⊥G(F1, F2,... ) if and only if F⊥⊥(G,F1,...,Fn)Fn+1 for n ≥ 0. In particular,

F⊥⊥G(F1, F2) if and only if F⊥⊥GF1 and F⊥⊥(G,F1)F2 . (10.24)

71 Exercise 30 (Chain rule for conditional independence): £ ¢ Prove (10.24). ¡

The previous results specified how to check for conditional independence via σ-algebras. An al- ternative is to check via an independent randomization. The proof relies on an application of Proposition 10.12. Proposition 10.21. Let X,Y,Z be random variables taking values in standard measurable spaces a.s. (E, E), (F, F), (G, G), respectively. Then X⊥⊥Y Z if and only if X = f(Y,U) for some (F ⊗ B([0, 1]))/E-measurable function f : F × [0, 1] → E and a uniform random variable U that is independent of Y and Z.

10.6 Statistical sufficiency

15

Conditioning and conditional independence have many important applications in statistics (see the classic papers by Dawid [Daw79; Daw80]). Sufficiency is one of the most important. It continues to play an important rule in modern statistics and machine learning. Recall that a statistical model for a random variable X taking values in a sample space (E, E) is a family P of probability measures on (E, E). A statistic statistic is a measurable function S : E → S into some measurable space (F, F). (In practice, S is typically Polish and therefore standard.) A statistic S is called sufficient for a model P if all probability measures P ∈ P have the conditional distribution of X given S. For standard (E, E), we showed above that this means there is some transition probability kernel K from (E, E) into (F, F) such that

P [X ∈ • | S = s] = K(s, • ) , for each P ∈ P . (10.25)

Example 10.1. Sufficiency in coin-flipping. Suppose X = (X1,...,Xn) is a sequence of n i.i.d. flips of a biased coin with probability p of heads, encoded as 1 for heads, 0 for tails. As you may Pn know from an introductory statistics class, S(X) = i=1 Xi is a sufficient statistic. In this case, K(s, • ) is the uniform distribution supported on all binary sequences of length n that have s 1’s. As we know, conditioning on S means conditioning on σS, and therefore any measurable function f of S is also a sufficient statistic because f belongs to σS. Hence, a sufficient statistic typically is not unique; let SP be the set of sufficient statistics for P. There is a large literature from classical statistics on finding a minimal sufficient statistic, which is a statistic T such that for every sufficient statistic S there is some measurable function f such that T = f ◦ S. Modulo some technical caveats, we see (via Theorem 6.2) that a minimal sufficient σ-algebra can be thought of as σT = T σS. S∈SP Sufficient statistic-kernel pairs. It is helpful to think of sufficiency in terms of the pair (S, K). A statistic S may be sufficient for two different models P and P0, but may fail to be sufficient for P ∪ P0. If the kernel K is the same in both cases, then (S, K) is sufficient for the union. More

15This section borrows heavily from the lecture notes of Peter Orbanz, http://www.gatsby.ucl.ac.uk/~porbanz/ teaching/G6106S16/NotesG6106S16_9May16.pdf.

72 importantly, if both S and K are specified, there is a uniquely defined set for which (S, K) is sufficient,

P(S, K) := {P : P [ • | S] = K(S, • )} .

The work of Lauritzen [Lau74], Diaconis and Freedman [DF84], and Diaconis [Dia88] explored the relationship between symmetry and sufficiency, and showed that P(S, K) is a convex set. Under suitable conditions, the extreme points of P(S, K) are those measures which are of the form K(s, • ) for some s ∈ F . Therefore, every P ∈ P(S, K) has the representation Z P ( • ) = K(s, • )νP (ds) , F for some probability measure νP on (F, F). Representations of this type provide the foundations for much of Bayesian statistics. For example, the famous theorem of de Finetti is obtained as a special case.

10.7 Construction of probability spaces Transition kernels can be used to prove the existence of basically all probability spaces that we encounter. The main results to that end are Ionescu-Tulcea’s Theorem and Kolmogorov’s Extention Theorem. The latter is commonly invoked to show the existence of stochastic processes. We won’t study these results here, but C¸inlar [C¸in11, Section IV.4] covers them in detail. You’re now equipped to understand the proofs and you should see them once in your life, so I encourage you to read that section on your own.

73 11 Convergence

Reading: C¸inlar [C¸in11], III.2 Supplemental: C¸inlar [C¸in11], III.1 Proving convergence of sequences of random variables is a core area of probability, and is often applied to statistical problems. The aim of this section is to provide the basic concepts that we will use in our study of martingales and of Markov chains. C¸inlar [C¸in11, Chapter III] contains much more material, but I will streamline the ideas as much as possible. There are three primary modes of convergence that we will study: almost sure convergence, con- vergence in probability, and convergence in distribution (also known as weak convergence). We will also briefly touch on Lp-convergence.

11.1 Convergence of real sequences I will introduce notation and review a few basic ideas about convergence of real sequences. I won’t go over everything we need here—please read C¸inlar [C¸in11], III.1 (and see the same for proofs of all results in this subsection).

Let (xn) be a sequence of real numbers, indexed by N+ = {1, 2,... }. Then

lim inf xn = sup inf xn , lim sup xn = inf sup xn , (11.1) n m n≥m n m n≥m

are well-defined numbers, possibly infinite. If they are equal to the same number x then (xn) is said to have limit x, written limn xn = x or xn → x. The sequence is said to be convergent if the limit exists and is a real (finite) number.

Characterization. C¸inlar has some nice notation: for ε ∈ R+, let iε be the indicator on the interval (ε, ∞), ( 1 x > ε iε(x) = 1 (x) = . (11.2) (ε,∞) 0 x ≤ ε

Let (xn) be a sequence of numbers in R. The sequence converges to x if and only if the sequence |xn − x| converges to 0. When x is known, it is typically easier to work with |xn − x| because of its positivity. The classical (real analysis) statement of convergence says that a sequence of positive numbers (xn) converges to 0 if and only if for every ε > 0 there exists a k such that xn ≤ ε for all n ≥ k, i.e. X xn → 0 ⇐⇒ iε(xn) < ∞ , for every ε > 0 . (11.3) n Since every term on the right-hand side is either 0 or 1, X iε(xn) < ∞ ⇐⇒ lim sup iε(xn) = 0 ⇐⇒ lim iε(xn) = 0 . (11.4) n n n

This is the basic characterization of convergence. C¸inlar [C¸in11], III.1 gives a number of useful ways to establish convergence—I will refer to them as necessary.

74 11.2 Almost sure convergence

As always (Ω, H, P) is our basic probability space. (Xn) denotes a sequence of random variables in R.

A sequence (Xn) is said to be almost surely convergent if the numerical sequence (Xn(ω)) is convergent for (P-)almost every ω. It is said to converge to X if X is an almost surely R-valued a.s. random variable and limn Xn(ω) = X(ω) for almost every ω. This is denoted by Xn −→ X.

This definition is basically almost sure pointwise convergence of the measurable functions Xn :Ω → R. Since lim infn Xn and lim supn Xn are random variables,

Ω0 = {ω ∈ Ω : lim inf Xn(ω) = lim sup Xn(ω) ∈ R} n n

is an event in H. The sequence (Xn) is almost surely convergent if P(Ω0) = 1. 0 a.s. 0 a.s. Observe that if X is another random variable such that X = X , then Xn −→ X implies that a.s. 0 Xn −→ X , too. The following theorem characterizes almost sure convergence.

Theorem 11.1. The sequence (Xn) converges to X almost surely if and only if, for every ε > 0, X iε ◦ |Xn − X| < ∞ almost surely. (11.5) n

a.s. Proof. Necessity: Suppose that Xn −→ X. Let Ω0 be the almost sure set on which Xn(ω) → X(ω), P and let Yn = |Xn − X|. Then for each ω ∈ Ω0, n iε ◦ Yn < ∞ for every  > 0, by (11.3). Thus, for fixed ,(11.5) holds on the almost sure set Ω0, which does not depend on .

Sufficiency: Suppose that, for each  > 0, (11.5) holds. Let (m) be a sequence strictly decreasing to P 0, and Nm = n im ◦ Yn. We have P{Nm < ∞} = 1 for every m by assumption. Since m+1 < m,

im+1 ≥ im and therefore Nm+1 ≥ Nm. Thus, the events {Nm < ∞} are shrinking to \ Ω0 = {Nm < ∞} . m

By sequential continuity of P, we have P(Ω0) = limm P{Nm < ∞} = 1. Furthermore, again by (11.3), Xn(ω) → X(ω) for every ω ∈ Ω0.

Exercise 31 (Almost sure version of the continuous mapping theorem): £ Let X and (Xn) be random variables taking values in R, and f : R → R a function with at most a ¢ a.s. ¡ a.s. countable number of points of discontinuity. Show that Xn −→ X implies that f ◦Xn −→ f ◦X.

Convergence in metric spaces. Recall that a metric space is a pair (E, d), where E is a set and d is a metric on E (see pp. 10-11). Let X and X1,X2,... be random variables taking values in E. Then the sequence (Xn) is said to converge to X almost surely if the R+-valued random variables d(Xn,X) converge to 0 almost surely.

Borel–Cantelli lemmas.

75 Lemma 11.2 (Borel–Cantelli). Let (Hn) be a sequence of events. Then X X P(Hn) < ∞ ⇒ 1Hn < ∞ almost surely. (11.6) n n P P Proof. Let N = n 1Hn be the random variable under question. By the MCT, E[N] = n P(Hn). Clearly, E[N] < ∞ implies that N < ∞ almost surely, which is what is claimed.

Another way to interpret this is that if (Hn) is a sequence of events, then N is the number of times P these events occur. So if n P(Hn) < ∞, then almost surely only a finite number of events Hn occur. The following suite of results follow from Lemma 11.2.

a.s. Proposition 11.3. Xn −→ X if any of the following holds: P i) n P{|Xn − X| > ε} < ∞ for every ε > 0. P ii) There exists a sequence (εn) decreasing to 0 such that n P{|Xn − X| > εn} < ∞. P P iii) There exists a sequence (εn) of strictly positive numbers such that n εn < ∞ and n P{|Xn+1− Xn| > εn} < ∞.

Exercise 32 (Applications of Borel–Cantelli Lemma): £ Prove Proposition 11.3. ¢ ¡

In general, the converse of Lemma 11.2 is not true, unless the events are independent.

Lemma 11.4 (Second Borel–Cantelli). Let (Hn) be a sequence of independent events. Then X X P(Hn) = ∞ ⇒ 1Hn = ∞ almost surely. (11.7) n n P Proof. Recall that { n 1Hn = ∞} is also written as {Hn i.o.}. Recall further from Section 9.3 that the event {Hn i.o.} is in the tail-σ-algebra, and therefore

{Hn i.o.} = (∩m ∪n≥m Hn) = lim (∪n≥mHn) . P P m P

P P Suppose that n P(Hn) = ∞. Note that this implies n≥m P(Hn) = ∞, m ≥ 1. Now,

c (N = ∞) = lim (∪n≥mHn) = 1 − lim (∩n≥mHn) . P m P m P

76 −x Now, by independence of (Hn) and the fact that 1 − x ≤ e for x ≥ 0,

Y c P(N = ∞) = 1 − P(Hn) n≥m Y = 1 − (1 − P(Hn)) n≥m Y ≥ 1 − e−P(Hn) n≥m  X  = 1 − exp P(Hn) n≥m = 1 − e−∞ = 1 .

11.3 Convergence in probability

A weaker notion than almost sure convergence is convergence in probability. Let X1,X2,... be R-valued random variables. The sequence (Xn) is said to converge to X in probability if, for every  > 0,

lim {|Xn − X| > } = 0 . (11.8) n P

P We denote this as Xn −→ X.

For comparison, observe that we can write almost sure convergence similarly: for every  > 0, (Xn) converges to X almost surely if

{lim |Xn − X| > } = 0 . (11.9) P n

Convergence in probability is widely used in modern probability—more so than almost sure conver- gence. This is because almost sure convergence requires satisfying an extremely strong condition: pointwise convergence on a set of probability 1. Convergence in probability is much less stringent. The following example illustrates. Example 11.1. Convergence in probability but not almost surely. Let Ω = (0, 1], H the Borel σ-algebra on (0, 1], and P the Lebesgue measure. Let X1,X2,X3,X4,X5,X6,X7,... be the indicators of (0, 1], (0, 1/2], (1/2, 1], (0, 1/3], (1/3, 2/3], (2/3, 1], (0, 1/4],... , respectively. Then for arbitrary  ∈ (0, 1), the probabilities P{Xn > } for the sequence (1, 1/2, 1/2, 1/3, 1/3, 1/3, 1/4,... ), P whose limit is 0. Thus, Xn −→ 0. But, for every ω ∈ Ω, the sequence (Xn(ω)) consists of zeros and ones without end (i.e., switching back and forth infinitely often). Therefore, lim infn(Xn(ω)) = 0 and lim supn(Xn(ω)) = 1, and the set of ω for which (Xn(ω)) converges is empty. The following theorem summarizes the relationship between almost sure convergence and conver- gence in probability.

Theorem 11.5. i) If (Xn) converges to X almost surely, then in converges to X in probability, as well.

77 ii) If (Xn) converges to X in probability, then it has a subsequence that converges to X almost surely.

iii) If every subsequence of (Xn) has a further subsequence that converges to X almost surely, then (Xn) converges to X in probability.

Proof. For simplicity, assume that the (Xn) are positive and X = 0; this is the same as replacing Xn with |Xn − X|. Recall that iε is the indicator of the interval (, ∞). Define the bounded sequence (pn) by

pn() = E[iε ◦ Xn] = P{Xn > } .

a.s. a.s. i) Suppose that Xn −→ 0. Fix  > 0. Then iε ◦ Xn −→ 0, which by the bounded convergence P theorem implies that limn pn() = limn E[iε ◦ Xn] = E[limn iε ◦ Xn] = 0. Therefore, Xn −→ 0.

P ii) Suppose that Xn −→ 0. Let k = 1/k, for k ≥ 1, and set n0 = 0. For each k ≥ 1, pn(k) → 0 by assumption (of convergence in probability). Therefore, there exists some nk > nk−1 such −k that p(k) ≤ 2 for all n ≥ nk. Our subsequence is then Xn1 ,Xn2 ,... , and

X X −k P{Xnk > k} ≤ 2 = 1 . k≥1 k≥1

By Proposition 11.3ii, the subsequence (Xnk ) converges to 0 almost surely.

To prove part iii), we need the following lemma characterizing convergence of sequences of real numbers via convergent subsequences.

Lemma 11.6. Let (xn) be a sequence of real numbers. If every subsequence that has a limit has the same value x for its limit, then the sequence (xn) tends to x (infinite values are allowed). If the sequence is bounded, and every convergent subsequence of it has the same limit x, then (xn) converges to x.

Proof of Theorem 11.5, continued.

iii) Assume that every subsequence of (Xn) has a further subsequence that converges to 0 almost surely. Fix some  > 0. Consider the bounded sequence of numbers (pn()). Let N be a 16 subsequence of N along which (pn()) converges; denote its limit by p. By assumption, N 0 a.s. 0 0 has a subsequence N such that Xn0 −→ 0, for n ∈ N . By part a) above, we have that 0 pn0 () → 0 along N , which means that the limit p = 0. Since every subsequence of (Xn) has a further subsequence for which this is true, by Lemma 11.6, the original sequence (pn()) P converges to 0 for every  > 0, and therefore Xn −→ 0.

16We know that such a convergent subsequence exists by the Bolzano–Weierstrass theorem, which says that every n bounded sequence in R has a convergent subsequence.

78 Convergence of continuous functions. An application of Theorem 11.5 is an in probability version of the continuous mapping theorem.

P P Proposition 11.7. Let f : R → R be continuous. If Xn −→ X then f ◦ Xn −→ f ◦ X. The proof demonstrates a useful technique: using almost sure convergence of subsequences to show convergence in probability of the entire sequence.

P 0 Proof. Suppose that Xn −→ X. Let N be a subsequence of N. Since the original sequence (Xn) 0 0 converges in probability to X, so does the subsequence (Xn0 ), n ∈ N . By Theorem 11.5b, there 00 a.s. 00 00 is a further subsequence N such that Xn00 −→ X, n ∈ N , which by Exercise 31 (almost sure a.s. version of continuous mapping) implies that f ◦ Xn00 −→ f ◦ X. Applying Theorem 11.5c, we have P that f ◦ Xn −→ f ◦ X for the original sequence.

Convergence and arithmetic. The previous proof technique can be used to show that convergence in probability is preserved under arithmetical operations.

P P Theorem 11.8. Suppose that Xn −→ X and Yn −→ Y . Then:

P i) Xn + Yn −→ X + Y ;

P ii) Xn − Yn −→ X − Y ;

P iii) XnYn −→ XY ; and

P iv) Xn/Yn −→ X/Y , provided that, almost surely, Yn and Y are non-zero.

Exercise 33 (Convergence in probability and arithmetic): £ Prove Theorem 11.8i using the method of Proposition 11.7. ¢ ¡

11.4 Convergence in Lp In order to define this mode of convergence, we need to define Lp spaces, which we previously skipped. See C¸inlar [C¸in11, Ch. II.3].

Lp spaces. Let X be a random variables taking values in R, and for p ∈ [1, ∞], define ( (E[Xp])1/p p ∈ [1, ∞) ||X||p = (11.10) inf{b ∈ R+ : |X| ≤ b almost surely} .

It is easy to see that

a.s. ||X||p = 0 ⇒ X = 0 , and ||cX||p = c||X||p . (11.11)

Furthermore, it can be shown that (see H¨older’sinequality, C¸inlar Theorem II.3.6)

0 ≤ ||X||p ≤ ||X||q ≤ +∞ if 1 ≤ p ≤ q ≤ +∞ . (11.12)

79 p For each p ∈ [1, ∞], denote by L the collection of all R-valued random variables X with ||X||p < ∞. p p For X in L , ||X||p is called the L -norm of X. For p = ∞, ||X∞|| is also called the essential supremum of X. Clearly, X is in Lp if and only if |X|p is integrable, and X is in L∞ if and only if X is almost surely bounded. These ideas have natural generalizations for measures and functions: for a measure µ on some R p 1/p p measurable space (E, E) and a measurable function f : E → R, ||f||p = ( µ(dx)f(x) ) . The L p space L (µ) is the collection of all R-valued functions f such that ||f||p < ∞. In the probability literature, it is common to see Lp(Q), etc., for some probability measure Q, to differentiate between the Lp spaces of different probability measures.

p L convergence. We will now restrict ourselves to p ∈ [1, ∞). A sequence (Xn) is said to converge in Lp to X if

p lim [|Xn − X| ] = 0 . (11.13) n E

Lp We denote this as Xn −→ X. Recall that Lp is the collection of all R-valued random variables with finite Lp-norm (11.10). If we identify as a single random variable all random variables that are almost surely equal to each other, then Lp is a normed vector space. Clearly, (11.13) is equivalent to

lim ||Xn − X||p = 0 , (11.14) n which is the usual notion of convergence in a normed vector space. Hence, Lp convergence aligns with much of our usual intuition about convergence.

p Lp Proposition 11.9. Suppose that (Xn) and X are in L for some p ∈ [1, ∞). Then Xn −→ X implies the following:

Lq i) Xn −→ X for each q ∈ [1, p]; and

P ii) Xn −→ X.

Exercise 34 (Implications of Lp convergence): £ Prove Proposition 11.9. ¢ ¡ Hint: For ii), use Markov’s inequality. Example 11.2. Convergence in probability and not L1 . Recall the setup from Example 11.1: Ω = (0, 1], H the Borel σ-algebra on (0, 1], and P the Lebesgue measure. Let X1,X2,X3,X4,X5,X6,X7,... be the indicators of (0, 1], (0, 1/2], (1/2, 1], (0, 1/3], (1/3, 2/3], (2/3, 1], (0, 1/4],... , respectively. We showed in that ex- 1 P L ample that Xn −→ 0. We have that Xn −→ 0, as well: E[|Xn|] → 0.

A slight alteration changes that. Let (Xˆn) = (X1, 2X2, 2X3, 3X4, 3X5, 3X6, 4X7,... ). Now for ˆ ˆ P ˆ ˆ  ∈ (0, 1), P{Xn > } = P{Xn > }, so Xn −→ 0. However, E[|Xn|] = 1 for all n,(Xn) does not converge to 0 in L1. But it cannot converge elsewhere, say some a 6= 0, otherwise Proposition 11.9 P 1 would imply that Xn −→ a. So (Xˆn) does not converge in L .

80 Convergence in L1 is more common in the literature than any other p. In fact, one often sees a proof of Lp-convergence (and in particular, L2) in order to show, via Proposition 11.9, L1-convergence. L1 One reason for this is that it allows interchanging limits with expectation: if Xn −→ X, then E[Xn] → E[X]. This is a corollary to the following more general result.

L1 Proposition 11.10. If Xn −→ X, then for every bounded random variable Y ,

lim [Xn Y ] = [XY ] . n E E

L1 Proof. Suppose that |Y | ≤ b almost surely, and that Xn −→ X. Then

|E[Xn Y ] − E[XY ]| ≤ E[|Xn Y − XY |] ≤ b E[|Xn − X|] → 0 .

We will find this particularly useful when working with martingales.

Uniform integrability. Convergence in L1 implies convergence in probability. However, the converse is not generally true. To get from convergence in probability to L1, we need a further condition. One might be that (Xn) is dominated, i.e., |Xn| ≤ Y almost surely for some integrable random variable Y . This is a pretty strong condition, and we might want something weaker.

P Revisiting Example 11.2 illustrates what we need to address: even though (Xn) −→ 0, i.e., P gives less and less mass to non-zero values of Xn, which non-zero value has non-zero mass is growing at the same rate as its probability decreases. Non-zero probability mass, however small, is escaping “off to infinity”. If we can control this escape, then we can get convergence in L1. This is known as uniform integrability. Before giving the definition (which applies to a collection), we illustrate with the simplest case of a single random variable.

Lemma 11.11. Let X be a random variable taking values in R = (−∞, +∞). Then, X is integrable if and only if

lim E[ |X| 1|X|>b ] = 0 . (11.15) b→∞

Proof. Let Zb = |X| 1|X|>b. Note that it is dominated by |X| and goes to 0 as b → ∞, since X takes values in R. If X is integrable, then dominated convergence (Theorem 7.7) implies that limb→∞ E[Zb] = E[limb→∞ Zb] = 0. This is (11.15).

Conversely, if (11.15) holds, choose b large enough to have E[Zb] ≤ 1. Then

|X| = |X|1|X|≤b + |X|1|X|>b ≤ b + Zb; ,

which shows that E[|X|] ≤ b + 1 < ∞, and thus X is integrable.

81 We extend this idea to a collection K of random variables by taking the limit (11.15) uniformly in K. Specifically, K is said to be uniformly integrable if,

k(b) = sup E[ |X| 1|X|>b ] , (11.16) X∈K goes to 0 as b → ∞. If the collection K is finite and each X ∈ K is integrable, then K is uniformly integrable:

(finite K) (Lemma 11.11) lim sup k(b) = sup lim k(b) = 0 . b→∞ K K b→∞

The harder case is when K is not finite, and we can’t necessarily interchange limb with supK. We need to consider what other conditions would give us uniform integrability. There turn out to be a few; the following three that are relatively easy to work with. (See C¸inlar [C¸in11, Ch. II.3] for more.)

Proposition 11.12. Let K be a collection of R-valued random variables. If any of the following conditions hold, then K is uniformly integrable. p p i) L -boundedness. supX∈K E[|X| ] < ∞ for some p > 1. ii) Integrable envelope. There is a random variable Y with E[|Y |] < ∞ such that each X ∈ K satisfies Z Z |X|dP ≤ |Y |dP , for all b > 0 . {|X|>b} {|X|>b}

iii) There is a (positive) random variable Y with E[Y ] < ∞ such that |X| ≤ Y almost surely, for each X ∈ K.

Proof. i) Assume without loss of generality that X ≥ 0 for each X ∈ K. Fix p > 1 such that condition p i) holds, and let K be some constant such that supX∈K E[|X| ] < k < ∞. If X ≥ b, then X1−p ≤ b1−p, and thus X ≤ Xpb1−p. Therefore, k [|X|1 ] ≤ b1−p [|X|p1 ] ≤ , E |X|>b E |X|>b bp−1 and hence k lim sup E[|X|1|X|>b] ≤ lim p−1 = 0 . b→∞ X∈K b→∞ b

ii) By a similar argument as Lemma 11.11, if condition ii) holds then,

(DCT) lim sup E[|X|1|X|>b] ≤ lim E[|Y |1|X|>b] = E[ lim |Y |1|X|>b] = 0 . b→∞ X∈K b→∞ b→∞

iii) If |X| ≤ Y almost surely then |X|1|X|>b ≤ |Y |1|X|>b ≤ Y 1Y >b almost surely, and in particular condition ii) holds.

82 The following result is the main theorem relating L1-convergence to convergence in probability. Its proof can be found as part of C¸inlar [C¸in11, Thm. III.4.6].

1 L1 Theorem 11.13. Let (Xn) and X be random variables in L . Then Xn −→ X if and only if: (i) P Xn −→ X; and (ii) the collection (Xn) is uniformly integrable. Using Theorem 11.5, we can prove the following corollary. 1 Corollary 11.14. Let (Xn) and X be random variables in L , and assume that (Xn) is uniformly a.s. L1 L1 integrable. If Xn −→ X then Xn −→ X. Conversely, if Xn −→ X then (Xn) has a subsequence that converges to X almost surely.

11.5 Overview of modes of convergence Chart.

Cauchy criterion. For completeness, an additional characterization of the different modes of conver- gence is the Cauchy criterion. Recall that a sequence of real numbers (xn), the Cauchy criterion says that (xn) converges if and only if, for every  > 0, there is some k such that

|xn − xm| ≤  , for all m ≥ k and n ≥ k .

An equivalent way of writing this is limm,n→∞ |xn − xm| = 0. Note that it doesn’t say where the sequence converges, but it is useful nonetheless. The probabilistic versions of this are summarized in the following result.

Theorem 11.15 (Probabilistic Cauchy criteria). Let (Xn) be a sequence of R-valued random vari- ables. i) Almost sure. The sequence is convergent almost surely if and only if

lim |Xn − Xm| = 0 , almost surely. m,n→∞

ii) In probability. The sequences converges in probability if and only if, for every  > 0,

lim {|Xn − Xm| > } = 0 . m,n→∞ P

iii) In L1. The sequence converges in L1 if and only if

lim [|Xn − Xm|] = 0 . m,n→∞ E

11.6 Convergence in distribution

The previous modes of convergence that we studied pertained to R-valued random variables (i.e., measurable functions on Ω) converging either:

83 a) pointwise on P-almost sure sets (almost sure); b) on sets with limiting probability is 1 (in probability); c) in expectation (in Lp). Convergence in distribution, or weak convergence, has to do with the convergence of se- quences of probability measures on a topological space. We will work exclusively with convergence in R, but these ideas can be ported to general topological spaces.

Let X1,X2,...,X be R-valued random variables whose distributions are µ1, µ2, . . . , µ (probability measures on R), respectively. Let Cb = Cb(R, R) denote the collection of bounded continuous functions from R into R.

The sequence (µn) is said to converge weakly to µ if

lim µnf = µf , for every f ∈ b . n→∞ C

d The sequence (Xn) is said to converge in distribution to X, denoted Xn −→ X if (µn) converges to µ weakly, that is, if

lim [f ◦ Xn] = [f ◦ X] , for every f ∈ b . (11.17) n→∞ E E C

Example 11.3. Weak convergence to Lebesgue measure. Let µn be the probability measure that puts mass 1/n on each of the points 1/n, 2/n, . . . , n/n. Then, for f ∈ Cb,

n X 1  k  Z 1 µ f = f −→ du f(u) = λf . n n n k=1 0

Alternatively, let Un be a random variable with distribution µn (i.e, the discrete uniform distribution d on {1/n, 2/n, . . . , 1}), and U ∼ Unif(0, 1]. Then Un −→ U. Weak convergence is implied by each of the other modes of convergence that we have seen. The following result illustrates.

P Proposition 11.16. Let (Xn) and X be R-valued random variables such that Xn −→ X. Then d Xn −→ X.

P Proof. Let f ∈ Cb. Since f is continuous, by Proposition 11.7 we know that f ◦ Xn −→ f ◦ X. Since f is also bounded, by Proposition 11.12iii we know that (f ◦ Xn) is uniformly integrable and L1 d therefore (by Theorem 11.13) f ◦ Xn −→ f ◦ X. Therefore, (11.17) holds and Xn −→ X.

There is a (very weak) partial converse.

d Proposition 11.17. Let (Xn) and X be R-valued random variables such that Xn −→ X. If a.s. P X = x0 for some constant x0, then Xn −→ X, as well.

d One might be tempted to think that Xn −→ X implies that P{Xn ∈ A} converges to P{X ∈ A}, for all Borel sets A. In reality, this is only true for some special sets. In the case of R- valued random variables, this is intimately related to the convergence of distribution functions

84 Fn(x) = µn(−∞, x]. At a high level, weak convergence only implies (and requires) convergence of the distribution functions on the sets of continuity. The proof of this fact is rather technical (e.g., [C¸in11, Prop. 5.7]), but we state the result for posterity. Theorem 11.18. The following are equivalent:

i) (µn) converges weakly to µ.

ii) The distribution functions Fn(x) → F (x) for every continuity point x of F .

iii) The quantile functions Qn(u) → Q(u) for every continuity point u of Q. There is much more that can (and probably should) be said about weak convergence. See C¸inlar [C¸in11, Ch. III.5]. For the sake of brevity, we devote our attention to two important results that will let us prove the Central Limit Theorem.

Tightness and Prohorov’s Theorem. For each f ∈ Cb, the sequence of numbers (µnf) is bounded. Therefore, using Lemma 11.6, if every subsequence of (µnf) has a further subsequence that con- verges to µf, then µnf → µf. In other words, if every subsequence of (µn) has a further subsequence that converges weakly to µ, then (µn) converges weakly to µ.

In order to make this idea work, we need a regularity condition: the sequence (µn) is said to be tight if for every  > 0, there is a compact set K such that µn(K) > 1 −  for all n. Recall that in a Euclidean space (such as Rd), a set K is compact if and only if it is closed and bounded. (For a general topological space, see [AB06, Ch. 2].) Therefore, tightness amounts to there being some finite r ∈ R such that c lim sup µn([−r, r] ) = 0 . (11.18) r→∞ n

Theorem 11.19 (Prohorov’s theorem). If (µn) is tight then every subsequence of it has a further subsequence that is weakly convergent.

Sketch of proof. At a high level, the proof goes as follows: Let (Fn) be the sequence of distribution functions corresponding to (µn). Helly’s theorem [C¸in11, Thm. III.1.9] states that every sequence of distribution functions has a subsequence that converges pointwise to some distribution func- tion F at every continuity point of F . Using this, we see that every subsequence of (Fn) has a further subsequence that converges to a distribution function F , corresponding to a measure µ. Using tightness, we can show that µ is a probability measure (i.e., F (−∞) = 0 and F (∞) = 1); Theorem 11.18ii shows that (µn) converges weakly to µ along the sub-subsequence.

See C¸inlar [C¸in11, Thm. III.3.15] for the complete proof.

Convergence of Fourier transforms. Recall that the Fourier transform, of a probability distribu- tion µn is Z irx fn(r) = µn(dx)e , r ∈ R . R Prohorov’s theorem is the key to showing the close relationship between weak convergence and Fourier transforms.

85 Theorem 11.20. The sequence (µn) is weakly convergent if and only if

lim fn(r) = f(r) (11.19) n→∞ exists for every r ∈ R and the function f is continuous at 0. If this holds, then f is the Fourier transform of a probability measure µ on R, and µ is the weak limit of (µn).

Proof. Necessity. Assume that (µn) converges weakly to a probability measure µ on R. Then since g : x 7→ cos(rx) and h : x 7→ sin(rx) are in Cb, µng → µg and µnh → µh, and hence

fn(r) = µng + iµnh → µg + iµh = f(r) , where f is the Fourier transform of µ. Then, f is continuous at 0 because it is is a probability measure on R.17

Sufficiency. This argument is a bit more complicated. First we will show tightness of (µn), then use Prohorov’s theorem to argue that every subsequence has a further subsequence along which (µn) 0 converges to some probability measure µ . By the necessity part of the proof, that implies that (fn) 0 0 converges to f along that subsequence. By assumption, though, (fn) converges to f, so f = f no matter the subsequence. Therefore, every subsequence of (µn) has a further subsequence that converges weakly to the same probability measure µ, whose Fourier transform is f, and therefore (µn) converges weakly to µ. Suppose that the limits (11.19) exist and f is continuous at 0. To show tightness, note that by the bounded convergence theorem (Theorem 7.8) we have, 1 Z b 1 Z b lim dr|1 − fn(r)| = dr|1 − f(r)| . n→∞ b −b b −b

By the continuity of f at 0, as b → 0, f(r) → f(0) = limn fn(0) = 1, so the right-hand side of the previous equation goes to 0 as b → 0. Therefore, for every  > 0 there is some b > 0 such that the right-hand side is less than /2, and thus, 1 Z b dr|1 − fn(r)| ≤  , (11.20) b −b for all but finitely many n. Now, by Fubini’s theorem (Theorem 10.11), 1 Z b Z 1 Z b dr(1 − f (r)) = µ (dx) dr (1 − eirx) . b n n b −b R −b Recall that eirx = cos rx + i sin rx. Furthermore, observe that sin is an odd function so its integral over a symmetric interval vanishes, and therefore, 1 Z b Z  sin bx dr(1 − f (r)) = 2 µ (dx) 1 − . b n n bx −b R 17To see this, note that |eirx| = 1 for all r, and therefore we can apply the bounded convergence theorem (Theo- rem 7.8) to show that f is continuous at 0 (and actually that it is uniformly continuous).

86 Now, 1−y−1 sin y is always positive and exceeds 1/2 when |y| > 2. (See the Fig.1 for an illustration with b = 1.) Therefore,

1 Z b Z  2 2 dr(1 − f (r)) ≥ µ (dx) 1 (b|x|) = 1 − µ − , . b n n (2,∞) n b b −b R Putting this together with (11.20), for every  > 0 we have that K = [−2/b, 2/b] is such that µ(K) ≥ 1 −  for some b > 0, for all but finitely many n. By taking b smaller if necessary, we can ensure that µn(K) ≥ 1 −  for all n. Hence (µn) is tight.

1.5 1.5

1 1

0.5 0.5

-5 -4 -3 -2 -1 0 1 2 3 4 5 -100 -75 -50 -25 0 25 50 75 100

-0.5 -0.5

sin x Figure 1: The function x shown over the interval (−5, 5) (left) and (−100, 100) (right).

We can translate this into convergence in distribution, recalling that the characteristic function of a random variable X is the Fourier transform of its distribution.

d Corollary 11.21. Xn −→ X if and only if

lim [eirXn ] = [eirX ] , r ∈ . n→∞ E E R

11.7 Central Limit Theorem From these results follows one of the most important achievements in probability theory: the Central Limit Theorem (CLT). The basic idea is that if X1,X2,... is a sequence of i.i.d. random variables 2 Pn with mean µ and variance σ < ∞, and Sn = i=1 Xi is their sum, then the distribution of, √ Zn = (Sn − nµ)/(σ n) , for large n is approximately standard normal. The importance of this result is that we haven’t assumed anything about the distribution of Xi except that it has finite variance. The standard proof technique (which we follow) obscures some of the intuition, but the high-level idea is that Zn has mean 0 and variance 1 for all n; for large n, all higher-order fluctuations are negligible. To state the result, let Z denote a standard normal random variable, i.e., Z ∼ N (0, 1).

87 Theorem 11.22 (Central Limit Theorem). Let (Xn) be a sequence of i.i.d. R-valued random 2 variables with mean µ < ∞ and variance σ < ∞. Then (Zn) (as defined above) converges to Z in distribution.

Proof. Let f denote the characteristic function of Yn = (Xn − µ)/σ (i.e., the Fourier transform of the distribution of Yn). Observe that Yn has mean 0 and variance 1. A version of Taylor’s theorem yields 1 f(r) = f(0) + f 0(0)r + f 00(0)r2(1 + h(r)) , 2 with some function h such that |h(r)| → 0 as r → 0.

Recall that for a random variable X with distribution µ and characteristic function f, if E[|X|m] < ∞, then

m d m m lim f(r) = i E[X ] . r→∞ drm

0 00 2 For Yn, this yields f(0) = 1, f (0) = E[Yn] = 0, and f (0) = E[Yn ] = 1, and thus

  n  2 n r r /2 √ 2 [eirZn ] = f √ = 1 − (1 + h(r/ n)) → e−r /2 , as n → ∞ , E n n

n c because (1+cn/n) → e if cn → c. This is the characteristic function of a standard normal random 2 variable Z: E[eirZ ] = e−r .

d Therefore, by Corollary 11.21, we have Zn −→ Z.

It’s worth squinting to see what actually happened there. For convenience,√ denote the mth deriva- tive of f with respect to r as f (m). Let’s take a closer look at h(r/ n) (assuming the higher-order derivatives exist):

√ 2n  1 1  h(r/ n) = f (3)(0)r3 + f (4)(0)r4 + ... . f (2)(0)r2 8n3/2 16n2

For large n, these terms are dominated by the behavior of their dependence on n, i.e., n−1/2, n−1, and so on. They all tend to zero as n → ∞. Observe that this would be true for any h(r/nα), with α > 0. On the other hand, consider [f(r/nα)]n as n grows large: if α > 1/2, then we would have α n α n [f(r/n )] → 1 because the resulting sequence (cn) would converge to zero; if α < 1/2, [f(r/n )] → ∞. The proof works because α = 1/2 is just right. See the discussion below about the relationship between the CLT and the Law of Large Numbers for more on this. There are stronger versions of the CLT that relax the i.i.d. assumption, and that have weaker assumptions on the moments; they are all describing a similar phenomenon to the basic CLT above. See C¸inlar [C¸in11, Ch. III.8] for much more on this.

88 11.8 Laws of large numbers Like the CLT, the law of large numbers (LLN) is a landmark result from classical probability theory. It is typically presented before the CLT because it says something somewhat less informative (more on this below). The trade-off for the less informative conclusion is that it is a stronger convergence result: the so-called weak law of large numbers (WLLN) states that the sample average converges in probability to its mean; the strong LLN (SLLN) establishes convergence almost surely. The proof of the almost sure version requires the following technical lemma.

Lemma 11.23. Let (xn) be a sequence of positive numbers and x¯n = (x1 + ··· + xn)/n the average of the first n entries. Let N = (nk) be a subsequence of N with limk nk+1/nk = r > 0. If the sequence (¯xn) converges along N to x, then

x/r ≤ lim inf x¯n ≤ lim sup x¯n ≤ rx . n n

See [C¸in11, Lemma 1.7] for the proof.

Theorem 11.24. Suppose that the Xn are pairwise independent and (marginally) identically dis- 2 2 tributed with finite mean µ and variance σ . Then, (X¯n) converges to µ in L , in probability, and almost surely.

2 Proof. Observe that E[Sn] = nµ and Var[Sn] = nσ because of pairwise independence. Thus, ¯ ¯ 2 E[Xn] = µ and Var[Xn] = σ /n. Hence, ¯ 2 ¯ 2 E[|Xn − µ| ] = Var[Xn] = σ /n → 0 , as n → ∞ .

2 L P In other words, X¯n −→ µ. This implies that Xn −→ µ, as well.

For almost sure convergence, note that it is enough to show this for positive Xn, because we could show (separately) convergence for the positive and negative parts of general Xn. Assume that 2 Xn ≥ 0 for all n. Let N = (nk) be defined by nk = k . By Chebyshev’s inequality,

∞ 2 X ¯ 2 X  P{|Xn − µ| > } ≤ σ < ∞ . n∈N k=1

a.s. By Proposition 11.3, this implies that X¯n −→ µ along N.

Let Ω0 be the almost sure set on which this convergence holds. For ω ∈ Ω0, (the original sequence) (Xn(ω)) is a sequence of positive numbers, and therefore Lemma 11.23 applied to (X¯n(ω)) with 2 2 r = limk(k + 1) /k = 1 yields

µ ≤ lim inf X¯n(ω) ≤ lim sup X¯n(ω) ≤ µ . n n

a.s. Therefore, for every ω ∈ Ω0,(X¯n(ω)) converges to µ. But Ω0 is almost sure, so X¯n −→ µ.

89 18 Relationship between CLT and LLN. For simplicity, consider an i.i.d. sequence (Xn), with finite 2 a.s. 2 mean µ and variance σ . Then the SLLN implies that X¯n −→ µ (and in L ), i.e.,

2 lim |X¯n − µ| = 0 , almost surely and in L . n→∞

A natural question often asked in statistics is, “How large must n be so that X¯n is close to µ?” ¯ That is, what is the rate of convergence of Xn to µ? For example, is there some α ∈ R such that

α lim n |X¯n − µ| = c > 0 , a.s.? n→∞

α It turns out that no such α exists: it cannot be that n (X¯n − µ) converges almost surely to a non-zero constant or a non-zero random variable, even in probability. √ ¯ 2 However, the CLT says that for α = 1/2, n(X√n − µ) converges in distribution to N (0, σ ). The rate of convergence is, in this weaker sense, n. (This is often quoted as a rule of thumb in statistical applications.)

Convergence of the empirical distribution. Suppose that (Xn) are i.i.d. R-valued random variables in L2. Call their common distribution function F , and suppose that we would like to estimate it. Let

Yn(x) = 1(−∞,x) ◦ Xn , x ∈ R .

2 Observe that (Yn(x))n are i.i.d. and also in L . Define

n 1 X F (x) = Y (x) , for x fixed. (11.21) n n n i=1

As x varies, Fn defines a function on R called the empirical distribution function. By the SLLN, for each x ∈ R, we have almost surely that

lim Fn(x) = [Y1(x)] = [1(−∞,x) ◦ X1] = {X1 ≤ x} = F (x) . n→∞ E E P

a.s. Therefore, Fn(x) −→ F (x) for each x ∈ R; the empirical distribution function converges to the true distribution function almost surely and in L2. It is straightforward to see that the above argument works for a general measurable space (E, E), a.s. with 1(−∞,x) ◦ Xn replaced by 1A ◦ Xn for any A ∈ E, and therefore Fn(A) −→ µ(A), where µ is the (common) distribution of Xn. Specifically,

n n 1 X 1 X a.s. F (A) = 1 ◦ X = δ (A) −→ µ(A) ,A ∈ E . n n A i n Xi i=1 i=1 Using this, it can be shown that Z Z n n 1 X 1 X a.s. f(x)F (dx) = f(x)δ (dx) = f ◦ X −→ µf . (11.22) n n Xi n i E E i=1 i=1

18Much of this discussion comes from Jacod and Protter [JP04, Ch. 21].

90 This is the basis for Monte Carlo methods (based on i.i.d. samples from the unknown distribution). In practice, we often cannot sample i.i.d. from our distribution of interest µ, but we might be able to generate samples from a Markov chain whose stationary distribution is µ. In that case, we need a stronger LLN, known as the ergodic LLN, which says that we can estimate integrals using samples from a Markov chain in the same way as in (11.22). This is the basis for Markov chain Monte Carlo (MCMC) methods. See, for example, Meyn and Tweedie [MT93] or, for a nice overview, Roberts and Rosenthal [RR04].

91 12 Martingales

Reading: Jacod and Protter [JP04], Ch. 24-27 Supplemental: C¸inlar [C¸in11], XXX

12.1 Basic definitions

We will consider families of random variables (Xn)n∈N that have very special properties. The ideas in this section generalize to more general index sets, with R an important case.

Filtrations. A filtration is a family F = (Fn)n∈N of σ-algebras that satisfies

n ≤ m ⇒ Fn ⊆ Fm . (12.1)

The monotonicity of the filtration means that there is a uniquely determined, smallest σ-algebra that contains all σ-algebras in F:

 [  F∞ = σ Fn n∈N

Let F = (Fn)n∈N be a filtration. A family of random variables (Xn)n∈N is called adapted to F if Xn ∈ Fn for each n ∈ N. Clearly, every sequence (Xn) is adapted to its canonical filtration, defined by

 [  Fn = σ σ(Xm) . m≤n

An adapted family (Xn, Fn)n∈N is called a martingale if:

i) E[|Xn|] < ∞ for each n (that is, each Xn is integrable); and a.s. ii) E[Xm | Fn] = Xn for each m ≥ n.

We also say that (Xn) is an F- or (Fn)-martingale. Note that for index set N, it is enough that E[Xn+1 | Fn] for all n ≥ 0; more general index sets require the more general condition above. a.s. If the ‘=’ is replaced by ≤a.s., then (Xn, Fn) is called a supermartingale; if replaced by ≥a.s., it is called a submartingale. At a high level, a martingale is, on average, constant; a submartingale (supermartingale) increases (decreases) on average. I’ll say more about the origin of the names a bit later.

Examples.

92 Example 12.1. Martingale: sum of centered independent random variables. Let (Xn) be a sequence of independent integrable random variables, with respective means (νn). Let Fn be Pn the canonical filtration of (Xn), with F0 = {∅, Ω}. Let S0 = 0 and Sn = i=1(Xi − νi). Then (Sn, Fn)n≥0 is a martingale because,

E[Sm | Fn] = E[Sn + (Sm − Sn) | Fn] " m #

X = Sn + E (Xi − νi) Fn i=n+1 m X = Sn + (E[Xi] − νi) i=n+1

= Sn .

Example 12.2. Martingale: product of independent random variables. Let (Rn) be a sequence of independent random variables with mean 1 and finite variances. Let X0 = 1 and

n Y Xn = X0 Ri , i=1

and F the canonical filtration of (Rn). Then E[Xn+1 | Fn] = XnE[Rn+1 | Fn] = Xn, so (Xn) is a F-martingale. In each of the previous examples, the filtrations were comprised of independent σ-algebras; by Kolmogorov’s 0-1 law, the limits must be constant almost surely. Much more interesting objects occur in practice, which give rise to random limits. The following is a simple (canonical) example.

Example 12.3. P´olya’s urn. Imagine an urn (a fancy bucket) that contains R0 = 1 red ball and W0 = 1 white ball. At each time step n ≥ 1, I draw a ball from the urn, then replace that ball and add another of the same color. Let Xn ∈ {0, 1} record the color of the ball drawn at step n, with 0 corresponding to red. The proportion of white balls after n steps is

0 Wn Wn Wn = = . Wn + Rn n + 2

Now, with F the canonical filtration of (Xn),

0 −1 E[Wn+1 | Fn] = (n + 3) (Wn + E[Xn | Fn]) −1 −1 −1 0 = (n + 3) Wn(1 + (n + 2) ) = (n + 2) Wn = Wn ,

and therefore (Wn) is an F-martingale. It turns out (we won’t show this) that the limiting propor- tion of white balls, W∞ is a random variable with distribution Unif(0, 1).

Some basic properties. Martingales are hugely important in probability theory and in applications, in large part because they are disarmingly simple. Once it has been established that something is a martingale (or, in many cases, a submartingale) proving many desirable properties becomes quite simple. The hard part, however, often is finding a martingale lurking around in the system being studied. Except for some simple cases (e.g., the first two examples above), there aren’t a lot of

93 recipes for constructing a (sub)martingale, unless one is already in hand. Some useful properties and observations are listed below.

a.s. a) The martingale property E[Xm | Fn] = Xn can be expressed expressed equivalently as Z Z XndP = XmdP , for each A ∈ Fn . A A

a.s. b) The martingale property E[Xm | Fn] = Xn can be (and is) expressed equivalently as, a.s. E[(Xm − Xn) | Fn] = 0 , for m ≥ n .

c) Let X be an F-submartingale. For m ≥ n, the random variable E[(Xm − Xn) | Fn] is non- negative, and therefore it is almost surely zero if and only if its expectation E[E[(Xm − Xn) | Fn]] = 0. It follows that the submartingale X is a martingale if and only if E[Xn] = E[X0] for all n ≥ 0.

d) If X and Y are F-submartingales, then so is aX + bY , for a, b ∈ R+. This is because, for m ≥ n,

E[aXm + bYm | Fn] = aE[Xm | Fn] + bE[Ym | Fn] ≥ aXn + bYn .

e) If X and Y are F-martingales, then so is aX + bY , for a, b ∈ R. This can be shown as above.

f) Let f be a convex function on R. If X is an F-martingale, and if f ◦Xn is integrable for every n, then f ◦ X is an F-submartingale. This follows from Jensen’s inequality (for conditional expectations): for m ≥ n,

E[f ◦ Xm | Fn] ≥ f ◦ E[Xm | Fn] = f ◦ Xn . Two important examples (in the context of martingale theory) are |x| and x+ = max{0, x}. g) The previous property also applies to F-submartingales, if f is non-decreasing.

Exercise 35 (Convex increasing functions of submartingales): £ ¢ Prove property g) above. ¡

Stopping times. The “constant expectation” property of martingales can be extended to hold over certain random times. A stopping time is a random variable taking values in N ∪ {+∞}, such that the event {T ≤ n} ∈ Fn, for all n. Constant integer-valued random variables are (uninteresting) stopping times. More interesting stopping times are constructed from sequences of random variables. For example, let ( inf {n : X ≥ 73} if X ≥ 73 for some n ∈ T = n≥0 n n N +∞ otherwise.

Theorem 12.1. Let T be a stopping time bounded be some c ∈ R and let (Xn) be a martingale with respect to its canonical filtration. Then E[XT ] = E[X0].

94 12.2 Martingale Convergence Theorem Ultimately, we will prove the following result, which ...

Theorem 12.2 (Martingale Convergence Theorem). Let X = (Xn, Fn)n∈N be a submartingale. 1 i) If X is bounded in L , the sequence (Xn)n∈N converges to some X∞ almost surely as n → ∞, 1 19 and X∞ ∈ L .

ii) If X is even a martingale, and if the collection (Xn)n∈N is uniformly integrable, then (Xn)n∈N 1 converges almost surely and in L to some X∞ as n → ∞. Moreover, Xn = E[X∞ | Fn] for all n ∈ N. iii) Conversely, for any integrable, real-valued random variable X,  [X | Fn], Fn , E n∈N is a uniformly integrable martingale.

1 Remarks. Observe that in part i) we get almost sure convergence of (Xn) if it is bounded in L , and the limit is also in L1. However, we do not have convergence in L1; that is, it may not be the case that E[Xn] → E[X∞]. For that, we need the uniform integrability condition of part ii). 1 The L -convergence claim in part ii) also holds for submartingales, with the conclusion that Xn ≤ ¯ E[X∞ | Fn]. This implies that X∞ extends X to a submartingale X = (Xn)n∈N∪{∞} (and therefore also for martingales).

Upcrossings. To show that (Xn) converges almost surely, we need to show that limn Xn exists (almost surely), which is true iff lim infn Xn = lim supn Xn. We will use the idea of upcrossings to prove by contradiction that this is true: Suppose that lim infn Xn < lim supn Xn. Then there are numbers a < b such that lim infn Xn < a < b < lim supn Xn. Recall that for any a > lim infn Xn,(Xn) will fall in the interval [lim infn Xn, a] infinitely often; likewise for the interval [b, lim supn Xn]. Therefore, (Xn) makes upward crossings, or “upcrossings”, from below a to above b infinitely many times. Thus, if we can show that the number of upcrossings is finite almost surely for all a < b, then it must be that lim infn Xn = lim supn Xn.

Formally, we express upcrossings via stopping times. Define T0 = 0, and inductively for j ≥ 0, ( S = min{n > T : X ≤ a} first time after T that X ≤ a j+1 j n j n (12.2) Tj+1 = min{n > Sj+1 : Xn ≥ b} first time after Sj+1 that Xn ≥ b .

We also use the conventions that min ∅ = +∞ and max ∅ = 0. Observe that all Sj and Tj are stopping times. The number of [a, b]-upcrossings up to time n is defined as

∞ X Un(a, b) = min{k : Tk ≤ n} = 1(0,n] ◦ Tk . k=1

This is the quantity we want to control (i.e., show finiteness of) as n grows.

1 19 L Note that this is not claiming that Xn −→ X∞.

95 To that end, define the upcrossing indicator process

∞ X Cn = 1(Sk,Tk](n) = 1{an upcrossing is being made at time n} . (12.3) k=1

(Cn) is a predictable process (that is, Cn ∈ Fn−1 for all n). To see this, note that we can write (12.3) as C1 = 1{X0 < a} and

Cn = 1{Cn−1 = 1 and Xn−1 ≤ b} + 1{Cn−1 = 0 and Xn−1 < a} , n ≥ 2 . (12.4)

We will use (Cn) to accumulate (Xn) during active upcrossings, via Y1 = 0 and n X Yn = Cm(Xm − Xm−1) , n ≥ 2 . (12.5) m=1

The upcrossing indicator (Cn) just picks out parts of (Xn) that are traversing from below a to above b. Hence,

− Yn ≥ (b − a)Un(a, b) − |(Xn − a) | almost surely. (12.6)

Furthermore, observe that because Cm+1 ≤ 1 and E[Xm+1 − Xm | Fm] ≥ 0,

E[Ym+1 − Ym | Fm] = E[Cm+1(Xm+1 − Xm) | Fm]

Cm+1 ∈ Fm = Cm+1E[Xm+1 − Xm | Fm] ≤ E[Xm+1 − Xm | Fm] . Now taking expectations of both sides and summing over m yields,

E[Yn] ≤ E[(Xn − X0)] . (12.7) From this and (12.6) we can deduce the following lemma, which is an important result on its own.

Lemma 12.3 (Doob’s Upcrossing Inequality). Let (Xn) be a submartingale, let a < b and let Un(a, b) be the number of upcrossings of [a, b] by time n. Then [(X − a)+ − (X − a)+] [U (a, b)] ≤ E n 0 , for all n ∈ . E n b − a N

+ Proof. Let Mn = (Xn − a) . Since (Xn) is a submartingale, (Mn) is also a submartingale (since + x 7→ (x − a) is convex and non-decreasing). Furthermore, an upcrossing of [a, b] by (Xn) is the same as an upcrossing of [0, b − a] by (Mn). Now re-defining Yn (12.5) in terms of Mn,(12.6) becomes (after rearrangement) Y U (a, b) ≤ n , n b − a − because (Mn) never falls below 0 by construction and therefore |Mn | = 0. Taking expectations and putting in (12.7) yields [Y ] [M − M ] [(X − a)+ − (X − a)+] [U (a, b)] ≤ E n ≤ E n 0 = E n 0 . E n b − a b − a b − a

96 Proof of Theorem 12.2i. First, observe that being bounded in L1 means that for the submartingale (Xn),

+ + + E[Xn ] ≤ E[|Xn|] = 2E[Xn ] − E[Xn] ≤ 2E[Xn ] − E[X0] , (12.8) and therefore

+ sup E[Xn ] < ∞ ⇐⇒ sup E[|Xn|] < ∞ . (12.9) n n

As outlined above, we will prove by contradiction that (Xn) has an almost sure limit. Suppose the contrary, and in particular that lim infn Xn < a < b < lim supn Xn for some a < b. Let Un(a, b) be the number of upcrossings of [a, b] by time n, which is non-decreasing and therefore U∞(a, b) = limn Un(a, b) exists. By the Monotone Convergence Theorem, using the fact that (x − a)+ ≤ x+ + |a| for all a, x ∈ R, + + supn E[(Xn − a) ] supn E[Xn ] + |a| E[U∞(a, b)] = lim E[Un(a, b)] ≤ ≤ < ∞ . n b − a b − a

Since E[U∞(a, b)] < ∞, P{U∞(a, b) < ∞} = 1. Therefore, (Xn) upcrossess [a, b] only finitely many times almost surely. If we let [ A = {lim inf Xn < a < b < lim sup Xn} , n n a

Hence, lim infn Xn = lim supn Xn = limn Xn = X∞ almost surely. 1 X∞ may be infinite, so we still need to show that it is in L . To that end, by Fatou’s Lemma and (12.8),

+ E[|X∞|] = E[lim inf |Xn|] ≤ lim inf E[|Xn|] ≤ 2 sup E[Xn ] − E[X0] < ∞ . n n n

1 Therefore, X∞ ∈ L .

a.s. Proof of Theorem 12.2ii. By part i), Xn = X∞. Assuming uniform integrability of (Xn) means L1 that Xn −→ X∞, as well, by Corollary 11.14. In particular,

|E[Xn] − E[X∞]| ≤ E[|Xn − X∞|] → 0 , as n → ∞ . (12.10)

Showing that Xn = E[X∞ | Fn] is equivalent to showing that E[1H (X∞ − Xn)] = 0 for all H ∈ Fn (see the ‘Uniqueness and versions’ discussion just before Proposition 10.1). To that end, fix n, and let H ∈ Fn and m ≥ n. Then by the martingale property,

E[1H Xn] = E[1H Xm] , m ≥ n .

97 L1 Now, Proposition 11.10 says that if Xn −→ X, then for every bounded random variable Y , L1 limn E[XnY ] = E[XY ]. Hence, since Xm − Xn −→ X∞ − Xn, with Y = 1H we have

Proposition 11.10 [1H (X∞ − Xn)] = lim [1H (Xm − Xn)] = 0 . E m E

Since H ∈ Fn was arbitrary, this implies Xn = E[X∞ | Fn].

Proof of Theorem 12.2iii. The martingale property follows from the properties of repeated condi- tioning: for each m ≤ n, Fm ⊂ Fn, and therefore (almost surely)

E[Xn | Fm] = E[E[X | Fn] | Fm] = E[X | Fm] = Xm ,

so (Xn, Fn) is a martingale. To show uniform integrability, observe that by Jensen’s inequality (Eq. (8.12)),

|Xn| = |E[X | Fn]| ≤ E[|X| | Fn] almost surely.

Hence, for every A ∈ Fn, Z Z |Xn|dP ≤ |X|dP , A A

which holds in particular for A = {|Xn| > b}, b ∈ R. By Proposition 11.12, since X is integrable, (Xn) is uniformly integrable.

12.3 Martingale inequalities

A tail bound for an R-valued random variable X is an inequality that upper-bounds the probability that X takes values far from its mean. Markov’s inequality (8.9) is an example: 1 |X| > c ≤ [|X|] . P c E Even this elementary example is extremely useful; we’ve used it (and its variants like Chebyshev’s inequality) numerous times in our proofs.

Maximal inequalities. Instead of a single random variable, suppose we have a sequence of random

variables (Xn)n∈N. A natural (and useful) question is whether similar tail bounds hold uniformly (i.e., simultaneously) for all random variables in the sequence. Such bounds are called maximal inequalities. For example, when the Xn’s are independent, much can be said about the sequence of sums (Sn); C¸inlar [C¸in11, Ch. III.7] is devoted the topic. The following result is a commonly used example.

Theorem 12.4 (Kolmogorov’s maximal inequality). Suppose that (Xn) is a sequence of indepen- dent R-valued random variables, each with mean zero. Then, for every c ∈ (0, ∞), 1 P{max |Sk| > c} ≤ Var[Sn] . k≤n c2

98 We saw in Example 12.1 that (Sn) is a martingale with respect to the canonical filtration of (Xn); we might ask whether such an inequality holds only for the particular martingale (Sn), or whether it (or something similar) holds for more general martingales. In fact, it does, and Kolmogorov’s inequality is recovered as a special case. p Theorem 12.5 (Doob’s martingale inequality). Let X = (Xn, Fn)n∈N be a martingale in L for some p ∈ [1, ∞). Then, for c > 0,

1 p P{max |Xk| > c} ≤ E[|Xn| ] . k≤n cp

This result actually follows of a corollary from a more general theorem whose proof is interesting—it involves many of the key ideas related to martingales and stopping times. See C¸inlar [C¸in11, Thm. V.3.21]. A simpler result that also has a independent-variables counterpart, is Azuma’s inequality, which is the martingale generalization of Hoeffding’s inequality for independent random variables. (Therefore, it is often called the Azuma–Hoeffding inequality.) Azuma’s inequality provides a powerful tail bound for a martingale whose increments ∆n = Xn − Xn−1 are bounded almost surely.

Theorem 12.6 (Azuma–Hoeffding inequality). Let X = (Xn, Fn)n∈N be a martingale with E[X0] = µ. If there are a sequence of constants cn ≥ 0 such that |X1 − µ| = |∆1| ≤ c1 and |∆n| ≤ cn almost surely, then for any n ≥ 1 and any b > 0,

 b2  P{|Xn − µ| > b} ≤ 2 exp − Pn 2 . 2 m=1 cm

The proof relies on the following approximation. Lemma 12.7. Fix some constant d > 0. Then for every x such that |x/d| ≤ 1 and every t > 0,

td −td 2 2 e − e etx ≤ et d /2 + x . 2

A proof of this can be found in, e.g., the lecture notes here.

Proof of Theorem 12.6. A martingale with its mean subtracted is a mean-zero martingale, so with- out loss of generality, we may assume that E[X0] = 0. By Markov’s inequality, for any b > 0 and t > 0,

Pn tXn tb −tb tXn −tb t ∆m P{Xn > b} = P{e > e } ≤ e E[e ] = e E[e m=1 ] .

0 Now, via repeated conditioning and noting that ∆m ∈ Fm0 for m ≤ m ,

 Pn   Pn−1  t ∆m t ∆m  t∆n  E e m=1 = E e m=1 E e Fn−1 .

99 Applying Lemma 12.7 to the conditional expectation with x = ∆n and d = cn, and using the martingale property E[∆n | Fn−1] = 0, we have

td −td 2 2 e − e 2 2  t∆n  t c /2 t c /2 e Fn−1 ≤ e n + [∆n | Fn−1] = e n . (12.11) E 2 E We can continue iterating in this way to obtain

2 Pn 2 Pn 2 −tb t ∆m −tb t c /2 P{Xn > b} ≤ e E[e m=1 ] ≤ e e m=1 m .

Pn 2 Optimizing over t to minimize the bound, we set t = b/ m=1 cm to obtain  b2  P{Xn > b} ≤ exp − Pn 2 . 2 m=1 cm

A similar argument for the negative tail of Xn yields

 b2  P{Xn < −b} ≤ exp − Pn 2 . 2 m=1 cm

Combining the two with a union bound yields the result.

 t∆n  Remark: Bounds for sub- and supermartingales. Observe that the inequality E e Fn−1 ≤ tc2 e n from (12.11) holds even for supermartingales, and therefore the one-sided bound,

 b2  P{Xn − µ ≥ b} ≤ exp − Pn 2 , 2 m=1 cm holds for supermartingales. Since the negative of a supermartingale is a submartingale, the other one-sided bound,

 b2  P{Xn − µ ≤ −b} ≤ exp − Pn 2 , 2 m=1 cm holds for submartingales.

100 A Additional results

A.1 An example of the use of Radon–Nikodym derivatives A simple application of change of measure using Radon–Nikodym derivatives is the following result.

Proposition A.1. Let X be a random variable taking values in (E, E), and µ its distribution. If fn E[f(X)] = 0 for any f ∈ E+ , then f(X) = 0 µ-a.s. Furthermore, if Y is a random variable taking values in (E, E), with distribution ν  µ, then f(Y ) = 0 ν-a.s.

Proof. The first claim is a restatement of Proposition 7.5 iii). To see the second claim, note that

Z Z dν dν E[f(Y )] = fdν = f dµ = E[f(X) (X)] = 0 , E E dµ dµ where the final equality follows from the fact that f(X) = 0 µ-a.s. Another application of Propo- sition 7.5 iii) implies that f(Y ) = 0 ν-a.s.

101 References

[Abb15] S. Abbott. Understanding Analysis. 2nd. Springer New York, 2015. [AB06] C. D. Aliprantis and K. C. Border. Infinite Dimensional Analysis: A Hitchhiker’s Guide. Springer-Verlag, 2006. [C¸in11] E. C¸inlar. Probability and Stochastics. Springer New York, 2011. [Daw79] A. P. Dawid. “Conditional Independence in Statistical Theory”. In: J. Royal Stat. Soc. Ser. B 41.1 (1979), pp. 1–31. [Daw80] A. P. Dawid. “Conditional Independence for Statistical Operations”. In: The Annals of Statistics 8.3 (1980), pp. 598–617. [Dia88] P. Diaconis. “Sufficiency as statistical symmetry”. In: Proceedings of the AMS Centennial Symposium. Ed. by F. Browder. American Mathematical Society, 1988, pp. 15–26. [DF84] P. Diaconis and D. Freedman. “Partial Exchangeability and Sufficiency”. In: Proc. Indian Stat. Inst. Golden Jubilee Int’l Conf. Stat.: Applications and New Directions. Ed. by J. K. Ghosh and J. Roy. Indian Statistical Institute, 1984, pp. 205–236. [Doo94] J. L. Doob. Measure Theory. Springer New York, 1994. [Gut05] A. Gut. Probability: A Graduate Course. Springer New York, 2005. [JP04] J. Jacod and P. Protter. Probability Essentials. 2nd. Springer-Verlag Berlin Heidelberg, 2004. [Lau74] S. L. Lauritzen. “Sufficiency, Prediction and Extreme Models”. In: Scandinavian Journal of Statistics 1.3 (1974), pp. 128–134. [MT93] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer-Verlag London, 1993. [RR04] G. O. Roberts and J. S. Rosenthal. “General state space Markov chains and MCMC algorithms”. In: Probability Surveys 1.0 (2004), pp. 20–71. [SS05] E. M. Stein and R. Shakarchi. Real Analysis: Measure Theory, Integration, and Hilbert Spaces. Princeton University Press, 2005.

102 Index

K-dimensional simplex, 30 Conditional determinism, 60 χ2-distribution, 33 conditional determinism, 60 Lp, 80 conditional distribution, 68 Lp-norm, 80 conditional expectation, 57 E-measurable, 16, 17 conditional expectation of X given F, 58 E/F-measurable, 16 conditional probability, 61 µ-almost everywhere, 26 conditionally independent, 70 π-system, 26 converge, 75 σ-algebra,9 converge in Lp, 80 σ-algebra generated by f, 16 converge in distribution, 84 σ-finite, 48, 63 converge weakly, 84 nth centered moment, 50 Convergence in distribution, 84 nth moment, 50 convergence in probability, 77 convergent, 74 absolutely continuous, 32, 47 convex, 50 adapted, 92 convolution, 55 algebra,9 counting measure, 24 almost everywhere, 26, 32 cumulative distribution function, 32 almost surely, 26, 32 almost surely convergent, 75 density function, 32, 47 atom, 24, 26 determines, 36 Azuma’s inequality, 99 diffuse, 26 Dirac measure, 23 Borel, 17 discrete, 10, 31 Borel σ-algebra, 12 discrete measure, 24 Borel sets, 12, 13 disintegration, 69 Bounded Convergence Theorem, 44 disjointed,8 distribution, 31 canonical filtration, 92 distribution function, 32 canonical form, 19 distributional identity, 35 Cauchy criterion, 83 dominated, 43 cdf, 32 Dominated Convergence Theorem, 43, 60 characteristic function, 51, 87 dyadic function, 19 closed,8 dyadic intervals, 19 closed set, 12 closed under countable intersections,8 edges, 34 collection,8 empirical distribution function, 90 complement,7 empirical measure, 24 Complete, 21 equal in distribution, 32 complete, 26 equivalent, 49 complete separable metric space, 21 essential supremum, 80 completion, 26 event notation, 31 composition, 17 events, 23, 28

103 expectation, 46 Linearity, 37, 60 expected value, 46 exponential distribution, 33 mapping, 15 marginal distributions, 36 filtration, 92 Markov kernel, 63 finite, 63 Markov’s inequality, 50 Fourier transform, 51, 85 martingale, 92 function, 15 mass, 23 maximal inequalities, 98 gamma distribution, 33 mean, 50 gamma function, 33 measurable rectangle, 14 generated, 11 measurable relative to E and F, 16 generative model, 34 measurable sets, 13 grand history, 23, 28 measurable space, 13 graph, 34 measurable with respect to E, 16 measure, 23 Hoeffding’s inequality, 99 measure space, 23 image measure, 27 metric, 12 indefinite integral, 47 Metric space, 21 independency, 53 metric space, 12 independent, 36, 53, 54 metric topology generated by d, 12 indicator function, 19 minimal sufficient statistic, 72 induced probability space, 31 mixture of normals, 30 infinitely often, 56 moment generating function, 51 insensitivity, 42 Monotone convergence, 37 integrable, 39, 46 Monotone Convergence Theorem, 60 integral, 37 monotone decreasing,8 integral of f over A, 39 monotone increasing,8 intersection,7 monotonic, 38 inverse image, 15 Monotonicity, 60 isomorphic, 21 mutually independent, 53 isomorphism, 21 negative part, 18 Jensen’s inequality, 50, 60 negligible, 26 joint distribution, 36 non-parametric, 29 null set, 26 Kolmogorov’s axioms, 28 numerical function, 17 Kullback–Leibler divergence, 51 open -ball, 12 Laplace transform, 51 open sets, 12 law, 31 outcomes, 23, 28 law of large numbers, 89 Lebesgue integral, 38, 39 p-system, 26, 54 Lebesgue measure, 24 parametric, 29 limit,8, 74 partition,8 limit inferior,8 pointwise limit, 18 limit superior,8 Poisson distribution, 33

104 Polish space, 21 sub-σ-algebra, 53 positive, 17 sub-Markov kernel, 63 positive part, 18 submartingale, 92 Positivity, 37 subset,7 posterior, 30 sufficient, 72 preferential attachment model, 34 supermartingale, 92 prior, 30 Supplemental,7, 15, 23, 28, 31, 37, 46, 53, 57, probabilistic programming, 35 74, 92 probability density function, 49 probability generating function, 52 tail σ-algebra, 55 probability mass function, 48 tail bound, 98 probability measure, 23, 28 the background probability space, 31 probability model, 28 tight, 85 probability space, 23 topological space, 12 probability transition kernel, 62 topology, 11 product, 14 trace, 13, 25 product σ-algebra, 14 transition kernel, 62, 63 product measure, 66 trivial, 10 product transition kernel, 63 purely atomic, 26 uniformly integrable, 82 pushforward, 27 union,7 uniqueness of measures, 26 Radon–Nikodym derivative, 32 upcrossings, 95 Radon–Nikodym theorem, 47 random element, 31 variance, 50 random experiments, 28 version, 58 random trees, 34 vertices, 34 random variable, 31 weak convergence, 84 Reading,7, 15, 23, 28, 31, 37, 46, 53, 57, 74, 92 regular conditional probability, 67 regular version, 67 Repeated conditioning, 60 repeated conditioning, 60 reservoir of randomness, 32 restriction, 25 sample space, 23, 28 section, 64 sections, 64 semi-parametric, 29 Separable, 21 simple, 19, 31 simple graph, 34 standard, 21 statistic, 72 statistical model, 29 stopping time, 94

105