<<

5

Absolute continuity and related topics

5.1 Signed and complex measures Relaxation of the requirement of a that it be nonnegative yields what is usually called a signed measure. Specifically this is an extended real-valued, countably additive set μ on a class E (containing ∅), such that μ(∅) = 0, and such that μ assumes at most one of the values +∞ and –∞ on E. As for measures, a signed measure μ defined on a class E, is called finite on E if |μ(E)| < ∞, for each E ∈E, and σ-finite if for ∈E { }∞ E ⊂∪∞ each E there is a En n=1 of sets in with E n=1En and |μ(En)| < ∞, that is, if E can be covered by the union of a sequence of sets with finite (signed) measure. It will usually be assumed that the class on which μ is defined is a σ-ring or σ-field. Some of the important properties of measures (see Section 2.2) hold also for signed measures. In particular a signed measure is subtractive and continuous from below and above. The basic properties of signed measures are given in the following theorem. Theorem 5.1.1 Let μ be a signed measure on a σ-ring S.

(i) If E, F ∈S,E⊂ F and |μ(F)| < ∞ then |μ(E)| < ∞. (ii) If E, F ∈S,E⊂ F and |μ(E)| < ∞ then μ(F – E)=μ(F)–μ(E). { }∞ S | ∪∞ | ∞ (iii) If En n=1 is a disjoint sequence of sets in such that μ( n=1En) < ∞ then the series n=1 μ(En) converges absolutely. { }∞ S | | ∞ (iv) If En n=1 is a monotone sequence of sets in , and if μ(En) < for some integer n in the case when {En} is a decreasing sequence, then

μ(lim En) = lim μ(En). n n Proof If E, F ∈S, E ⊂ F then F = E ∪ (F – E), a union of two disjoint sets, and from the countable (and hence also finite) additivity of μ, μ(F)=μ(E)+μ(F – E).

86 5.2 Hahn and Jordan decompositions 87

Hence (i) follows since if μ(F) is finite, so are (both) μ(E) and μ(F – E). On the other hand if μ(E) is assumed finite it can be subtracted from both sides to give (ii). + ∅ – ∅ ≥ (iii) Let En = En or , and En = or En, according as μ(En) 0or μ(En) < 0 respectively. Then ∞ + ∪∞ + ∞ – ∪∞ – n=1μ(En )=μ( n=1En ) and n=1μ(En)=μ( n=1En) ∞ + ∞ – imply by (i) that n=1 μ(En ) and n=1 μ(En) are both finite. Hence ∞ | | ∞ + – ∞ + ∞ – n=1 μ(En) = n=1(μ(En )–μ(En)) = n=1μ(En )– n=1μ(En) is finite as required. (iv) is shown as for measures (Theorems 2.2.4 and 2.2.5).  While not needed here, it is worth noting that the requirement that μ be (extended) real may also be altered to allow complex values. That is, a is a complex-valued, countably additive set function E ∅ ∅ μ defined on a class (containing ) and such that μ( ) = 0. Thus if En E ∪∞ ∈E ∞ are disjoint sets of with n=1En = E ,wehaveμ(E)= n=1 μ(En). Since the convergence of a complex sequence requires convergence of its real and imaginary parts, it follows that the real and imaginary parts of μ are countably additive. That is, a complex measure μ may be written in the form μ = λ + iν where λ and ν are finite signed measures. Conversely, of course, if λ and ν are finite signed measures then λ + iν is a complex measure. Thus the complex measures are precisely the set functions of the form λ+iν where λ and ν are finite signed measures. Some of the properties of complex measures are given in Ex. 5.29.

5.2 Hahn and Jordan decompositions

If μ1, μ2 are two measures on a σ-field S, their sum μ1 + μ2 (defined for E ∈Sas μ1(E)+μ2(E)) is clearly a measure on S. The difference μ1(E)– μ2(E) is not necessarily defined for all E ∈S(i.e. if μ1(E)=μ2(E)=∞). However, if at least one of the measures μ1 and μ2 is finite, μ1 –μ2 is defined for every E ∈Sand is a signed measure on S. It will be shown in this section that every signed measure can be written as a difference of two measures of which at least one is finite (Theorem 5.2.2). If μ is a signed measure on a measurable space (X, S), a set E ∈Swill be called positive (resp. negative, null), if μ(F) ≥ 0 (resp. μ(F) ≤ 0, μ(F)= 0) for all F ∈Swith F ⊂ E. Notice that measurable subsets of positive sets are positive sets. Further the union of a sequence {An} of positive sets 88 and related topics ∈S ⊂∪∞ ∪∞ ∩ ∪∞ is clearly positive (if F , F 1 An, F = 1 (F An)= 1 Fn where S ⊂ ∩ ≥ Fn are disjoint sets of and Fn F An (Lemma 1.6.3) so that μ(Fn) 0 and μ(F)= μ(Fn) ≥ 0). Similar statements are true for negative and null sets.

Theorem 5.2.1 (Hahn Decomposition) If μ is a signed measure on the measurable space (X, S), then there exist two disjoint sets A, B such that A is positive, and B is negative, and A ∪ B = X.

Proof Since μ assumes at most one of the values +∞,–∞, assume for definiteness that –∞ <μ(E) ≤ +∞ for all E ∈S. Define

λ = inf{μ(E):E negative}.

∅ ≤ { }∞ Since the empty set is negative, λ 0. Let Bn n=1 be a sequence of ∪∞ negative sets such that λ = limn→∞ μ(Bn) and let B = n=1Bn. The theorem will be proved in steps as follows: (i) B is negative since as noted above the countable union of negative sets is negative. (ii) μ(B)=λ, and thus –∞ <λ≤ 0. For certainly λ ≤ μ(B) by (i) and the definition of λ. Also for each n, B =(B – Bn) ∪ Bn and hence

μ(B)=μ(B – Bn)+μ(Bn) ≤ μ(Bn) since B – Bn ⊂ B (negative). It follows that μ(B) ≤ limn→∞ μ(Bn)=λ,so that μ(B)=λ as stated. (iii) Let A = X – B. If F ⊂ A is negative, then F is null. For let F ⊂ A be negative and G ∈S, G ⊂ F. Then G is negative and E = B ∪ G is negative. Hence, by the definition of λ and (ii), λ ≤ μ(E)=μ(B)+μ(G)=λ + μ(G). Thus μ(G) ≥ 0 but since F is negative, μ(G) ≤ 0, so that μ(G) = 0. Thus F is null.

(iv) A = X–B is positive. Assume on the contrary that there exists E0 ⊂ A, E0 ∈S,withμ(E0) < 0. Since E0 is not null, by (iii) it is not negative. Let k1 be the smallest positive integer such that there is a measurable set E1 ⊂ E0 with μ(E1) ≥ 1/k1. Since μ(E0) is finite (–∞ <μ(E0) < 0) and E1 ⊂ E0, Theorem 5.1.1 (i) and (ii) give μ(E0 – E1)=μ(E0)–μ(E1) < 0, since μ(E0) < 0, μ(E1) > 0. Thus the same argument now applies to E0 – E1. Let k2 be the smallest positive integer such that there is a measurable set E2 ⊂ E0 – E1 with μ(E2) ≥ 1/k2. Proceeding inductively, let kn be the ⊂ ∪n–1 smallest positive integer such that there is a measurable set En E0 – i=1 Ei with μ(En) ≥ 1/kn. 5.2 Hahn and Jordan decompositions 89

∪∞ ∪∞ ⊂ | | ∞ ∞ Write F0 = E0 – i=1Ei.Now 1 En E0, μ(E0) < so that 1 μ(En) ∪∞ → →∞ (= μ( 1 En)) converges and hence μ(En) 0, so that kn .Nowfor ⊂ ∪n–1 ∈S ⊂ each n, F0 E0 – i=1 Ei. Hence for all F , F F0,wehaveμ(F) < 1/(kn – 1) so that μ(F) ≤ 0, since kn →∞. Thus F0 is negative and by (iii) F0 is null.But ∞ μ(F0)=μ(E0)– i=1μ(Ei) < 0 since μ(E0) < 0, μ(Ei) > 0, i =1,2,....Butμ(F0) < 0 contradicts the fact that F0 is null. Hence the assumption that A is not positive leads to a contradiction, so that A is positive, as stated. 

A representation of X as a disjoint union of a positive set A and a negative set B is called a Hahn decomposition of X with respect to μ. Thus, by the theorem, a Hahn decomposition always exists, but is clearly not unique (since a may be attached to either A or B – see the example after Theorem 5.2.3). Even though a Hahn decomposition of X with respect to the signed measure μ is not unique, it does provide a representation of μ as the difference of two measures which does not depend on the particular Hahn decomposition used. This is seen in the following theorem.

Theorem 5.2.2 (Jordan Decomposition) Let μ be a signed measure on a measurable space (X, S).IfX = A ∪ B is a Hahn decomposition of X for μ, then the set functions μ+, μ– defined on S by μ+(E)=μ(E ∩ A), μ–(E)=–μ(E ∩ B) for each E ∈S, are measures on S, at least one of which is finite, and μ = μ+ – μ–. The measures μ+, μ– do not depend on the particular Hahn decomposition chosen. The expression μ = μ+ – μ– is called the Jordan decomposition of the signed measure μ. 90 Absolute continuity and related topics

Proof Since A ∩ E ⊂ A (positive) and B ∩ E ⊂ B (negative), the set functions μ+ and μ– are nonnegative, and thus are clearly measures on S. Since μ assumes at most one of the values ±∞, at least one of μ+, μ– is finite. Also, for every E ∈S,

μ(E)=μ(E ∩ A)+μ(E ∩ B)=μ+(E)–μ–(E) and thus μ = μ+ – μ–. In order to prove that μ+, μ– do not depend on the particular Hahn de- composition chosen, we consider two Hahn decompositions X = A1 ∪ B1 = A2 ∪ B2 of X with respect to μ and show that for each E ⊂S,

μ(E ∩ A1)=μ(E ∩ A2) and μ(E ∩ B1)=μ(E ∩ B2).

Notice that the set E ∩ (A1 – A2) is a subset of the positive set A1, and thus μ{E∩(A1–A2)}≥0, as well as of the negative set B2, so that μ{E∩(A1–A2)}≤ 0. Hence μ{E ∩ (A1 – A2)} = 0 for each E ∈S. Similarly μ{E ∩ (A2 – A1)} =0 and it follows that

μ(E ∩ A1)=μ(E ∩ A1 ∩ A2)=μ(E ∩ A2) as desired. It follows in the same way that μ(E ∩ B1)=μ(E ∩ B2) and thus the proof is complete.  It is clear that a signed measure may be written as a difference of two measures in many ways; e.g. μ =(μ+ + λ)–(μ– + λ) where λ is an ar- bitrary finite measure. However, among all possible decompositions of a signed measure as a difference of two measures, the Jordan decomposition is characterized by a certain uniqueness property and also by a “minimal property”, given in Ex. 5.6. The set function |μ| defined on S by |μ|(E)=μ+(E)+μ–(E) is clearly a measure (see Ex. 4.11) and is called the of μ.Notethataset E ∈Sis positive if and only if μ–(E)=0.ForifE is positive, E ∩ B is a subset of both the positive set E and the negative set B so that μ(E ∩ B)=0 and hence μ–(E) = 0. Conversely if μ–(E) = 0 and F ∈S, F ⊂ E then μ–(F) = 0 and μ(F)=μ+(F) ≥ 0, showing that E is positive. Similarly E is negative if and only if μ+(E) = 0. Also |μ(E)|≤|μ|(E) with equality only if E is positive or negative. Finally note that |μ|(E)=0 implies that E is a null set with respect to |μ|, μ+, μ– and μ. A useful example of a signed measure is provided by the indefinite inte- gral of a function whose integral can be defined, as shown in the following result. 5.2 Hahn and Jordan decompositions 91

Theorem 5.2.3 Let (X, S, μ) be a measure space and f a measurable function defined a.e. on X and such that either f+ ∈ L1(X, S, μ) or f– ∈ L1(X, S, μ). Then the set function ν defined for each E ∈Sby ν E fdμ ( )= E is a signed measure on S; and if f ∈ L1(X, S, μ) then ν is a finite signed measure.

Proof Clearly ν(∅) = 0 and if f ∈ L1(X, S, μ) then ν is finite. The proof { }∞ will be completed by checking countable additivity of ν.Let En n=1 be a ∪∞ ∞ sequence of disjoint measurable sets, E = n=1En. Then f+χE = n=1f+χEn a.e. (i.e. for all x for which f is defined) and by the corollary to Theorem 4.5.2 f dμ = f χ dμ = ∞ f χ dμ = ∞ f dμ. E + + E n=1 + En n=1 E + n ∞ ∞ Hence f+ dμ = f+ dμ and similarly f– dμ = f– dμ. Since E n=1 En E n=1 En either f+ ∈ L1(μ)orf– ∈ L1(μ), at least one of the two positive series con- verges to a finite number and thus ∞ ν(E)= f+ dμ – f– dμ = f+ dμ – f– dμ E E n=1 En En ∞ ∞ = fdμ = ν(En) n=1 En n=1 as required.  It is clear that a Hahn decomposition of X with respect to ν is A ∪ B where A = {x : f (x) ≥ 0} and B = Ac (i.e. {x : f (x) < 0} if f is defined on X). If the set {x : f (x)=0} is nonempty then another Hahn decomposition is ∪ { } c A1 B1 where A1 = x : f (x) > 0 and B1 = A1. The Jordan decomposition ν = ν+ – ν– of ν is given in both cases by ν E f dμ ν E f dμ +( )= E + , –( )= E – for each E ∈S, and the total variation |ν| of ν is |ν| E ν E ν E f dμ f dμ |f | dμ ( )= +( )+ –( )= E + + E – = E . Finally the following simple application of the Jordan decomposition shows that extensions of σ-finite signed measures have a uniqueness prop- erty corresponding to that for measures. This will be useful later. Lemma 5.2.4 Let μ, ν be signed measures on the σ-field S which are equal on a semiring P such that S(P)=S.Ifμ is σ-finite on P then μ = ν on S. 92 Absolute continuity and related topics

Proof Write μ = μ+ – μ–, ν = ν+ – ν–.ForE ∈P

μ+(E)–μ–(E)=ν+(E)–ν–(E) and hence μ+(E)+ν–(E)=ν+(E)+μ–(E) when all four terms are finite. But if e.g. μ+(E)=∞ then clearly ν+(E)=∞ (and μ–(E), ν–(E) are finite) so that the same rearrangement holds, i.e. μ+ + ν– = ν+ + μ– on P. Since these two σ-finite measures are equal on P, they are equal on S(P)=S,from which μ = ν on S follows by the reverse rearrangement. 

5.3 Integral with respect to signed measures

If μ is a signed measure on (X, S) with Jordan decomposition μ = μ+ – μ–,theintegral with respect to μ over X of any f which belongs to both L1(X, S, μ+) and L1(X, S, μ–) may be defined by fdμ = fdμ – fdμ + – = f+ dμ+ – f– dμ+ – f+ dμ– + f– dμ–.

Notice that since |μ| = μ+ + μ– we have for every measurable f defined a.e. (|μ|)onX |f | d|μ| = |f | dμ+ + |f | dμ–

(see Ex. 4.11) and thus f belongs to both L1(X, S, μ+) and L1(X, S, μ–)if and only if f ∈ L1(X, S, |μ|). Further, as at the end of Section 4.3, if f is a measurable function defined a.e. (|μ|)onX but f  L1(X, S, |μ|)wemay define fdμ =+ ∞ when the two negative terms in the above defining expression for fdμ are finite and one of the positive terms is +∞. That is ∞ ∈ ∈   fdμ =+ when f– L1(μ+), f+ L1(μ–) and f+ L1(μ+)orf– L1(μ–). Similarly fdμ is defined as –∞ when f+ ∈ L1(μ+), f– ∈ L1(μ–) and f–  L1(μ+)orf+  L1(μ–). This integral has many of the basic properties of the integral with respect to a measure described in Chapter 4. A few of these are collected here, more as examples and for reference than for detailed study.

Theorem 5.3.1 (i) If μ is a signed measure and f ∈ L1(|μ|), then | fdμ|≤ |f | d|μ|.

(ii) (Dominated Convergence). Let μ be a signed measure, {fn} a sequence of functions in L1(|μ|) and g ∈ L1(|μ|) such that |fn|≤|g| a.e. (|μ|) for 5.3 Integral with respect to signed measures 93

each n =1,2,.... If f is a measurable function such that fn → fa.e. (|μ|) then f ∈ L1(|μ|) and |fn – f | d|μ|→0, fn dμ → fdμ as n →∞.

Proof (i) By using the corresponding property for measures (Theorem 4.4.5) and Ex. 4.11, we have by the definition fdμ = fdμ+ – fdμ–,             fdμ ≤ fdμ+ + fdμ– ≤ |f | dμ+ + |f | dμ– = |f | d|μ|.

(ii) The first limit is just dominated convergence for the measure |μ| (The- orem 4.5.5), and the second limit follows from the first and the inequality in (i). 

The next result is the transformation theorem for signed measures. As for measures it may be extended to nonintegrable cases where integrals are defined.

Theorem 5.3.2 Let (X, S) and (Y, T ) be measurable spaces, μ a signed measure on S and T a measurable transformation defined a.e. (|μ|) on X into Y. Then the set function μT–1 defined on T by (μT–1)(E)=μ(T–1E), E ∈T, is a signed measure on T , and if f is a T -measurable function –1 –1 defined a.e. (μT ) on Y and such that f T ∈ L1(|μ|), then f ∈ L1(|μT |) and fdμT–1 fTdμ Y = X .

Proof Exactly as when μ is a measure it is seen that μT–1 is countably additive (Theorem 3.7.1) and that μT–1(∅) = 0. Also, since μ assumes at most one of the values ±∞, so does μT–1. Thus μT–1 is a signed measure on T . Now assume first for simplicity that T is defined on X. Then T–1T is a σ-field (Theorem 3.2.2) and let λ denote the restriction of μ from S to T–1T⊂S. Clearly λT–1 = μT–1.LetY = A ∪ B be a Hahn decomposition of Y for λT–1,withA positive and B negative. We now show that X = (T–1A) ∪ (T–1B) is a Hahn decomposition of X for λ. Indeed T–1A and T–1B are disjoint sets in T–1T with union X.NowifE is a T–1T -measurable subset of T–1A, then E = T–1G, for some G ∈T. Since E = T–1G ⊂ T–1A we have E = T–1(G∩A) and thus λ(E)=λT–1(G∩A) ≥ 0 since A is positive for λT–1. It follows that T–1A is positive for λ and similarly T–1B is negative for λ. 94 Absolute continuity and related topics

Now let λ = λ+ – λ– be the Jordan decomposition of λ. We show that –1 –1 –1 –1 –1 λT =(λ+ – λ–)T = λ+T – λ–T is the Jordan decomposition of λT . Indeed for each E ∈T,

–1 –1 –1 –1 –1 (λ+T )(E)=λ(T E ∩ T A)=λ{T (E ∩ A)} =(λT )(E ∩ A) –1 =(λT )+(E)

–1 –1 since Y = A ∪ B is a Hahn decomposition of Y for λT . Hence λ+T = –1 –1 –1 –1 (λT )+ and similarly λ–T =(λT )–. It thus follows that λT = –1 –1 –1 λ+T – λ–T is the Jordan decomposition of λT , and –1 –1 –1 –1 –1 |λT | = λ+T + λ–T =(λ+ + λ–)T = |λ|T . Notice that |λ|(E) ≤|μ|(E) for each E ∈ T–1T since

–1 –1 |λ|(E)=λ+(E)+λ–(E)=λ(E ∩ T A)–λ(E ∩ T B) = μ(E ∩ T–1A)–μ(E ∩ T–1B) ≤|μ|(E ∩ T–1A)+|μ|(E ∩ T–1B) = |μ|(E). Thus by Theorem 4.6.1 |f | d|μT–1| = |f | d|λT–1| = |f | d|λ|T–1 = |fT| d|λ| Y Y Y X ≤ |fT| d|μ| X (the inequality being an easy exercise whose details are left to the interested –1 reader). Hence fT ∈ L1(|μ|) implies f ∈ L1(|μT |) and, again by Theorem 4.6.1, fdμT–1 = fdλT–1 = fdλ T–1 – fdλ T–1 Y Y Y + Y – fTdλ fTdλ fTdλ fTdμ = X + – X – = X = X with the last equality from Ex. 4.10. Thus the theorem follows when T is defined on X. The requirement that T is defined on X, may then be weakened to T defined a.e. (|μ|)onX in the usual straightforward way (i.e. if T is defined on E ∈Swith |μ|(Ec) = 0 apply the previous result to the transformation c T which is defined on X by T x = Tx, x ∈ E, and T x = y0, x ∈ E , where y0 is any fixed point in Y). This completes the proof of the theorem. 

5.4 Absolute continuity and singularity In this section (X, S) will be a fixed measurable space and μ, ν two signed measures on S (in particular one or both of μ and ν may be measures). Then ν is said to be absolutely continuous with respect to μ, written νμ, 5.4 Absolute continuity and singularity 95 if ν(E) = 0 for all E ∈Ssuch that |μ|(E) = 0. Of course when μ is a measure |μ| = μ and ν  μ if all measurable sets with μ-measure zero have also ν-measure zero. In any case, the involvement of |μ| in the definition implies trivially that ν  μ if and only if ν |μ|.Ifμ and ν are mutually absolutely continuous, that is if ν  μ and μ  ν, then μ and ν are said to be equivalent, written μ ∼ ν. When both μ and ν are measures, they are equivalent if and only if they have the same zero measure sets. Theorem 5.2.3 provides an example of a signed measure ν which is ab- solutely continuous with respect to a measure μ: the indefinite μ-integral ν E fdμ f f ∈ L μ f ∈ L μ defined by ( )= E where is such that + 1( )or – 1( ). In fact the celebrated Radon–Nikodym Theorem of the next section (Theo- rem 5.5.3) shows that when μ is a σ-finite measure then all σ-finite signed measures ν with ν  μ are indefinite μ-integrals. For two signed measures we now show that ν  μ if and only if |ν||μ|, i.e. ν  μ whenever all measurable sets with total μ-variation zero have also total ν-variation zero. It follows that μ ∼ ν if and only if the total variations |μ| and |ν| give zero measure to the same class of measurable sets. Theorem 5.4.1 If μ and ν are signed measures on the measurable space (X, S) then the following are equivalent (i) ν  μ (ii) ν+  μ and ν–  μ (iii) |ν||μ|. Proof To see that (i) implies (ii), fix E ∈Swith |μ|(E) = 0, and let X = A ∪ B be a Hahn decomposition of X with respect to ν. Then since |μ| is a measure, |μ|(E) = 0 implies |μ|(E ∩ A)=|μ|(E ∩ B) = 0. Since ν  μ, ν(E ∩ A)=ν(E ∩ B) = 0 and thus ν+(E)=ν–(E) = 0. It follows that ν+  μ, ν–  μ, and |ν|μ giving (ii). Clearly (ii) implies (iii) since |ν|(E)=ν+(E)+ν–(E)=0if|μ|(E)=0. Finally to show that (iii) implies (i), let E ∈Swith |μ|(E) = 0. By (iii) |ν|(E) = 0, so that |ν(E)|≤|ν|(E) = 0 showing ν(E) = 0 and hence (i).  Notice that, by Theorem 5.4.1, ν  μ if and only if |ν||μ| and thus if and only if |ν|(E) = 0 whenever |μ|(E) = 0, or equivalently, |μ|(E) > 0 whenever |ν|(E) > 0. In particular μ ∼ ν if and only if |μ|∼|ν| and thus if and only if |μ| and |ν| assign strictly positive measure to the same class of sets. A notion “opposite” to equivalence (∼), and thus also to absolute continuity (), would therefore be one under which |μ| and |ν| are concen- trated on disjoint sets, so that they have essentially distinct classes of sets of strictly positive measure. Specifically two signed measures μ, ν defined on 96 Absolute continuity and related topics

S are called singular, written μ ⊥ ν, if and only if there is a set E ∈Ssuch that |μ|(E)=0=|ν|(Ec). It then follows that for every F ∈S, |μ|(F ∩ E)=0 and |ν|(F ∩ Ec) = 0 and thus

μ(F)=μ(F ∩ Ec) and ν(F)=ν(F ∩ E), i.e. the measure μ is concentrated on the set Ec and the measure ν is con- centrated on the set E. Important implications of the notions of absolute continuity and singu- larity are contained in the Lebesgue decomposition and the Radon– Nikodym Theorem given in the following section.

5.5 Radon–Nikodym Theorem and the Lebesgue decomposition The Lebesgue–Radon–Nikodym Theorem asserts that every σ-finite signed measure ν may be written as the sum of two signed measures of which the first is an indefinite integral of a given σ-finite measure μ and the second is singular with respect to μ. We establish this result first for finite measures, and then extend it to the σ-finite and signed cases. A function f satisfying a certain property is said to be essentially unique if when g is any other function with this property then f = g a.e.

Lemma 5.5.1 Let (X, S, μ) be a finite measure space and ν a finite mea- sure on S. Then there exist two uniquely determined finite measures ν1 and ν2 on S such that

ν = ν1 + ν2, ν1  μ, ν2 ⊥ μ, and an essentially unique μ-integrable function f such that for all E ∈S, ν E fdμ 1( )= E . The function f may be taken nonnegative.

Proof Uniqueness is most readily shown. For suppose ν = ν1 +ν2 = ν3 +ν4 where ν1  μ, ν2 ⊥ μ, ν3  μ, ν4 ⊥ μ. Then λ = ν1 – ν3 = ν4 – ν2 is a finite signed measure which is both absolutely continuous and singular with respect to μ (Ex. 5.11) and hence must be zero (Ex. 5.12). That is, ν1 = ν3 and ν2 = ν4 as required for uniqueness of the decomposition ν ν ν ν E fdμ gdμ E ∈S = 1 + 2. Further if 1( )= E = E for all , it follows from Theorem 4.4.8 (Corollary) that f = g a.e. (μ). Hence the uniqueness statements are proved. 5.5 Radon–Nikodym Theorem and the Lebesgue decomposition 97

Turning now to the existence of ν1, ν2 and f ,letK denote the class of all nonnegative measurable functions f on X such that fdμ ≤ ν E E ∈S E ( ) for all . The method of proof is to find f ∈Kmaximizing fdμ and thus “extract- ν ν E fdμ ν ν ν ing as much of as is possible by 1( )= E ”, the remainder 2 = – 1 being shown to be singular. Note that K is nonempty since it contains the function which is identi- cally zero. Write α fdμ f ∈K = sup X : , {f } K f dμ → α and let n be a sequence of functions in such that X n . Write gn(x)=max{f1(x), ..., fn(x)}≥0. Then if E ∈S, for fixed n, E can ∪n be written as i=1Ei where the Ei are disjoint measurable sets and gn(x)= fi(x)forx ∈ Ei. (Write E1 = {x : gn(x)=f1(x)}, E2 = {x : gn(x)=f2(x)} – E1, etc.) Thus n n n gn dμ = gn dμ = fi dμ ≤ ν(Ei)=ν(E), E i=1 Ei i=1 Ei i=1 showing that gn ∈K. Since {gn} is an increasing sequence it has a limit f (x) = limn→∞ gn(x) and by monotone convergence fdμ = lim gn dμ ≤ ν(E). E n→∞ E f ∈K fdμ →∞ g dμ ≥ →∞ f dμ α It follows that and X = limn X n limn X n = fdμ α so that X = . Write now ν E fdμ ν E ν E ν E E ∈S 1( )= E and 2( )= ( )– 1( ) for all .

Then ν1 is clearly a finite measure (Theorem 5.2.3) with f ≥ 0, f ∈ L1(μ) and ν1  μ. Further ν2 is finite, countably additive, and ν2(E) ≥ 0 for all E ∈S f ∈K ν E fdμ ≤ ν E ν since implies that 1( )= E ( ). Hence 2 is a finite measure, and it only remains to show that ν2 ⊥ μ. –1 To see this, consider the finite signed measure λn = ν2–n μ (n =1,2,...) and let X = An ∪ Bn be a Hahn decomposition of X for λn (An positive, Bn –1 ∈S negative). If hn = f + n χAn , then for all E , h dμ fdμ n–1μ A ∩ E ν E ν E n–1μ A ∩ E E n = E + ( n )= ( )– 2( )+ ( n ) = ν(E)–ν2(E ∩ Bn)–λn(An ∩ E) ≤ ν(E) 98 Absolute continuity and related topics since ν2 is a measure and An is positive for λn. Thus hn ∈Kso that α ≥ h dμ fdμ n–1μ A X n = X + ( n) –1 = α + n μ(An) ∪∞ c ⊂ which implies that μ(An)=0.IfA = n=1An, then μ(A) = 0. Since A c c ≤ c ≤ –1 c An = Bn we have λn(A ) 0 and thus ν2(A ) n μ(A ) for each n. Thus c ν2(A )=0=μ(A) showing that ν2 ⊥ μ, and thus completing the proof.  We next establish the Lebesgue Decomposition Theorem in its general form. Theorem 5.5.2 (Lebesgue Decomposition Theorem) If (X, S, μ) is a σ- finite measure space and ν is a σ-finite signed measure on S, then there exist two uniquely determined σ-finite signed measures ν1 and ν2 such that

ν = ν1 + ν2, ν1  μ, ν2 ⊥ μ.

If ν is a measure, so are ν1 and ν2. ν = ν1 + ν2 is called the Lebesgue decomposition of ν with respect to μ.

Proof The existence of ν1 and ν2 will first be shown when both μ and ∪∞ ν are σ-finite measures. Then clearly X = n=1Xn, where Xn are disjoint measurable sets with 0 ≤ μ(Xn) < ∞,0≤ ν(Xn) < ∞. For each n = 1, 2, ..., define

(n) (n) μ (E)=μ(E ∩ Xn) and ν (E)=ν(E ∩ Xn) for all E ∈S. Then μ(n), ν(n) are finite measures and by Lemma 5.5.1, (n) (n) (n) (n)  (n) (n) ⊥ (n) ν = ν1 + ν2 where ν1 μ , ν2 μ . ∈S ∞ Now define the set functions ν1, ν2 for E by (writing n for n=1) (n) (n) ν1(E)= n ν1 (E), ν2(E)= n ν2 (E). (n) (n) (n) Then ν = ν1 + ν2 since ν(E)= n ν (E)= n (ν1 (E)+ν2 (E)). Also ν1 and ν2 are readily seen to be σ-finite measures. For countable additivity, if ∪∞ S E = k=1Ek where Ek are disjoint sets of then (n) (n) (n) ν1(E)= n ν1 (E)= n k ν1 (Ek)= k n ν1 (Ek)= k ν1(Ek) by interchanging the order of summation of the double series whose terms are nonnegative. Hence ν1 is a measure, and similarly so is ν2. σ-finiteness S ∪∞ follows since X (and hence each set of ) may be covered by n=1Xn, where (m) ≤ (m) ∞ νi(Xn)= m νi (Xn) m ν (Xn)=ν(Xn) < , i =1,2. 5.5 Radon–Nikodym Theorem and the Lebesgue decomposition 99

(n) To show that ν1  μ,fixE ∈Swith μ(E) = 0. Then μ (E)=μ(E ∩ X ) = 0 and since ν(n)  μ(n) we have ν(n)(E) = 0. It follows that ν (E)= n 1 1 1 (n)  n ν1 (E) = 0 and hence ν1 μ. The proof (when ν is a σ-finite measure) is completed by showing that ⊥ (n) ⊥ (n) ∈S ν2 μ. Since for each n =1,2,..., ν2 μ there is a set En such that (n) (n) c μ (En) = 0 and ν2 (En)=0. ∩ ∪∞ Let Fn = En Xn, F = 1 Fn. Then the sets Fn are disjoint and (n) μ(F)= n μ(Fn)= n μ (En)=0. (n) c ∩ c (n) c On the other hand ν (Xn)=ν(Xn Xn) = 0 implies ν2 (Xn) = 0 and since c c ∪ c (n) c Fn = En Xn it follows that ν2 (Fn)=0.Now c (n) c ≤ (n) c ν2(F )= n ν2 (F ) n ν2 (Fn)=0 c ⊂ c c ⊥ since F Fn. Hence μ(F)=0=ν2(F ) and thus ν2 μ as desired. Thus the result follows when ν is a σ-finite measure. When ν is a σ-finite signed measure it has the Jordan decomposition ν = ν+ – ν–, where at least one of the measures ν+, ν– is finite and the other σ-finite. Using the theorem for σ-finite measures, write ν+ = ν+,1 + ν+,2 and ν– = ν–,1 +ν–,2 where ν+,1, ν–,1  μ, ν+,2, ν–,2 ⊥ μ. If, for instance, ν– is finite, then so are the measures ν–,1, ν–,2, and hence ν =(ν+,1 – ν–,1)+(ν+,2 – ν–,2)= ν1 + ν2 with ν1 = ν+,1 – ν–,1  μ and ν2 = ν+,2 – ν–,2 ⊥ μ (Ex. 5.11). Thus existence of the Lebesgue decomposition follows when ν is a σ- finite signed measure. To show uniqueness, suppose first that ν is a σ-finite measure and ν = ν1 + ν2 = ν3 + ν4 where ν1, ν3  μ and ν2, ν4 ⊥ μ. ∪∞ Since both μ and ν are σ-finite we again write X = n=1Xn where Xn are disjoint measurable sets with both μ(Xn), ν(Xn) finite. For each n =1,2,... (n) (n) (n) ∩ define the finite measures μ , νi , i =1,2,3,4byμ (E)=μ(E Xn)and (n) ∩ ∈S νi (E)=νi(E Xn) for all E . Then clearly (n) (n) (n) (n) (n) (n)  (n) (n) (n) ⊥ (n) ν1 + ν2 = ν3 + ν4 ; ν1 , ν3 μ ; ν2 , ν4 μ . (n) (n) (n) (n) By the uniqueness part of Lemma 5.5.1, ν1 = ν3 and ν2 = ν4 for all n =1,2,..., so that (n) (n) ν1 = n ν1 = n ν3 = ν3 and similarly ν2 = ν4. Thus uniqueness follows when ν is a σ-finite mea- sure. If ν is a σ-finite signed measure with two decomposition ν1 + ν2 = ν3 + ν4, uniqueness follows by using the Jordan decomposition for each 100 Absolute continuity and related topics

νi, rearranging the equation so that each side is positive, and applying the result for measures.  We now prove the general form of the Radon–Nikodym Theorem. Theorem 5.5.3 (Radon–Nikodym Theorem) Let (X, S, μ) be a σ-finite measure space and ν a σ-finite signed measure on S.Ifν  μ then there is an essentially unique finite-valued measurable function f on X such that for all E ∈S, ν E fdμ ( )= E . fisμ-integrable if and only if ν is finite. In general at least one of f+, f– is μ-integrable and these happen as ν+ or ν– is finite. If ν is a measure then f is nonnegative. Proof The existence of f follows from Lemma 5.5.1 if μ, ν are finite measures. For by the uniqueness of the Lebesgue decomposition of ν = ν1 + ν2 = ν + 0 (regarding zero as a measure) we must have ν1 = ν and thus ν E ν E fdμ E ∈S μ f ( )= 1( )= E , , for some nonnegative -integrable which (by Theorem 4.4.2 (iv)) may be taken to be finite-valued. Assume now that μ, ν are σ-finite measures. As in previous proofs write ∪∞ ∞ X = n=1Xn where Xn are disjoint measurable sets with μ(Xn) < , ν(Xn) < (n) (n) (n) (n) ∞, and define μ (E)=μ(E ∩ Xn), ν (E)=ν(E ∩ Xn). Then μ , ν are finite measures on S with ν(n)  μ(n), and by the result just shown ν(n) E f dμ(n) E ∈S for finite measures, ( )= E n , all , for some nonnegative, finite-valued, measurable fn. Thus (using Ex. 4.9) ∩ (n) (n) ν(E Xn)=ν (E)= χE fn dμ = χE fn dμ = χE χX fn dμ. Xn n ∞ Hence, writing f = n=1 χXn fn and using monotone convergence, ν(E)= ∞ ν(E ∩ X )= ∞ χ χ f dμ = χ fdμ = fdμ. n=1 n n=1 E Xn n E E f is a nonnegative measurable function and is finite-valued (Xn are disjoint and thus f (x)=fn(x) on each Xn). Thus the existence of f follows when μ, ν are σ-finite measures. When ν is a σ-finite signed measure, it has Jordan decomposition ν = ν+ – ν–, where at least one of the measures ν+, ν– is finite and the other σ-finite. Using the results just shown for finite and σ-finite measures we ν E f dμ ν E f dμ E ∈S have +( )= E + , –( )= E – , , for some nonnegative finite- valued measurable functions f+, f–, at least one of which is μ-integrable. Notice that if X = A ∪ B is a Hahn decomposition of X for ν, ν+(B)=0= ν–(A) and thus we may take f+ =0onB and f– =0onA. Then clearly 5.5 Radon–Nikodym Theorem and the Lebesgue decomposition 101 ν E fdμ E ∈S f f f f f ( )= E , all , where = + – – (and +, – are the positive and negative parts of f ) has all properties stated in the theorem. Thus the existence of f is shown. To show its essential uniqueness let g ∪∞ be another function with the same properties as f . Write X = n=1Xn, where Xn are disjoint measurable sets with μ(Xn) and ν(Xn) finite. Then for each fixed n, ν(n)(E)=ν(E ∩ X )= f χ dμ = gχ dμ for all E ∈S. n E Xn E Xn (n) Since ν is a finite signed measure, f χXn and gχXn are μ-integrable (see Theorem 5.2.3 and the discussion following its proof) and by Theorem

4.4.8 (Corollary), f χXn = gχXn a.e. (μ) for all n. Thus f = g a.e. (μ)on X. It follows that f is essentially unique and the proof of the theorem is complete.  The following result provides an informative equivalent definition of ab- solute continuity for finite signed measures. This may be given a straight- forward direct proof but as shown here follows neatly as a corollary to the above theorem, from the result for the indefinite integral of an L1-function shown in Theorem 4.5.3. Corollary Let (X, S, μ) be a σ-finite measure space and ν a finite signed measure on S. Then ν  μ if and only if given any >0 there exists δ = δ( ) > 0 such that |ν(E)| < whenever E ∈Sand μ(E) <δ. Proof If the stated condition holds, and μ(E) = 0 then |ν(E)| < for any >0 and thus ν(E) = 0, i.e. ν  μ. Conversely, a finite signed measure ν ν  μ ν E fdμ f ∈ L with may be written as ( )= E for some 1 by the theorem and hence the result just restates Theorem 4.5.3.  The Lebesgue decomposition and Radon–Nikodym Theorem may be combined into the following single statement which provides a useful rep- resentation of a measure in terms of another. This generalizes the more limited statement of Lemma 5.5.1. Theorem 5.5.4 (Lebesgue–Radon–Nikodym Theorem) Let (X, S, μ) be a σ-finite measure space and ν a σ-finite signed measure on S. Then there exist two uniquely determined σ-finite signed measures ν1 and ν2 such that

ν = ν1 + ν2, ν1  μ, ν2 ⊥ μ, and an essentially unique finite-valued measurable function f on X such that f+ or f– is μ-integrable and for all E ∈S, ν E fdμ 1( )= E . 102 Absolute continuity and related topics

Thus for some E0 ∈Swith μ(E0)=0we have for all E ∈S, ν E fdμ ν E ∩ E fdμ ν E ∩ E ( )= E + 2( 0)= E + ( 0) since μ(E0)=0⇒ ν1(E ∩ E0)=0.f isμ-integrable if and only if ν1 is finite. ν  μ if and only if ν(E0)=0.Ifν is a measure so are ν1, ν2 and f is nonnegative.

Note that both the Lebesgue decomposition theorem and the Radon– Nikodym Theorem may fail in the absence of σ-finiteness. For a simple example see Ex. 5.20.

5.6 Derivatives of measures If μ is a σ-finite measure and ν a σ-finite signed measure on (X, S) such that ν  μ f ν E fdμ , then the function appearing in the relation ( )= E is called dν the Radon–Nikodym derivative of ν with respect to μ, and written dμ (or dν/dμ). It is not defined uniquely for every point x, since any measurable g f μ ν E gdμ E ∈S equal to a.e. ( ) will satisfy ( )= E for all .However, dν/dμ is essentially unique, in the sense already described. (f and g may be regarded as “versions” of dν/dμ.) An important use of the Radon–Nikodym Theorem concerns a change of measure in an integral. If μ, ν are two σ-finite measures, and if ν  dν μ, the following result shows that fdν = f dμ dμ (as if the dμ were cancelled). This and other properties of the Radon–Nikodym derivative justify the quite suggestive symbol used to denote it.

Theorem 5.6.1 Let μ, ν be σ-finite measures on the measurable space (X, S),withν  μ. If f is a measurable function defined on X and is either nonnegative or ν-integrable, then fdν = f (dν/dμ) dμ. Proof dν dμ g E ∈S χ gdμ gdμ ν E Write / = .If then E = E = ( )=

χE dν. Thus the desired result holds whenever f is the indicator func- tion of a measurable set E. Hence, it also holds for a nonnegative simple function f and, by monotone convergence, for a nonnegative measurable function f (in the usual way, let fn be an increasing sequence of nonnega- tive simple functions converging to f at each point x. Note that g ≥ 0a.e. (μ), hence fng increases to fg a.e. and thus Theorem 4.5.2 applies). Finally, by expressing any ν-integrable f as f+ – f– we see that the result holds for such an f also.  5.6 Derivatives of measures 103

A comment on the requirement that f be defined for all x may be helpful. If f ∈ L1(X, S, ν), the set where f is not defined has ν-measure zero, but not necessarily zero μ-measure. However, the result is true if f is defined a.e. (μ). It is, indeed, true if f ∈ L1(X, S, ν)eveniff is not defined a.e. (μ), pro- vided the definition of f is extended in any way (preserving measurability) to all or almost all (μ-measure) points x. (See Ex. 5.21.) Theorem 5.6.1 expresses the integral with respect to ν as an integral with respect to μ when ν  μ. If moreover μ {x : dν/dμ =0} = 0 then μ  ν μ ∼ ν f dν 1 dν 1 dν dμ μ E so that . For if = dμ then E f = E f dμ = ( ) so that μ  ν and dμ/dν = (dν/dμ)–1 a.e. (ν). Hence μ-integrals can be expressed as ν-integrals as well (see Ex. 5.18). In general (when no absolute continuity assumptions are made) one can still express ν-integrals in terms of μ-integrals and a “remainder” term. This is an immediate corollary of the Lebesgue–Radon–Nikodym Theorem 5.5.4, the change of measure rule of Theorem 5.6.1 and Ex. 4.9.

Corollary Let μ, ν, f and E0 be as in Theorem 5.5.4 (μ(E0)=0).Ifgisa measurable function defined on X, and either nonnegative or ν-integrable, then gdν = gf dμ + gdν. E0 Radon–Nikodym derivatives may in some ways be manipulated like d(λ+ν) ordinary derivatives of functions. For example it is obvious that dμ = dλ dν   dμ + dμ a.e. (μ)ifλ μ and ν μ. A “chain rule” also follows as a corollary of the previous theorem.

Theorem 5.6.2 Let μ, ν be σ-finite measures on the measurable space (X, S) and λ a σ-finite signed measure on S. Then if λ  ν  μ, dλ dλ dν = · a.e. (μ). dμ dν dμ

Proof Assume that λ is a measure (the signed measure case can be ob- tained from this by the Jordan decomposition). For each E ∈S, dλ dλ dλ dν dμ = λ(E)= dν = · · dμ E dμ E dν E dν dμ by Theorem 5.6.1. Now the essential uniqueness of the Radon–Nikodym derivative (Theorem 5.5.3) implies that dλ/dμ =(dλ/dν) · (dν/dμ)a.e.(μ).  104 Absolute continuity and related topics

5.7 Real line applications This section concerns some applications of the previous results to the real line as well as some further results valid only on the real line. As usual R will denote the real line, B the Borel sets of R, and m Lebesgue measure on B. We begin with a refinement of the Lebesgue decomposition for a Lebesgue–Stieltjes measure with respect to Lebesgue measure. A measure ν on B is called discrete or atomic if there is a countable set C such that ν(Cc) = 0, i.e. if the measure ν has all its mass concentrated on a count- able set of points. This means, if ν  0, then ν({x}) > 0 for some (or all) x ∈ C. Since countable sets have zero Lebesgue measure, discrete mea- sures are singular with respect to Lebesgue measure. Recall that a measure ν on B is a Lebesgue–Stieltjes measure if and only if ν{(a, b]} < ∞ for all ∞ ∞ – < a < b < , or equivalently if and only if ν = μF , the Lebesgue– Stieltjes measure corresponding to a finite-valued, nondecreasing, right- F on R (Theorem 2.8.1). Since such a measure ν is σ-finite it has by Theorem 5.5.2, a Lebesgue decomposition with respect to Lebesgue measure m which we will here write as ν = ν0 + ν1, where ν0 ⊥ m and ν1  m. It will be shown that the singular part ν0 of ν may be further decomposed into two parts, one of which is discrete and the other is singular with respect to m and has no mass “at any one point”, i.e. having no atoms.

Theorem 5.7.1 If ν is a Lebesgue–Stieltjes measure on B, then there are three uniquely determined measures ν1, ν2, ν3 on B such that ν = ν1+ν2+ν3 and such that ν1  m, ν2 is discrete, and ν3 ⊥ mwithν3({x})=0for all x ∈ R.

Proof As noted above we may write ν = ν0+ν1 where ν0 ⊥ m and ν1  m. Now let C = {x : ν0({x}) > 0}. Then since ν0({x}) ≤ ν({x}) for each x and the atoms of ν are countable (Lemma 2.8.2) it follows that C is a countable set. c Write ν2(B)=ν0(B ∩ C), ν3(B)=ν0(B ∩ C )forB ∈B. Then ν0 = ν2 + ν3 c and hence ν = ν1 + ν2 + ν3.Nowν2 is discrete since ν2(C ) = 0; and c ν3 ⊥ m since ν0 ⊥ m implies ν0(G)=m(G ) = 0 for some G and hence c ν3(G) ≤ ν0(G)=0=m(G ). Further, for any x ∈ R, by definition of C,  ∅ ∈ c ν0( )=0 if x C ν3({x})=ν0({x}∩C )= ν0({x})=0 if x  C.

To prove uniqueness suppose that ν = ν1 +ν2 +ν3 = ν1 +ν2 +ν3, where νi has the same properties as νi. Since (ν2 +ν3) and (ν2 +ν3) are both singular with 5.7 Real line applications 105

respect to m, the uniqueness of the Lebesgue decomposition gives ν1 = ν1, ν2 + ν3 = ν2 + ν3 = ν0, say. Then clearly there is a countable set C such that c c ν2(C )=ν2(C ) = 0 (the union of the countable sets supporting ν2 and ν2), so that for B ∈B, ∩ { } { } ν2(B)=ν2(B C)= x∈B∩C ν2( x )= x∈B∩C ν0( x ).  Similarly this is also ν2(B) so that ν2 = ν2 and ν3 = ν3.

ν1 is called the absolutely continuous part of ν, ν2 is the discrete singular part of ν (usually called just the “discrete part”), and ν3 is the continuous singular part of ν (usually called just the “singular part”). From Theorem

5.7.1 we can obtain a corresponding decomposition of F if ν = μF ,and thus of any nondecreasing right-continuous function F. Before stating this decomposition the following terminology is needed. Let F be a nondecreasing right-continuous function defined on R and  μF its corresponding Lebesgue–Stieltjes measure. If μF m, F is said to be absolutely continuous with density function f = dμF /dm. Since { } ∞ ∞ ∞ μF (a, b] < for all – < a < b < , it follows from the Radon–Nikodym Theorem that f ∈ L1(a, b) and that b F b F a μ { a b } f t dt f t dt ( )– ( )= F ( , ] = (a,b] ( ) = a ( ) . Thus for each a and all x, x F x F a f t dt ( )= ( )+ a ( ) x a f t dt f t dt x < a where we write a ( ) =–x ( ) when . Also by Theorem 5.6.1, g(x) dF(x)= g(x)f (x) dx R whenever g is a nonnegative measurable function on or μF -integrable. ⊥ If F is continuous and μF m, F is said to be (continuous) singular. { } ∈ R Recall that F is continuous if and only if μF ( x ) = 0 for all x . Thus “F ⊥ { } ∈ R is singular” means that μF m and μF ( x ) = 0, all x . c If μF is atomic (discrete) F is called discrete. Then μF (C ) = 0 for some { }∞ ∞ ∞ countable set C = xn n=1 and for – < a < b < , { } { ∩ } { } F(b)–F(a)=μF (a, b] = μF (a, b] C = a

Corollary to Theorem 5.7.1 Every nondecreasing and right-continuous function F defined on R has a decomposition

F(x)=F1(x)+F2(x)+F3(x), x ∈ R, where F1, F2, F3 are nondecreasing and right-continuous, and F1 is ab- solutely continuous, F2 is discrete, and F3 is singular. Each of F1, F2, F3 is unique up to an additive constant. F has at most countably many dis- continuities, arising solely from possible jumps in the discrete component F2.

Proof Let μF = ν1 + ν2 + ν3 be the decomposition of μF into its three components. Write Fi(x)=νi{(0, x]} for x ≥ 0, and –νi{[x,0)} for x < 0 (as in the proof of Theorem 2.8.1). Then the corollary follows immediately from Theorem 5.7.1 by noting that F(x)–F(0) = F1(x)+F2(x)+F3(x) and by adding the constant F(0) to any of the Fi’s. Since each νi (i = 1,2,3)is unique, each Fi is unique up to an additive constant by Theorem 2.8.1. Lemma 2.8.2 showed that F has at most countably many (jump) dis- continuities. This also follows from the above decomposition since the absolutely continuous and singular components of a Lebesgue–Stieltjes measure have no atoms. Hence the only atoms arise from the discrete com- ponent.  We introduced the notion of an absolutely continuous nondecreasing function F defined on R (or on [a, b]) and showed that for any –∞ < a < b < ∞, there exists an essentially unique nonnegative function f ∈ L1(a, b) such that for all a ≤ x ≤ b x F x F a f t dt ( )= ( )+ a ( ) . This definition can be extended by allowing f to take negative as well as positive values, but still of course requiring f ∈ L1(a, b). The resulting func- tions F are also said to be absolutely continuous. As will be seen later in { } ≤ ≤ this section, the set function μF (x, y] = F(y)–F(x), a x < y b, can be extended to a finite signed (Lebesgue–Stieltjes) measure on B[a, b] which  is such that μF m with dμF /dm = f . This property justifies the terminol- ogy used. F is also clearly continuous (this is an immediate application of dominated convergence), and in fact is differentiable with derivative f a.e. (Theorem 5.7.3). This a.e. differentiability suggests that it should be possible to use an ab- solutely continuous functionF for substitution of variables in integration, i.e. to evaluate g(x) dx as g(F(t))f (t) dt (i.e. formally writing x = F(t) and regarding f (t) as the derivative F(t)). This is readily seen to be true 5.7 Real line applications 107 for nondecreasing (absolutely continuous) F, for which it is simply checked –1 (Ex. 2.19) that μF F = m, Lebesgue measure, and hence by Theorem 4.6.1, for appropriate functions g, dμ g(y) dy = gdμ F–1 = (g ◦ F) dμ = (g ◦ F) F dm F F dm = g(F(x))f (x) dx by Theorem 5.6.1. When F is not monotone the proof still relies on the above simple ar- gument but requires the splitting of the interval of integration into parts as seen in the figure in Theorem 5.7.2. The proof is straightforward but more tedious and is given here for reference.

Theorem 5.7.2 Let F be an absolutely continuous function on [a, b], x ∞ < a < b < ∞,withFx F a f t dt, f ∈ L a b , and g – + ( )= ( )+ a ( ) 1( , ) a Borel measurable function defined on R.Ifg(F(t))f (t) ∈ L1(a, b) then g(x) ∈ L1(F(a), F(b)) or g(x) ∈ L1(F(b), F(a)) according as F(a) < F(b) or F(b) < F(a) respectively, and F(b) b g x dx g F t f t dt F(a) ( ) = a ( ( )) ( ) β α where g x dx g x dx for β<α ( α ( ) =– β ( ) ). Proof For E ∈Bdenote the Borel subsets of E by B(E) and write m for B E ν E ∈Ba b ν E f t dt Lebesgue measure on ( ). Define for ( , )by ( )= E ( ) . Since f ∈ L1(a, b), ν is a finite signed measure by Theorem 5.2.3. Also ν  m and by the Radon–Nikodym Theorem 5.5.3, dν/dm = f . Consider the function F as a transformation from ((a, b), B(a, b), ν) into (R, B). Since F is continuous it is measurable (Ex. 3.10) and induces the signed measure νF–1 on B. We will show that if F(a) < F(b) then νF–1(B)=m{B ∩ (F(a), F(b))} for all B ∈B, (5.1) and if F(b) < F(a) then νF–1(B)=–m{B ∩ (F(b), F(a))} for all B ∈B. (5.2) (For F nondecreasing this was shown in Ex. 2.19.) Let m, M be the minimum and maximum values of (the continuous func- tion) F on [a, b]. Assume first that F(a) < F(b). Let I =(c, d) be an open interval of R. Since F is continuous, F–1I is an open subset of (a, b) and as such it may be written as a countable union of open intervals; these are facts of elementary real line topology. Clearly F–1I is nonempty if and only 108 Absolute continuity and related topics if I ∩ [m, M] is nonempty, and this is henceforth assumed without loss of generality. Consider first the case where I contains neither F(a) nor F(b), i.e. I either is a subset of or is disjoint from (F(a), F(b)). Then, by the continuity of F, open intervals Jn =(α, β) can be found such that F(x) ∈ I =(c, d) for all x ∈ Jn, and F(α)=c, F(β)=d or F(α)=d, F(β)=c (interval of type 1) or F(α)=F(β)=c (interval of type 2) or F(α)=F(β)=d (interval of type 3)

–1 (see figure below). It follows that F I = ∪kJ1k ∪p J2p ∪q J3q where for each i =1,2,3,k =1,2,3,...Jik are the distinct intervals of type i. Since ν(Jik)= f (t) dt which is d – c = m(I)orc – d =–m(I)fori = 1 and, is zero for Jik i =2,3, –1 –1 (νF )(I)=ν(F I)= k ν(J1k). | | | | ≥| | Also ν(J1k) = m(I) for all k, which implies ν (J1k) ν(J1k) = m(I). | | | | However, since ν is a finite signed measure, ν is finite and k ν (J1k)= |ν|(∪kJ1k) < ∞, it follows that the number of nonempty J1k’s is finite. They may therefore be ordered as {J11 , J12, ..., J1s}. Now it is quite clear from the continuity of F that ν(J1k)+ν(J1 k+1)=0, since if ν(J1k)=m(I) then F is “increasing overall” on J1k, hence overall decreasing on the next interval J1 k+1, and thus ν(J1 k+1)=–m(I); similarly –1 if ν(J1 k)=–m(I) then ν(J1 k+1)=m(I). Since (νF )(I)=ν(J11 )+···+ ν(J1s) it follows that (νF–1)(I) = 0 when s is even, and (νF–1)(I)=m(I) when s is odd. If I ⊂ (F(a), F(b)) it is clear that s is odd and thus (νF–1)(I)=m(I). On the other hand if I ⊂ R –(F(a), F(b)) then s is even and (νF–1)(I)=0. In either case (νF–1)(I)=m{I ∩ (F(a), F(b))}. Now consider the case where I contains F(b) but not F(a). We can then write

–1 F I = ∪kJ1k ∪p J2p ∪q J3q ∪ (b , b)

where (b , b) is disjoint from all intervals Jik and F(b )=c. It is again clear that the number s of nonempty J1k is even and thus b νF–1 I ν{ b b } f t dt F b F b ( )( )= ( , ) = b ( ) = ( )– ( ) = F(b)–c = m{I ∩ (F(a), F(b))}.

The same result is obtained similarly when I contains F(a) but not F(b). 5.7 Real line applications 109

It then follows that for every open interval I in R

(νF–1)(I)=m{I ∩ (F(a), F(b))}.

Hence the same is true for semiclosed intervals and then, by Lemma 5.2.4, for all Borel sets. Thus (5.1) is established, i.e. νF–1 is Lebesgue measure m on the Borel subsets of (F(a), F(b)), and the zero measure on the Borel subsets of R –(F(a), F(b)). Similarly, when F(b) < F(a), (5.2) is estab- lished, i.e. νF–1 is negative Lebesgue measure (–m) on the Borel subsets of R (F(b), F(a)) and the zero measure on the Borel subsets of –(F(b), F(a)). b g F t f t ∈ L a b |g F t ||f t | dt < ∞ Now ( ( )) ( ) 1( , ) implies a ( ( )) () . By the dis- cussion following Theorem 5.2.3 we have |ν|(E)= |f (t)| dt, and thus by E b |g F t | d|ν| t < ∞ the Radon–Nikodym Theorem and Theorem 5.6.1, a ( ( )) ( ) , –1 i.e. gF ∈ L1(|ν|). Hence, g ∈ L1(|ν|F ), i.e. g ∈ L1(F(a), F(b)) if F(a) < F(b), g ∈ L1(F(b), F(a)) if F(b) < F(a), and by the transformation theorem for signed measures (Theorem 5.3.2), ∞ b b gdνF–1 gF dν g F t f t dt –∞ = a = a ( ( )) ( ) by the Radon–Nikodym Theorem and Theorem 5.6.1. Also, by what has ∞ F(b) F a < F b gdνF–1 g x dx been shown, when ( ) ( ), –∞ = F(a) ( ) . 110 Absolute continuity and related topics ∞ F(a) F(b) F b < F a gdνF–1 g x dx g x dx When ( ) ( ), –∞ =–F(b) ( ) = F(a) ( ) , and hence F(b) b g x dx g F t f t dt F(a) ( ) = a ( ( )) ( ) in all cases which completes the proof of the theorem.  Absolutely continuous functions have many important properties some of which we now state. Their proofs may be found in standard texts on Real Analysis. First, there is an equivalent definition of absolute continuity more in line with the definition of continuity (in fact of uniform continuity) as follows. A function F is absolutely continuous on [a, b] if and only if for every >0thereisaδ = δ( ) > 0 such that n | | i=1 F(xi )–F(xi) < { }n n for every finite collection (xi, xi ) i=1 of disjoint intervals in [a, b]with i=1 | | xi – xi <δ. An important property of absolutely continuous functions is their differentiability a.e. Theorem 5.7.3 Every absolutely continuous function is differentiable a.e. (m). In particular if F is absolutely continuous on [a, b] and F(x)=F(a)+ x f t dt, a ≤ x ≤ b, f ∈ L a b , then F x f x a.e. (m) on a b .If a ( ) 1( , ) ( )= ( ) [ , ] moreover f is continuous, then F(x)=f (x) for all a ≤ x ≤ b. This property makes precise the sense in which integration is the inverse of differentiation, and vice versa. Thus if f ∈ L (a, b)wehave  1 d x f (t) dt = f (x)a.e.(m), dx a and if F is absolutely continuous on [a, b], then b F t dt F b F a a ( ) = ( )– ( ). A further important class of functions are the functions of bounded vari- ation. A real-valued function F defined on [a, b], –∞ < a < b < +∞, is said to be of bounded variation if it is the difference of two nonde- creasing functions defined on [a, b] (the term “” will be justified below and in Ex. 5.26). Since nondecreasing functions have at most a countable number of points of discontinuity (which must be jumps), the same is true for functions of bounded variation. Hence it can be eas- ily seen that if the function F of bounded variation is right-continuous, then F = F1 – F2 where the functions F1 and F2 are nondecreasing and may be taken to be both right-continuous, e.g. by replacing F1(x), F2(x)by F1(x +0),F2(x + 0) – cf. Ex. 5.27. The relationship between nondecreas- ing functions and (Lebesgue–Stieltjes) measures given in Theorem 2.8.1, 5.7 Real line applications 111 provides a corresponding relationship between functions of bounded vari- ation and signed measures. Theorem 5.7.4 (i) If F is a right-continuous function of bounded vari- ation on [a, b], –∞ < a < b < +∞, then there is a unique finite { } signed measure μF on the Borel subsets of (a, b] such that μF (x, y] = F(y)–F(x) whenever a ≤ x < y ≤ b. (ii) Conversely, if ν is a finite signed measure on the Borel subsets of (a, b], –∞ < a < b < +∞, then there exists a right-continuous function F of

bounded variation on [a, b] such that ν = μF . F is unique up to an additive constant.

Proof (i) Let F = F1 – F2, where F1 and F2 are nondecreasing and right- continuous functions on [a, b]. Let μ and μ be the Lebesgue–Stieltjes F1 F2 measures corresponding to F and F , and define μ = μ – μ . Clearly 1 2 F F1 F2

μF is a finite signed measure on the Borel subsets of (a, b] and whenever a ≤ x < y ≤ b, μ {(x, y]} = μ {(x, y]} – μ {(x, y]} F F1 F2 = F1(y)–F1(x)–{F2(y)–F2(x)}

= {F1(y)–F2(y)} – {F1(x)–F2(x)} = F(y)–F(x). { } Hence μF (x, y] depends on F but not on its particular representation as

F1 – F2. The uniqueness of μF now follows from the fact that if two finite signed measures ν1, ν2 agree on the semiring P(a, b] of intervals (x, y], a ≤ x ≤ y ≤ b, then they agree on B(a, b]=S(P(a, b]) (Lemma 5.2.4). (ii) Conversely, if ν is a finite signed measure on B(a, b], let ν = ν+ – ν– be its Jordan decomposition and define F1(x)=ν+(a, x], F2(x)=ν–(a, x], a ≤ x ≤ b. Clearly F1 and F2 are nondecreasing and right-continuous and if F = F1 – F2, then F is a right-continuous function of bounded variation P B on [a, b]. Clearly μF and ν are equal on (a, b] and hence also on (a, b]

(Lemma 5.2.4), i.e. ν = μF . Finally if G is another right-continuous function ≤ ≤ of bounded variation such that μG = ν = μF we have for all a x b, G(x)–

G(a)=μG (a, x]=μF (a, x]=F(x)–F(a). Hence G(x)=F(x)+G(a)–F(a), which shows that F is unique up to an additive constant. 

If F is a right-continuous function of bounded variation on [a, b]andg gdμ a Borel measurable function such that the integral (a,b] F is defined we write g x dF x gdF gdμ (a,b] ( ) ( )= (a,b] = (a,b] F , Lebesgue–Stieltjes Integral gdF gdμ and thus define the (a,b] by (a,b] F . 112 Absolute continuity and related topics

Absolutely continuous functions on [a, b] are of bounded variation and in fact their Lebesgue–Stieltjes signed measures are absolutely continuous with respect to Lebesgue measure. Indeed if F is absolutely continuous on x a b F x F a f t dt a ≤ x ≤ b f ∈ L a b f f f [ , ] then ( )= ( )+ a ( ) , , 1[ , ]. Writing = + – – gives x x F x F a f t dt f t dt ( )= ( )+ a +( ) – a –( ) .

Since f+(t) ≥ 0, f–(t) ≥ 0, their integrals are nondecreasing functions in x and thus F is of bounded variation. Clearly whenever a ≤ x ≤ y ≤ b, y μ { x y } F y F x f t dt f t dt F ( , ] = ( )– ( )= x ( ) = (x,y] ( ) and hence μ B f t dt B ∈Ba b F ( )= B ( ) for all ( , ] P  since the two finite signed measures agree on (a, b]. Thus μF m and dμF /dm = f . We finally mention that, as shown in Ex. 5.26, a function F is of bounded variation on [a, b] if and only if N | | ∞ sup n=1 F(xn)–F(xn–1) < where the supremum is taken over all N and all subdivisions a = x0 < x1 < ··· < xN = b. This justifies the use of the term bounded variation, and in fact the sup is called the total variation of F on [a, b]. One can similarly consider functions F of bounded variation on R,in which case the corresponding Lebesgue–Stieltjes measure μF is a finite signed measure on B.

Exercises 5.1 Give an example of a signed measure μ on a measurable space (X, S)for which there is a measurable set E with μ(E) = 0 and a measurable subset F of E with μ(F) > 0. ∞ 5.2 If μi are measures define μ(E)= 1 μi (E). Is μ a measure? If μi are finite, is

μ necessarily either finite or σ-finite? If each μi is a finite signed measure is μ a signed measure? 5.3 If ν is a finite signed measure on the measurable space (X, S), then show that there exists a finite constant M such that |ν(E)|≤M for all E ∈S. 5.4 If λ, ν are finite signed measures, show that so is aλ + bν,wherea, b are real numbers. If λ, ν are signed measures show that so is aλ + bν provided that ab > 0ifλ and ν assume the same infinite value, and ab < 0 if one of λ, ν assumes the value +∞ and the other –∞. Exercises 113

5.5 If λ and ν are finite signed measures, or signed measures assuming the same infinite value +∞ or –∞ (if at all), show that |λ + ν|≤|λ| + |ν|, i.e. that for each measurable set E, |λ + ν|(E) ≤|λ|(E)+|ν|(E).

5.6 Let μ be a signed measure on (X, S)andμ = μ+ – μ– its Jordan decomposi- tion. (i) Show that μ+ ⊥ μ– and that (μ+, μ–) is the unique pair of singular measures on S whose difference is μ (this is a uniqueness property of the Jordan decomposition). (ii) If μ = λ1 – λ2 where λ1, λ2 are measures on S, show that

μ+ ≤ λ1 and μ– ≤ λ2 (this is a “minimal property” of the Jordan decomposition). 5.7 Let μ be a finite signed measure on a measurable space (X, S). Show that for all E ∈S, | | n | | μ (E)=sup i=1 μ(Ei) where the sup is taken over all finite partitions of E into disjoint measurable n sets Ei, E = ∪ Ei,andalso i=1   | |   μ (E)=sup E fdμ where the sup is taken over all measurable functions f such that |f |≤1 a.e. (|μ|)onX. 5.8 Let (X, S, μ) be a measure space and ν a signed measure on S. Show that ν ⊥ μ if and only if both ν+ ⊥ μ, ν– ⊥ μ. 5.9 If (X, S, μ) is a measure space and ν is a signed measure on S, show that ν ⊥ μ if and only if there is a set G ∈Swith μ(G)=0andsuchthatν(E)=0 for every measurable subset E of Gc. 5.10 Let (X, S, μ) be a measure space, and let λ, ν each be a signed measure on S such that |λ(E)|≤|ν|(E)forallE ∈S. (In particular this holds if |λ(E)|≤ |ν(E)| for all E ∈S.) Show that (i) If ν  μ then λ  μ. (ii) If ν ⊥ μ then λ ⊥ μ. 5.11 Let μ be a measure and let λ, ν be signed measures on a measurable space (X, S) such that both λ and ν assumethesameinfinitevalue+∞ or –∞.Show that (i) If λ  μ, ν  μ then λ + ν  μ. (ii) If λ ⊥ μ, ν ⊥ μ then λ + ν ⊥ μ. Note: To show (ii) find a set G such that μ(G)=0andboth |λ|(Gc)=|ν|(Gc)= 0 and use Ex. 5.4. 5.12 If μ is a measure on (X, S)andν is a signed measure on S such that both ν  μ and ν ⊥ μ, show that ν = 0 (i.e. ν(E) = 0 for all measurable E). Note: It is simplest to show that |ν|(X) = 0, using Theorem 5.4.1. 114 Absolute continuity and related topics

5.13 If ν is a signed measure show that ν+ ⊥ ν– and that ν |ν|. 5.14 Let (X, S)and(Y, T ) be measurable spaces, let T be a measurable transfor- mation from (X, S)into(Y, T ), and let μ, ν be two measures on S.Show that (i) If ν  μ,thenνT–1  μT–1. (ii) If ν ∼ μ,thenνT–1 ∼ μT–1. (iii) If νT–1 ⊥ μT–1,thenν ⊥ μ. (The converse statements are not true in general.) 5.15 Let μ and ν be two σ-finite measures on the measurable space (X, S)such that ν(E) ≤ μ(E)forallE in S. Show that ν is absolutely continuous with respect to μ and that the Radon–Nikodym derivative f = dν/dμ satisfies 0 ≤ f ≤ 1 a.e. (μ). 5.16 If μ is a σ-finite measure and ν a σ-finite signed measure on (X, S) such that ν  μ show that dν |ν|{x : (x)=0} =0. dμ 5.17 Let μ, ν be σ-finite measures on a measurable space (X, S). Show that ν  μ + ν and dν 0 ≤ ≤ 1 a.e. (μ + ν). d(μ + ν) If also ν  μ, show that one of the inequalities is strict. 5.18 All measures considered here are σ-finite measures on the measurable space (X, S).  dν ∼ { ∈ (i) If ν μ and dμ = f , then show that ν μ if and only if μ x X : f (x)= 0} =0,andthendμ/dν =1/f .

(ii) If νi ∼ μ and dνi/dμ = fi, i = 1, 2, then show that ν1 ∼ ν2 and dν2/dν1 = f2/f1 a.e. (μ). (iii) On the measurable space (R, B)(R = the real line, B =theBorelsets of R) give the following examples: (a) a finite measure equivalent to Lebesgue measure, (b) two (mutually) singular measures each of which is absolutely con- tinuous with respect to Lebesgue measure. 5.19 Let μ, ν and f be as in Theorem 5.5.4. Show that (i) μ{x : f (x) > 0} = 0 if and only if μ ⊥ ν. (ii) μ{x : f (x)=0} = 0 if and only if μ  ν. 5.20 Let X = [0, 1], S be the class of Lebesgue measurable subsets of X, m Lebesgue measure on S,andν counting measure on S (i.e. if E ∈Sis a finite set of points, ν(E) is the number of points in E;otherwiseν(E)=+∞). Exercises 115

(i) Show that ν has no Lebesgue decomposition with respect to m. (ii) Show that m  ν but that there is no nonnegative, ν-integrable function ∈S f on X such that m(E)= E fdν for all E . Note that ν is not σ-finite and thus σ-finiteness cannot be dropped in the Lebesgue decomposition theorem and the Radon–Nikodym Theorem. 5.21 With the notation of Theorem 5.6.1 suppose that f is nonnegative, measur- able, and defined a.e. (ν). Let f * be defined for all x in such a way that f *(x)=f (x)whenx ∈ A, the set where f is defined, and so that f * is * dν measurable. Show that fdν = f dμ dμ. (Note that the right hand side * dν dν c is A f dμ dμ since dμ = 0 a.e. (μ)onA .) Show a corresponding result if f ∈ L1(X, S, ν). 5.22 Let 0 = x0 < x1 < ··· < xn < +∞,leta0, a1, ..., an be positive numbers and let F be defined on the real line by ⎧ ⎪ ⎨⎪ 0for x < 0 F(x)=⎪ k a +1–e–x for x ≤ x < x , k =0,1,..., n –1 ⎪ i=0 i k k+1 ⎩ n –x ≥ i=0ai +1–e for x xn.

If μF is the Lebesgue–Stieltjes measure corresponding to F, find:

(i) a Hahn decomposition for μF ,

(ii) the Lebesgue decomposition of μF with respect to Lebesgue measure,

(iii) the Radon–Nikodym derivative of the absolutely continuous part of μF with respect to Lebesgue measure,

(iv) the discrete and the continuous singular part of μF . 5.23 Let R be the real line, R+ =(0,+∞), B the Borel sets of R, B+ the Borel sets of R+ (i.e. the σ-field generated by P = {(a, b]:0< a ≤ b < +∞}), and m Lebesgue measure. Let the transformation T from (R, B, m)into(R+, B+) be defined by Tx = ex for all x ∈ R. Show that T is measurable and that the measure mT–1 it induces on B+ is absolutely continuous with respect to –1 Lebesgue measure with Radon–Nikodym derivative 1/x (= (dmT /dμ)(x)). b 1 ≤ ∞ (Hint: Use the property a x dx =logb –loga for 0 < a b < + ,andthe extension theorem.) 5.24 Let R be the real line, L the σ-field of Lebesgue measurable sets, and μ a σ-finite measure on L.Foreverya in R,letTa be the transformation from R L R L ∈ R –1 ( , , μ)to( , )definedbyTa(x)=x + a for all x and let μa = μTa .

Then a is called an admissible translation of μ if μa is absolutely continuous with respect to μ.

If a is an admissible translation of μ write fa = dμa /dμ. Prove that if a and b are admissible translations then so is a + b and that fa+b = fa(x)fb(x – a) a.e. (μ). 5.25 Let R be the real line, B the Borel sets of R, m Lebesgue measure on B, I B a bounded interval, I the Borel subsets of I,andmI Lebesgue measure on BI (i.e. the restriction of m to BI). Let f be a real-valued, Borel measurable 116 Absolute continuity and related topics

–1 B function defined on I. Then the induced measure ν = mI f on is called the occupation time measure of f; ν(E) is the “amount of time” in I spent by f at values in E ∈B. Also, if ν is absolutely continuous with respect to m, its Radon–Nikodym derivative φ is called the local time of f .DenotebyfA the restriction of f to A ∈BI and by νA the occupation time measure of fA.Show the following:

(a) If f has local time φ then for every A ∈BI, fA has local time, denoted by φA,andφA ≤ φ a.e. (m). (b) For every A, B ∈BI, ∞ A φB(f (t)) dt = –∞φA(x)φB(x) dx = BφA(f (t)) dt. { ∈ } (c) φ(f (t)) > 0 a.e. (mI ). (Hint: Let A = t I : φ(f (t)) = 0 and show that φA = 0 a.e. (m) by using (a) and (b).) 5.26 Let F be a real-valued function on [a, b] and define the extended real-valued function V(x)on[a, b]by N | | ≤ ≤ V(x)=sup n=1 F(xn)–F(xn–1) , a x b,

where the supremum is taken over all N and all subdivisions a = x0 < x1 < ··· < xn = x. Clearly 0 ≤ V(x) ≤ V(y) ≤∞whenever a ≤ x < y ≤ b.Show by the following steps that F is of bounded variation on [a, b] (Section 5.7) if and only if V(b) < ∞, thus justifying the term used.

(i) If F is of bounded variation show that V(b) < ∞. (Write F = F1 – F2, F1, F2 nondecreasing and show that V(b) ≤ F1(b)–F1(a)+F2(b)– F2(a).) (ii) If V(b) < ∞ show that F is of bounded variation as follows. First show that |F(y)–F(x)|≤V(y)–V(x) whenever a ≤ x < y ≤ b. Then define

F1(x)=(V(x)+F(x))/2, F2(x)=(V(x)–F(x))/2, a ≤ x ≤ b,

and show that F1, F2 are nondecreasing functions and F = F1 – F2. (iii) If F is a right-continuous function of bounded variation on [a, b]show | | ≤ ≤ ≤| | that μF (a, x]=V(x), a x b.(V(x) μF (a, x] follows directly from the definition of V. For the reverse inequality notice that by (ii), | |≤ | |≤ ∈B μF (x, y] μV (x, y], hence μF (B) μV (B)forallB [a, b], and | | ≤ μF (B) μV (B).) 5.27 Show that if a function F(x) of bounded variation is right-continuous, then the nondecreasing functions F1(x), F2(x) in the representation F = F1 – F2 may each be taken to be right-continuous. 5.28 State the change of variable of integration result (Theorem 5.7.2) for a func- tion F of bounded variation. Are any adjustments needed in the proof of Theorem 5.7.2 in this case? Exercises 117

5.29 Let μ be a complex measure on the measurable space (X, S). Then μ may be written as μ = μ1 + iμ2 where μ1, μ2 are finite signed measures. Write ν = |μ1| + |μ2|. Then by Ex. 5.17, further write g1 = dμ1/dν, g2 = dμ2/dν and define the total variation of the complex measure μ as, for all E ∈S,  | | 2 2 μ (E)= E g1 + g2 dν. Show that |μ| is a finite measure on (X, S), and there is a complex-valued measurable function f (i.e. f = f1+ if2 where f1, f2 are measurable) such that | | ∈S | | f = 1 and for all E , μ(E)= E fdμ . (This may be written f = dμ/d|μ|, and is called the polar representation or decomposition of μ. This definition of the total variation of a complex measure μ is equivalent to the more intuitive definition as, | | n | | μ (E)=sup k=1 μ(Ek) , where the sup is taken over all n and over all disjoint partition of E such that ∪n E = k=1Ek.) 6

Convergence of measurable functions, Lp-spaces

6.1 Modes of pointwise convergence Throughout this chapter (X, S, μ) will denote a fixed measure space. Con- sider a sequence {fn} of functions defined on E ⊂ X and taking values in * * R .Iff is a function on E (to R ) and fn(x) → f (x) for all x ∈ E, then fn c converges pointwise on E to f .IfE ∈Sand μ(E ) = 0 then fn → f (point- wise) a.e. (as in Chapter 4). It is clear that if fn → f , fn → g a.e. then f = g a.e. since the limit is unique where it exists.

If fn is finite-valued on E, and given any >0, x ∈ E, there exists N = N(x, ) such that | fn(x)–fm(x)| < for all n, m > N, then {fn} is said to be a c (pointwise) Cauchy sequence on E.IfE ∈Sand μ(E )=0,{fn} is called Cauchy a.e. Since each Cauchy sequence of real numbers has a finite limit, if {fn} is Cauchy on E (or Cauchy a.e.) there is a finite-valued function f such that fn → f on E (or fn → f a.e.).

If {fn} is a sequence of finite-valued functions on a set E and f is finite- valued on E, we say that fn converges to f uniformly on E if, given any >0, there exists N = N( ) such that | fn(x)–f (x)| < for all n ≥ N, x ∈ E. c If E ∈Sand μ(E ) = 0, we say that fn → f uniformly a.e. Similarly, if given any >0, there exists N = N( ) such that | fn(x)–fm(x)| < whenever n, m > N, x ∈ E, {fn} is called a uniformly Cauchy sequence on E.Sucha sequence is pointwise Cauchy on E and thus has a pointwise limit f (x)onE. By letting m →∞in the definition just given, it follows that | fn(x)–f (x)| < for all n ≥ N, x ∈ E; that is fn → f uniformly on E. One may also talk about a sequence which is convergent or Cauchy (pointwise or uniformly) a.e. on a set E ∈S. (For example fn → f a.e. on E if fn(x) → f (x)onE – F for some F ∈S, μ(F) = 0.) The above remarks all hold for such (e.g. if fn is Cauchy a.e. on E then fn converges a.e. on E to some f ). In addition to pointwise convergence (a.e.) and uniform convergence (a.e.), a third (technically useful) concept is that of “almost uniform

118 6.1 Modes of pointwise convergence 119 convergence”. Specifically if {fn} and f are functions defined on E ∈S * and taking values in R , fn is said to converge to f almost uniformly on E if, given any >0, there is a measurable set F = F with μ(F) < and such that fn → f uniformly on E – F. (In particular, this requires fn and f to be finite-valued on E – F for any >0, and it is easily seen that this requires fn and f to be finite-valued a.e. on E.) Similarly a sequence {fn} of (a.e. finite-valued) functions on E is said to be almost uniformly Cauchy on E if given any >0 there is a measurable subset F = F with μ(F) < such that fn is uniformly Cauchy on E – F. We abbreviate “almost uniformly” to a.u. It is worth remarking that while uniform convergence a.e. clearly im- plies convergence almost uniformly, the converse is not true (Ex. 6.1). The following result shows that, as would be expected, almost uniform conver- gence implies convergence a.e.

* Theorem 6.1.1 If {fn} is a sequence of functions on E ∈Sto R , and fn is almost uniformly Cauchy on E (or fn → f almost uniformly on E), then fn is Cauchy a.e. on E (or fn → fa.e.onE).

Proof Suppose {fn} is a.u. Cauchy on E. Then given any integer p ≥ 1 there exists a measurable set Fp such that μ(Fp) < 1/p and {fn} is uniformly ∩∞ Cauchy on E – Fp, and hence pointwise Cauchy on E – Fp.LetF = p=1Fp. μ(F) ≤ μ(Fp) < 1/p and hence μ(F)=0.Ifx ∈ E–F then x ∈ E–Fp for some p and hence {fn(x)} is a Cauchy sequence. That is, {fn} is pointwise Cauchy on E – F. This proves the first assertion. The second follows similarly. 

This result will be used to show that a sequence which is almost uni- formly Cauchy converges almost uniformly.

Theorem 6.1.2 If {fn} is almost uniformly Cauchy on E ∈S, then there exists a function f such that fn → f almost uniformly on E.

Proof If {fn} is Cauchy a.u., it is Cauchy a.e. on E by Theorem 6.1.1, and hence there is a function f on E such that fn → f a.e. on E. Since fn is a.u. Cauchy, given >0 there is a measurable set F = F , μ(F) < , such that fn is uniformly Cauchy on E – F. The set of points of E where fn → f may be included in F without increasing its measure. But fn is uniformly Cauchy and hence converges uniformly to a function g on E – F. Since uniform convergence implies convergence at each x it follows that fn converges to both f and g on E – F. Thus f = g there and fn → f uniformly on E – F.But this shows that fn → f a.u. on E, as required.  120 Convergence of measurable functions, Lp-spaces

One would not necessarily expect convergence a.e. to imply almost uni- form convergence, i.e. the converse to Theorem 6.1.1 to hold. This does in fact hold, however, for measurable functions on sets of finite measure.

Theorem 6.1.3 (Egoroff’s Theorem) Let E ∈S,withμ(E) < ∞, and let {fn} and f be measurable functions defined and finite a.e. on E and such that fn → f a.e.onE.Thenfn → f almost uniformly on E.

Proof By excluding the zero measure subset of E where fn or f is not defined, or infinite, or where fn(x) → f (x), it is seen that no generality is lost in assuming that fn(x), f (x) are defined and finite and that fn(x) → f (x) for all x ∈ E. Write, for m, n =1,2,..., m ∩∞ { ∈ | | } En = i=n x E : fi(x)–f (x) < 1/m . m ∈S { m} Then En , and for each fixed m, En is monotone increasing in n with m → m limn En = E (since fn f on E). Thus E – En is decreasing in n and m ∅ ∞ m → →∞ limn(E – En )= . Since μ(E) < it follows that μ(E – En ) 0asn . m Hence, given >0 there is an integer Nm = Nm( ) such that μ(E – En ) < /2m for n ≥ N . Write F = F = ∪∞ (E – Em ). Then clearly F ⊂ E, F ∈S m m=1 Nm and ∞ ∞

μ(F) ≤ μ(E – Em ) < = . Nm 2m m=1 m=1 We now show that f → f uniformly on E – F.Ifx ∈ E – F, then x ∈ Em , n Nm m =1,2,..., and thus

| fi(x)–f (x)| < 1/m for all i ≥ Nm.

Hence given any δ>0, m may be chosen such that 1/m <δgiving | fi(x)– f (x)| <δfor all i ≥ Nm and all x ∈ E – F.(NoteNm does not depend on x.) It follows that fn → f uniformly on E – F, and thus fn → f a.u. on E. 

6.2 Convergence in measure We turn now to another form of convergence (particularly important in applications to probability theory). Consider a measurable set E and a se- quence of measurable functions { fn} defined on E, and finite a.e. on E. Then if f is a measurable function defined and finite a.e. on E we say that fn → f in measure on E if for any given >0,

μ{x ∈ E : | fn(x)–f (x)|≥ }→0asn →∞. 6.2 Convergence in measure 121

That is, the emphasis is not on the difference between fn and f at each point, but rather with the measure of the set where the difference is at least . Similarly fn is a Cauchy sequence in measure on E if for each >0,

μ{x ∈ E : | fn(x)–fm(x)|≥ }→0asn, m →∞.

The set E will be regarded as the precise set of definition of the fn and f (even if some of these functions have been defined on larger sets). Then E may be omitted in the above expressions. c Finally, if μ(E ) = 0 and fn → f in measure on E (or { fn} is Cauchy in measure on E) we say that fn → f in measure (or { fn} is Cauchy in measure) without reference to a set. It will be seen next that a sequence which converges in measure is Cauchy in measure, and the limits in measure are essentially unique.

Theorem 6.2.1 (i) If { fn} converges in measure (to f , say) on E ∈S, then { fn} is Cauchy in measure on E. (ii) If { fn} converges in measure on E to both f and g, then f = ga.e.onE, i.e. limits in measure are “essentially unique”.

Proof Since | fn – fm|≤|fn – f | + | f – fm|, it follows that for any >0

{x : | fn(x)–fm(x)|≥ }⊂{x : | fn(x)–f (x)|≥ /2}

∪{x : | fm(x)–f (x)|≥ /2}

(for if x is not in the right hand side, then | fn(x)–fm(x)| < ). The measure of each set on the right tends to zero as n, m →∞since fn → f in measure on E. Hence also so does the measure of the set on the left hand side, showing that { fn} is Cauchy in measure on E. To prove (ii) note that it follows in an exactly analogous way that for any >0,

μ{x : | f (x)–g(x)|≥ }≤μ{x : | f (x)–fn(x)|≥ /2}

+ μ{x : | fn(x)–g(x)|≥ /2} → 0asn →∞. Hence μ{x : | f (x)–g(x)|≥ } = 0 for each >0 and thus {  } ∪∞ { | |≥ } μ x : f (x) g(x) = μ[ n=1 x : f (x)–g(x) 1/n ]=0, so that f = g a.e. on E, as required.  We now turn to the relationship between convergence in measure, and almost uniform (and hence also a.e.) convergence. It will first be shown that 122 Convergence of measurable functions, Lp-spaces almost uniform convergence of measurable functions implies convergence in measure.

Theorem 6.2.2 Let { fn}, f be measurable functions defined on E ∈Sand finite a.e. on E.

(i) If { fn} is Cauchy almost uniformly on E, it is Cauchy in measure on E. (ii) If fn → f almost uniformly on E, then fn → f in measure on E.

Proof If { fn} is Cauchy a.u. on E, given any δ>0 there is a measurable set Fδ ⊂ E such that μ(Fδ) <δand fn–fm → 0 uniformly on E–Fδ as n, m →∞. Hence if >0, there exists N = N( , δ) such that | fn(x)–fm(x)| < for all n, m ≥ N, and all x ∈ E – Fδ. Thus

μ{x : | fn(x)–fm(x)|≥ }≤μ(Fδ) <δfor m, n ≥ N, or μ{x : | fn(x)–fm(x)|≥ }→0asn, m →∞. Hence (i) follows and the proof of (ii) is virtually the same.  As a corollary, convergence of measurable functions a.e. on sets of finite measure implies convergence in measure.

Corollary If μ(E) < ∞ and fn → f a.e.onE,thenfn → f in measure on E.

Proof By Egoroff’s Theorem (Theorem 6.1.3) fn → f a.u. on E and thus by Theorem 6.2.2 (ii), fn → f in measure on E.  In the converse direction we show that convergence in measure implies almost uniform (and hence also a.e.) convergence of a subsequence of the original sequence. This is a corollary of the following result which shows that if a sequence is Cauchy in measure, it has a limit in measure (a prop- erty, i.e. completeness, of all modes of convergence considered previously).

Theorem 6.2.3 Let { fn} be a sequence of measurable functions on a set E ∈Swhich is Cauchy in measure on E. Then { } (i) There is a subsequence fnk which is Cauchy almost uniformly on E, and

(ii) There is a measurable function f on E such that fn → f in measure on E. By Theorem 6.2.1 (ii) f is essentially unique on E.

Proof (i) For each integer k there exists an integer nk such that for n, m ≥ nk –k –k μ{x : | fn(x)–fm(x)|≥2 }≤2 . 6.2 Convergence in measure 123

Further we may take n1 < n2 < n3 < ··· . Write { | |≥ –k} Ek = x : fnk (x)–fnk+1 (x) 2 , k =1,2,... ∪∞ Fk = m=kEm. ≤ –k ≤ ∞ ≤ –k+1 Then μ(Ek) 2 and μ(Fk) m=k μ(Em) 2 . Now given >0, –k+1 choose k such that 2 < and hence μ(Fk) < . Also for all x ∈ E – Fk, ∈ ≥ | | –m ≥ x E – Em for m k and hence fnm (x)–fnm+1 (x) < 2 for all m k,and thus for all ≥ m ≥ k,  –1 | |≤ | | –m+1 → →∞ fnm (x)–fn (x) fni (x)–fni+1 (x) < 2 0asm . i=m { } { } Hence fnm is uniformly Cauchy on E – Fk where μ(Fk) < . Thus fnm is Cauchy a.u., as required. { } { } (ii) By (i) there is a subsequence fnk of fn which is Cauchy a.u. and thus converges a.u. to a measurable f on E (Theorem 6.1.2). Given >0, { | |≥ }⊂{ | |≥ } x : fk(x)–f (x) x : fk(x)–fnk (x) /2 ∪{ | |≥ } x : fnk (x)–f (x) /2 .

Since { fn} is Cauchy in measure (and nk →∞as k →∞) the measure of the first set on the right tends to zero as k →∞. But the measure of the → second set also tends to zero, since fnk f a.u. and hence by Theorem 6.2.2, in measure. Thus μ{x : | fk(x)–f (x)|≥ }→0ask →∞, showing that fn → f in measure.  → { } Corollary If fn f in measure on E then there is a subsequence fnk → such that fnk f almost uniformly, and hence also a.e.

Proof By Theorem 6.2.1 (i), { fn} is Cauchy in measure on E, and by (i) { } of Theorem 6.2.3 it has a subsequence fnk which is Cauchy a.u. on E, and hence convergent a.u. on E to some function g (Theorem 6.1.2). Then → by Theorem 6.2.2, fnk g in measure also and hence f = g a.e. on E by → → Theorem 6.2.1. Thus fnk f a.u. on E, and hence also fnk f a.e. on E (Theorem 6.1.1).  The final theorem of this section gives a necessary and sufficient con- dition (akin to the definition of convergence in measure) for convergence a.e. on a set of finite measure. This result is interesting in applications to probability.

Theorem 6.2.4 Let { fn}, f be measurable functions defined and a.e. finite-valued on E ∈S, where μ(E) < ∞.Write,for >0 and 124 Convergence of measurable functions, Lp-spaces n =1,2,..., En( )={x : | fn(x)–f (x)|≥ }. Then fn → fa.e.onEif and only if for every >0, {∪∞ } lim μ m=nEm( ) =0. n→∞

Proof fn may fail to converge to f at points x ∈ E for which f (x) has infinite values – assumed to be a zero measure set. Aside from these points → ∈ ∪∞ ∈ fn(x) f (x) if and only if x D = k=1limnEn(1/k) since x D if and only if for some k, | fn(x)–f (x)|≥1/k for infinitely many n. Since limnEn(1/k)is clearly monotone nondecreasing in k,

μ(D) = lim μ{limnEn(1/k)} = lim lim μ{ Fn(1/k)}, k→∞ k→∞ n→∞ ∪∞ where Fn( )= m=nEm( )(μ(E) being finite). If limn→∞ μ{ Fn( )} = 0 for each >0, it thus follows that μ(D) = 0 and hence fn → f a.e. on E. Conversely, if fn → f a.e. on E, then μ(D) = 0. But this means limn→∞ μ{ Fn(1/k)} = 0 for each k since this quantity is nonnegative and nondecreasing in k. Given >0 choose k with 1/k < . Then

0 ≤ lim μ{ Fn( )}≤lim μ{ Fn(1/k)} =0 n→∞ n→∞ which yields the desired conclusion limn→∞ μ{ Fn( )} =0.  Note that the corollary to Theorem 6.2.2 also follows simply from the present theorem. The principal relationships between the forms of convergence con- sidered for measurable functions are illustrated diagrammatically in Section 6.5.

6.3 Banach spaces In this section we introduce the notion of a , which will be referred to in the following sections. Although the results of the next section may be developed without it, the framework and language of Banach spaces will be helpful and useful. The discussion is kept here to the bare minimum necessary for stating the results of Section 6.4. It is first useful to define a metric space and some related concepts. AsetL is called a metric space if there is a real-valued function d(f , g) defined for f , g ∈ L and called a distance function or metric such that for all f , g, h in L, (i) d(f , g) ≥ 0 and d(f , g) = 0 if and only if f = g (ii) d(f , g)=d(g, f ) (iii) d(f , g) ≤ d(f , h)+d(h, g). 6.3 Banach spaces 125

Since by definition a metric space consists of a set L together with a metric d, we will denote it by (L, d) (clearly one may be able to define several metrics on a set). The simplest example of a metric space is the real line L = R,with | | Rn d(f , g)= f – g ; or the finite-dimensional space L = , with the Euclidean { n 2}1/2 metric d(f , g)= k=1(xk – yk) where f =(x1, ..., xn), g =(y1, ..., yn). Once an appropriate measure of distance is introduced one can define the notion of convergence. A sequence { fn} in a metric space (L, d) will be said to converge to f ∈ L (fn → f or limn fn = f ), if d(fn, f ) → 0asn →∞. A simple property of convergence for later use is the following.

Lemma 6.3.1 Let (L, d) be a metric space and fn,f, g elements of L. Then

(i) The limit of a convergent sequence is unique, i.e. if fn → f and fn → g, then f = g.

(ii) If fn → f,gn → g, then d(fn, gn) → d(f , g).

Proof (i) Assume that fn → f and fn → g. For each n

0 ≤ d(f , g) ≤ d(f , fn)+d(fn, g) and since both terms on the right hand side converge to zero as n →∞,it follows that d(f , g) = 0 and thus f = g. (ii) Applying properties (iii) and (ii) of a distance function twice it fol- lows that

d(fn, gn) ≤ d(fn, f )+d(f , g)+d(gn, g)

d(f , g) ≤ d(f , fn)+d(fn, gn)+d(gn, g) and thus,

|d(fn, gn)–d(f , g)|≤d(fn, f )+d(gn, g).

Hence fn → f , gn → g implies d(fn, gn) → d(f , g). 

A sequence { fn} in a metric space (L, d) is called Cauchy if d(fn, fm) → 0 as n, m →∞. Note that if fn → f , then it follows from the inequality

d(fn, fm) ≤ d(fn, f )+d(f , fm) that { fn} is Cauchy. Thus a sequence in a metric space which converges to an element of the metric space is Cauchy. However, the converse is not al- ways true, i.e. a Cauchy sequence does not necessarily converge in a metric 126 Convergence of measurable functions, Lp-spaces space. Whenever every Cauchy sequence in a metric space converges to an element of the metric space, the metric space is called complete. The real line with d(x, y)=|x – y| is of course a complete metric space. Let (L, d) be a metric space. A subset E of L is said to be dense in L if for every f ∈ L and every >0 there is g ∈ E with d(f , g) < . A metric space is called separable if it has a countable dense subset. Again the real line with d(f , g)=| f – g| is separable, since the set of rational numbers forms a countable dense subset of R. Another useful concept is that of a linear space. Specifically, set L is called a linear space (over the real numbers) if there is (i) a map, called addition, which assigns to each f and g in L an element of L denoted by f + g, with the following properties (1) f + g = g + f , for all f , g ∈ L, (2) f +(g + h)=(f + g)+h, for all f , g, h ∈ L, (3) there is an element of L, denoted by 0, such that f +0=0+f = f for all f ∈ L, (4) for each f ∈ L there exists an element of L (denoted by –f ) such that f +(–f ) = 0. One naturally then writes g – f for g +(–f ). (ii) a map, called scalar multiplication, which assigns to each real a and f ∈ L an element of L denoted simply by af with the properties that for all a, b ∈ R and f , g ∈ L, (1) a(f + g)=af + ag (2) (a + b)f = af + bf (3) a(bf )=(ab)f (4) 0f =0,1f = f . The simplest example of a linear space is the set of real numbers R, or Rn. Also the set of all finite-valued measurable functions defined on a measurable space (X, S) (or defined a.e. on a measure space (X, S, μ)) is a linear space with addition and scalar multiplication defined in the usual way: (f + g)(x)=f (x)+g(x) and (af )(x)=af (x). Finally L1(X, S, μ) is also a linear space. A linear space L is called a normed linear space, if there is a real-valued function defined on L, called and denoted by ·, such that for all f , g ∈ L, and a ∈ R, (i) f ≥0 and f  = 0 if and only if f =0 (ii) af  = |a|f  (iii) f + g≤f  + g. 6.4 The spaces Lp 127

It is straightforward to verify that the following are all normed linear Rn   { n 2}1/2 spaces. is a normed linear space with f = k=1 xk where f = (x1, ..., xn). The set C[0, 1] of all continuous real-valued functions on [0, 1],   | | S is a normed linear space with f = sup0≤t≤1 f (t) . L1(X, , μ) is a normed linear space with  f  = | f | dμ,ifweputf = g in the space L1 whenever f = g a.e. A normed linear space clearly becomes a metric space with distance function d(f , g)=f – g.

A complete normed linear space is called a Banach space (the completion is of course meant with respect to the distance induced by the norm as above). Again the simplest example of a Banach space is the real line R, Rn   | | or . Also C[0, 1] with norm f = sup0≤t≤1 f (t) can be easily seen to be a Banach space. It will be shown in Section 6.4 that L1(X, S, μ)isa Banach space. Of course there are normed linear spaces that are not Banach 1 spaces. As an example, it may be easily seen that  f  =( | f t |2 dt 1/2 0 ( ) ) defines a norm on C[0, 1], but this normed linear space is not complete, as the following Cauchy sequence { fn} shows, where fn(t)=0for0≤ t ≤ 1/2, fn(t) = 1 for 1/2 + 1/n ≤ t ≤ 1, and fn(t)=n(t – 1/2) for 1/2 ≤ t ≤ 1/2 + 1/n (in fact its “completion” is the space L2[0, 1] defined in Section 6.4).

6.4 The spaces Lp

In this section the class L1 of functions is generalized in an obvious way and the properties of the resulting class are studied. (X, S, μ) will be a fixed measure space throughout. For each real p > 0 and measurable f defined a.e., write p 1/p f p =(| f | dμ) p (= ∞ if | f | dμ = ∞). The subclass of all such f for which  f p< ∞ is denoted by Lp = Lp(X, S, μ). Equivalently Lp is clearly the class of all p measurable functions f such that | f | ∈ L1. It is convenient and useful to define the class L∞ = L∞(X, S, μ) as the set of all measurable functions defined a.e. which are essentially bounded in the sense that | f (x)|≤M a.e. for some finite M. For each f ∈ L∞,  f ∞ will denote the essential supremum of f , that is the least such M,i.e.

f ∞ = ess sup | f | = inf{M > 0:μ{x : | f (x)| > M} =0}. 128 Convergence of measurable functions, Lp-spaces

In the following we concentrate on the classes of functions Lp for 0 < p ≤∞. With addition of functions and scalar multiplication defined in the usual way (i.e. (f + g)(x)=f (x)+g(x) at all points x for which the sum makes sense, and (af )(x)=af (x) at all points x where f is defined) it is simply shown that each Lp,0< p ≤∞, is a linear space. Of course for p = 1 this was already established in Theorem 4.4.3.

Theorem 6.4.1 Each Lp,0< p ≤∞, is a linear space. In particular if f1, ..., fn are in Lp and a1, ..., an real numbers then a1f1 + ···+ anfn ∈ Lp.

Proof If f ∈ Lp and a is a it is clear that af ∈ Lp. That f , g ∈ Lp implies f + g ∈ Lp is again clear when p = ∞, and for 0 < p < ∞ we have

| f (x)+g(x)|≤|f (x)| + |g(x)|, | f (x)+g(x)|p ≤ 2p max(| f (x)|p, |g(x)|p) ≤ 2p(| f (x)|p + |g(x)|p) at all points for which f + g is defined, and hence a.e. Since the right hand p side is in L1,sois| f + g| (Theorem 4.4.6), showing that f + g ∈ Lp,as required. It is now quite clear that all properties of addition and scalar mul- tiplication are satisfied so that each Lp is a linear space. 

Further properties of Lp-spaces are based on the following important classical inequalities.

Theorem 6.4.2 (Holder’s¨ Inequality) Let 1 ≤ p, q ≤∞be such that 1/p +1/q =1(with q = ∞ when p =1). If f ∈ Lp and g ∈ Lq then fg ∈ L1 and

fg1 ≤f p gq . For 1 < p, q < ∞ equality holds if and only if f =0a.e. or g =0a.e. or | f |p = c|g|q a.e. for some c > 0.Ifp= q =2the last equality of course becomes | f | = c|g|, some c > 0.

Proof For p =1, q = ∞ we have |g(x)|≤ g∞ a.e. and thus fg1 = | fg| dμ ≤g∞ | f | dμ = f 1 g∞ (< ∞), and similarly for p = ∞, q =1. Now assume that 1 < p, q < ∞.If0<α<1, then

tα –1 ≤ α(t –1) 6.4 The spaces Lp 129 for all t ≥ 1, with equality only when t = 1. (This is easily seen from the equality at t = 1 and the fact that the derivative of the left side is strictly less than that of the right side for t > 1.) Putting t = a/b we thus have for a ≥ b > 0, aαb1–α ≤ αa +(1–α)b 0 <α<1. (6.1) This inequality holds for a ≥ b > 0 and thus for a ≥ b ≥ 0 with equality only if a = b (≥ 0). But by symmetry it holds also if b ≥ a ≥ 0, and thus for all a ≥ 0, b ≥ 0, with equality only when a = b. If f = 0 a.e. or g = 0 a.e., the conclusions of the theorem are clearly true. It may therefore be assumed that neither f nor g is zero a.e.; that is we p p q q assume  f p = | f | dμ>0,  g q = |g| dμ>0 (Theorem 4.4.7). Then p p q q by (6.1), writing a = | f (x)| / f p, b = |g(x)| / gq, α =1/p,1–α =1/q,it follows that | || | | |p | |q f (x) g(x) ≤ f (x) g(x)     p + q (6.2) f p g q p f p q gq for all x for which f and g are both defined and finite, and hence a.e. Since p q the right hand side is in L1 (| f | ∈ L1, |g| ∈ L1), it follows from Theorem 4.4.6 that | fg|∈L1, and by Theorem 4.4.4, the integral of the left hand side of (6.2) does not exceed that on the right, i.e. | fg| dμ | f |p dμ |g|q dμ ≤ 1 1     p + q = + =1. f p g q p f p q gq p q Hence fg ∈ L1 and fg1 = | fg| dμ ≤f p gq. Finally if equality holds,    | f (x)|p |g(x)|q | f (x)g(x)| p + q –     dμ(x)=0 p f p q gq f p g q and since by (6.2) the integrand is nonnegative, it must be zero a.e. by Theorem 4.4.7. But since equality holds in (6.1) only when a = b,wemust p p q q thus have | f (x)| /f p = |g(x)| /gq a.e. from which the final conclusion of the theorem follows.  In the special case when p = q =2Holder’s¨ Inequality is usually called the Schwarz Inequality. When 0 < p < 1 and 1/p +1/q = 1 (hence q < 0) a reverse Holder’s¨ Inequality holds for nonnegative functions (see Ex. 6.18).

Theorem 6.4.3 (Minkowski’s Inequality) If 1≤p≤∞ and f , g ∈ Lp then f + g ∈ Lp and

f + gp ≤f p + gp . 130 Convergence of measurable functions, Lp-spaces

For 1 < p < ∞ equality holds if and only if f =0a.e. or g =0a.e. or f = cg a.e. for some c > 0.Forp=1equality holds if and only if fg ≥ 0 a.e.

Proof Theorem 6.4.1 shows that f + g ∈ Lp. Since | f (x)+g(x)|≤|f (x)| + |g(x)| for all x where both f and g are defined and finite, and thus a.e., the inequality clearly follows for p = 1 and p = ∞. When p = 1 equality holds if and only if | f + g| = | f | + |g| a.e., which is equivalent to fg ≥ 0a.e. Assume now that 1 < p < ∞. Then the following holds a.e.

| f + g|p = | f + g|·|f + g|p–1 ≤|f |·|f + g|p–1 + |g|·|f + g|p–1. (6.3)

Since p > 1 there exists q > 1 such that 1/p +1/q = 1. Further (p –1)q = p, (p–1)q p p–1 so that | f +g| = | f +g| ∈ L1 and hence | f +g| ∈ Lq. Thus by Holder’s¨ Inequality, | || |p–1 ≤ | |(p–1)q 1/q    p/q f f + g dμ f p ( f + g dμ) = f p f + g p (6.4) and similarly for |g||f + g|p–1. It then follows that  p | |p ≤      p/q f + g p = f + g dμ ( f p + g p) f + g p and since p – p/q =1,f + gp ≤f p + gp as required. Equality holds if and only if equality holds a.e. in (6.3), and in both (6.4) as stated and with f , g interchanged. That is if and only if fg ≥ 0 and (by Theorem 6.4.2)

p p f =0orf + g =0or| f + g| = c1| f | , c1 > 0 and p p g =0orf + g =0or| f + g| = c2|g| , c2 > 0 where each relationship is meant a.e. This is easily seen to be equivalent to f = 0 a.e. or g = 0 a.e. or f = cg a.e. for some c > 0. 

When 0 < p < 1 a reverse Minkowski Inequality holds for nonnegative functions in Lp (see Ex. 6.18). However, the following inequality also holds.

Theorem 6.4.4 If 0 < p < 1 and f , g ∈ Lp then f + g ∈ Lp and  p | |p ≤ | |p | |p  p  p f + g p = f + g dμ f dμ + g dμ = f p + g p with equality if and only if fg =0a.e.

Proof Since 0 < p < 1wehave(1+t)p ≤ 1+tp for all t ≥ 0 with equality only when t = 0. (This is easily seen again from the equality at t = 0 and 6.4 The spaces Lp 131 the fact that the derivative of the left side is strictly less than that of the right side for t > 0.) Putting t = a/b we thus have for a ≥ 0, b > 0, (a + b)p ≤ ap + bp. (6.5) This inequality holds for a ≥ 0, b > 0, and thus also for a, b ≥ 0with equality only when a =0orb =0,i.e.ab =0. p p p p Now f +g ∈ Lp by Theorem 6.4.1. By (6.5), | f +g| ≤ (| f |+|g|) ≤|f | +|g| a.e. and the result follows by integrating both sides (Theorem 4.4.4). Also the equality holds if and only if | f +g|p = | f |p +|g|p a.e. i.e. fg = 0 a.e., since there is equality in (6.5) only when ab =0. 

It is next shown that ·p may be used to introduce a metric on each Lp,0< p ≤∞, provided we do not distinguish between two functions in Lp which are equal a.e. That is equality of two elements f , g in Lp (written f = g) is taken to mean that f (x)=g(x) a.e. (More precisely Lp could be defined as the set of all equivalence classes of measurable functions f with p | fp| ∈ L1 under the equivalence relation f ∼ g if f = g a.e.) This metric turns out to be different for 0 < p < 1 and for 1 ≤ p ≤∞.

Theorem 6.4.5 (i) For 1 ≤ p ≤∞, Lp is a normed linear space with norm f p and hence metric dp(f , g)=f – gp. p (ii) For 0 < p < 1, Lp is a metric space with metric dp(f , g)=f – gp.

Proof (i) Assume 1 ≤ p ≤∞and f , g ∈ Lp. Then f p ≥ 0 and f p =0 if and only if f = 0 a.e., and thus f = 0 as an element of Lp. Also for ≤ ∞ 1 p < , p 1/p af p =(|af | dμ) = |a|f p, and quite clearly  af ∞ = |a|f ∞. Finally by Minkowski’s Inequality,  f + g p ≤f p +  g p. Hence  f p is a norm on Lp, which thus is a normed linear space, proving (i). (ii) Assume 0 < p < 1. As in (i) it is quite clear that dp(f , g) ≥ 0with dp(f , g) = 0 if and only if f = g, and that dp(f , g)=dp(g, f ). The last (triangle) property follows from Theorem 6.4.4,  p  p dp(f , g)= f – g p = f – h + h – g p ≤ p  p f – h p + h – g p = dp(f , h)+dp(h, g).

Hence Lp is a metric space with distance function dp,for0< p < 1. 

Thus each Lp,0< p ≤∞, is a metric space with distance function  p f – gp for 0 < p < 1 dp(f , g)= f – gp for 1 ≤ p ≤∞. 132 Convergence of measurable functions, Lp-spaces

From now on all properties of each Lp as a metric space will be meant with respect to this distance function dp. For instance fn → f in Lp will mean thatdp(fn, f ) → 0, or equivalently  fn – f p → 0, and thus for 0 < p < p ∞, |fn – f | dμ → 0 and for p = ∞, ess sup | fn – f |→0. The next result shows that convergence in Lp implies convergence in measure as well as convergence of the integrals of the pth absolute powers.

Theorem 6.4.6 Let 0 < p ≤∞and fn, f be elements in Lp.

(i) If { fn} is Cauchy in Lp, then it is Cauchy in measure if p < ∞, and for p = ∞ uniformly Cauchy a.e. (hence also Cauchy a.u. and in measure).

(ii) If fn → finLp, then fn → f in measure if p < ∞, and for p = ∞ uniformly a.e. (hence also a.u. and in measure), and  fn p →f p. Thus for 0 < p < ∞ p p | fn| dμ → | f | dμ.

Proof (ii) Assume that fn → f in Lp. Since the zero function belongs to Lp, Lemma 6.3.1 shows that dp(fn,0) → dp(f , 0), where dp is defined in the discussion preceding the theorem. It follows, for all 0 < p ≤∞, that fn p →f p. We now show that fn → f in measure when 0 < p < ∞. Since fn, f ∈ Lp, each fn and f are defined and finite a.e. For every >0 write En( )={x : | fn(x)–f (x)|≥ }. Then | |p ≥| |p ≥ p fn – f fn – f χEn( ) χEn( ) a.e. p p Thus fn – f p ≥ μ{En( )}, showing that μ{En( )}→0 since fn – f p→ 0. Hence fn → f in measure as required. For p = ∞, it follows from the facts that | fn(x)–f (x)|≤fn – f ∞ a.e. and fn – f ∞ → 0 that fn → f uniformly a.e. (i) is shown similarly.  The next theorem is the main result of this section showing that each Lp,0< p ≤∞,iscomplete as a metric space, i.e. whenever { fn} is a Cauchy sequence in Lp, there exists f ∈ Lp such that fn → f in Lp.For 1 ≤ p ≤∞this means that Lp is a Banach space. As before we put f = g if f = g a.e.

Theorem 6.4.7 (i) For 1 ≤ p ≤∞, Lp is a Banach space with norm f p. (ii) For 0 < p < 1, Lp is complete metric space with metric dp(f , g)= p f – gp. 6.4 The spaces Lp 133

Proof Since by Theorem 6.4.5 each Lp,0< p ≤∞, is a metric space with metric dp (defined as in (i) or (ii)) it suffices to show that it is complete, i.e. that each Cauchy sequence in Lp converges to an element of Lp. First assume that 0 < p < ∞ and let { fn} be a Cauchy sequence in Lp. By Theorem 6.4.6 (i), { fn} is Cauchy in measure and by Theorem 6.2.3 (ii), there is a measurable f (defined a.e.) such that fn → f in measure. By the { } corollary to Theorem 6.2.3, there is a subsequence fnk converging to f a.e. Hence for all k,  p | |p | |p fnk – f p = fnk – f dμ = (lim fnk – fnj ) dμ j p ≤ lim inf | fn – fn | dμ (Fatou’s Lemma) j k j  p = lim inf fnk – fnj p j and thus for all p > 0,

dp(fn , f ) ≤ lim inf dp(fn , fn ). k j k j

But since { fn} is Cauchy in Lp, given >0, there exists N = N( ) such that dp(fn, fm) < /2 when n, m ≥ N. Thus if nk, nj ≥ N it follows that ≤ ≤ dp(fnk , fnj ) < /2 and hence lim infj dp(fnk , fnj ) /2, so that dp(fnk , f ) /2 ≥   ∞ ∈ for nk N. In particular this implies that fnk– f p < and thus (fnk– f ) Lp ∈ and also f =(f – fnk )+fnk Lp, since Lp is a linear space (Theorem 6.4.1). Furthermore for all k ≥ N (requiring nk to be strictly increasing so that nk ≥ k ≥ N) ≤ dp(fk, f ) dp(fk, fnk )+dp(fnk , f ) < from which it follows that dp(fk, f ) → 0 giving fk → f in Lp. Now let p = ∞ and let { fn} be a Cauchy sequence in L∞. By combining a countable number of zero measure sets a set E ∈Swith μ(Ec) = 0 can be found such that for all x ∈ E and all n, m

| fn(x)–fm(x)|≤ fn – fm∞.

Since  fn – fm∞ → 0asn, m →∞, { fn} is uniformly Cauchy on E. Hence there is a function f defined on E such that fn → f uniformly on E.By Theorem 3.4.7, f is measurable and thus may be extended to a measurable function defined on the entire space X by putting f (x)=0forx ∈ Ec. → | |→ Since fn f uniformly on E, supx∈E fn(x)–f (x) 0. Hence given >0, | | ≥ there exists N = N( ) such that supx∈E fn(x)–f (x) < when n N. Then 134 Convergence of measurable functions, Lp-spaces

| f (x)|≤|f (x)–fn(x)| + | fn(x)|, x ∈ E, implies that for n ≥ N,

sup | f (x)|≤ sup | f (x)–fn(x)| + sup | fn(x)| < +  fn∞. x∈E x∈E x∈E c Since μ(E ) = 0, it follows that f ∈ L∞. Also for n ≥ N we have | fn – f | < a.e. which implies  fn – f ∞ < . Hence fn – f ∞ → 0 and thus fn → f in L∞. 

The final result of this section shows that the spaces Lp,0< p ≤∞,are ordered by inclusion when the underlying measure space is finite, a result especially important in probability theory. Theorem 6.4.8 If (X, S, μ) is a finite measure space (μ(X) < ∞) and 0 < q ≤ p ≤∞then Lp ⊂ Lq and for f ∈ Lp:

1 – 1  f q ≤f p{μ(X)} q p .

Proof Assume first that p = ∞ and f ∈ L∞. Then | f (x)|≤f ∞ a.e. and thus q q | f (x)| dμ(x) ≤f ∞ μ(X) < ∞

1 which implies that f ∈ Lq and  f q ≤f ∞{μ(X)} q , as required. Now assume that 0 < q < p < ∞ and let f ∈ Lp.Putr = p/q ≥ 1. Then q r p q (| f |) dμ = | f | dμ<∞ implies that | f | ∈ Lr. Define r by 1/r+1/r =1. Since μ(X) < ∞, the constant function 1 ∈ Lr and by Holder’s¨ Inequality q | f | · 1 ∈ L1. Hence f ∈ Lq. Again by Holder’s¨ Inequality,  q | |q ≤ | |q r 1/r r 1/r f q = f dμ ( ( f ) dμ) ( 1 dμ) q q | |p q/p{ }1– p  q { }1– p =( f dμ) μ(X) = f p μ(X) and the desired inequality follows by taking qth roots.  Corollary If (X, S, μ) is a finite measure space and 0 < q < p ≤∞, convergence in Lp implies convergence in Lq.

6.5 Modes of convergence – a summary This chapter has concerned a variety of convergence modes including con- vergence (pointwise) a.e., almost uniform, in measure, and in Lp. The dia- gram below indicates some of the important relationships between these forms of convergence (which have been shown to hold in this chapter). The arrows indicate that one form of convergence implies another. The word “finite” indicates that the corresponding implication holds when μ is finite, Exercises 135 but not in general. The word “subsequence” indicates that one mode of { } { } convergence for fn implies another for some subsequence fnk . Examples showing that no further relationships hold in general are given in the exercises (Exs. 6.2, 6.7 and 6.11).

Exercises 6.1 Consider the unit interval with Lebesgue measure. Let

fn(x)=1, 0≤ x ≤ 1/n =0, 1/n < x ≤ 1and f (x)=0, 0≤ x ≤ 1.

Does { fn} converge to f (a) for all x? (b) a.e.? (c) uniformly on [0,1]? (d) uniformly a.e. on [0,1]? (e) almost uniformly? (f) in measure? (g) in Lp? 136 Convergence of measurable functions, Lp-spaces

6.2 Let X = {1, 2, 3, ...}, S = all subsets of X,andletμ be counting measure on X.Definefn(x)=χ{1,2,...,n}(x). Does fn converge (a) pointwise? (b) almost uniformly? (c) in measure? Comment concerning Theorem 6.1.1, and the corollary to Theorem 6.2.2. 6.3 Let { fn} be a Cauchy sequence a.e. on (X, S, μ)andE ∈Swith 0 <μ(E) < ∞. Show that there exists a real number C and a measurable set F ⊂ E such that μ(F) > 0and| fn(x)|≤C for all x ∈ F, n =1,2,.... (Show in fact that given any >0, F ⊂ E may be chosen so that μ(E – F) < .) 6.4 Let { fn}, {gn} be a.e. finite measurable functions on (X, S, μ). If fn → f in measure and gn → g in measure, show that

(i) afn → af in measure, for any real a (ii) fn + gn → f + g in measure, and hence (iii) afn + bgn → af + bg in measure for any real a, b.

6.5 If fn → f in measure, show that | fn|→|f | in measure. 6.6 Let (X, S, μ) be a finite measure space. Let { fn}, f , {gn}, g (n =1,2,...)be a.e. finite measurable functions on X. (i) Show that given any >0 there exists E ∈S, μ(Ec) < and a constant C such that |g(x)|≤C for all x ∈ E. → 2 → (ii) If fn 0 in measure, show that fn 0 in measure. (iii) If fn → f in measure, show that fng → fg in measure (use (i)). → 2 → 2 (iv) If fn f in measure, show that fn f in measure (apply (ii) to fn – f and use (iii) with g = f ). (v) If fn → f in measure, gn → g in measure, show that fngn → fg in 1 { 2 2} measure (fngn = 4 (fn + gn) –(fn – gn) a.e.). 6.7 Let (X, S, μ) be the unit interval [0, 1] with the Borel sets and Lebesgue measure. For n =1,2,... let

i En =[(i –1)/n, i/n] i =1,..., n i { 1 1 2 1 2 with indicator function χn. Show that the sequence χ1, χ2, χ2, χ3, χ3, 3 } χ3, ... converges in measure to zero but does not converge at any point of X. 6.8 Let { fn} be a sequence of measurable functions on (X, S, μ), which is Cauchy { } { } in measure. Suppose fnk , fmk are two subsequences converging a.e. to f , g respectively. Show that f = g a.e. 6.9 Let (X, S, μ) be a finite measure space and F a field generating S.Iff is an S -measurable function defined and finite a.e., show that given any , δ>0 there is a simple F -measurable function g (i.e. g = n a χ where E ∈F) i=1 i Ei i such that μ{x : | f (x)–g(x)| > } <δ. Exercises 137

Hence every S-measurable finite a.e. function can be approximated “in mea- sure” by a simple F -measurable function. (Hint: Use Theorem 3.5.2 and its corollary and Theorem 2.6.2.) The result remains true if f is measurable with respect to the σ-field obtained by completing the measure μ. 6.10 Let (X, S, μ) be a finite measure space and L the set of all measurable func- tions defined and finite a.e. on X.Foranyf , g ∈ L define  | f – g| d(f , g)= dμ. X 1+| f – g| Show that (L, d) is a metric space (identifying f and g if f = g a.e.). Prove that convergence with respect to d is equivalent to convergence in measure. Is (L, d) complete? 6.11 Give an example of a sequence converging in measure but not in Lp,foran arbitrary but fixed 0 < p ≤∞. (Hint: Modify appropriately fn of Ex. 6.1.) 6.12 Let { fn} and f be in Lp,0< p < ∞.Iffn → f a.e. and  fnp →f p,then p p p show that fn → f in Lp. (Hint: Apply Fatou’s Lemma to {2 (| fn| +| f | )–| fn – f |p}.) In Chapter 11 (Theorem 11.4.2) it is shown that a.e. convergence may be replaced by convergence in measure, when the measure space is finite. ≥ 1 1 ∈ ∈ → 6.13 Let p 1, p + q =1,andfn, f Lp and gn, g Lq, n =1,2,....Iffn f in Lp and gn → g,inLq show that fngn → fg in L1. 6.14 If 0 < p < r < q < ∞ show that Lp ∩ Lq ⊂ Lr and that if f ∈ Lp ∩ Lq then

 f r ≤ max{ f p,  f q}. 1 1 1 ∈ ∈ ∈ 6.15 Suppose p > 1, q > 1, r > 1, p + q + r =1andletf Lp, g Lq, h Lr. Show that fgh ∈ L1 and  fgh1 ≤f p gq hr.(Showfg ∈ Ls, s s i.e. | f | |g| ∈ L1 where 1/s = 1–1/r.) The Holder¨ Inequality may thus be generalized to apply to the product of n > 2 functions. 6.16 Let (X, S, μ) be the unit interval (0, 1) with the Borel sets and Lebesgue –a measure and let f (x)=x , a > 0. Show that f ∈ Lp for all 0 < p < p0,and f  Lp for all p ≥ p0,andfindp0 in terms of a. 6.17 If (X, S, μ) is a finite measure space, show that for all f ∈ L∞

lim  f p =  f ∞. p→∞

1 (Hint: Use the fact that for a > 0, limp→∞ a p = 1 to show that for each >0

(1 – ) f ∞ ≤ lim inf  f  ≤ lim sup  f  ≤f ∞.) →∞ p p p p→∞

6.18 Let (X, S, μ) be a measure space and 0 < p < 1. ∈ ∈ 1 1 (i) If f Lp and g Lq where p + q = 1 (hence q < 0) show that

 fg1 ≥f pgq 138 Convergence of measurable functions, Lp-spaces q provided |g| dμ>0. (Notice that fg may not belong to L1.) (Hint: Let 1 1 1 | |p | |–p r = p > 1, r + r =1, φ = fg , ψ = g , and use Holder’s¨ Inequality for φ and ψ with r and r.)

(ii) If f , g ∈ Lp and fg ≥ 0 a.e. show that

 f + gp ≥f p + gp.

(Hint: Proceed as in the proof of Minkowski’s Inequality and use (i).) (iii) If X contains two disjoint measurable sets each having a finite positive measure, show that  f p is not a norm by constructing two functions f , g ∈ Lp such that  f + gp >  f p + gp. (Hint: If E, F are the two p disjoint sets take f = aχE , g = bχF , and determine a, b using (1 + t) < 1+tp for t > 0.)

(iv) If the assumption of (iii) is not satisfied determine all elements of Lp and show that it is a Banach space with norm  f p, but a trivial one. In fact this is true for all 0 < p < ∞. (Hint: If there are no sets of finite positive measure, show that Lp = {0}, i.e. Lp consists of only the zero function. If there is a measurable set E of finite positive measure, show that Lp consists of all multiples of the indicator function of E.) ∞ { }∞ 6.19 Let 0 < p < and p be the set of all real sequences an n=1 such that ∞ ∞ | |p ∞ ∞ { } n=1 an < .Letalso be the set of all bounded real sequences an n=1, i.e. |an|≤M for all n and some 0 < M < ∞.

(i) Show that p = Lp(X, S, μ), 0 < p ≤∞,whereX is the set of positive integers, S the class of all subsets of X,andμ is counting measure on S.

(ii) Show that p,1≤ p ≤∞, is a Banach space, and write down its norm; show that p,0< p < 1, is a complete metric space, and write down ∞ 1 1 { }∞ ∈ its distance function; show that if 1 < p < , p + q =1,and an n=1 { }∞ ∈ { }∞ ∈ p, bn n=1 q,then anbn n=1 1 and ∞ ∞ ∞ ∞ p 1 q 1 | anbn|≤ |anbn|≤( |an| ) p ( |bm| ) q ; n=1 n=1 n=1 m=1 ≤ ∞ { }∞ { }∞ ∈ and that if 1 p < and an n=1, bn n=1 p then ∞ ∞ ∞ p 1 p 1 p 1 ( |an + bn| ) p ≤ ( |an| ) p +( |bn| ) p . n=1 n=1 n=1

(iii) If 0 < p < q < ∞ show that p ⊂ q ⊂ ∞. 6.20 Let (X, S, μ) be a measure space and S the class of all simple functions φ on X such that μ{x ∈ X : φ(x)  0} < +∞.If0< p < +∞ then prove that S is dense in Lp. Exercises 139

6.21 Let (X, S, μ) be the real line with the Borel sets and Lebesgue measure. Then show that for 0 < p < +∞:

(i) Lp = Lp(X, S, μ) is separable, (ii) the set of all continuous functions that vanish outside a bounded closed interval is dense in Lp. (Hints: (i) Use Ex. 6.20 and the approximation of every measurable set of finite Lebesgue measure by a finite union of intervals, and of an interval by an interval with rational end points (the class of all intervals with rational end points is countable). (ii) Use Ex. 6.20, part (c) of Ex. 3.12, and a natural approximation of a step function by a continuous function.) 6.22 Let (X, S, μ) be the real line with the Borel sets and Lebesgue measure. If f is a function on X and t ∈ X define the translate ft of f by t as the function given by ft(x)=f (x – t). Let 1 ≤ p < ∞ and f ∈ Lp.

(i) Show that for all t ∈ X, ft ∈ Lp and  ftp =  f p. (ii) Show that if t → s in X,thenft → fs uniformly in Lp, i.e. given any >0 there exists δ>0suchthat ft – fsp < whenever |t – s| <δ.In particular ft → f in Lp and  ∞ lim | f (x – t)–f (x)|p dx =0. → t 0 –∞ (Hint: Prove this first for a continuous function which vanishes outside a bounded closed interval and then use Ex. 6.21 (ii).) 6.23 Let (X, S, μ) be the unit interval [0, 1] with the Borel sets and Lebesgue measure, let g ∈ Lp,1≤ p ≤ +∞, and define f on [0, 1] by  x f (x)= g(u) du for all x ∈ [0, 1]. 0 (i) Show that f is uniformly continuous on [0, 1]. (ii) Show that for 1 < p < +∞ N | f (y )–f (x )|p sup n n ≤gp < ∞ (y – x )p–1 p n=1 n n where the supremum is taken over all positive integers N and all nonover- { }N lapping intervals (xn, yn) n=1 in [0, 1]. 6.24 Let (X, S) be a measurable space and μ1, μ2 two probability measures on S. If λ isameasureonS such that μ1  λ and μ2  λ (for example μ1 + μ2 is such a measure) and if fi is the Radon–Nikodym derivative of μi with respect to λ, i =1,2,define  1/2 hλ(μ1, μ2)= (f1f2) dλ. 140 Convergence of measurable functions, Lp-spaces

(i) Prove that h does not depend on the measure λ used in its definition, and thus we write h(μ1, μ2)forhλ(μ1, μ2). (Hint: If λ is another measure on S such that μ1  λ and μ2  λ , put ν = λ + λ and show that hλ(μ1, μ2)=hν(μ1, μ2)=hλ (μ1, μ2).) (ii) Show that

0 ≤ h(μ1, μ2) ≤ 1

and that in particular h(μ1, μ2) = 0 if and only if μ1 ⊥ μ2 and that h(μ1, μ2) = 1 if and only if μ1 = μ2. (iii) Here take X to be the real line, S the Borel sets and μ the measure on S which is absolutely continuous with respect to Lebesgue measure on 2 √1 – x S with Radon–Nikodym derivative e 2 .Foreverya ∈ X let Ta 2π be the transformation from (X, S, μ)to(X, S)definedbyTa(x)=x – a ∈ –1 for all x X,andletμa = μTa .Findh(μ, μa) as a function of a,and use this expression to conclude that for mutually absolutely continuous probability measures μ1 and μ2 (μ1 ∼ μ2), h(μ1, μ2) can take any value in the interval (0, 1]. 7

Product spaces

7.1 Measurability in Cartesian products Up to this point, our attention has focussed on just one fixed space X. Con- sider now two (later more than two) such spaces X, Y, and their Cartesian product X × Y, defined to be the set of all ordered pairs (x, y)withx ∈ X, y ∈ Y. The most familiar example is, of course, the Euclidean plane where X and Y are both (copies of) the real line R. Our main interest will be in defining a natural measure-theoretic struc- ture in X × Y (i.e. a σ-field and a measure) in the case where both X and Y are measure spaces. However, for slightly more generality it is useful to first consider σ-rings S, T in X, Y, respectively and define a natural “product” σ-ring in X × Y. First, a rectangle in X × Y (with sides A ⊂ X, B ⊂ Y) is defined to be a set of the form A × B = {(x, y):x ∈ A, y ∈ B}. Rectangles may be regarded as the simplest subsets of X × Y and have the following property. Lemma 7.1.1 If S, T are semirings in X, Y respectively, then the class P of all rectangles A × B such that A ∈S, B ∈T, is a semiring in X × Y.

Proof P is clearly nonempty. If Ei ∈P, i = 1, 2, then Ei = Ai × Bi where Ai ∈S, Bi ∈T. It is easy to verify that

E1 ∩ E2 =(A1 ∩ A2) × (B1 ∩ B2) and hence E1 ∩ E2 ∈Psince A1 ∩ A2 ∈S, B1 ∩ B2 ∈T. It is also easily checked (draw a picture!) that

E1 – E2 =[(A1 ∩ A2) × (B1 – B2)] ∪ [(A1 – A2) × B1]. The two sets forming the union on the right are clearly finite disjoint unions of sets of P, and are disjoint since (A1 – A2) is disjoint from A1 ∩ A2. Thus E1 – E2 is expressed as a finite disjoint union of sets of P. Hence P is a semiring. 

141 142 Product spaces

If S, T are σ-rings, the σ-ring in X × Y generated by this semiring P is called the product σ-ring of S and T , and is denoted by S×T. It is clear that if S and T are both σ-fields, so is S×T which is also then called the product σ-field of S and T . Thus if (X, S) and (Y, T ) are measurable spaces then so is (X × Y, S×T). The sets of P may be called measurable rectangles (cf. Ex. 7.1). An important notion is that of sections of sets in the product space. If E ⊂ X × Y is a subset of X × Y, then for each x ∈ X, and y ∈ Y, the sets y Ex ⊂ Y and E ⊂ X defined by

y Ex = {y :(x, y) ∈ E} and E = {x :(x, y) ∈ E} are called the x-section of E and the y-section of E, respectively. Note that c if A ⊂ X and B ⊂ Y,(A × B)x = B or ∅ according as x ∈ A or x ∈ A , and (A × B)y = A or ∅ according as y ∈ B or y ∈ Bc. It is convenient to introduce (for each fixed x ∈ X) the transformation Tx from Y into X × Y defined by Txy =(x, y), and for each fixed y ∈ Y the transformation Ty from X into X × Y defined by Tyx =(x, y). Then if ⊂ × –1 y y –1 E X Y its sections are simply given by Ex = Tx E and E =(T ) E.

Lemma 7.1.2 If E, F are subsets of X × Y and x ∈ X, then (E – F)x = Ex – Fx.IfEi are subsets of X × Yfori=1,2,..., and x ∈ X, then ∪∞ ∪∞ ∩∞ ∩∞ ( 1 Ei)x = 1 (Ei)x,( 1 Ei)x = 1 (Ei)x. Corresponding conclusions hold for y-sections.

Proof These are easily shown directly, or follow immediately using the transformation Tx by, e.g. (using Lemma 3.2.1) –1 –1 –1  (E – F)x = Tx (E – F)=Tx E – Tx F = Ex – Fx.

y It also follows easily in the next result that Tx, T are measurable, and that sections of measurable sets are measurable:

Theorem 7.1.3 If (X, S), (Y, T ) are measurable spaces then the transfor- y mations Tx and T are measurable transformations from (Y, T ) and (X, S) y respectively into (X × Y, S×T). Thus Ex ∈Tand E ∈Sfor every E ∈S×T,x∈ X, y ∈ Y.

∈ ∈S ∈T –1 × × Proof For each x X, A , B , Tx (A B)=(A B)x = B or ∅∈T –1 ∈T P , and it follows that Tx E for each E in the semiring of rectan- gles A × B with A ∈S, B ∈T. Since S(P)=S×T the measurability of Tx follows from Theorem 3.3.2. Measurability of Ty follows similarly.  7.2 Mixtures of measures 143

It also follows that measurable functions on the product space have mea- surable “sections”, just as measurable sets on the product space do. Let f (x, y) be a function defined on a subset E of X × Y. For each x ∈ X,the x-section of f is the function fx defined on Ex ⊂ Y by fx(y)=f (Txy)= f (x, y), y ∈ Ex;i.e.fx is the function on a subset of Y resulting by holding x fixed in f (x, y). Similarly for each y ∈ Y,they-section of f is the function f y defined on Ey ⊂ X by f y(x)=f (Tyx)=f (x, y), x ∈ Ey. Theorem 7.1.4 Let (X, S) and (Y, T ) be measurable spaces and let f be an S×T-measurable function defined on a subset of X × Y. Then every y x-section fx is T -measurable and every y-section f is S-measurable.

Proof For each x ∈ X, fx is the composition fTx of the measurable func- tion f and measurable transformation Tx (Theorem 7.1.3). Hence each fx is T -measurable and similarly each f y is S-measurable. 

7.2 Mixtures of measures In this section it will be shown that under appropriate conditions, a family of measures may be simply “mixed” to form a new measure. This will not only give an immediate definition of an appropriate “product measure” (as will be seen in the next section) but is important for a variety of e.g. probabilistic applications. It is easily seen (cf. Ex. 5.2) that if λi is a measure on a measurable space S ∈S ∞ (X, ) for each i =1,2,..., then λ defined for E by λ(E)= 1 λi(E) is also a measure on S. λ may be regarded as a simple kind of mixture of the measures λi. More general mixtures may be defined as shown in the following result. Theorem 7.2.1 Let (X, S, μ) be a measure space, and (W, W) a measur- able space. Suppose that for every x ∈ X, λx is a measure on W, such that for every fixed E ∈W, λx(E) is S-measurable in x, and for E ∈W, define λ E λ E dμ x ( )= X x( ) ( ).

Then λ is a measure on W. Further λ(E)=0if and only if λx(E)=0 a.e. (μ).

∞ Proof If Ei are disjoint sets in W and E = ∪ Ei, 1 ∞ ∞ λ(E)= λx(∪ Ei) dμ(x)= λx(Ei) dμ(x) X 1 X 1 ∞ λ E dμ x ∞λ E = 1 X x( i) ( )= 1 ( i) 144 Product spaces using the corollary to Theorem 4.5.2. Thus λ is countably additive and hence a measure, since λ(∅) = 0. The final statement follows at once from Theorem 4.4.7. 

For obvious reasons λ will be termed a mixture of the measures λx,with ∞ respect to the measure μ. Note that in the example λ = 1 λi given prior to the theorem, μ is simply counting measure on X = {1, 2, 3, ...}. The next task is to show that integration with respect to λ may be done in two stages, as a “repeated” integral, first with respect to λx and then with μ fdλ { fdλ } dμ x f W respect to ; i.e. that W = X W x ( ), for any suitable on . For clarity this is split into two parts, first showing the result when f is nonnegative and defined at all points of W.

Lemma 7.2.2 Let f be a nonnegative W-measurable function defined at all points of W and let λ be as in Theorem 7.2.1. Then fdλ is a W x nonnegative, S-measurable function of x and { fdλ } dμ x fdλ X W x ( )= W . n Proof If f is a nonnegative simple function, f (w)= 1aiχEi (w), say (Ei disjoint sets in W) then fdλ na λ E W x = 1 i x( i) which is nonnegative and S-measurable since λx(Ei) is measurable for each Ei. Further { fdλ } dμ x na λ E dμ x na λ E fdλ X W x ( )= 1 i W x( i) ( )= 1 i ( i)= W . Thus the result holds for nonnegative simple functions. If f is a nonnegative measurable function defined on all of W, write f = limn→∞ fn where {fn} is an increasing sequence of nonnegative simple functions. By monotone convergence (or simply definition) fdλx = lim fn dλx W n→∞ W fdλ so that W x is a limit of nonnegative measurable functions and hence is nonnegative and measurable. Also { fdλx} dμ(x)= {lim fn dλx} dμ(x) X W X n→∞ W = lim { fn dλx} dμ(x) n→∞ X W 7.2 Mixtures of measures 145 f dλ by monotone convergence, since W n x is nonnegative and nondecreas- ing in n. But the final expression above is (since fn is simple) lim fn dλ = fdλ n→∞ W W again using monotone convergence, so that the result follows.  This result will now be generalized as the main theorem of the section. Theorem 7.2.3 Let (X, S, μ) be a measure space, (W, W) a measurable space and λx a measure on W for each x ∈ X, such that λx(E) is S- measurable as a function of x for each E ∈W.Letλ be the mixture of the λx as defined above, and f be a W-measurable function defined a.e. (λ) on W. Then (i) If f is nonnegative a.e. λ on W, then fdλ is a nonnegative S- ( ) W x measurable function defined a.e. (μ) on X, and fdλ = { fdλx} dμ(x). (7.1) W X W (ii) If |f | dλ<∞ i.e. f ∈ L W W λ or if { |f | dλ } dμ x < ∞ W ( 1( , , )) X W x ( ) then f ∈ L W W λ for a.e. x μ fdλ ∈ L X S μ and 1( , , x) ( ), W x 1( , , ) (7.1) holds. Proof (i) Let E (∈W) be the set where f is defined and nonnegative, and write f *(w)=f (w)forw ∈ E, f *(w) = 0 otherwise. Thus f * = f a.e. (λ) and f * is defined everywhere. Now since f is defined a.e. (λ), λ(Ec) = 0 and c c hence λx(E )=0a.e.(μ) by Theorem 7.2.1. That is if A = {x : λx(E )=0} c c we have A ∈S(since λx(E )isS-measurable), and μ(A )=0. * c * Now f = f on E and if x ∈ A, λx(E ) = 0 so that f = f a.e. (λx) and * fdλx = f dλx, which is S-measurable by Lemma 7.2.2. Thus fdλx, defined precisely on A ∈Sis S-measurable (Lemma 3.4.1) and defined a.e. since μ(Ac)=0. fdλ f * dλ x ∈ A μ μ Ac Finally W x = W x for and hence a.e. ( ) since ( )=0, so that { fdλ } dμ x f * dλ dμ x f * dλ fdλ X W x ( )= X( W x) ( )= W = W since f * = f a.e. (λ), as required. (ii) Note first that by (i) with |f | for f we have |f | dλ { |f | dλ } dμ x W = X W x ( ) so that finiteness of one side implies that of the other, and the two finiteness conditions in the statement of (ii) are equivalent. For brevity write L1(λ)for 146 Product spaces

L1(W, W, λ), L1(λx)forL1(W, W, λx), and L1(μ)forL1(X, S, μ). Then as- suming f ∈ L1(λ)wehavef+ ∈ L1(λ), f– ∈ L1(λ) (Theorem 4.4.5). Now f+ dλx is S-measurable by (i) and W { f+ dλx} dμ(x)= f+ dλ<∞.(7.2) X W W f dλ < ∞ μ f ∈ L λ μ Hence W + x a.e. ( ) so that + 1( x)a.e.( ). The same is true with f– instead of f+ and hence f = f+ – f– ∈ L1(λx)a.e.(μ) which proves the first statement of (ii). Further fdλx = f+ dλx – f– dλx a.e. (μ) W W W f dλ ∈ L μ f dλ ∈ and since by (7.2) W + x 1( ) (and correspondingly W – x L μ fdλ ∈ L μ 1( )) we have W x 1( ) (which is the second statement of (ii)) and { fdλ } dμ(x)= { f dλ } dμ(x)– { f dλ } dμ(x) X W x X W + x X W – x f dλ f dλ = W + – W –

(again using (7.2) and its counterpart for f–). But this latter expression is fdλ  just W so that the final statement of (ii) follows.

7.3 Measure and integration on product spaces If (X, S), (Y, T ) are measurable spaces, the product measurable space is simply (X ×Y, S×T) where S×T is defined as in Section 7.1. This product space will be identified with the space (W, W) of the previous section, and a mixed measure thus defined on S×T from “component measures” μ on S and νx defined on T for each x ∈ X. These will be assumed to be uniformly σ-finite for x ∈ X, in the sense that there are sets Bn ∈T, ∪nBn = Y such that νx(Bn) < ∞ for all x ∈ X. Clearly the sets Bn can (and will) be taken to be disjoint. The results thus obtained have important uses e.g. in probability theory. In the next section the measures νx will be taken to be independent of x, leading to traditional “product measures”. Theorem 7.3.1 Let (X, S, μ) be a measure space, (Y, T ) a measurable space, and let νx be a measure on T for each x ∈ X. Suppose that νx(B) is S-measurable in x for each fixed B ∈Tand that {νx : x ∈ X} is a uniformly σ-finite family. Then

(i) νx(Ex) is S-measurable for each E ∈ S×T, and λ defined on S×T by λ E ν E dμ x for E ∈S×T ( )= X x( x) ( ) , 7.3 Measure and integration on product spaces 147

is a measure on S×T satisfying λ A × B ν B dμ x for A ∈S B ∈T ( )= A x( ) ( ) , . (ii) λ is the unique measure on S×T with this latter property if also ν B dμ x < ∞,m n ...for some sequence of sets A ∈S A x( m) ( ) , =1,2, n n ∪∞ with 1 An = X. Proof (i) Write W = X × Y, W = S×T and for each x ∈ X, E ∈W, –1 define λx(E)=νx(Ex)(=νxTx E where Tx again denotes the measurable transformation Txy =(x, y)). It is clear that λx is a measure on W. That λ may be defined as in (i) and is a measure will follow at once from Theorem 7.2.1 provided we show that νx(Ex)isS-measurable for each E ∈W= S×T. To see this let C beasetinT such that νx(C) < ∞ for all x ∈ X. Write

D = {E ∈S×T : νx(Ex ∩ C)isS-measurable}. ∈D ⊃ { ∩ } ∩ ∩ Since for E, F ,withE F, νx (E – F)x C = νx(Ex C)–νx(Fx C) ∩ ≤ ∞ { ∪∞ ∩ } ∞ ∩ (νx(Fx C) νx(C) < ) and νx ( 1 Ei)x C = 1 νx(Ei,x C) for disjoint sets Ei ∈D, it is clear that D is a D-class. If E is a measurable rectangle × ∈S ∈T ∩ ∩ (E = A B, A , B ), then νx(Ex C)=νx(B C)χA (x) which is measurable since νx(B ∩ C) is measurable by assumption, and A ∈S,so that νx(Ex ∩D)isS-measurable for measurable rectangles E. Since D thus contains the semiring of measurable rectangles, it contains the generated σ-ring S×T. Hence νx(Ex ∩ C)isS-measurable for any E ∈S×T. Replacing C by Bm where Bm are as in the theorem statement we have for E ∈S×T, ∞ ∩ νx(Ex)= m=1νx(Ex Bm) which is a countable sum of S-measurable functions and hence is measur- able as required. The final statement of (i) follows simply since, as noted above, νx(A × B)x = νx(B)χA(x)forA ∈S, B ∈T. (ii) will follow immediately from the uniqueness part of Theorem 2.5.4 provided λ is σ-finite on the semiring P of measurable rectangles A × B, A ∈S, B ∈T. But under the assumptions of (ii)   × ∞ ∞ × X Y = n=1 m=1(An Bm) where λ(An × Bm)= νx(Bm) dμ(x) < ∞. The double union may be An written as a single union, to show that λ has the required σ-finiteness property.  148 Product spaces

Notice that if μ and each νx are probability measures, and if for each fixed B ∈T, νx(B)isS-measurable in x, then Theorem 7.3.1 is applicable and λ is also a probability measure. Theorem 7.2.3 may now be applied to give the following result for inte- gration with respect to the measure λ on S×T.

Theorem 7.3.2 With the notation and conditions of Theorem 7.3.1 for the existence of the measure λ on S×T given by λ(E)= νx(Ex) dμ(x),letf be a measurable function defined a.e. (λ) on S×T (with x-section fx as usual). (i) If f ≥ 0 a.e. (λ) then fx dνx is defined a.e. (μ) on X, S-measurable and fdλ { f dν } dμ x X×Y = X Y x x ( ). (ii) If |f | dλ<∞,i.e.f ∈ L X × Y, S×T, λ ,orif { |f | dν } dμ x < 1( ) X Y x x ( ) ∞, then f dν ∈ L X S μ and Y x x 1( , , ) fdλ { f dν } dμ x X×Y = X Y x x ( ).

Proof As in Theorem 7.3.1 define the measure λx on S×T by

–1 λx(E)=νx(Ex)=νxTx (E), where Txy =(x, y). Then if e.g. f ≥ 0a.e.(λ)wehave fdλ fdν T–1 fT dν f dν X×Y x = X×Y x x = Y ( x) x = Y x x by the transformation theorem (Theorem 4.6.1). Hence (i) follows at once from Theorem 7.2.3 by identifying (W, W)with(X × Y, S×T) (noting that λ(E)= νx(Ex)dμ(x)= λx(E) dμ(x)) and hence fdλ = { fdλ } dμ { f dν } dμ X X×Y x = X Y x x . (ii) follows in almost precisely the same way.  fdλ double integral It is sometimes convenient to refer to X×Y as a (emphasizing the fact that the integration is over a product space X × Y, even though only one integration is involved). Correspondingly we may call { f dν } dμ x repeated iterated integral X Y x x ( )a or . Theorem 7.3.2 thus gives conditions under which a double integral may be evaluated as a repeated integral. The case of most immediate concern, that when νx is independent of x, will be considered in the next section. 7.4 Product measures and Fubini’s Theorem 149

7.4 Product measures and Fubini’s Theorem As noted, this section specializes the results of the previous one to the case where νx = ν, independent of x. Then the measure λ is a true “product measure” in that the measure λ of a rectangle A × B is (as will be seen) the product μ(A)ν(B) of the measures of its sides.

Theorem 7.4.1 Let (X, S, μ) be a measure space and (Y, T , ν) a σ-finite measure space. Then (i) λ defined for E ∈S×Tby λ E ν E dμ x , is a measure on ( )= X ( x) ( ) S×T × · ∈S ∈T satisfying λ(A B)=μ(A) ν(B) when A ,B . (ii) If further μ is σ-finite, then also λ E μ Ey dν y for E ∈S×T ( )= Y ( ) ( ) . Then λ is σ-finite and is the unique measure on S×T satisfying λ(A × B)=μ(A) · ν(B) for A ∈S,B∈T.

Proof (i) follows immediately from Theorem 7.3.1 by noting that the con- stant ν(B)isS-measurable for each B ∈T, and ν is σ-finite, uniformity not being an issue. The first statement of (ii) follows by interchanging the roles of X and Y, and the remainder follows simply from Theorem 7.3.1. 

If (X, S, μ), (Y, T , ν)areσ-finite measure spaces the measure λ defined as above on S×T has (as noted) the property that λ(A × B)=μ(A)ν(B)for A ∈S, B ∈T. For this reason it is referred to as the product measure and is written as μ × ν.(X × Y, S×T, μ × ν) is then called the product measure space, and by Theorem 7.4.1 the product measure of a set E ∈ S×T is expressed in terms of the measures of its sections by μ × ν E ν E dμ x μ Ey dν y ( )( )= X ( x) ( )= Y ( ) ( ). This is a general version of the customary way of calculating areas in calcu- lus and as an immediate corollary gives a useful criterion for a set E ∈ S×T to have zero product measure.

Corollary Let (X, S, μ), (Y, T , ν) be σ-finite measure spaces. Then for any fixed E ∈S×T, (μ × ν)(E)=0if and only if ν(Ex)=0a.e. (μ),or equivalently if and only if μ(Ey)=0a.e. (ν).

The above corollary is sometimes referred to as (a part of) Fubini’s Theorem. However, the main part of Fubini’s Theorem is the following counterpart of Theorem 7.3.2 when νx is independent of x. 150 Product spaces

Theorem 7.4.2 (Fubini’s Theorem) Let (X, S, μ), (Y, T , ν) be σ-finite mea- sure spaces and let f be an S×T-measurable function defined a.e. (λ = μ × ν) on S×T. (i) If f ≥ a.e. λ , then f dν and f y dμ are respectively S- and 0 ( ) Y x X T -measurable (defined a.e. (μ), (ν) respectively) and fdλ { f dν} dμ x { f y dμ} dν y X×Y = X Y x ( )= Y X ( ). (7.3) (ii) The three conditions |f | dλ<∞ { |f | dν} dμ x < ∞ { |f y| dμ} dν y < ∞ X×Y , X Y x ( ) , Y X ( ) , y are equivalent and each guarantees that fx ∈ L1(Y, T , ν) a.e. (μ), f ∈ L X S μ a.e. ν f dν ∈ L X S μ f y dμ ∈ L Y T ν and 1( , , ) ( ), Y x 1( , , ), X 1( , , ) that (7.3) holds. Proof This follows at once from Theorem 7.3.2 – in part directly, and in part by interchanging the roles of X and Y in an obvious way.  It is convenient to write fdν dμand fdμ dν respectively for the { f dν} dμ x { f y dμ} dν y repeated integrals X Y x ( ), Y X ( ). The main use of Theorem 7.4.2 is to invert the order of such repeated integrals e.g. of fdν dμ to obtain fdμ dν. By the theorem, this may be done whenever the (S×T -measurable) function f is nonnegative, or, if f can take both posi- tive and negative values, whenever one of |f | dν dμ, |f | dμ dν can be shown to be finite. It should also be noted that commonly one wishes to invert the order of integration of { fx dν} dμ(x) where E ∈S×T. Replacing f by f χE one X Ex fdμ × ν { f y dμ} dν y sees that this integral is simply E ( )or Y Ey ( ) under the appropriate conditions from Theorem 7.4.2. The product measure space (X × Y, S×T, μ × ν) is not generally com- plete even if both spaces (X, S, μ) and (Y, T , ν) are complete (cf. Ex. 7.5). Sometimes one wishes to use Fubini’s Theorem on the completed space (X × Y, S×T, μ × ν), where S×T is the completion of S×T with re- spect to μ × ν, and μ × ν is the extension of μ × ν from S×T to S×T (see Section 2.6). The results of Theorem 7.4.2 hold for the completed product space as we show now, the only difference being that almost all, rather than all, sections of f are measurable in this case. Theorem 7.4.3 Let (X, S, μ) and (Y, T , ν) be two complete σ-finite measure spaces and let f be defined a.e. (μ × ν) on X × Y, and S×T-measurable. 7.4 Product measures and Fubini’s Theorem 151

(i) If f is nonnegative a.e. (μ × ν), then fx is T -measurable for a.e. x (μ), y y f is S-measurable for a.e. y (ν), the functions fx dν and f dμ are defined for a.e. x, y, are S- and T -measurable respectively, and fd(μ × ν)= fdμ dν = fdν dμ.(7.4)

y (ii) If f ∈ L1(X × Y, S×T, μ × ν) then fx ∈ L1(Y, T ,ν) for a.e. x (μ), f ∈ y L1(X, S, μ) for a.e. y (ν), fx dν ∈ L1(X, S, μ), f dμ ∈ L1(Y, T , ν), and (7.4) holds.

Proof (i) Since f is S×T-measurable, there is an S×T-measurable function g defined on (all of) X × Y such that f = g a.e. (μ × ν) (Ex. 3.9) and it may be assumed that g ≥ 0onX × Y since f ≥ 0a.e.(μ × ν). We will show that for a.e. x (μ)wehavefx = gx a.e. (ν). Let

E = {(x, y):f (x, y)=g(x, y)}.

Then E ∈ S×T and (μ × ν)(Ec) = 0, and by the corollary to Theorem c { } 7.4.1 ν(Ex) = 0 for a.e. x (μ). But Ex = y : fx(y)=gx(y) and thus for a.e. x (μ)wehavefx = gx a.e. (ν). Since each gx is T -measurable (by Theorem 7.1.4) and (Y, T , ν) is complete, it follows from Theorem 3.6.1 that fx is T -measurable for a.e. x (μ). Hence fx dν = gx dν for a.e. x (μ) and since (X, S, μ) is also complete, again by Theorem 3.6.1, fx dν is S- measurable. Finally fdν dμ = { f (y) dν(y)} dμ(x) x = { g (y) dν(y)} dμ(x) x = gd(μ × ν) (Theorem 7.4.2 (i)) = gd(μ × ν) (Ex. 4.10) = fd(μ × ν) the last equality holding since f = g a.e. (μ × ν) and thus also a.e. (μ × ν). y y It is shown similarly that f is S-measurable for a.e. y (ν), that f dμ is T - measurable and that fdμ dν = fd(μ × ν), completing the proof of (i). (ii) is shown as (i): the details should be furnished by the reader as an exercise.  152 Product spaces

7.5 Signed measures on product spaces It is of interest to note that products of signed (or even complex) measures may also be quite simply defined. In this section we briefly consider the most useful case of finite signed measures.

Theorem 7.5.1 Let (X, S) and (Y, T ) be measurable spaces and μ and ν finite signed measures on S and T respectively. There is a unique finite signed measure μ × ν on S×T such that for all A ∈Sand B ∈T,

(μ × ν)(A × B)=μ(A)ν(B).

Moreover (μ × ν)+ = μ+ × ν+ + μ– × ν– and (μ × ν)– = μ+ × ν– + μ– × ν+, and thus |μ × ν| = |μ|×|ν| and for all E ∈S×T, μ × ν E ν E dμ x μ Ey dν y ( )( )= X ( x) ( )= Y ( ) ( ).

Proof Let μ = μ+ – μ– and ν = ν+ – ν– be the Jordan decompositions of μ and ν and define μ × ν by

μ × ν =[(μ+ × ν+)+(μ– × ν–)] – [(μ+ × ν–)+(μ– × ν+)].

Since μ+, μ–, ν+, ν– are measures, it follows immediately from Theorem μ × ν A × B μ A ν B μ × ν E ν E dμ x 7.4.1 that ( )( )= ( ) ( ) and ( )( )= X ( x) ( )= μ Ey dν y Y ( ) ( ). Now let X = A ∪ B,withA positive and B negative, be a Hahn decompo- sition of (X, S, μ) and Y = C ∪ D,withC positive and D negative, a Hahn decomposition of (Y, T , ν). Notice that if E×F ∈ S×T, E×F ⊂ A×C, then (μ × ν)(E × F) ≥ 0. Hence (μ × ν)(G) ≥ 0 for all finite disjoint unions G of such measurable rectangles. But given >0 it is readily shown from The- orem 2.6.2 that a measurable set G ⊂ A×C may be approximated by such a union H of measurable rectangles in the sense that |μ × ν|(GΔH) < . Since (μ × ν)(H) ≥ 0 it follows that (μ × ν)(G) ≥ – and hence (μ × ν)(G) ≥ 0, being arbitrary. Thus any measurable subset of A × C has nonnegative μ × ν-measure so that A × C is positive for μ × ν. Similarly B × D is positive for μ × ν, whereas A × D and B × C are negative sets for μ × ν. Hence X × Y = {(A × C) ∪ (B × D)}∪{(A × D) ∪ (B × C)} is a Hahn decomposition for (X × Y, S×T, μ × ν). It is then clear that (μ × ν)+, the restriction of μ × ν to (A × C) ∪ (B × D), equals μ+ × ν+ + μ– × ν–, since the two finite measures agree on the measurable rectangles. Similarly (μ × ν)– = μ+ × ν– + μ– × ν+. Finally the uniqueness of μ × ν follows from the uniqueness of its restric- tion to each of the subsets A × C, A × D, B × C, B × D,i.e.fromthe 7.6 Real line applications 153 uniqueness of μ+ × ν+, μ+ × ν–, μ– × ν+, μ– × ν–, which is guaranteed by Theorem 7.4.1. 

Fubini’s Theorem holds for finite signed measures as well. In view of Theorem 7.5.1, this is an immediate consequence of Fubini’s Theorem for measures (Theorem 7.4.2) and we now state it, leaving the simple details to the reader.

Theorem 7.5.2 Let (X, S) and (Y, T ) be measurable spaces, and μ, ν fi- nite signed measures on S, T respectively. If f ∈ L1(X × Y, S×T, |μ|×|ν|), y then fx ∈ L 1(Y, T , |ν|) for a.e. x (|μ|), f ∈ L1(X, S, |μ|) for a.e. y (|ν|),the y functions fx dν and f dμ which are thus defined a.e. (|μ|) on X and a.e. (|ν|) on Y are in L1(X, S, |μ|) and L1(Y, T , |ν|) respectively, and fd(μ × ν)= fdμ dν = fdν dμ.

7.6 Real line applications This section concerns some applications to the real line R =(–∞,+∞). As usual B denotes the Borel sets of R and m Lebesgue measure on B. Write R2 for the plane R × R, and B×B= B2 the class of two-dimensional Borel sets, or simply the Borel sets of R2, and m2 = m × m two-dimensional Lebesgue measure,orLebesgue measure on R2. The completion B×Bof B×Bwith respect to m×m is called the class of two-dimensional Lebesgue measurable sets,ortheLebesgue measurable sets of R2, and is denoted by L2. Notice that L2  L×L,i.e.B×B B×B as shown in Ex. 7.5. 2 2 In the sequel we will write L1(R)forL1(R, B, m), and L1(R )forL1(R , 2 B , m × m). Note that f , g ∈ L1(R) does not (in general) imply fg ∈ L1(R), –1/2 as the example f (x)=g(x)=x χ(0,1)(x) demonstrates. However, the following remarkable and useful result follows as a first application of Fubini’s Theorem.

Theorem 7.6.1 Let f , g be functions defined on R.Iff, g ∈ L1(R) then for a.e. x ∈ R the function of y, f (x – y)g(y) belongs to L1(R), and if for these x’s we define ∞ h x f x y g y dy ( )= –∞ ( – ) ( ) , then h ∈ L1(R) and h1 ≤f 1 g1. h is called the convolution of f and g and is here denoted by f ∗ g. 154 Product spaces

Proof Define the function F(x, y)onR2 by F(x, y)=f (x – y)g(y) and assume for the moment that F is B2-measurable. Then by Fubini’s Theorem for nonnegative functions (Theorem 7.4.2), ∞ ∞ |F| d m × m |f x y g y | dx dy R2 ( )= –∞ –∞ ( – ) ( ) ∞ ∞ |g y | |f x y | dx dy = –∞ ( ) ( –∞ ( – ) ) ∞ f  |g y | dy f  g = 1 –∞ ( ) = 1 1 ∞ ∞ |f x y | dx |f x | dx since –∞ ( – ) = –∞ ( ) by the translation invariance of 2 Lebesgue measure (see last paragraph of Section 4.7). Thus F ∈ L1(R ) and by Fubini’s Theorem for integrable functions F ∈ L (R) for a.e. x ∈ R (m), ∞ x 1 h x F y dy R L R and ( )= –∞ x( ) which is thus defined a.e. on belongs to 1( ). Applying again Fubini’s Theorem for nonnegative functions it follows as before that ∞ ∞ ∞ h |h x | dx ≤ |f x y g y | dx dy f  g 1 = –∞ ( ) –∞( –∞ ( – ) ( ) ) = 1 1. It thus only remains to be shown that F is B2-measurable for the proof of 2 the theorem to be complete. Consider the functions F1, F2 defined on R 2 by F1(x, y)=x and F2(x, y)=y. Clearly F1 and F2 are B -measurable. Since f and g are B-measurable, by Theorem 3.3.1 the compositions f (x – y)=f {F1(x, y)–F2(x, y)} =(f ◦ (F1 – F2))(x, y) and g(y)=g{F2(x, y)} = 2 (g ◦ F2)(x, y)areB -measurable, and hence so also is their product F(x, y)=f (x – y)g(y) (Theorem 3.4.4).  The notion of convolution of two integrable functions has an immediate, and useful, generalization to the convolution of two finite signed measures given in Ex. 7.24. The next application of Fubini’s Theorem gives the formula for integra- tion by parts in a general form. Theorem 7.6.2 If F and G are right-continuous functions of bounded variation on [a, b], –∞ < a < b < ∞, then G x dF x F b G b F a G a F x dG x (a,b] ( ) ( )= ( ) ( )– ( ) ( )– (a,b] ( –0) ( ). Proof Let E = {(x, y) ∈ (a, b]×(a, b]:y ≤ x}. Then E ∈B2 since the func- 2 tions F1(x, y)=x, F2(x, y)=y are B -measurable and E = {(a, b]×(a, b]}∩ {(x, y):F2(x, y) ≤ F1(x, y)}.IfμF and μG are the finite signed Lebesgue– Stieltjes measures on B(a, b] corresponding to F and G (see Theorem 5.7.4) then by Theorem 7.5.1, μ × μ E μ E dμ x μ Ey dμ y ( F G)( )= (a,b] G( x) F( )= (a,b] F( ) G( ). 7.7 Finite-dimensional product spaces 155

Since E =(a, x] and Ey =[y, b] this is written x {G x G a } dF x {F b F y } dG y (a,b] ( )– ( ) ( )= (a,b] ( )– ( –0) ( ) so that G(x) dF(x)–G(a){F(b)–F(a)} (a,b] F b {G b G a } F y dG y = ( ) ( )– ( ) – (a,b] ( –0) ( ) and the desired expression follows by cancelling the terms F(b)G(a).  For absolutely continuous functions integration by parts has a simpler form. Corollary If F and G are absolutely continuous functions on [a, b], –∞ < x x a < b < ∞,withFx F a f t dt G x G a g t dt f g ∈ ( )= ( )+ a ( ) , ( )= ( )+ a ( ) , , L1(a, b), then b b G x f x dx F x g x dx F b G b F a G a a ( ) ( ) + a ( ) ( ) = ( ) ( )– ( ) ( ). Proof The result follows immediately from the theorem since F is con- tinuous and dμF/dm = f , and similarly for G.  Further real line applications are given in the exercises.

7.7 Finite-dimensional product spaces The results of Sections 7.1, 7.3–7.5 may be generalized to include the prod- uct of a finite number of factor spaces. To see this, first let X1, ..., Xn n × × × be spaces and 1 Xi = X1 X2 ... Xn their Cartesian product, i.e. {(x1, ..., xn):xi ∈ Xi, i =1,..., n}. If Si are semirings of subsets of Xi, i =1,..., n, the class Pn of all rectangles A1 ×A2 ×...×An such that Ai ∈Si for each i, is again a semiring. In fact the proof of Lemma 7.1.1 generalizes at once by noting that (A1 × A2 × ...× An)–(B1 × B2 × ...× Bn) may be expressed as the finite disjoint ∪n union 1Er where

Er =(A1 ∩ B1) × (A2 ∩ B2) × ...× (Ar–1 ∩ Br–1) × (Ar – Br) × Ar+1 × ...× An.

(Note that if r < s, Er ⊂ A1 × A2 × ...× (Ar – Br) × Ar+1 × ...× An whereas ⊂ × × × ∩ × × × ∩ ∅ Es A1 A2 ... (Ar Br) Ar+1 ... An and hence Er Es = .) S S S n S S ×S × ×S For σ-rings 1, 2, ..., n the product σ-ring 1 i = 1 2 ... n is simply defined to be the σ-ring generated by this semiring Pn. We as- sume now that Si are σ-fields, so that (X1, S1), ...,(Xn, Sn) are measurable × × × S ×S × ×S spaces, and (X1 X2 ... Xn, 1 2 ... n) is a measurable space, n n S the “product measurable space” ( 1 Xi, 1 i). 156 Product spaces

If E is a subset of X1 × X2 × ...× Xn, a section may be defined by fixing any number of x1, x2, ..., xn (xi ∈ Xi) to be a subset of the product of the remaining spaces Xi. For example { ∈ } Ex1,x2,...,xr = (xr+1, xr+2, ..., xn): (x1, x2, ..., xn) E –1 ⊂ × × × = Tx E Xr+1 Xr+2 ... Xn where Tx,forx =(x1, x2, ..., xr), is the mapping of Xr+1 × ... × Xn into X1 × X2 × ...× Xn given by Tx(xr+1, xr+2, ..., xn)=(x1, x2, ..., xn). It is easily seen that Theorem 7.1.3 generalizes so that each Tx is mea- surable and if E ∈S1 ×S2 × ...×Sn then any section is a member of the appropriate σ-field (Sr+1 ×Sr+2 × ...×Sn in the example given). Suppose now that μ1, ..., μn are σ-finite measures on S1, ..., Sn. Write Yn = X1 ×X2 ×...×Xn and Tn = S1 ×S2 ×...×Sn. Then a product measure λn, denoted by μ1 × μ2 × ...× μn, may be defined (e.g. inductively) on Tn, with the property that

λn(A1 × A2 × ...× An)=μ1(A1)μ2(A2) ...μn(An) where Ai ∈Si, i =1,..., n. To see this more precisely, we suppose that λn–1 has been defined on Tn–1 with this product property. We may “iden- tify” Yn with the product space Yn–1 × Xn in a natural way by the mapping T((x1, ..., xn–1), xn)=(x1, ..., xn)fromYn–1 × Xn to Yn. That is, while Yn is the product of n factor spaces, it may be regarded as the product of two spaces (of which one is itself a product) in this way. It may be shown that –1 if E ∈Tn then T E ∈Tn–1 ×Sn (Ex. 7.30) and thus λn is naturally defined –1 by λn =(λn–1 × μn)T .IfE = A1 × A2 × ...× An (Ai ∈Si, i =1,..., n) then –1 T E =(A1 × A2 × ...× An–1) × An and hence

λn(E)=λn–1(A1 × A2 × ...× An–1)μn(An)=μ1(A1)μ2(A2) ...μn(An) as required. λn is the unique measure on Tn with this property since any other such measure must coincide with λn on the semiring Pn and hence on S1 ×S2 × ...×Sn (σ-finiteness on Pn is clear). λn is also thus σ-finite. Thus in summary the following result holds.

Theorem 7.7.1 Let (Xi, Si, μi) be σ-finite measure spaces for i =1,2,..., n. Then there exists a unique measure λn (written μ1 × μ2 × ...× μn) on the σ-field S1 ×S2 × ...×Sn such that n λn(A1 × A2 × ...× An)= μi(Ai) i=1 for each such rectangle with Ai ∈Si, i =1,..., n. λn is σ-finite. 7.7 Finite-dimensional product spaces 157

The results of Section 7.4 also generalize to apply to a product of n > 2 measure spaces using the same “identification” of Yn with Yn–1 × Xn as above. For example, suppose that the function f (x1, ..., xn) defined on Yn, is S1 ×S2 × ... ×Sn-(i.e. Tn-) measurable and, say, nonnegative. It is usually convenient to evaluate fdλn as a repeated integral ... fdμ1 dμ2 ... dμn, say. It is clear what is meant by such a repeated integral.

First for fixed x2, x3, ..., xn the “section” fx2,...,xn (x1)=f (x1, ..., xn)isin- (2) tegrated over X1, giving a function f (x2, ..., xn)say,onX2 × ... × Xn. (2) (3) Then fx3,...,xn (x2) is integrated over X2 to give f (x3, ..., xn), and so on. That is the repeated integral may be precisely defined by (n) ... fdμ1 dμ2 ... dμn = f (xn) dμn(xn) Xn

(1) (i) where f = f and the f are defined inductively on Xi × ...× Xn by (i+1) (i) f (xi+1, ..., xn)= f (xi) dμi(xi). Xi xi+1,...,xn To show the equality of fdλn and the repeated integral we regard f as * * a function f on Yn–1 × Xn by writing f {(x1, ..., xn–1), xn} = f (x1, ..., xn); i.e. f * = fT where T denotes the mapping used above. T is a measurable transformation (Ex. 7.30) and thus by Theorem 4.6.1 and the fact that λn = –1 (λn–1 × μn)T , × –1 × fdλn = fd(λn–1 μn)T = × fTd(λn–1 μn) Yn Yn Yn–1 Xn * * = f d(λn–1 × μn)= { f dλn–1} dμn(xn) Yn–1×Xn Xn Yn–1 xn

* by Fubini’s Theorem for positive functions. But fxn is a function on Yn–1 * whose value at (x1, ..., xn–1)isf (x1, ..., xn) and hence fxn = fxn . Thus { } fdλn = fxn dλn–1 dμn(xn). Yn Xn Yn–1

The inner integral on the right (with respect to λn–1) may clearly be reduced in the same way, and so on, leading to the repeated integral. (The precise notational details are indicated as Ex. 7.31.) Thus fdλn may be evaluated as a repeated integral in the indicated order. Similarly, any other order may be used (see e.g. Ex. 7.32). Fubini’s Theorem for L1-functions also generalizes in the obvious way to the case of a product of n measure spaces. We state this together with a summary of the above discussion as a theorem. 158 Product spaces

Theorem 7.7.2 (Fubini, n factors) Let (Xi, Si, μi) be σ-finite measure spaces for i =1,..., n, and denote their product by (Yn, Tn, λn).Letf beaTn- measurable function defined on Yn. (i) If f is nonnegative then fdλn may be expressed as a repeated integral in any chosen order (e.g. ... fdμ1 dμ2 ... dμn). In particular the repeated integrals taken in any two distinct orders have the same value.

(ii) The same conclusions hold if f ∈ L1(Yn, Tn, λn). This latter condition is equivalent (by (i)) to the finiteness of any repeated integral of |f | e.g. ... |f | dμ1 ... dμn < ∞.

For each i =1,2,..., n,letXi = R the real line, Si = B the Borel n sets of R, and mi = m Lebesgue measure. Write R for the n-dimensional n Euclidean space X1 × X2 × ...× Xn, B for S1 ×S2 × ...×Sn, the class n n of n-dimensional Borel sets,ortheBorel sets of R , and m for m1 × m2 × ... × mn called n-dimensional Lebesgue measure,orLebesgue measure on Rn. The completion Bn of Bn with respect to mn is called the class of n-dimensional Lebesgue measurable sets,orLebesgue measurable sets of Rn, and is denoted by Ln.(Asforn =2,Ln  L×L×...×L.)

7.8 Lebesgue–Stieltjes measures on Rn The previous section concerned product measures, where the measure of a rectangle is the product of measures of the sides. It is natural to consider more general measures (useful in particular for probability applications in- volving dependence) and we do so in this section in the context of the measurable space (Rn, Bn) where Rn is the n-dimensional Euclidean space and Bn is the class of Borel sets of Rn. As defined Bn is the σ-field gen- erated by the semiring of measurable rectangles E1 × E2 × ...× En where each Ei is a Borel set of R. It is also generated by an even simpler semir- ing: if a =(a1, a2, ..., an), b =(b1, b2, ..., bn), a ≤ b (i.e. ai ≤ bi for each i), let (a, b] denote the “bounded semiclosed interval” of Rn defined by (a, b]=(a1, b1] × (a2, b2] × ...× (an, bn]. It is not difficult to check that the class Pn of all such bounded semiclosed intervals is a semiring, and that its generated σ-ring is Bn (Ex. 7.33). In Section 2.8 it was shown how a nondecreasing right-continuous func- tion F(x) can be used to define a Lebesgue–Stieltjes measure on B, and conversely. In this section the procedure will be generalized to define mea- sures on Bn. Such measures are of fundamental importance in the theory of probability and stochastic processes. 7.8 Lebesgue–Stieltjes measures on Rn 159

The measures on B obtained in Section 2.8 did not have to be finite,pro- vided they took finite values on bounded intervals (and hence were σ-finite, of course). Here we consider, for simplicity, only finite measures (which will be sufficient for all our applications). The main result, an analog of Theorem 2.8.1, is as follows.

Theorem 7.8.1 (i) Let ν be a finite measure on Bn. Then there is a unique n function F(x1, ..., xn) on R which is bounded, nondecreasing and right- continuous in each xi, tends to zero as any xi → –∞, and is such that * n–r ν{(a, b]} = (–) F(c1, ..., cn)

for all a =(a1, ..., an), b =(b1, ..., bn) with a ≤ b (ai ≤ bi,1≤ i ≤ n), where the * denotes that the sum is taken over all 2n distinct terms with ci = ai or bi, i =1,..., n and r is the number of ci equal to bi. n (ii) Conversely, let F(x1, ..., xn) be a function on R which is bounded, nondecreasing and right-continuous in each xi, tends to zero as any xi → –∞, and satisfies the condition * n–r (–) F(c1, ..., cn) ≥ 0 for all a ≤ b in Rn with the notation as in (i). Then there is a unique n finite measure μF on B such that * (n–r) μF{(a, b]} = (–) F(c1, ..., cn)

for all a ≤ b. In particular for all x =(x1, ..., xn),

μF{(–∞, x]} = F(x1, ..., xn)

where (–∞, x]=(–∞, x1] × ...× (–∞, xn].

n Proof (i) Define F on R by F(x1, ..., xn)=ν{(–∞, x]}, x =(x1, ..., xn) ∈ Rn. It is easily verified that F is bounded, nondecreasing, right- continuous, and that F(x1, ..., xn) → 0asanyxi → –∞. In order to ex- press ν{(a, b]} in terms of F note that if Ai =(–∞, ai] and Bi =(–∞, bi] n then for each x =(x1, ..., xn) ∈ R , Πn { } * n–r χ(a,b](x)= i=1 χBi (xi)–χAi (xi) = (–) χC1 (x1) ...χCn (xn) where Ci =(–∞, ci]=Ai or Bi and the notation for * and r is as in (i) of the theorem statement. It follows that { } * n–r ν (a, b] = Rn χ(a,b] dν = (–) F(c1, ..., cn). 160 Product spaces

Since letting a → –∞ in the last expression shows that ν{(–∞, b]} = F(b1, ..., bn), it follows that F is uniquely determined by ν. n (ii) Define the nonnegative set function μF on the semiring P of inter- vals (a, b], a ≤ b,by * n–r μF{(a, b]} = (–) F(c1, ..., cn).

Notice that when a = b this gives μF(∅) = 0. It is shown in Lemma 7.8.2 Pn ∪∞ ∈ (below) that μF is finitely additive on .NowletI = k=1Ik where I, Ik Pn ≤ and the Ik’s are disjoint. Then it is shown in Lemma 7.8.3 that μF(I) ∞ n ≤ k=1μF(Ik), and it is easily seen (Ex. 2.18) that for each n, k=1μF(Ik) ∞ ≤ ∞ μF(I) and hence k=1μF(Ik) μF(I). Thus μF(I)= k=1μF(Ik) and μF is n n countably additive on P . Since μF is clearly finite on P , by the extension theorem (Theorem 2.5.4) μF has a unique extension to a finite measure on S(Pn)=Bn. 

The following two lemmas were used in the proof of the theorem.

Lemma 7.8.2 Let F be as in (ii) of Theorem 7.8.1, and define the set n * n–r function μF on P by μF(∅)=0and μF(a, b]= (–) F(c1, c2, ..., cn) for n all a ≤ b. Then μF is a (nonnegative) finitely additive set function on P .

Proof For simplicity of notation consider the two-dimensional case – the ∈P2 ∪K general one follows inductively. Let I0 , I0 = k=1Ik where Ik are disjoint sets of P2.

Suppose first that the rectangles Ik occur in “regular stacks”, i.e. that we have

∪M ∪N Specifically this means that the union may be written as I0 = i=1 j=1 Eij where I0 =(a0, aM] × (b0, bN ], Eij =(ai–1, ai] × (bj–1, bj], each Ik being one 7.8 Lebesgue–Stieltjes measures on Rn 161 of the terms Eij in the union. Then for fixed i, N N μF(Eij)= [F(ai, bj)–F(ai, bj )] j=1 j=1 –1 N – j=1[F(ai–1, bj)–F(ai–1, bj–1)]

= F(ai, bN )–F(ai, b0)–[F(ai–1, bN )–F(ai–1, b0)] so that M N M μF(Eij)= [F(ai, bN )–F(ai , bN )] i=1 j=1 i=1 –1 M – i=1[F(ai, b0)–F(ai–1, b0)]

= F(aM, bN )–F(a0, bN )–F(aM, b0)+F(a0, b0) K which gives μF(I0)= ijμF(Eij)= k=1μF(Ik) for this “stacked rectangle” case. The general case may be reduced to the stacked one as follows. If Ik = × (αk, αk] (βk, βk], denote the distinct ordered values of α1, α1, α2, α2, ..., αK, αK (in increasing order of size) by a0, a1, ..., aM, and those of β1, β1, ..., βK, βK by b0, b1, ..., bN . Then I0 is the union of the disjoint intervals × M N { × } (ai–1, ai] (bj–1, bj] and by the above μF(I0)= i=1 j=1μF (ai–1, ai] (bj–1, bj] . But each Ik is a disjoint union of a certain stacked group of these in- tervals and μF(Ik) is therefore just the sum of the corresponding terms { × } K  μF (ai–1, ai] (bj–1, bj] . Hence μF(I0)= k=1μF(Ik), as required.

Lemma 7.8.3 Under the same conditions and notation as Lemma 7.8.2, ∈Pn ∈Pn ⊂∪∞ ≤ ∞ if I , Ik , k =1,2,...and I k=1Ik, then μF(I) k=1μF(Ik).

Proof Write Ik =(ak, bk], I =(a0, b0], h =(h, h, ..., h). The right- continuity of F implies that μF{(a, b + h]}↓μF{(a, b]} as h ↓ 0. Hence for each k, hk > 0 may be chosen so that

k μF{(ak, bk + hk]}≤μF(Ik)+ /2 ⊂∪∞ where >0 is given. Now for any h > 0,[a0 + h, b0] k=1(ak, bk + hk) and hence by the Heine–Borel Theorem, for some K, ⊂ ⊂∪K ⊂∪K (a0 + h, b0] [a0 + h, b0] k=1(ak, bk + hk) k=1(ak, bk + hk]. It is easy to see from this and Lemma 7.8.2 (cf. Ex. 2.18) that { }≤ K { }≤ ∞ μF (a0 + h, b0] k=1μF (ak, bk + hk] k=1μF(Ik)+ from which the desired conclusion follows simply by letting first ↓ 0 and then h ↓ 0, since the right-continuity of F implies that μF{(a0 + h, b0]}→ μF{(a0, b0]}.  162 Product spaces

The measure μF constructed in Theorem 7.8.1 (ii) is called the Lebesgue– Stieltjes measure on Bn corresponding to the function F. The expression of μF{(a, b]} in terms of F becomes quite involved for large n but may be described as the sum of the values of F at the vertices (c1, ..., cn)of the interval (a, b] with alternating signs (this is easily seen pictorially for n =2).μ{(a, b]} may also be expressed as a generalized difference of values of F (see Ex. 7.34). Note that while the function F has been assumed bounded, the discus- sion may be generalized to the case where F is not bounded, but real- valued, yielding a σ-finite measure μF. As noted before, however, the case where μF is finite will be the most useful one in applications to probabil- ity. Further, a common special case of the above discussion occurs when F(x1, x2, ..., xn)=G1(x1)G2(x2) ...Gn(xn) where each Gi is a nondecreas- ing, bounded, right-continuous function on R with Gi(–∞) = 0. It should × × × be verified (Ex. 7.35) that μF = μG1 μG2 ... μGn ,i.e.then-fold prod- B uct of the Lebesgue–Stieltjes measures μGi determined by each Gi on . This measure is useful in probability theory for dealing with independent random variables. The final result of this section (used in the next) establishes “regularity” of finite measures on Bn, closely approximating a set B ∈Bn in measure “from without” by an open set, and “from within” by a bounded closed (i.e. compact) set. While this is topological in nature (and capable of substantial generalization) only the very simplest and most familiar concepts of open and closed sets in Rn will be needed for the context.

Lemma 7.8.4 (Regularity) Let μ be a finite measure on (Rn, Bn). Given B ∈Bn and >0 there is an open set G and a bounded closed set F such that F ⊂ B ⊂ G and μ(B – F) < , μ(G – B) < .

Proof Since the semiring of rectangles (a1, b1] × (a2, b2] × ...× (an, bn] generates Bn (Ex. 7.33) it follows from the extension procedure of Section ∪∞ ⊃ 2.5 that rectangles Bi, i =1,2,... of this form exist with 1 Bi B and ∞ 1 μ(Bi) <μ(B)+ /2. The sides of the rectangles may clearly be extended ⊃ to give rectangles Ei Bi with open sides and such that μ(Ei)<μ(Bi)+ i+1 ∪∞ ⊃ ≤ ≤ /2 . Hence G = 1 Ei is an open set with G B, μ(G) μ(Ei) μ(Bi)+ /2 <μ(B)+ . To define the bounded closed set F, note that the above result may be applied to Bc to give an open set U ⊃ Bc, μ(U) <μ(Bc)+ so that clearly c c n μ(U ) >μ(B)– (e.g. μ(U )=μ(R )–μ(U)). If Ir =[–r, r] × [–r, r] × ...× n n c c [–r, r](=[–r, r] ), Ir ↑ R as r →∞so that U ∩ Ir ↑ U and hence 7.9 The space (RT , BT ) 163

c c c μ(U ∩ Ir) → μ(U ). Thus for some N, μ(IN ∩ U ) >μ(B)– and the proof c is completed on writing F for the bounded closed set IN ∩ U . 

7.9 The space (RT , BT ) In previous sections, product and other Lebesgue–Stieltjes measures on finite-dimensional product space were investigated. We now consider infi- nite product spaces in this section and corresponding product measures in the next, as well as general (not necessarily product) measures on them. For simplicity we will deal with the case where all component measurable spaces are copies of the real line with its Borel sets. This is the most in- teresting case in connection with the theory of probability and stochastic processes. However, all results of these sections are also valid for more gen- eral component measurable spaces (which, incidentally, need not be copies of the same measurable space) satisfying certain topological conditions. Let T be an arbitrary (index) set. It may be convenient to think of T as time, i.e. a subset of R, and draw pictures – but no conditions will be imposed on T throughout this section. For each t ∈ T let the measurable space (Xt, St) be a copy of the real line R with its Borel sets, i.e.

(Xt, St)=(R, B) for all t ∈ T.

Πn × × Recall that the finite-dimensional (Cartesian) product i=1Xti = Xt1 ... Xtn is the set {(x(t1), ..., x(tn)) : x(ti) ∈ R, i =1,..., n}, in other words the set of all real-valued functions on the set (t1, ..., tn). Similarly the product of the spaces Xt, t ∈ T, is defined to be the set of all real-valued functions on T, denoted by

T R = Πt∈T Xt and called the function space on T. Each element x in RT is a real-valued function x(t) defined on T, each x(t) is called a coordinate of x,orthet- coordinate of x.

The first task is to define the product σ-field of the σ-fields St, t ∈ T, for which the following notation will be used.

Let u =(t1, t2, ..., tn) denote the ordered n-tuple of distinct points ti ∈ T (with “order” denoting only that t1 is the first element, t2 the second – not a size ordering since set T may not be “size ordered” in any sense). In particular for distinct t1, t2,(t1, t2), (t2, t1) are different 2-tuples. 164 Product spaces

For u =(t1, t2, ..., tn) write Ru Πn × × Rn = i=1Xti = Xt1 ... Xtn (= ) Bu Πn S S × ×S Bn = i=1 ti = t1 ... tn (= ). T u The projection map πu from R onto R is defined by T πu(x)=(x(t1), ..., x(tn)) for all x ∈ R .

If  =(s1, s2, ..., sk) is another such k-tuple, and k ≤ n, define  ⊂ u to mean that each element sj of  is one of the ti in u (not necessarily in the ≤ ≤ same order), i.e. sj = tτj say, 1 j k. Then we define the “projection mapping” from Ru to R by

πu,(x(t1), x(t2), ..., x(tn)) = (x(s1), x(s2), ..., x(sk)) noting that this involves both evaluation of x(t) at a subset of values of the tj and a possible permutation of their order. It is apparent that πu, is a measurable mapping. ⊂ ≤ ≤ If as above  =(s1, s2, ..., sk) u =(t1, t2, ..., tn) and sj = tτj ,1 j k, then for x ∈ RT

πu,πux = πu,(x(t1), ..., x(tn)) = (x(s1), ..., x(sk)) = πx so that πu,πu = π. To fix ideas if u =(t1, t2, t3),  =(t1, t2) then πu,(x(t1), x(t2), x(t3)) = (x(t1), x(t2)), and if u =(t1, t2),  =(t2, t1) then πu,(x(t1), x(t2)) = (x(t2), x(t1)). u T Now for fixed u =(t1, ..., tn) ⊂ T and B ∈B the following subset of R T C = {x ∈ R :(x(t1), ..., x(tn)) ∈ B} { ∈ RT ∈ } –1 = x : πux B = πu B is called a cylinder set with base B at u =(t1, ..., tn). A cylinder with base at u is also a cylinder with base at any w ⊃ u, since if u =(t1, ..., tn), ≤ ≤ ∈Bu w =(s1, ..., sn+1)(withtj = sτj 1 j n) and B then the cylinder with base B ∈Bu is –1 –1 –1 –1 Bw πu B = πw πw,uB = πw (set of ) which is a cylinder with base at w. The class of all cylinder sets with base at a given u is denoted by C u C t ... t {π–1B B ∈Bn} π–1Bu π–1 Bu ( )= ( 1, , n)= u , = u = t1,...,tn and each C(u)isaσ-field (by Theorem 3.2.2). The class of all cylinder sets is denoted by C, and each set in C is called a cylinder set in RT . Thus C ∪ C ∪ C = {u⊂T: u finite} (u)= n;t1,...,tn∈T (t1, ..., tn). 7.9 The space (RT , BT ) 165

Lemma 7.9.1 C is a field.

Proof Let E1, E2 ∈C. Then by the definition of C,wehaveEi ∈C(ui), i = 1, 2, where u1, u2 are ordered finite subsets of T.Letu = u1 ∪ u2, consisting of all the distinct elements of u1 and u2 in some arbitrary but fixed order. Then E1, E2 ∈C(u), and since C(u)isaσ-field it follows that ∪ c C C C  E1 E2, E1 belong to (u) and hence to , so that is a field. The σ-field generated by the field C is called the product σ-field of T St, t ∈ T,ortheproduct σ-field in R , and is denoted by

T B = Πt∈T St = S(C).

Note that for each ordered finite subset u =(t1, ..., tn)ofT the projection T T u u map πu is a measurable transformation from (R , B ) onto (R , B ), since ∈Bu –1 ∈C ⊂C⊂BT for each B we have πu B (u) . When u consists of a single point, u = {t}, πu = πt is called the evaluation function at t since T T πt(x)=x(t) for all x ∈ R . It can be easily seen that B is the σ-field of T T subsets of R generated by the evaluation functions πt, t ∈ T,i.e.B is the smallest σ-field of subsets of RT with respect to which all evaluation functions are measurable (Ex. 7.36). When T is a countably infinite set, for example the set of positive in- tegers, T = {1, 2, ...}, then RT becomes the set of all real sequences and we use instead the more suggestive notation R∞, B∞. R∞ is also called the (real) sequence space. Even though, when T is an uncountable set, the function space (RT , BT ) is clearly much larger than the sequence space (R∞, B∞), each measur- able set in (RT , BT ) essentially belongs to some (R∞, B∞) (Theorem 7.9.2). A corresponding statement holds for measurable functions on (RT , BT ), and this property is often very useful in dealing with such functions. The pro- jection maps and cylinder sets have been defined for ordered finite subsets u of T. The same definitions apply quite clearly when u is an ordered count- T able subset of T, u =(t1, t2, ...). Then the projection map πu from R to Ru (= R∞) is defined by

T πu(x)=(x(t1), x(t2), ...) for all x ∈ R , ∈Bu –1 RT a cylinder set with base B at u is the subset πu B of , and the class of all cylinder sets at u is again denoted by C(u), and is given by C –1Bu Ru R (u)=πu . For every ordered subset  of u the map πu, from to is defined similarly and by definition (i.e. applying the definition of BT to Bu), Bu ∪ –1 B = σ( {⊂u:  finite} πu, ) 166 Product spaces

–1 B Ru since πu, are the cylinder sets at  in . The following result is not needed in the sequel but provides the useful characterization of measurable sets as cylinders with base in countably many dimensions referred to above. Theorem 7.9.2 With the above notation

T B = ∪{u⊂T: u countable} C(u). Hence if E ∈BT there is a countable subset S of T (depending on E) such that E ∈C(S). Further, if f is a BT -measurable function there is a countable subset S of T (depending on f ) such that f is C(S)-measurable. Proof For each ordered u ⊂ T, –1 u –1 –1  C(u)=π B = π σ ∪{ ⊂ } π B u u  u:  finite u, –1 –1  = σ ∪{ ⊂ } π π B  u:  finite u u, = σ ∪{⊂u:  finite} C() –1 –1 B –1B C C ⊂BT since πu πu, = π = (). Since for each finite , () , it follows that C(u) ⊂BT and thus

T E = ∪{u⊂T: u countable} C(u) ⊂B. In order to show the reverse inclusion BT ⊂Eit suffices to show that E is a σ-field containing C (since BT = S(C)). Each set in C is in some C(t , ..., t ) and hence of the form π–1 (B) for some B ∈Bn. But this 1 n (t1,...,tn) set may also be written as π–1 (B × R × R × ...) for any choice of (t1,...,tn,...) tn+1, tn+2, ..., and thus it belongs to C(t1, ..., tn, ...) and also to E, since B × R × R × ... ∈B∞. It follows that E contains C. We now show that E is a σ-field. For n =1,2,...,letEn ∈E. Then En ∈C(un) for some ∪∞ countable subset un of T.Ifu = n=1un then u is also a countable subset ∈C –1 ∈B∞ of T and En (u) for all n. Hence En = πu (Bn) for some Bn and ∪∞ –1 ∪∞ ∪∞ C n=1En = πu ( n=1Bn) implies that n=1En belongs to (u), and thus also to E so that E is closed under the formation of countable unions. Similarly, E is closed under complementation. Now let f be a BT -measurable function defined on RT . Then for a –1 rational r, f {–∞}, {x : f (x) ≤ r} belong respectively to C(u∞), C(ur) where u∞ and ur are countable subsets of T. Then u = u∞ ∪ (∪rur) is also a countable subset of T and f –1{–∞} ∈ C(u), {x : f (x) ≤ r}∈C(u) for each rational r,i.e.f is C(u)-measurable.  Theorem 7.9.2 shows that each set E ∈BT is of the form –1 { ∈ RT ∈ } E = πS B = x :(x(s1), x(s2), ...,) B 7.10 Measures on RT , Kolmogorov’s Extension Theorem 167

∞ for some countable subset S =(s1, s2, ...,)ofT and some B ∈B , i.e. it can be described by conditions on a countable number of coordinates. Hence each BT -measurable set, as well as function, depends only on a countable number of coordinates.

7.10 Measures on RT , Kolmogorov’s Extension Theorem This section concerns the construction of (probability) measures on the space (RT , BT ) from probability measures on “finite-dimensional” sub- spaces. For each u =(t1, ..., tn) ⊂ T, πu (as defined above) is a measur- able transformation from (RT , BT ) onto (Ru, Bu). Hence if μ is a probability measure in (RT , BT ), each ν = ν = μπ–1 = μπ–1 u (t1,...,tn) u (t1,...,tn) is a probability measure on (Ru, Bu)=(Rn, Bn). The converse question is of interest in the theory of probability and stochastic processes, i.e. given for each ordered finite (nonempty) subset (t1, ..., tn)ofT, a probability Rn Bn RT BT measure ν(t1,...,tn) on ( , ), is there a probability measure μ on ( , ) such that μπ–1 = ν ? Note that if  ⊂ u and B ∈B then (t1,...,tn) (t1,...,tn) –1 –1 –1 –1 νu(πu,B)=μ(πu πu,B)=μ(π B)=ν(B) –1 and thus νuπu, = ν. This necessary (“consistency”) condition turns out to be sufficient as well, which is the main result of this section. For clarity the result will be shown in two parts and combined as Theorem 7.10.3.

Lemma 7.10.1 With the above notation let νu be a probability measure on (Ru, Bu) for each ordered finite subset u ⊂ T, and assumed consistent as defined above. Then a set function μ may be defined unambiguously on the C ∈C –1 field of cylinder sets by μ(E)=νu(B) when E (u),E= πu (B). μ is a measure on each C(u) and is finitely additive on C. Proof If E ∈C, then E ∈C(u) for some finite subset u of T and hence –1 ∈ u E = πu (B), B B . To show that μ is uniquely defined by μ(E)=νu(B)itis necessary to check that different representations for E give the same value for μ(E). ∈C –1 –1 ∈Bu ∈B Thus let E and suppose that E = πu B = π C where B , C and u,  are finite subsets of T.Letw = u ∪ . Then E ∈C(w) so that –1 ∈Bw Rw E = πw D for some D .Nowπw maps onto and it is simply shown that –1 –1 –1 D = πwπw D = πwE = πwπu B = πw,uB, 168 Product spaces since u ⊂ w implies πw,uπw = πu, and by the consistency condition

–1 νw(D)=νwπw,u(B)=νu(B).

Similarly it can be shown that νw(D)=ν(C). Hence νu(B)=ν(C) and μ is uniquely defined on C by μ(E)=νu(B). Now if Ei are disjoint sets of C(u), –1 u ∪ –1 ∪ Ei = πu Bi where Bi are disjoint sets of B . Hence Ei = πu ( Bi) and ∪∞ ∪∞ ∞ ∞ μ( 1 Ei)=νu( 1 Bi)= 1 νu(Bi)= 1 μ(Ei).

Hence μ is a measure on C(u), for each finite u ⊂ T. Finally, to show finite additivity of μ on C it is sufficient to show addi- tivity since C is a field. If E, F are disjoint sets of C, E ∈C(u), F ∈C() say then both E and F belong to C(w)forw = u ∪ . Since μ is a measure on C(w) it follows that μ(E ∪ F)=μ(E)+μ(F) as desired. 

The above result uses the given consistent measures on classes Bu to define an additive set function μ on C which is a measure on each C(u). This will be combined with the following result which shows that such a set function μ is actually a measure on the field C and hence may be extended to S(C). The proof may be recognized as a thinly disguised variant of that for Tychonoff’s Theorem for compactness of product spaces.

Theorem 7.10.2 Let μ be a finitely additive set function on C such that μ is a probability measure on C(u) for each finite set u ∈ T. Then μ is a probability measure on C and hence may be extended to a probability measure on S(C)=BT .

Proof Since μ is finitely additive to show countable additivity it is suffi- cient by Theorem 2.2.6 to show that μ is continuous from above at ∅,i.e. → ∈C ∩∞ ∅ that μ(En) 0 for any decreasing sequence of sets En with 1 En = . Equivalently it is sufficient to assume (as we now do) that En are decreasing C ≥ ∩∞  ∅ sets of with μ(En) h for some h > 0 and show that 1 En . ∈C ∪n Now En (un) where (replacing un by k=1uk) it may be assumed that ⊂ ⊂ ⊂ ∪ u1 u2 u3 ..., uj =(t1, t2, ..., tnj ) say, and uj =(t1, t2 ...). By Lemma 7.8.4 the base of the cylinder En contains a bounded closed subset –1 ⊂ approximating it in νun (= μπun )-measure. Thus a cylinder Fn En may be un constructed with bounded closed base in R , and such that μ(En – Fn) < n+1 ∩n h/2 . The (decreasing) cylinders Cn = r=1Fr have bounded closed bases un Bn in R and

∪n ⊂∪n (En – Cn)= r=1(En – Fr) r=1(Er – Fr) 7.10 Measures on RT , Kolmogorov’s Extension Theorem 169 ≤ so that (since μ is additive and thus also monotone), μ(En – Cn) n ≤ r=1μ(Er – Fr) h/2, giving

μ(Cn)=μ(En)–μ(En – Cn) ≥ h/2 > 0 from which it follows that no Cn is empty. Thus for each j, Cj contains Ruj a point xj say so that the point (xj(t1), ..., xj(tnj )) of belongs to the bounded closed base Bj of the cylinder Cj ⊂ Ej. If Σ denotes a subsequence {jr} of the positive integers (with j1 < j2 < j3 < ...) and aj is a sequence of real numbers we shall write “{aj : j ∈ Σ} →∞ converges” to mean that ajr converges as r . { }∞ ∈ Now the sequence xj(t1) j=1 of bounded (since xj C1) real numbers has a convergent subsequence. That is, there is a subsequence Σ1 of the positive integers such that {xj(t1):j ∈ Σ1} converges. Similarly a subsequence of {xj(t2):j ∈ Σ1} converges and hence Σ1 has a subsequence Σ2 such that {xj(t2):j ∈ Σ2} converges. Proceeding in this way we obtain subsequences Σs of the positive integers such that Σ1 ⊃ Σ2 ⊃ Σ3 ⊃ ...and {xj(ts):j ∈ Σs} converges. Form now the “diagonal subsequence” Σ of positive integers consisting of the first member of Σ1, the second of Σ2, and so on. Clearly { ∈ Σ} Σ { } xj(ts):j converges for each s. Writing = rk this means that xrk (ts) converges to a limit, ys say, as k →∞, for each s.Lety be any element of T R such that y(ts)=ys, s =1,2,....

Since (xj(t1), ..., xj(tn1 )) belongs to the base B1 of C1 for every j and B1 ∈ is closed, it follows that (y(t1), ..., y(tn1 )) = (y1, ..., yn1 ) B1 and hence y ∈ C1. In a similar way we may show that y ∈ C2, y ∈ C3 and so on. That is ∈∩∞ ⊂∩∞ ⊂∩∞ ∩∞  ∅ y j=1Cj j=1Fj j=1Ej, showing that j=1Ej and thus completing the proof.  The main theorem now follows by combining the last two results. Theorem 7.10.3 (Kolmogorov’s Extension Theorem) Let T be an arbi- trary set and for each ordered finite subset u of T let νu be a probability u u measure on (R , B ).Ifthefamily{νu : u ordered finite subset of T} is con- –1 ⊂ sistent, in the sense that νuπu, = ν whenever  u, then there is a unique probability measure μ on (RT , BT ) such that for all finite subsets u of T, –1 μπu = νu. Proof The set function μ defined as in Lemma 7.10.1 satisfies the condi- tions of Theorem 7.10.2 and hence is a probability measure on the field C so that it has an extension to a probability measure on S(C)=BT .Ifλ is C –1 C another probability measure on with λπu = νu then λ = μ on (u)for each finite u so that λ = μ on C and hence on S(C)=BT by the uniqueness of the extension from C to S(C).  170 Product spaces

Corollary If for each t ∈ T, μt is a probability measure on (Xt, St)= (R, B), there is a unique probability measure μ on (RT , BT ) such that for each u =(t1, ..., tn) ⊂ T –1 × × μπu = μt1 ... μtn . Proof Define × × Ru Bu νu = μt1 ... μtn on ( , ).

Let  ⊂ u and assume for simplicity of notation that  =(t1, ..., tk), 1 ≤ ≤ ∈B –1 × × × k n. Then for each B , πu,B = B Xtk+1 ... Xtn and –1 –1 (νuπu,)(B)=νu(πu,B) × × × × × =(μt1 ... μtn )(B Xtk+1 ... Xtn ) × × =(μt1 ... μtk )(B)μtk+1 (Xtk+1 ) ...μtn (Xtn )

= ν(B).

Thus the family of probability measures {νu : u ordered finite subset of T} is consistent, and the conclusion follows from Kolmogorov’s Extension Theorem.  The measure μ in this corollary is denoted by  μ = μt. t∈T

In fact this corollary holds if (Xt, St) is an arbitrary measurable space for each t, in contrast to the topological nature of Theorem 7.10.3, where the product space and product σ-field definitions extend those for the above real line cases in obvious ways e.g. as stated in the following theorem. (For proof see e.g. [Halmos, Theorem 38 B].)

Theorem 7.10.4 Let (Xi, Si, μi) be a sequence of measure spaces with μi(Xi)=1for all i. Then there exists a unique measure μ on theσ-field S ∞ S × ∞ = i=1 i such that for every measurable set E of the form A n+1 Xi,

μ(E)=(μ1 × μ2 × ...× μn)(A).

Exercises 7.1 If S, T are σ-rings on spaces X, Y respectively and A, B are nonempty sub- sets of X, Y respectively, show that A×B ∈S×Tif and only if A ∈S, B ∈T (i.e. a rectangle A × B belongs to S×T if and only if it is a member of the semiring P (cf. Lemma 7.1.1)). Exercises 171

7.2 Let X = Y be the same uncountable set and let the σ-rings S = T each be the class of all countable subsets of X, Y respectively. What is S×T? 7.3 In Ex. 7.2 let D denote the “diagonal” in X × Y; i.e. D = {(x, y):x = y}.Show y that Dx ∈T, D ∈Sif x ∈ X, y ∈ Y, but that D  S×T (cf. Theorem 7.1.3). 7.4 Show that the functions f (x, y)=x, g(x, y)=y defined on the plane R2 are B2-measurable. Hence show that the “diagonal” D = {(x, y):x = y} is a Borel set of the plane. 7.5 Let R be the real line, B the Borel sets of R and L the Lebesgue measurable sets of R, i.e. L = B, the completion of B with respect to Lebesgue measure. Assuming that there is a Lebesgue measurable set which is not a Borel set (cf. Halmos, Exs. 15.6, 19.4) show that B×B⊂L×Lbut B×B L×L. Is L×Lthe class of two-dimensional Lebesgue measurable sets defined in Section 7.6, i.e. is B×B = B×B? (Assume that there is a set E ⊂ R which is not Lebesgue measurable (cf. Halmos, Theorem 16.D) and use Ex. 7.1 appliedtotheset{x}×E for some fixed x.) 2 7.6 Let f be a real-valued function defined on R such that each fx is Borel measurable on R, and each f y is continuous on R. Show that f is Borel R2 k measurable on . (Hint: For n =1,2,...,definefn(x, y)=f ( 2n , y)for k ≤ k+1 ± ± → R2 2n < x 2n , k =0, 1, 2, ...and show that fn f on .) 7.7 Let E ⊂ R2 be such that each Ey is a Lebesgue measurable set in R and {Ey,–∞ < y < ∞} form a monotone increasing (or decreasing) family, i.e. Ey ⊂ Ey whenever y < y. Show that E is a Lebesgue measurable set in R2. (Hint: Fix any I =[a, b], –∞ < a < b < ∞, define the Lebesgue measurable 2 sets Fn, Gn, n =1,2,...,ofR by

y yk,n Fn = E ∩ (I × I)=yk,n ≤ y < yk+1,n, y yk+1,n Gn = E ∩ (I × I)=yk,n < y ≤ yk+1,n

n –n for k =0,1,...,2 –1,whereyk,n = a +(b – a)k2 ,andshowthatFn ↑ F, Gn ↓ G, F ⊂ E ∩ (I × I) ⊂ G and (G – F) has Lebesgue measure zero.) 2 7.8 Let f be a real-valued function defined on R such that each fx is Lebesgue measurable on R, and each f y is monotone on R. Show that f is Lebesgue measurable on R2. (Hint: If all f y’s are increasing (or decreasing) the result follows from Ex. 7.7. The general case follows by showing that A = {y : f y is increasing} and B = {y : f y is decreasing} are Lebesgue measurable sets in R.) 7.9 Let f be a Borel measurable function on R2 and g a Borel measurable func- tion on R. Show that f (x, g(x)) is Borel measurable on R. 7.10 Let (X, S, μ), (Y, T , ν)and(X × Y, S×T, λ) be finite measure spaces. If × × λ(E F)= E×Ffd(μ ν) for all E ∈S, F ∈T, for some nonnegative S×T-measurable function f on X × Y, then prove that λ is absolutely continuous with respect to μ × ν with Radon–Nikodym derivative f . 172 Product spaces

7.11 Let (X, S, μ)and(Y, T , ν)beσ-finite measure spaces. If E, F ∈S×Tand ν(Ex)=ν(Fx) for a.e. x (μ), show that (μ × ν)(E)=(μ × ν)(F). 7.12 Let (X, S, μ)and(Y, T , ν)beσ-finite measure spaces. If a subset E of X × Y is S×T-measurable and such that for every x ∈ X c either ν(Ex)=0 or ν(Ex)=0, then prove that μ(Ey) is a constant a.e. (ν). (Hint: Show that μ(EyΔA) = 0 a.e. { c } (ν), where A = x : ν(Ex)=0.) 7.13 Let (X, S, μ)beaσ-finite measure space, let (Y, T , ν) be the real line R with Borel sets and Lebesgue measure, and let f1 and f2 be measurable functions on X. Prove that the set

E = {(x, y) ∈ X × Y : f1(x) < y < f2(x)} is product measurable, i.e. E ∈S×T,andthat × (μ ν)(E)= A(f2 – f1) dμ

where A = {x ∈ X : f1(x) < f2(x)}. In particular if f is a nonnegative measurable function on x then × { ∈ × } (μ ν) (x, y) X Y :0< y < f (x) = Xfdμ. What happens if “<” in the definition of E is replaced by “≤”? 7.14 Let (X, S, μ)beaσ-finite measure space, f a finite-valued nonnegative mea- surable function defined on X and for each t ≥ 0, Et = {x : f (x) > t}.Letg be a nonnegative function defined on (0, ∞)andsuchthatg ∈ L1(0, a)forall x a > 0, and define G(x)= g(t) dt, x ≥ 0. Show that 0 { } ∞ t XG f (x) dμ(x)= 0 μ(E )g(t) dt (applying Theorem 7.4.1 to E = {(x, t) ∈ X × [0, ∞): 0< t < f (x)})andthat, in particular, ∞ fdμ = μ(Et) dt X 0 (which may serve as a definition of the abstract Lebesgue integral Xfdμ if the Lebesgue integral over (0, ∞) is defined), and for p > 1, p ∞ t p–1 Xf dμ = p 0 μ(E )t dt. S T { }∞ 7.15 Let (X, , μ)and(Y, , ν) be two finite measure spaces and fn n=1, f be S×T-measurable functions defined on X × Y. If for a.e. y (ν) y y fn (x) → f (x)inμ-measure as n →∞, show the following. (i) fn → f in μ × ν-measure. { }∞ (ii) There is a subsequence fnk k=1 such that for a.e. x (μ) → →∞ fnk,x(y) fx(y) a.e. (ν)ask . Exercises 173

7.16 Let μ be Lebesgue measure on (R, B), ν be “counting measure” on (R, B) (ν(E) is the number of points in the set E∈B), D be the diagonal of R2, defined in Ex. 7.4, and f = χD.Evaluate fdμ dν, fdν dμ. What con- clusion can you draw concerning Fubini’s Theorem? 7.17 Let (X, S, μ), (Y, T , ν)beσ-finite measure spaces, let f (x)andg(y)beinte- grable functions on (X, S, μ)and(Y, T , ν) respectively, and define h on X ×Y by h(x, y)=f (x)g(y). Show that h is integrable on (X × Y, S×T, μ × ν)and that × · X×Y hd(μ ν)= Xfdμ Y gdν. 7.18 With the notation and assumptions of Ex. 4.22, show that g is Lebesgue integrable on the real line. 7.19 Let (X, S, μ)beaσ-finite measure space. Let Y be the set of positive inte- gers, T the class of all subsets of Y,andν counting measure on Y.If{fn} is a sequence of nonnegative measurable functions on X, show by Fubini’s Theorem that ∞ ∞ ≤∞ X( n=1fn) dμ = n=1 Xfn dμ ( ).

(Define g(n, x)=fn(x)onY × X and note that { } ∪∞ { }×{ } (n, x): g(n, x) < c = m=1( m x : fm(x) < c ).)

This provides an alternative proof for the corollary to Theorem 4.5.2 but only when μ is σ-finite; a similar proof for Ex. 4.20 may be constructed. { }∞ 7.20 Let an,m n,m=1 be a double sequence of real numbers. Show that the relations n manm = m nanm whenever an,m ≥ 0foralln, m =1,2,...,or n m|anm| < ∞, are special cases of Fubini’s Theorem. 7.21 Continuing Theorem 7.2.3, assume that ν isameasureonW. Show that if λx  ν a.e. (μ)thenλ  ν. Is the converse true? If λ and ν are σ-finite,  dλx λx ν a.e. (μ) and the Radon–Nikodym derivative dν (w) is measurable in (x, w), what additional assumption is needed in order to show that

dλ dλ (w)= x (w) dμ(x)? dν X dν 7.22 Let (X, S)and(Y, T ) be measurable spaces, μ and μ σ-finite measures on S,andν and ν σ-finite measures on T . Show the following.   ×  × d(μ ×ν ) (i) If μ μ and ν ν,thenμ ν μ ν and d(μ×ν) (x, y)= dμ dν dμ (x) dν (y). (ii) If μ ⊥ μ or ν ⊥ ν,thenμ × ν ⊥ μ × ν. 174 Product spaces

(iii) If the subscripts 1 and 2 denote the absolutely continuous and the singu- lar parts in the Lebesgue decomposition of μ (ν, μ × ν) with respect to μ (ν, μ × ν), then × × × × × × (μ ν )1 = μ1 ν1 and (μ ν )2 = μ1 ν2 + μ2 ν1 + μ2 ν2.

7.23 Let f and g be functions defined on R and 1 ≤ p ≤∞.Iff ∈ L1(R)and g ∈ Lp(R) show that the integral defining the convolution (f ∗ g)(x) exists for a.e. x ∈ R. Show that f ∗ g ∈ Lp and

f ∗ gp ≤f 1gp. 7.24 Let M be the set of all finite signed measures on (R, B). (i) Show that M is a Banach space with respect to the norm ν = |ν|(R), ν ∈ M. (ii) Let ν, λ ∈ M and define the set function ν ∗ λ on B by ∗ ∞ (ν λ)(B)= –∞ν(B – y) dλ(y) for all B ∈B,whereB – y = {x – y : x ∈ B}. Show that ν ∗ λ ∈ M, ν ∗ λ = λ ∗ ν, ν ∗ λ≤ν·λ, and that ∞ ∗ ∞ –∞fd(ν λ)= –∞f (x + y) dν(x) dλ(y) whenever either integral exists. (Hint: (ν ∗ λ)(B)=(ν × λ)(E)where E = {(x, y): x + y ∈ B}.) If δ ∈ M denotes the measure with total mass 1 at 0 (i.e. δ({0})=1andδ(B)=δ(B ∩{0}), B ∈B) show that for all ν ∈ M ν ∗ δ = ν = δ ∗ ν. (iii) If ν, λ ∈ M and m is Lebesgue measure, show the following. If ν  m then ν ∗ λ  m and d(ν ∗ λ) ∞ dν (x)= (x – y) dλ(y). dm –∞ dm If ν, λ  m then d(ν ∗ λ) dν dλ = ∗ . dm dm dm If ν and λ are discrete (see Section 5.7) then so is ν ∗ λ. 7.25 Prove the following form of the formula for integration by parts. If F and G are right-continuous functions of bounded variation on [a, b], –∞ < a < c < d < b < ∞,then [c,d]G(x) dF(x)+ [c,d]F(x –0)dG(x)=F(d)G(d)–F(c –0)G(c –0). Exercises 175

7.26 If f ∈ L1(a, b)andG is a right-continuous function of bounded variation on [a, b], show that fG ∈ L1(a, b)and b af (x)G(x) dx = F(b)G(b)– (a,b]F(x) dG(x) x where F(x)= af (t) dt. 7.27 Let f , g ∈ L1(R), x x ∞ ∞ F(x)= –∞f (t) dt, G(x)= –∞g(t) dt,– < x < ,

and F(∞) = limx→∞ F(x), G(∞) = limx→∞ G(x). Show that ∞ ∞ ∞ ∞ –∞F(x)g(x) dx + –∞G(x)f (x) dx = F( )G( ). 7.28 Let –∞ < a < b < ∞, F be a continuous nondecreasing function on [a, b], and G a continuous function of bounded variation on [a, b]. Show that there is a u, a ≤ u ≤ b, such that { } { } [a,b]F(x) dG(x)=F(a) G(u)–G(a) + F(b) G(b)–G(u) . (Hint: Use Theorem 7.6.2 and the first mean value theorem for integrals, Ex. 4.4.) This is called the second mean value theorem for integrals. In particular, if F is as above and g ∈ L1(a, b), then there is a u, a ≤ u ≤ b, such that b u b aF(x)g(x) dx = F(a) ag(x) dx + F(b) ug(x) dx. 7.29 Let S, T be σ-rings of subsets of spaces X, Y respectively and let μ, ν be σ- finite measures on S, T . Use Theorem 7.2.1 to show that there exists a unique (σ-finite) measure λ on the σ-ring S×T such that λ(A × B)=μ(A)ν(B)for all A ∈S, B∈T. (Hint: It is sufficient to show that if λ is defined on the semiring P of measurable rectangles A × B, A ∈S, B ∈Tby λ(A × B)= ∞ μ(A)ν(B)andifA × B = ∪ Ei for disjoint, nonempty Ei ∈Pthen λ(A × ∞ 1 B)= 1 λ(Ei). This follows very simply from the theorem by considering the spaces (A, S0, μ0)(B, T0, ν0)whereS0 is the σ-field S∩A = {F ∩A : F ∈S} of subsets of A, T0 = T∩B and μ0 = μ, ν0 = ν on S0, T0 respectively.) 7.30 With the notation of Section 7.7 show that the mapping T((x1 ...xn–1), xn) =(x1, x2, ..., xn)isameasurable transformation from (Yn–1 × Xn, Tn–1 ×Sn) –1 to (Yn, Tn); i.e. that T E ∈Tn–1 ×Sn if E ∈Tn. 7.31InSection7.7(withthenotationusedthere)itwasshownthat { } fdλn = fxn dλn–1 dμn(xn). Then the identity fdλn = Yn Xn Yn–1 Yn ... fdμ1 ... dμn can be shown as follows. (i) Assume inductively that the result is true for integrals of functions of (n – 1) variables. Hence show that fdλn = { ... fx dμ1 ... dμn–1} dμn(xn). Yn Xn n 176 Product spaces

(ii) Check (from the precise definition of repeated integrals) that the right hand side is ... fdμ1 ... dμn. Show inductively that

... fxi,...,xn (x1, ..., xi–1) dμ1 ... dμi–1 (i) (i) = f (xi, ..., xn)=fxi+1,...,xn (xi).

7.32 Let (Xi, Si, μi)beσ-finite measure spaces, i =1,2,3.Letf be a nonnegative measurable function on (X1 × X2 × X3, S1 ×S2 ×S3). If λ = μ1 × μ2 × μ3 show that fdλ = fdμ2 dμ1 dμ3.

(Consider the transformation T of X1 × X2 × X3 to X2 × X1 × X3 given by * * T(x1, x2, x3)=(x2, x1, x3) and write f = f T where f is a certain function on X2 × X1 × X3.) 7.33 Show that the class Pn of bounded semiclosed intervals (a, b]ofRn is a semiring which generates the σ-field of Borel sets of Rn. n n 7.34 Let μ be a finite measure on the σ-field B of Borel sets of R and F(x1, x2, ..., xn)=μ{(–∞, x1] × (–∞, x2] × ...× (–∞, xn]}. Show that the measure of an interval (a, b] may be written as

{ } Δh1 Δh2 Δhn μ (a, b] = 1 2 ... n F(a1, a2, ..., an) Δh where a =(a1, a2, ..., an), b =(b1, b2, ..., bn), hi = bi – ai and i is the difference operator defined by Δh i F(x1, ..., xn)=F(x1, ..., xi–1, xi + h, xi+1, ..., xn)–F(x1, ..., xn).

7.35 For each i =1,2,..., n,letGi(x) be a bounded nondecreasing function on R which is right-continuous and such that limx→–∞ Gi(x)=0.IfF(x1, x2, ..., × × × xn)=G1(x1)G2(x2) ...Gn(xn)showthatμF = μG1 μG2 ... μGn . 7.36 Show that BT is the smallest σ-field of subsets of RT with respect to which all evaluation functions πt, t ∈ T, are measurable. 7.37 Let μ be a measure on (RT , BT )andletBT be the completion of BT with respect to μ. Show that if E ∈ BT (respectively, f is a BT -measurable func- tion) there is a countable subset S of T such that E ∈ C(S) (respectively, f is C(S)-measurable) where C(S) is the completion of the σ-field C(S)with respect to the restriction of μ to C(S). 8

Integrating complex functions, Fourier theory and related topics

The intent of this short chapter is to indicate how the previous theory may be extended in an obvious way to include the integration of complex-valued functions with respect to a measure (or signed measure) μ on a measurable space (X, S). The primary purpose of this is to discuss Fourier and re- lated transforms which are important in a wide variety of contexts – and in particular the Chapter 12 discussion of characteristic functions of random variables which provide a standard and useful tool in summarizing their probabilistic properties. Some standard inversion theorems will be proved here to help avoid over- load of the Chapter 12 material. However, methods of this chapter also ap- ply to other diverse applications e.g. to Laplace and related transforms used in fields such as physics as well as in probabilistic areas such as stochastic modeling, and may be useful for reference. Finally it might be emphasized (as noted later) that the integrals consid- ered here involve complex functions as integrands and as for the preceding development, form a “Lebesgue-style” theory. This is in contrast to what is termed “complex variable” methodology, which is a “Riemann-style” the- ory in which integrals are considered with respect to a complex variable z along some curve in the complex plane. The latter methods – not con- sidered here – can be especially useful in providing means for evaluation of integrals such as characteristic functions which may resist simple real variable techniques.

8.1 Integration of complex functions Let (X, S, μ) be a measure space and f a complex-valued function defined on X with real and imaginary parts u, : f (x)=u(x)+i(x). f is said to be measurable if u and  are measurable functions.

177 178 Integrating complex functions, Fourier theory and related topics

We say f ∈ L1(X, S, μ)ifu and  both belong to L1(X, S, μ) and write fdμ = udμ + i  dμ. As noted above this is not integration with respect to a complex variable here, i.e. we are not considering contour integrals. The integral involves a complex-valued function, integrated with respect to a (real) measure on (X, S). Many properties of integrals of real functions hold in the complex case also. Some of the most elementary and obvious ones are given in the fol- lowing theorem.

Theorem 8.1.1 Let (X, S, μ) be a measure space and write L1 = L1(X, S, μ). Let f be a complex measurable function on X, f = u + i. Then

∈ | | 2 2 1/2 ∈ (i) f L1 if and only if f =(u +  ) L1. ∈ ∈ (ii) If f , g L1, α, β complex, then αf + βg L1 and (αf + βg) dμ = α fdμ + β gd μ. (iii) If f ∈ L1 then | fdμ|≤ |f | dμ. Proof (i) Measurability of |f | follows from that of u, . Also it is easily checked that |u|, |v|≤|f | =(u2 + 2)1/2 ≤|u| + || from which (i) follows in both directions. (ii) is easily checked by expressing f , g, α, β in terms of their real and imaginary parts and applying the corresponding result for real functions. (iii) is perhaps slightly more involved to show directly than one might imagine. Write z = fdμ and z = reiθ. Then | fdμ| = r = e–iθz = e–iθ fdμ = (e–iθf ) dμ. But since this is real, the imaginary part of the integral must vanish, giving | fdμ| = R[e–iθf ] dμ (R denoting “real part”) ≤ |e–iθf | dμ = |f | dμ as required.  Many of the simple results for real functions will be used for complex functions with little if any comment, in view of their obvious nature – e.g. Theorems 4.4.3, 4.4.6, 4.4.8, 4.4.9. Of course some results (e.g. Theorem 4.4.4) simply have no immediate generalization to complex functions. For the most part the more important and sophisticated theorems also generalize in cases where the generalized statements have meaning. This 8.1 Integration of complex functions 179 is the case for Fubini’s Theorem for L1-functions (Theorem 7.4.2 (ii)), the “Transformation Theorem” (Theorem 4.6.1), Dominated Convergence (Theorem 4.5.5) and the uses of the Radon–Nikodym Theorem such as Theorem 5.6.1 (for complex integrable functions). It may be checked that these results follow from the real counterparts. As an example we prove the dominated convergence theorem in the complex setting.

Theorem 8.1.2 (Dominated Convergence for complex sequences) Let{ fn} be a sequence of complex-valued functions in L1(X, S, μ) such that |fn|≤|g| a.e. where g ∈ L1. Let f be a complex measurable function → ∈ | | → such that fn f a.e. Then f L1 and fn – f dμ 0. In particular fn dμ → fdμ.

Proof Write fn = un + in, f = u + i. Since fn → f a.e. it follows that un → u, n →  a.e. Also |un|≤|g|, |n|≤|g|. Hence u,  ∈ L1 by Theorem 4.5.5 (hence f ∈ L ), and 1 |un – u| dμ → 0, |n – | dμ → 0.

Thus |(un + in)–(u + i)| dμ ≤ (|un – u| + |n – |) dμ → 0 or |fn – f | dμ → 0 as required. Finally | fn dμ – fdμ| = | (fn – f ) dμ|≤ |fn – f | dμ by Theorem 8.1.1 and thus the final statement follows. 

We conclude this section with some comments concerning Lp-spaces of complex functions, and the Holder¨ and Minkowski Inequalities. As for real functions, if f is complex and measurable we define f p = p 1/p ( |f | dμ) for p > 0 and say that f ∈ Lp if f p < ∞. Clearly such (com- p plex, measurable) f ∈ Lp if and only if |f |∈Lp,i.e.|f | ∈ L1. It is also easily checked that if f = u + iv, then f ∈ Lp if and only if each of u,  are in Lp. p p (For if f ∈ Lp, |u| ≤|f | ∈ L1, whereas if u,  ∈ Lp then |u| + ||∈Lp and p p |f | ≤ (|u| + ||) ∈ L1.) Further if f , g are complex functions in Lp, it is readily seen that f +g ∈ Lp and hence αf + βg ∈ Lp for any complex α, β.For|f |, |g| are real functions in Lp and hence |f | + |g|∈Lp, so that |f + g|≤(|f | + |g|) ∈ Lp showing that p |f + g| ∈ L1 and hence f + g ∈ Lp. Holder’s¨ Inequality generalizes verbatim for complex integrands, since if f ∈ Lp, g ∈ Lq for some p ≥ 1, q ≥ 1, 1/p +1/q = 1, then |f |∈Lp, |g|∈Lq so that |fg|∈L by Theorem 6.4.2 and 1 |fg| dμ = |f ||g| dμ ≤ ( |f |p dμ)1/p( |g|q dμ)1/q. 180 Integrating complex functions, Fourier theory and related topics

Armed with Holder’s¨ Inequality, Minkowski’s Inequality follows by the same proof as in the real case. The complex Lp-space may be discussed in the same manner as the real Lp-space (cf. Section 6.4). This is a linear space (over the complex field) p 1/p and is normed by f p =( |f | dμ) (p ≥ 1). It is easily checked that if fn → f in Lp (i.e. fn – f →0) and if fn = un + in, f = u + i, then p p un → u, n →  in Lp, and conversely (e.g. |un – u| ≤|fn – f | and hence un – u≤fn – f , whereas also fn – f ≤un – u + n – ). Using these facts, completeness of Lp follows from the results for the real case. As for the real case Lp is a complete metric space for 0 < p < 1 (Theorem 6.4.7).

8.2 Fourier–Stieltjes, and Fourier Transforms in L1 Suppose that F is a real bounded, nondecreasing function (assumed right- continuous, for convenience) on the real line R and defining the measure * μF. The Fourier–Stieltjes Transform F (t)ofF is defined as a complex func- tion on R by ∞ F* t eitx dF x eitx dμ ( )= –∞ ( )(= F). itx This integral exists since |e | = 1 and μF(R) < ∞. A function F on R is of bounded variation (b.v.) on R (cf. Section 5.7 for finite ranges) if it can be expressed as the difference of two bounded nondecreasing functions, F = F1 – F2 (again assume F1, F2 to be right- continuous for convenience). If F is b.v. its Fourier–Stieltjes Transform is defined as * * * F (t)=F1(t)–F2(t).

(Note that this definition is unambiguous since if also F = G1 – G2 then * * * * G1 + F2 = G2 + F1, and it is readily checked that G1 + F2 = G2 + F1, giving * * * * G1 – G2 = F1 – F2.) Theorem 8.2.1 If F is b.v., its Fourier–Stieltjes Transform F*(t) is uniformly continuous on R. Proof Suppose F is nondecreasing. For any real t, s, t – s = h, |F*(t)–F*(s)| = | (eitx – eisx) dF(x)| ≤ |eisx(eihx –1)| dF(x) = |eihx –1| dF(x). As h → 0, |eihx –1|→0 and is bounded by |eihx| + 1 = 2 which is dF-integrable. Hence by Dominated Convergence (Theorem 8.1.2) 8.2 Fourier–Stieltjes, and Fourier Transforms in L1 181 ihx |e –1| dF(x) → 0ash → 0 (through any sequence and hence gener- ally). Thus given >0 there exists δ>0 such that |eihx –1| dF(x) < if |h| <δ. Then |F*(t)–F*(s)| < for all t, s such that |t – s| <δ, which proves uniform continuity. If F is b.v. the result follows by writing F = F1 –F2.  Suppose now that f is a real Lebesgue measurable function on R and itx f ∈ L1 = L1(–∞, ∞) (Lebesgue measure). Then f (x)e ∈ L1 for all real t, † and we define the L1 Fourier Transform f of f by ∞ f † t eitxf x dx ( )= –∞ ( ) . † † † Firstnotethatf , g ∈ L1 then (αf + βg) = αf + βg for any real constants α, β. † * x It is also immediate that f (t)=F (t) where F(x)= ∞f (u) du. For if f is nonnegative, F is then nondecreasing and F*(t)= eitx dF(x)= eitxf (x) dx

by Theorem 5.6.1. The general case follows by writing f = f+ – f–, F1(x)= x x ∞f+(u) du, F2(x)= ∞f–(u) du. – – † If f ∈ L1 it follows from the above fact and Theorem 8.2.1 that f (t)is uniformly continuous on R. It is clear that a general Fourier–Stieltjes Transform F*(t) does not have to tend to zero as t →±∞. For example if F(x) has a single jump of size α at x = λ, then F*(t)=αeiλt. However, the Fourier Transform f †(t)ofan L1-function f does tend to zero as t →±∞as the important Theorem 8.2.3 shows. This depends on the following useful lemma. ∈ ∞ ∞ Lemma 8.2.2 Let f L1(– , ) (Lebesgue measure). Then given >0 n there exists a function h of the form h(x)= 1 αjχIj (x), where I1, ..., In are ∞ (disjoint) bounded intervals, such that |h f | dx < . –∞ – Proof Since f ∈ L , there exists A < ∞ such that |f (x)| dx < /3, 1 (|x|>A) and hence |g – f | dx < /3 where g(x)=f (x)for|x| < A, and g(x)=0for | |≥ x A. By the definition of the integral, g(x) may be approximated by a n simple function k(x)= j=1 αjχBj (x) where the Bj are bounded Borel sets and where |g–k| dx < /3, so that |f –k| dx < 2 /3. Finally for each j there is a finite union Ij of bounded intervals such that m(Bj Ij) < /(3n max |αj|) where m denotes Lebesgue measure (Theorem 2.6.2), so that writing h(x)= n 1 αjχIj we have   | | ≤ | | | | | |  k – h dx αj χIj – χBj dx = αj m(Ij Bj) < /3 182 Integrating complex functions, Fourier theory and related topics and hence |f – h| dx < . The given form of h may now be achieved by a simple change of notation – replacing each Ij by the intervals of which it is composed. 

Theorem 8.2.3 (Riemann–Lebesgue Lemma) Let f ∈ L1(–∞, ∞) (i.e. f is Lebesgue integrable). Then its Fourier Transform f †(t) → 0 as t →±∞.

Proof Let g be any function of the form cχ(a,b] for finite constants a, b, c. b g† t c eitx dx c eitb eita it t →±∞ Then ( )= a = [ – ]/( ) which tends to zero as . n If h(x)= j=1 αjgj(x) where each gj is of the above type, then clearly h†(t) → 0ast →±∞. Now given > 0 there is (by Lemma 8.2.2) a function h of the above type such that |h(x)–f (x)| dx < . Hence |f †(t)| = | eitx(f (x)–h(x)) dx + h†(t)| ≤ |f (x)–h(x)| dx + |h†(t)| < + |h†(t)|. Since h†(t) → 0 it follows that |f †(t)| can be made arbitrarily small for t sufficiently large (positive or negative) and hence f †(t) → 0ast →±∞,as required. 

8.3 Inversion of Fourier–Stieltjes Transforms The main result of this section is an inversion formula from which F may be “recovered” from a knowledge of its Fourier–Stieltjes Transform. In fact the ˜ 1 1 formula gives not F itself but F(x)= 2 [F(x+0)+F(x–0)] = 2 [F(x)+F(x–0)], assuming right-continuity. F itself is easily obtained from F˜ since F = F˜ at continuity points, and at discontinuities F(x)=F˜ (x +0). Theorem 8.3.1 (Inversion for Fourier–Stieltjes Transforms) Let F be b.v. with Fourier–Stieltjes Transform F*. Then for all real a, b (a < bsay)with the above notation,

–ibt –iat 1 T e – e F˜ (b)–F˜ (a) = lim F*(t) dt. T→∞ 2π –T –it Also, for any real a, the jump of F at a is 1 T F(a +0)–F(a – 0) = lim e–iatF*(t) dt T→∞ 2T –T (which will be zero if F is continuous at a). 8.3 Inversion of Fourier–Stieltjes Transforms 183

Proof If the result holds for bounded nondecreasing functions, it clearly holds for a b.v. function. Hence we assume that F is nondecreasing and bounded (and right-continuous for convenience). Now

–ibt –iat –ibt –iat 1 T e – e 1 T e – e ∞ F*(t) dt = eitx dF(x) dt 2π –T –it 2π –T –it –∞ it(x–b) it(x–a) 1 ∞ T e – e = ( dt) dF(x) 2π –∞ –T –it by an application of Fubini’s Theorem (noting that the integrand may be x–a eitu du written as x–b and its modulus therefore does not exceed the constant (b – a) which is integrable with respect to the product of Lebesgue measure on (–T, T) and F-measure). Now the inner integral above is T x–a x–a T eitu du dt eitu dt du –T x–b = x–b –T x–a sin Tu T(x–a) sin u =2 du =2 du x–b u T(x–b) u =2{H[T(x – a)] – H[T(x – b)]} x H x sin u du H where ( )= 0 u . As is well known, is a bounded, odd function π π π →∞ →∞ which converges to 2 as x . Hence limT H[T(x – a)] = – 2 ,0 or 2 according as x < a, x = a,orx > a. Thus (with the corresponding limit for H[T(x – b)]), lim {H[T(x – a)] – H[T(x – b)]} =0 x < a or x > b T→∞ π = x = a or x = b 2 = π a < x < b. Further {H[T(x – a)] – H[T(x – b)]} is dominated in absolute value by a constant (which is dF-integrable) and hence, by dominated convergence,

–ibt –iat 1 T e – e lim F*(t) dt →∞ –T T 2π  –it  2 π π = (F(a)–F(a – 0)) + π(F(b –0)–F(a)) + (F(b)–F(b –0)) 2π 2 2 which reduces to F˜ (b)–F˜ (a), as required. The second expression is obtained similarly. Specifically 1 T 1 T ∞ e–iatF*(t) dt = e–iat eitx dF(x) dt 2T –T 2T –T –∞ 1 ∞ T ∞ sin T(x – a) = eit(x–a) dt dF(x)= dF(x) 2T –∞ –T –∞ T(x – a) 184 Integrating complex functions, Fourier theory and related topics

(using Fubini) where the value of the integrand at x = a is unity. The in- tegrand tends to zero as T →∞for all x  a and is bounded by one (dF-integrable). Hence the integral converges as T →∞by dominated convergence, to the value

μF({a})=F(a)–F(a –0)=F(a +0)–F(a –0) as required. 

A most interesting case occurs when the (complex) function F*(t) is it- self in L1(–∞, ∞). First of all it is then immediate that F must be continuous since dominated convergence gives ∞ T –iat * –iat * lim e F (t) dt = ∞e F (t) dt T→∞ –T – and hence it follows from the second formula of Theorem 8.3.1 that F(a+0) – F(a – 0) = 0. Similarly, the limit in the first inversion may be written as ∞ T F˜ F –∞ instead of lim –T (again by dominated convergence) and = (since F is continuous) giving

–ibt –iat 1 ∞ e – e F(b)–F(a)= F*(t) dt. 2π –∞ –it In fact even more is true and can be shown using the following obvious lemma.

Lemma 8.3.2 Let F = F1 – F2 be a b.v. function on R (F1,F2 bounded nondecreasing) and g a real function in L1(–K, K) for any finite K, and b such that F b F a g x dx for all real a < b. Then g ∈ L ∞ ∞ ( )– ( )= a ( ) 1(– , ) and μ E g x dx for all Borel sets E μ is defined to be μ μ . F( )= E ( ) ( F F1 – F2 ) Proof Fix K and define the finite signed measures μ E μ E ∩ K K ν E g x dx ( )= F( (– , )), ( )= E∩(–K,K) ( ) . Clearly μ = ν for all sets of the form (a, b] and hence for all Borel sets (Lemma 5.2.4). Thus the “total variations” |μ|, |ν| are equal giving |g x | dx |ν| K K |μ| K K ≤ μ μ K K (–K,K) ( ) = (– , )= (– , ) ( F1 + F2 )(– , ) ≤ R ∞ (μF1 + μF2 )( ) < .

Hence g ∈ L1(–∞, ∞) by monotone convergence (K →∞). Thus μF(E) gdx a b and E are two finite signed measures which are equal on sets ( , ] and thus on B, as required.  8.3 Inversion of Fourier–Stieltjes Transforms 185

Theorem 8.3.3 Let F be b.v. on R, with Fourier–Stieltjes Transform F*, * and assume F ∈ L1(–∞, ∞). Then F is absolutely continuous, and specifi- cally x F x F ∞ g u du ( )= (– )+ –∞ ( ) ∞ where g u 1 e–iutF* t dt is real and in L ∞ ∞ . ( )= 2π –∞ ( ) 1(– , ) Proof The formula just prior to Lemma 8.3.2 gives 1 ∞ b F b F a e–iutF* t du dt ( )– ( )= –∞ a ( ) 2π b g u du = a ( ) * by Fubini’s Theorem (since F ∈ L1) and the definition of g. To see that g is real note that the integral of its imaginary part over any finite interval is zero, and it follows that the imaginary part of g has zero integral over any Borel set E, and is thus zero a.e. (Theorem 4.4.8). But a function which is continuous and zero a.e. is everywhere zero (as is easily checked) and thus g is real. The result now follows at once by applying Lemma 8.3.2 to F and g. 

We may now obtain an important inversion theorem for L1 Fourier Trans- forms when the transform is also in L1. † Theorem 8.3.4 Let f ∈ L1(–∞, ∞). Then if its Fourier Transform f (t) is in L1(–∞, ∞), we have the inversion 1 ∞ f (x)= e–ixtf †(t) dt a.e. (Lebesgue measure). 2π –∞ x Proof F x f u du a b Write ( )= –∞ ( ) . Then by Theorem 8.3.3, for all , b b f (u) du = F(b)–F(a)= g(u) du a a 1 –ixt † ∞ ∞ where g(x)= 2π e f (t) dt is real and in L1(– , ). The finite signed fdx gdx E a b measures E , E are thus equal for all of the form ( , ] and hence for all E ∈B(and finally for all Lebesgue measurable sets E). Hence f = g a.e. by the corollary to Theorem 4.4.8, as required.  ∞ f x 1 e–ixtf † t dt Note that the expression ( )= 2π –∞ ( ) a.e. may be regarded as displaying f as an “inverse Fourier Transform”. For (apart from the factor 1 2π and the negative in the exponent) this has the form of the Fourier † Transform of the (assumed L1) function f . Of course we have defined Fourier Transforms of real functions since that is our primary interest (and f † may be complex) but one could also define the transform of a complex 186 Integrating complex functions, Fourier theory and related topics

L1-function. The “inverse transform” thus is an ordinary Fourier Transform 1 with a negative sign in the exponent and the factor 2π .

8.4 “Local” inversion for Fourier Transforms In the last section it was shown that the inversion 1 ∞ f (x)= e–ixtf †(t) dt a.e. 2π –∞ † † holds when the transform f (t) ∈ L1. There are important cases when f does not belong to L1 but where an inversion is still possible. For example suppose f (x)=0forx < 0 and f (x)=e–x for x > 0. Then ∞ ∞ ∞ f † t e–xeixt dx e–x xt dx i e–x xt dx ( )= 0 = 0 cos + 0 sin 1 it = + 1+t2 1+t2 1 = . 1–it † † 2 –1/2 Clearly f (t)  L1 since |f (t)| =(1+t ) . To obtain an appropriate inversion the following limit is needed. Lemma 8.4.1 (Dirichlet Limit) If for some δ>0,g(x) is a bounded nondecreasing function of x in (0, δ), then 2 δ sin Tx g(x) dx → g(0+) π 0 x as T →∞. δ Tδ Proof sin Tx dx sin u du → π T →∞ 0( x ) = 0 ( u ) 2 as (cf. proof of Theo- rem 8.3.1). Thus it will be sufficient to show that δ sin Tx (g(x)–g(0+)) dx → 0. 0 x Given >0 there exists η>0 such that g(η)–g(0+) < . Then η sin Tx η sin Tx (g(x)–g(0+)) dx =[g(η –0)–g(0+)] dx 0 x ξ x for some ξ ∈ [0, η] by the second mean value theorem for integrals. The last expression may be written as ηT sin x (g(η –0)–g(0+)) dx. ξT x 8.4 “Local” inversion for Fourier Transforms 187   T  T  But since (sin u/u) du is bounded,  2 (sin u/u) du < A for some A and 0 T1 all T1, T2 ≥ 0. Thus for all T    η sin Tx   (g(x)–g(0+)) dx ≤ A. 0 x

Now (g(x)–g(0+))/x ∈ L1([η, δ]) (g being bounded and η>0). The Riemann–Lebesgue Lemma (Theorem 8.2.3) applies equally well to a finite range of integration (or the function may be extended to be zero outside such a range). Considering the imaginary part of the integral we see that δ g x g sin Tx dx → T →∞ η( ( )– (0+))( x ) 0as . Hence    δ sin Tx   g x g dx ≤ A lim sup  0 ( ( )– (0+))  T→∞ x for any >0 from which the required result follows.  Recall from Section 5.7 that a function f is b.v. in a finite range if it can be written as the difference of two bounded nondecreasing functions in that range. The Dirichlet Limit clearly holds for such b.v. functions (in (0, δ)) also. The desired inversion may now be obtained.

Theorem 8.4.2 (Local Inversion Theorem for L1 Transforms) If f ∈ L1, and f is b.v. in (x – δ, x + δ) for a fixed given x and for some δ>0, then 1 1 T {f (x +0)+f (x –0)} = lim e–itxf †(t) dt. 2 T→∞ 2π –T Proof 1 T 1 T ∞ e–itxf †(t) dt = e–it(x–y)f (y) dy dt 2π –T 2π –T –∞ 1 ∞ T = ( e–it(x–y) dt)f (y) dy (Fubini) 2π –∞ –T 1 ∞ sin T(x – y) = f (y) dy π –∞ x – y 1 ∞ sin Tu = f (x + u) du. π –∞ u

Now for x fixed, f (x + u)/u is in L1(δ, ∞) and L1(–∞,–δ)forδ>0sothat Tu u f x u du → T →∞ |u|>δ(sin / ) ( + ) 0as by the Riemann–Lebesgue Lemma. Thus we need consider only the range [–δ, δ] for the integral. Now f (x + u)isb.v.in(0,δ) and by the Dirichlet 188 Integrating complex functions, Fourier theory and related topics δ 0 1 Tu u f x u du → 1 f x 1 Tu u Limit π 0(sin / ) ( + ) 2 ( + 0). Similarly π –δ(sin / ) → 1 f (x + u) du 2 f (x – 0) and hence 1 δ sin Tu 1 f (x + u) du → (f (x +0)+f (x –0)) π –δ u 2 giving the desired conclusion of the theorem.  Corollary If f is continuous at x the stated inversion formula gives f (x). ∞ If also f † ∈ L f x 1 e–ixtf † t dt 1, ( )= 2π –∞ ( ) . In contrast to the previous inversion formula, that considered here ap- plies to the value of f at a given point x rather than holding a.e. It is of- ten convenient to use complex variable methods (i.e. contour integrals) to † 1 evaluate the formula. For example in the case f (t)= 1–it one may con- 1 e–izx dz f x sider 2π C 1–iz around upper and lower semicircles to recover ( )=0 for x < 0 and f (x)=e–x for x > 0. (The limit as T →∞occurs naturally, making the semicircle larger.) The case x = 0 is easily checked directly 1 giving the value 2 (= (f (0+) + f (0–))/2). 9

Foundations of probability

9.1 Probability space and random variables By a probability space we mean simply a measure space for which the measure of the whole space is unity. It is customary to denote a probability space by (Ω, F , P), rather than the (X, S, μ) used in previous chapters for general measure spaces. That is, P is a measure on a σ-field F of subsets of a space Ω, such that P(Ω) = 1 (and P is thus called a probability measure). It will be familiar to the reader that this framework is used to provide a mathematical (“probabilistic”) model for physical situations involving ran- domness i.e. a random experiment E – which may be very simple, such as the tossing of coins or dice, or quite complex, such as the recording of an entire noise waveform. In this model, each point ω ∈ Ω represents a possible outcome that E may have. The measurable sets E ∈F are termed events. An event E represents that “physical event” which occurs when the experiment E is conducted if the actual outcome obtained corresponds to one of the points of E. It will also be familiar that the complement Ec of an event E represents another physical event – which occurs precisely when E does not occur if E is conducted. Further, for two events E, F, E∪F represents that event which occurs if either or both of E, F occur, whereas E ∩ F represents occurrence of both these events simultaneously. If E∩F = ∅, the events E and F cannot occur together when E is performed. Similar interpretations hold for other Δ ∪∞ set operations such as –, , 1 and so on. The probability measure P(E) (sometimes written also as Pr(E)) of an event E, is referred to as the “probability that the event E occurs” when E is conducted. As is intuitively reasonable, its values lie between zero and one (P being monotone). If E, F are events which cannot occur together (i.e. disjoint events – E ∩ F = ∅), it is also intuitively plausible that the probability P(E ∪ F) of one or other of E, F occurring, should be equal to P(E)+P(F). This is true since the measure P is additive. (Of course, the

189 190 Foundations of probability countable additivity of P implies a corresponding statement for a sequence of disjoint events.) It is worth recalling that these properties are also intuitively desirable from a consideration of the “frequency interpretation” of P(E) as the pro- portion of times E occurs in very many repetitions of E. Thus the require- ments which make P a probability measure are consistent with intuitive properties which probability should have. We turn now to random variables. To conform to the notion of a random variable as a “numerical outcome of a random experiment”, it is intuitively reasonable to consider a function on Ω (i.e. an assignment of a numerical value to each possible outcome ω). For example for two tosses of a coin we may write Ω = (HH, HT, TH, TT) and the number of heads ξ(ω) taking the respective values 2, 1, 1, 0. It will be convenient to allow infinite values on occasions. Precisely, the following definitions will apply. By an extended (real) random variable we shall mean a measurable function (Section 3.3) ξ = ξ(ω) defined a.e. on (Ω, F , P). If the values of ξ are finite a.e., we shall simply refer to ξ as a random variable (r.v.). Note that the precise usage of the term random variable is not uniform among different authors. Sometimes it is required that a r.v. be defined and finite for all ω, and sometimes defined for all ω and finite a.e. The latter definition is inesthetic since the sum of two such “r.v.’s” need not be defined for all ω, and hence not a r.v. The former can be equally as good as the definition above since a redefinition of an a.e. finite function will lead to one which is everywhere finite, with the “same properties except on a zero measure set” (a fact which will be used from time to time anyway). Which definition is chosen is largely a matter of personal preference since there are compensating advantages and disadvantages of each, and in any case the differences are of no real consequence. As in previous chapters, B (B∗) will be used to denote the σ-field of Borel sets (extended Borel sets – Section 3.1) on the real line R (extended real line R∗). By a Borel function f on R (R∗) we mean that f (either real or extended real) is measurable with respect to B (B∗). An extended r.v. ξ viewed as a mapping (transformation) from Ω to R∗, induces the probability measure Pξ–1 on B∗ (Section 3.7). As discussed in the next section this is the distribution of ξ, using the notation (for B ∈B∗), P{ξ ∈ B} = P(ξ–1B). Similarly other obvious notation (such as P{ξ ≤ a} for Pξ–1(–∞, a]) will be clear and used even if not formally defined. A further convenient notation is the use of the abbreviation “a.s.” (“almost surely”) which is usually preferred over “a.e.” when the measure 9.2 Distribution function of a random variable 191 involved is a probability measure. This is especially useful when another measure (e.g. Lebesgue) is considered simultaneously with P, since then “a.s.” will refer to P, and “a.e.” to the other measure. It is also not un- common to use the phrase “with probability one” instead of “a.s.”. Thus statements (for a Borel set B) such as

“ξ ∈ B a.e. (P)”, “ξ ∈ B a.s.”, “ξ ∈ B with probability one”, P{ξ ∈ B} =1 are equivalent. Finally the measures P, Pξ–1 may or may not be complete (Section 2.6). Completeness may, of course, be simply achieved where needed or desired by the completion procedure of Theorem 2.6.1.

9.2 Distribution function of a random variable As above a r.v. ξ on (Ω, F , P) induces the distribution Pξ–1 on (R∗, B∗)and also, by restriction, on (R, B). Further if A denotes the (measurable) set of points ω where ξ is either not defined or ξ(ω)=±∞ then P(A) = 0 and Pξ–1(R)=P(Ω)–P(A) = 1, so that Pξ–1 is a probability measure on B, and, since Pξ–1(R∗) = 1, also on B∗. Now Pξ–1 as a measure on (R, B) is a Lebesgue–Stieltjes measure, corresponding to the point function (Theorem 2.8.1) by

F(x)=Pξ–1{(–∞, x]} = P{ξ ≤ x} ,

–1 i.e. Pξ = μF in the notation of Section 2.8. F is called the distribution function (d.f.) of ξ. According to Theorem 2.8.1 F(x) is nondecreasing and continuous to the right. Further it is easily checked, writing F(–∞)= limx→–∞ F(x), F(∞) = limx→∞ F(x) that F(–∞)=0, F(∞) = 1. In fact these properties are also sufficient for a function F to be the d.f. of some r.v. ξ, as concluded in the following theorem.

Theorem 9.2.1 (i) For a function F on R to be the d.f. (P{ξ ≤ x}) of some r.v. ξ, it is necessary and sufficient that F be nondecreasing, continu- ous to the right and that limx→–∞ F(x) = 0, limx→∞ F(x)=1. (ii) Two r.v.’s ξ, η (on the same or different probability spaces) have the same distribution (i.e. Pξ–1B = Pη–1B for all B ∈B∗) if and only if they have the same d.f. F.

Proof The necessity of the conditions in (i) has been shown by the re- marks above. Conversely if F is a nondecreasing function with the proper- ties stated in (i), we may define a probability space (R, B, μF)whereμF is 192 Foundations of probability the measure defined by F (as in Theorem 2.8.1). Since

μF(R) = lim μF{(–n, n]} = lim{F(n)–F(–n)} =1, n→∞ n→∞ it follows that μF is a probability measure. If ξ denotes the “identity r.v.” on (R, B, μF) given (for real ω)byξ(ω)=ω, its d.f. is

–1 μFξ {(–∞, x]} = μF{(–∞, x]} = F(x), so that F is the d.f. of a r.v. ξ as required. To prove (ii), note that clearly if ξ, η have the same distribution (on either B∗ or B) they have the same d.f. (Take B =(–∞, x].) Conversely if ξ, η have the same d.f., then by the uniqueness part of Theorem 2.8.1, Pξ–1 and Pη–1 are equal on B (being measures on (R, B) corresponding to the same function F), i.e. Pξ–1(B)=Pη–1(B) for all B ∈B. But this also holds if B is replaced by B∪{∞}, B∪{–∞} or B∪{∞}∪{–∞} (since e.g. Pξ–1(B∪{∞})= Pξ–1(B)=Pη–1(B)=Pη–1(B ∪{∞})). That is Pξ–1 = Pη–1 on B∗ also. 

If two r.v.’s ξ, η (on the same or different probability spaces) have the same distribution (Pξ–1B = Pη–1B for all B ∈B, or equivalently for all d B ∈B∗) we say that they are identically distributed, and write ξ = η.By the theorem it is necessary and sufficient for this that they have the same d.f. It is, incidentally, usually “distributional properties” of a r.v. which are important in probability theory. If ξ is a r.v. on some (Ω, F , P), we can always find an identically distributed r.v. on the real line. For if F is the d.f. of ξ ar.v.η may be constructed on (R, B, μF) as above (η(x)=x). η has the same d.f. F as ξ, and hence the same distribution as ξ, by Theorem 9.2.1. –1 As noted, if F is the d.f. of ξ, Pξ is the Lebesgue–Stieltjes measure μF defined by F as in Section 2.8. However, in addition to being everywhere finite, as required in Section 2.8, a d.f. is bounded (with values between zero and one). Ad.f.F may have discontinuities, but as noted above it is continuous to the right. Also since F is monotone the limit F(x–0) = limh↓0 F(x–h) exists for every x. The measure of a single point is clearly the jump μF({x})= F(x)–F(x – 0). The following useful result follows from Lemma 2.8.2.

Lemma 9.2.2 Let F be a d.f. (with corresponding probability measure μF on B). Then μF has at most countably many “atoms” (i.e. points x with μF({x}) > 0). Correspondingly F has at most countably many discontinuity points. 9.2 Distribution function of a random variable 193

Two extreme kinds of distribution and d.f. are of special interest. The first corresponds to r.v.’s ξ whose distribution Pξ–1 on B is discrete. That is (cf. Section 5.7) there is a countable set C such that Pξ–1(Cc)=0.If C = {x , x , ...} and Pξ–1{x } = p ,wehaveforanyB ∈B 1 2 i i   –1 –1 –1 Pξ (B)=Pξ (B ∩ C)= Pξ {xi} = pi

{xi∈B} {xi∈B} and thus for the d.f.  –1 F(x)=Pξ (–∞, x]= pi.

{xi≤x}

F increases by jumps of size pi at the points xi and is called a discrete d.f. The r.v. ξ with such a d.f. is also said to be a discrete r.v. Note that such a d.f. may often be visualized as an increasing “step function” with successive stairs of heights pi. This is the case (cf. Section 5.7) if the xi can be written as a sequence in increasing order of size. However, such size ordering is not always possible – as when the set of xi consists of all rational numbers. Two standard examples of discrete r.v.’s are (i) Binomial, where C = {0, 1, 2, ...n} and   n p = pr(1 – p)n–r, r =0,1,..., n (0 ≤ p ≤ 1), r r (ii) Poisson, where C = {0, 1, 2 ...} and

–m r pr = e m /r!, r =0,1,2... (m > 0).

–1 At the “other extreme” the distribution Pξ (= μF)ofξ may be abso- lutely continuous with respect to Lebesgue measure. Then for any B ∈B Pξ–1 B f x dx ( )= B ( ) where the Radon–Nikodym derivative f (of Pξ–1 with respect to Lebesgue measure) is nonnegative a.e. and hence may be taken as everywhere non- negative (by writing e.g. zero instead of negative values). f is in L1(–∞, ∞) and its integral is unity. It is called the probability density function (p.d.f.) for ξ and the d.f. is given by x F x Pξ–1 ∞ x f u du ( )= (– , ]= –∞ ( ) . (F is thus an absolutely continuous function – cf. Section 5.7.) We then say that ξ has an absolutely continuous distribution or simply that ξ is an absolutely continuous r.v. Common examples are 194 Foundations of probability

(i) the normal distribution N(μ, σ2) where √ f (x)=(σ 2π)–1 exp{–(x – μ)2/2σ2} (μ real, σ>0)

(ii) the gamma distribution with parameters α>0, β>0, where f (x)= αβ(Γ(β))–1e–αxxβ–1 (x > 0). The case β = 1 gives the exponential distribution.

There is a third “extreme type” of r.v. which is not typically encountered in classical statistics but has received significant more recent attention in connection with use of fractals in important applied sciences. This is a r.v. ξ whose distribution is singular with respect to Lebesgue measure (Section 5.4) and such that Pξ–1{x} = 0 for every singleton set {x}. That is Pξ–1 has mass confined to a set B of Lebesgue measure zero, but unlike a discrete r.v. Pξ–1 has no atoms in B (or Bc, of course). The corresponding d.f. F is everywhere continuous, but clearly by no means absolutely continuous. Such a d.f. (and the r.v.) will be called singular (though continuous singular would perhaps be a better name). It is readily seen from Section 5.7 that any d.f. whatsoever may be repre- sented in terms of the three special types considered above, as the following celebrated result shows.

Theorem 9.2.3 (Lebesgue Decomposition for d.f.’s) Any d.f. F may be written as a “convex combination”

F(x)=α1F1(x)+α2F2(x)+α3F3(x) where F1, F2, F3 are d.f.’s, F1 being absolutely continuous, F2 discrete, F3 singular, and where α1, α2, α3 are nonnegative with α1 + α2 + α3 =1.The constants α1, α2, α3 are unique, and so is the Fi corresponding to any αi > 0 (hence the term αiFi is unique for each i).

∗ ∗ Proof By Theorem 5.7.1 (Corollary) we may write F(x)=F1(x)+F2(x)+ ∗ ∗ F (x), where F (x) are nondecreasing functions defining measures μ ∗ 3 i Fi which are respectively absolutely continuous, discrete and singular (for 3 ∗ ∞ ∗ i = 1, 2, 3). Further, noting that i=1 Fi (– ) = 0, we may replace Fi by ∗ ∗ ∞ ∗ ∞ ∗ ∞ Fi – Fi (– ) and hence take Fi (– ) = 0 for each i. Write now αi = Fi ( ) ∗ and Fi(x)=Fi (x)/αi if αi > 0 (and an arbitrary d.f. of “type i”ifαi =0). Then Fi is a d.f. and the desired decomposition F(x)=α1F1(x)+α2F2(x)+ α3F3(x) follows. Letting x →∞we see that α1 + α2 + α3 =1. If there is another such decomposition, F = β1G1 +β2G2 +β3G3 say, then

μα1F1 + μα2F2 + μα3F3 = μβ1G1 + μβ2G2 + μβ3G3 9.3 Random elements, vectors and joint distributions 195

and hence by Theorem 5.7.1, μαiFi = μβiGi . Hence αiFi differs from βiGi at most by an additive constant which must be zero since Fi and Gi vanish at –∞. Since Fi(∞)=Gi(∞) = 1 we thus have αi = βi and hence also Fi = Gi (provided αi > 0). 

9.3 Random elements, vectors and joint distributions It is natural to extend the concept of a r.v. by considering more general mappings rather than just “measurable functions”. These will be precisely “measurable transformations” as discussed in Chapter 3, but the term “measurable mapping” will be more natural (and thus used) in the present context. Specifically let ξ be a measurable mapping defined a.s. on a prob- ability space (Ω, F , P), to a measurable space (X, S) (i.e. ξ–1E ∈Ffor all E ∈S). Then ξ will be called a random element (r.e.) on (Ω, F , P) with values in X (or in (X, S)). An extended r.v. is thus a r.e. with values in (R∗, B∗). Another case of importance is when (X, S)=(R∗n, B∗n) and ξ(ω)=(ξ1(ω), ..., ξn(ω)). A r.e. of this form and such that each ξi is finite a.s. will be called a random vector or vector random variable.Yetmore generally a stochastic process maybedefinedasar.e.of(X, S)=(RT , BT ) (cf. Section 7.9) for e.g. an index set T = {1, 2, 3, ...} or T =(0,∞). As will be briefly indicated in Chapter 15 this is alternatively described as an infinite (countable or uncountable) family of r.v.’s. Before pursuing probabilistic properties of random elements it will be convenient to develop some notation and obvious measurability results in the slightly more general framework in which ξ is a mapping defined on a space Ω, not necessarily a probability space, with values in a measurable space (X, S). Apart from notation this is precisely the framework of Sec- tion 3.2 replacing X by Ω and (Y, T )by(X, S), and identifying ξ with the transformation T. It will be more natural in the present context to refer to ξ as a mapping rather than a transformation but the results of Section 3.2 apply. For such a mapping ξ the σ-field σ(ξ) generated by ξ is defined on Ω (cf. Section 3.2, identifying ξ with T)by

σ(ξ)=σ(ξ–1S)=σ(ξ–1E : E ∈S).

As noted in Section 3.3, σ(ξ) is the smallest σ-field G on Ω making ξ G|S - measurable. Further if ξ(ω) is defined for every ω then the σ-ring ξ–1(S) contains ξ–1(X)=Ω and hence is itself the σ-field σ(ξ). Note that σ(ξ) depends on the “range” σ-field S. 196 Foundations of probability

More generally if C is any family of mappings on the same space Ω, but with values in possibly different measurable spaces, we write

σ(C)=σ(∪ξ∈Cσ(ξ)).

If the family is written as an indexed set C ={ξλ : λ∈Λ}, where ξλ maps Ω into (Xλ, Sλ), we write

σ(C)=σ{ξλ : λ ∈ Λ} = σ (∪λ∈Λσ(ξλ)) .

For Λ = {1, 2, ..., n} write σ(C)=σ(ξ1, ξ2, ..., ξn). The following lemma, stated for reference, should be proved as an exer- cise (Ex. 9.7). Lemma 9.3.1 (i) If C is any family of mappings on the space Ω, σ(C) is then the unique smallest σ-field on Ω with respect to which every ξ ∈Cis measurable. (σ(C) is called the σ-field generated by C.) C { ∈ Λ} S C { –1 (ii) If = ξλ : λ , ξλ taking values in (Xλ, λ), then σ( )=σ ξλ Bλ : Bλ ∈Sλ, λ ∈ Λ}. (iii) If Cλ is a family of mappings on the space Ω for each λ in an index set Λ then

σ (∪λ∈ΛCλ) = σ (∪λ∈Λσ(Cλ)) . As indicated above, we shall be especially interested in the case where (X, S)=(R∗n, B∗n) leading to random vectors. The following lemma will be applied to show the equivalence of a random vector and its component r.v.’s. Lemma 9.3.2 Let ξ be a mapping defined on a space Ω with values in ∗n ∗n ∗ ∗ (R , B ) so that ξ =(ξ1, ξ2, ..., ξn) where ξi maps Ω into (R , B ). Then σ(ξ)=σ(ξ1, ξ2, ..., ξn). That is the σ-field generated on Ω by the mapping ξ into (R∗n, B∗n) is identical to that generated by the family of its compo- ∗ ∗ nents ξi, each mapping Ω into (R , B ). ∈B∗ –1 × × × ∩n –1 Proof If Bi for each i, then ξ (B1 B2 ... Bn)= 1ξi Bi. Since ∗n the rectangles B1 × B2 × ...× Bn generate B , the corollary to Theorem 3.3.2 gives {∩n –1 ∈B∗} { –1 ∈B∗ ≤ ≤ } σ(ξ)=σ 1ξi Bi : Bi = σ ξi Bi : Bi ,1 i n as is easily checked. But this is just σ(ξ1, ξ2, ..., ξn) by Lemma 9.3.1 (ii). 

We proceed now to consider random vectors – measurable mappings ξ =(ξ1, ξ2, ..., ξn) defined a.s. on a probability space (Ω, F , P) with values ∗n ∗n n in (R , B ) its components ξi being finite a.s. (i.e. ξ ∈ R a.s.). 9.3 Random elements, vectors and joint distributions 197

The following result shows that a random vector ξ is, equivalently, just a family of n r.v.’s (ξ1, ..., ξn)(withσ(ξ)=σ(ξ1, ..., ξn) as shown above). Theorem 9.3.3 Let ξ be a mapping defined a.s. on a probability space ∗n (Ω, F , P), with values in R .Writeξ =(ξ1, ξ2, ..., ξn). Then σ(ξ)= ∗n ∗n ∗n σ(ξ1, ξ2, ..., ξn). Further, ξ is a random element in (R , B ) (i.e. F|B - ∗ measurable) if and only if each ξi is an extended r.v. (i.e. F|B -measurable). n n Hence ξ is a random vector (r.e. of (R , B )) if and only if each ξi is a r.v.

Proof That σ(ξ)=σ(ξ1, ξ2, ..., ξn) restates Lemma 9.3.2. The mapping ξ isar.e.on(Ω, F , P) with values in (R∗n, B∗n) iff it is F -measurable, i.e. σ(ξ) ⊂F. But this is precisely σ(ξ1, ξ2, ..., ξn) ⊂F, which holds iff all ξi are extended r.v.’s. The final statement also follows immediately. 

The distribution of a r.e. ξ on (Ω, F , P) with values in (X, S) is defined to be the probability measure Pξ–1 on S – directly generalizing the distri- bution of a r.v. Note that a corresponding point function (d.f.) is not defined as before except in special cases where e.g. X = Rn (or at least has some –1 “order structure”). The distribution Pξ of a random vector ξ =(ξ1, ..., ξn), is a probability measure on B∗n, and its restriction to Bn is a probability measure on (Rn, Bn), as in the case n = 1 considered previously. The cor- responding point function (cf. Section 7.8) F(x1, ..., xn)=P{ξi ≤ xi,1≤ –1 i ≤ n} = Pξ {(–∞, x]} (x =(x1, ..., xn)) is the joint distribution function of ξ1, ..., ξn. As shown in Theorem 7.8.1, such a function has the following properties:

(i) F is bounded, nondecreasing and continuous to the right in each xi.

(ii) For any a =(a1, ..., an), b =(b1, ..., bn), ai < bi we have  ∗ n–r (–) F(c1, c2, ..., cn) ≥ 0 ∗ n where denotes summation over the 2 distinct terms with ci = ai or bi and r is the number of ci which are bi’s. In addition since Pξ–1 is a probability measure it is easy to check that the following also hold: ≤ ≤ (iii) 0 F(x1, ..., xn) 1 for all x1, ..., xn, limxi→–∞ F(x1, ..., xn)=0(for any fixed i), and

lim F(x1, ..., xn)=1. (x1,...,xn)→(∞,...,∞) In fact these conditions are also sufficient for F to be the joint d.f. of some set of r.v.’s as stated in the following theorem. 198 Foundations of probability

n Theorem 9.3.4 A function F on R is the joint d.f. of some r.v.’s ξ1, ..., ξn if and only if it satisfies Conditions (i)–(iii) above. Then for ai ≤ bi, 1≤ i≤n, P{ai <ξi ≤ bi,1≤ i ≤ n} is given by the sum in (ii) above.

Sketch of Proof The necessity of the conditions has been noted. The suf- ficiency follows simply from the fact (Theorem 7.8.1) that F defines a mea- n n sure μF on (R , B ). It is easily checked that μF is a probability measure. n n If Ω = R , F = B , P = μF and ξi(x1, x2, ..., xn)=xi then ξ1, ..., ξn are r.v.’s on Ω with the joint d.f. F. (The details should be worked through as an exercise.) 

As in the previous section, it is of particular interest to consider the case when Pξ–1 is absolutely continuous with respect to n-dimensional Lebesgue measure, i.e. for every E ∈Bn, Pξ–1 E f u ... u du ... du ( )= E ( 1, , n) 1 n for some Lebesgue integrable f which is thus (Radon–Nikodym Theorem) nonnegative a.e. (hence may be taken everywhere nonnegative) and inte- grates over Rn to unity. Equivalently, this holds if and only if x x F x ... x n ... 1 f u ... u du ... du ( 1, , n)= –∞ –∞ ( 1, , n) 1 n for all choices of x1, ..., xn. We say that f is the joint p.d.f. of the r.v.’s n ξ1, ..., ξn whose d.f. is F. As noted above its integral over any set E ∈B gives Pξ–1(E) which is the probability P{ξ ∈ E} that the value of the vector (ξ1(ω), ..., ξn(ω)) lies in the set E. Next note that if the r.v.’s ξ1, ..., ξn have joint d.f. F, the joint d.f. of any subset, say ξ1, ..., ξk of the ξ’s may be obtained by letting the remain- ∞ ∞ ing x’s (xk+1, ..., xn) tend to + ;e.g.F(x1, ..., xn–1, ) = limxn→∞ F(x1, ..., xn–1, xn) is the joint d.f. of ξ1, ..., ξn–1. This is easily checked. If F is absolutely continuous, the joint density for ξ1, ..., ξk may be obtained by integrating the density f (x1, ..., xn) (corresponding to F) over xk+1, ..., xn. Again this is easily checked (Ex. 9.9). Of course, if we “put” x2 = x3 = ··· = xn = ∞ in the joint d.f. (or integrate the joint density over these vari- ables in the absolutely continuous case) we obtain just the d.f. (or p.d.f.) of ξ1. Accordingly the d.f. (or p.d.f.) of ξ1 is called a marginal d.f. (or p.d.f.) obtained from the joint d.f. (or p.d.f.) in this way. ∗ ∗ ∗ Finally, note that if ξ1, ..., ξn, ξ1, ..., ξn are r.v.’s such that ξi = ξi a.s. ∗ ∗ for each i, then the joint d.f.’s of the two families (ξ1, ..., ξn), (ξ1, ..., ξn) are equal. This is obvious, but should be checked. 9.4 Expectation and moments 199

9.4 Expectation and moments Let (Ω, F , P) be a probability space. If ξ is a r.v. or extended r.v. on this space, we write Eξ to denote ξ(ω) dP(ω) whenever this integral is de- fined, e.g. if ξ is a.s. nonnegative or ξ ∈ L1(Ω, F , P). E thus simply denotes the operation of integration with respect to P and Eξ is termed the mean or expectation of ξ. In the case where ξ ∈ L1(Ω, F , P) (and hence in particular ξ is a.s. finite and thus a r.v.) Eξ and E|ξ| are finite (since |ξ|∈L1 also). It is then customary to say that the mean of ξ exists, or that ξ has a finite mean. Since E denotes integration, any theorem of integration theory will be used with this notation without comment. Suppose now that ξ is finite a.s. (i.e. is a r.v.) with d.f. F.Letg(x)=|x|, so that g(ξ(ω)) is defined a.s. and then E| | –1 ξ = Ω g(ξ(ω)) dP(ω)= R∗ g(x) dPξ (x) ∗ viewing ξ as a transformation from Ω to R (Theorem 4.6.1). But this latter –1 | | –1 integral is just R g(x) dPξ (x)= x dF(x) (since Pξ = μF – see Section 4.7) and hence E|ξ| = |x| dF(x) ≤∞. E|ξ| is thus finite if and only if |x| dF(x) < ∞, and in this case the same argument but with g(x)=x gives Eξ = xdF(x). If also ξ has an absolutely continuous distribution, with p.d.f. f then (Theorem 5.6.1) Eξ = xf (x) dx. { } On the other hand, if ξ is discrete with P ξ = xn = pn, it is easily checked (Ex. 9.12) that E|ξ| = pn|xn| and, when E|ξ| < ∞, that Eξ = pnxn. Suppose now that ξ is a r.v. on (Ω, F , P) and that g is a real-valued mea- surable function on R. Then g(ξ(ω)) is clearly a r.v. (Theorem 3.4.3) and an argument along the precise lines as that given above at once demonstrates the truth of the following result.

Theorem 9.4.1 If ξ is a r.v. and g is a finite real-valued measurable function on R, then E|g(ξ)| < ∞ if and only if |g(x)| dF(x) < ∞. Then Eg(ξ)= g(x) dF(x). In particular consider g(x)=xp for p = 1,2,3,... . We call E|ξ|p the pth absolute moment of ξ and when it is finite, say that the pth moment of 200 Foundations of probability

p ξ exists, given by Eξ . This holds equivalently if ξ ∈ Lp(Ω, F , P)andthe theorem shows that Eξp = xp dF(x). If p > 0 but p is not an integer then xp is not real-valued for x < 0 and thus ξp(ω) is not necessarily defined a.s. However, if ξ is a nonnegative r.v. (a.s.) ξp(ω) is defined a.s. and the above remarks hold. In any case one can still consider E|ξ|p for all p > 0 regardless of the signs of the values of ξ. It will be seen in the next section that if ξ ∈ Lp = Lp(Ω, F , P) for some p p > 1 (i.e. E|ξ| < ∞) then ξ ∈ Lq for 1 ≤ q ≤ p. (This fact applies since P is a finite measure – it does not apply to Lp classes for general measures.) Thus in this case the mean of ξ exists in particular, and (since any constant belongs to Lp on account of the finiteness of P) if p is a positive integer, p ξ – Eξ ∈ Lp or E|ξ – Eξ| < ∞. This quantity is called the pth absolute central moment of ξ, and E(ξ – Eξ)p the pth central moment, p =1,2,.... If p = 2, the quantity E(ξ – Eξ)2 is the variance of ξ (denoted by var(ξ) 2 or σξ ). It is readily checked (Ex. 9.13) that a central moment may be ex- pressed in terms of ordinary moments (and conversely) and in particular that var(ξ)=Eξ2 –(Eξ)2. Joint moments of two or more r.v.’s are also commonly used. For exam- ple if ξ, η have finite second moments (ξ, η ∈ L2) then as will be seen in Theorems 9.5.2, 9.5.1 they are both in L1 and (ξ – Eξ)(η – Eη) ∈ L1. The expectation γ = E{(ξ – Eξ)(η – Eη)} is termed the covariance (cov(ξ, η)) 2 of ξ and η, and ρ = γ/(σξση) is their correlation, where σξ =var(ξ) and 2 ση =var(η). See Ex. 9.20 for some useful interpretations and properties which should be checked. A most important family of r.v.’s in statistical theory and practice arising from Theorem 9.3.4 is that of multivariate normal r.v.’s ξ1, ξ2, ..., ξn whose joint distribution is specified by their means, variances and covariances (or correlations). For the nonsingular case they have the joint p.d.f.

–n/2|Λ|–1/2 { 1 Λ–1 } f (x1, x2, ..., xn)=(2π) exp – 2 (x – μ) (x – μ) where x =(x1, x2, ..., xn) , μ =(μ1, μ2, ..., μn) ,(μi = Eξi) and Λ is the covariance matrix with (i, j)th element γij =cov(ξi, ξj), assumed nonsingu- lar (that is, its determinant |Λ| is not zero). See Exs. 9.21, 9.22 for further details, properties and comments.

9.5 Inequalities for moments and probabilities There are a number of standard and useful inequalities concerning mo- ments of a r.v., and probabilities of exceeding a given value. A few of 9.5 Inequalities for moments and probabilities 201 these will be given now, starting with a “translation” of the Holder¨ and Minkowski Inequalities (Theorems 6.4.2, 6.4.3) into the expectation notation. Theorem 9.5.1 Suppose that ξ, η are r.v.’s on (Ω, F , P).

(i) (Holder’s¨ Inequality) If E|ξ|p < ∞, E|η|q < ∞ where 1 < p, q < ∞, 1/p +1/q =1, then E|ξη| < ∞ and |Eξη|≤E|ξη|≤(E|ξ|p)1/p (E|η|q)1/q with equality in the second inequality only if one of ξ, η is zero a.s. or if |ξ|p = c|η|q a.s. for some constant c > 0. (ii) (Minkowski’s Inequality) If E|ξ|p < ∞, E|η|p < ∞ for some p ≥ 1 then E|ξ + η|p < ∞ and (E|ξ + η|p)1/p ≤ (E|ξ|p)1/p +(E|η|p)1/p with equality (if p > 1) only if one of ξ, η is zero a.s. or if ξ = cη a.s. for some constant c > 0.Forp=1equality holds if and only if ξη ≥ 0 a.s. (iii) If 0 < p < 1 and E|ξ|p < ∞, E|η|p < ∞, then E|ξ + η|p < ∞ and E|ξ + η|p ≤E|ξ|p + E|η|p, with equality iff ξη =0a.s. (see also Ex. 9.19).

p 1/p The norm notation – writing ||ξ||p =(E|ξ| ) – gives the neatest state- ments of the inequalities as in Section 6.4, in the case p ≥ 1. For Holder’s¨ Inequality may be written as ||ξη||1 ≤||ξ||p||η||q and Minkowski’s Inequality as ||ξ + η||p ≤||ξ||p + ||η||p. The following result, mentioned in the previous section, is an immediate corollary of (i), and restates Theorem 6.4.8 (with μ(X)=1). Theorem 9.5.2 If ξ is a r.v. on (Ω, F , P) and E|ξ|p < ∞ for some p > 0, q q 1/q p 1/p then E|ξ| < ∞ for 0 < q ≤ p, and (E|ξ| ) ≤ (E|ξ| ) ,i.e.||ξ||q ≤||ξ||p. In particular it follows that if Eξ2 < ∞ then E|ξ| < ∞ and (Eξ)2 ≤ (E|ξ|)2 ≤Eξ2 (which, of course, may be readily shown directly from E(|ξ| – E|ξ|)2 ≥ 0). Another very simple class of (“Markov type”) inequalities relates prob- abilities such as P{ξ ≥ a}, P{|ξ|≥a} etc., to moments of ξ. The following result gives typical examples of such inequalities. Theorem 9.5.3 Let g be a nonnegative, real-valued function on R, and let ξ be a r.v. 202 Foundations of probability

(i) If g(x) is even, and nondecreasing for 0 ≤ x < ∞ then for all a ≥ 0, with g(a)  0, P{|ξ|≥a}≤E{g(ξ)}/g(a).

(ii) If g is nondecreasing on –∞ < x < ∞ then for all a with g(a)  0,

P{ξ ≥ a}≤E{g(ξ)}/g(a).

Proof Note first that the monotonicity of g in each case implies its (Borel) measurability (cf. Ex. 3.11). With g as in (i) it is clear that g(ξ(ω)) is defined and finite a.s. and is thus a (nonnegative) r.v. and Eg ξ g ξ ω dP ω ≥ g ξ ω dP ω ≥ g a P{|ξ|≥a} ( )= ( ( )) ( ) {ω:|ξ(ω)|≥a} ( ( )) ( ) ( ) , since g(ξ(ω)) ≥ g(a)if|ξ(ω)|≥a. Hence (i) is proved, and the proof of (ii) is similar. 

For an inequality in the opposite direction see Ex. 9.18.

Corollary (i) If ξ is any r.v. and 0 < p < ∞,a> 0, then

P{|ξ|≥a}≤E|ξ|p/ap.

(ii) If ξ isar.v.withEξ2 < ∞, then for all a > 0, var(ξ) P{|ξ – Eξ|≥a}≤ . a2 The inequality in (i) (which follows by taking g(x)=|x|p) is called “the” Markov Inequality. The case p = 2 in (i) is the well known Chebychev Inequality.

The final inequality, which is sometimes very useful, concerns convex functions of a r.v. We recall that a function g defined on the real line is convex if g(λx +(1–λ)y) ≤ λg(x) + (1 – λ)g(y) for any x, y,0≤ λ ≤ 1. A convex function is known to be continuous and thus Borel measurable.

Theorem 9.5.4 (Jensen’s Inequality) If ξ isar.v.withE|ξ| < ∞ and g is a convex function on R such that E|g(ξ)| < ∞, then

g(Eξ) ≤Eg(ξ).

Proof Since g is convex it is known that given any x0 there is a real number h = h(x0) such that g(x)–g(x0) ≥ (x – x0)h for all x. (This may be proved for example by showing that for all x < x0 < y we have 9.6 Inverse functions and probability transforms 203 g x g x x x ≤ g y g x y x h g x ( ( 0)– ( ))/( 0 – ) ( ( )– ( 0))/( – 0) and taking = supx

g(ξ)–g(Eξ) ≥ (ξ – Eξ)h (h = h(Eξ)).

The desired conclusion follows at once by taking expectations of both sides since the expectation of the right hand side is zero. 

9.6 Inverse functions and probability transforms If F is a strictly increasing continuous function on the real line (or a sub- interval thereof) and a = inf F(x), b = sup F(x), then its inverse function F–1 is immediately defined for y ∈ (a, b)byF–1(y)=x, where x is the unique value such that F(x)=y. Then F–1(F(x)) = x for all x in the domain of F and F(F–1(y)) = y for all y in the domain (a, b), of F–1. If F is strictly increasing but not everywhere continuous, F–1(y)isnot thus defined in this way for all y ∈ (a, b)e.g.ifx0 is a discontinuity point of F and e.g. F(x0) > F(x0 – 0), there is no x for which F(x)=y if y ∈ (F(x0 – 0), F(x0)). On the other hand, if F is continuous and nondecreasing but not strictly increasing, there is an interval (x1, x2), on which F is constant, i.e. F(x)=y say for x1 < x < x2. Hence there is no unique x for which F(x)=y. It is, however, useful to define an inverse function F–1 when F is nonde- creasing (or nonincreasing) but not necessarily strictly monotone or contin- uous, and this may be done in various equally natural ways to retain some of the useful properties valid for the strictly monotone continuous case. We employ the following (commonly used) form of definition. Let F be a nondecreasing function defined on an interval and for y ∈ (inf F(x), sup F(x)) define F–1(y)by

F–1(y) = inf{x : F(x) ≥ y}.

To see the meaning of this definition it is helpful to visualize its value at points y ∈ (F(x0 –0), F(x0 +0)) where F is discontinuous at x0 or at points y = F(x)forx such that F is constant in some neighborhood (x – , x + ). It is also helpful to determine the points x for which F–1(F(x))  x, y such that F(F–1(y))  y. The following results are examples of many useful properties of this form of the inverse function, the proofs of which may be supplied as exercises by an interested reader.1

1 Or see e.g. [Resnick, Section 0.2] for an excellent detailed treatment. 204 Foundations of probability

Lemma 9.6.1 If F is a nondecreasing function on R with inverse F–1 defined as above, then

(i) (a) F–1 is nondecreasing and left-continuous (F–1(y –0)=F–1(y)) (b) F–1(F(x)) ≤ x (c) If F is strictly increasing from the left at x in the sense that F(a) < F(x) whenever a < x, then F–1(F(x)) = x. (ii) If F is right-continuous then (a) {x : F(x) ≥ y} is closed for each y (b) F(F–1(y)) ≥ y (c) F–1(y) ≤ x if and only if y ≤ F(x) (d) x < F–1(y) if and only if F(x) < y. (iii) If for a given y, F is continuous at F–1(y) then F(F–1(y)) = y. Hence if F is everywhere continuous then F(F–1(y)) = y for all y. Results of this type are useful for transformation of r.v.’s to standard dis- tributions (“Probability transformations”). For example, it should be shown as an exercise (Ex. 9.4) that if ξ has a continuous distribution function F, then F(ξ) is a uniform r.v. and (Ex. 9.5) that if ξ is a uniform r.v. and F some d.f., then η = F–1(ξ)isar.v.withd.f.F. Such results can be useful for simulation and sometimes allow the proof of properties of general r.v.’s to be done just under special assumptions such as uniformity, normality, etc. We shall be interested later in the topic of “convergence in distribution” involving the convergence of d.f.’s Fn to a d.f. F at continuity points of the latter. The following result (which may be proved as an exercise or refer- ence made to e.g. [Resnick]) involves the more general framework where the Fn’s need not be d.f.’s (and convergence at continuity points is then commonly referred to as vague convergence – cf. Section 11.3).

Lemma 9.6.2 If Fn, n ≥ 1, F are nondecreasing and Fn(x) → F(x) at all –1 → –1 continuity points x of F, then Fn (y) F (y) at all continuity points y of F–1.

Exercises ≥ ∞ 9.1 Let pj 0, 1 pj =1, xj real, F(x)= xj≤x pj. Show that ν(E)= xj∈E pj B ∈B defines a measure on the Borel sets and ν(E)=μF(E)for E . (If E ∪∞E x x E p E = 1 k write χj = χE( j), χjk =χEk ( j)sothatν( )= χj j, ν( k)= ≥ j χjkpj.) Thus for given pj 0, pj =1,thereisadiscreter.v.ξ with { } { ∈ } P ξ = xj = pj and P ξ E = xj∈E pj. Exercises 205 x ∈ ∞ ∞ 9.2 Let F be a d.f. and F(x)= –∞ f (t) dt where f L1(– , ). (It is not initially ≥ assumed that f 0.) Define the finite signed measure ν(E)= E fdx.Show that ν(E)=μF(E)ontheBorelsetsB. (Hint: Use Lemma 5.2.4.) Hence show that f ≥ 0 a.e. 9.3 Let Ω be the unit interval, F its Borel subsets, and P Lebesgue measure on F .Letξ(ω)=ω, η(ω)=1–ω. Show that ξ, η have the same distribution but are not identical. In fact P(ξ  η)=1. 9.4 Let ξ bear.v.whosed.f.F is continuous.Letη = F(ξ) (i.e. η(ω)=F(ξ(ω))). Show that η is uniformly distributed on (0, 1), i.e. that its d.f. G is given by G(x)=0forx < 0, G(x)=x for 0 ≤ x ≤ 1andG(x)=1forx > 1. What if F is not continuous? (For simplicity assume F has just one jump.) 9.5 Let F be any d.f. and define its inverse F–1 as in Section 9.6. Show that if ξ is uniformly distributed over (0, 1), then η = F–1(ξ)hasd.f.F. 9.6 If ξ, η are discrete r.v.’s, is ξ + η discrete? What about ξη and ξ/η?What happens to these combinations if ξ is discrete and η continuous? 9.7 Prove Lemma 9.3.1. (Hints: For (i) it may be noted that (a) every ξ ∈Cis σ(C)-measurable and (b) if every ξ ∈Cis G-measurable (for some fixed σ-field G)thenG⊃σ(ξ), each ξ ∈C. Clearly in (ii) the σ-fieldontheleft contains that on the right. However, each ξλ is measurable with respect to the σ-field on the right, which therefore contains the smallest σ-field yielding measurability of all ξλ,viz.σ(C).) 9.8 In Theorem 9.3.3, the ξi are all defined on the same subset of Ω (i.e. where ξ is defined). If we start with mappings ξ1, ..., ξn defined (and finite a.s.) on possibly different subsets D1, ..., Dn (with P(Di) = 1) we may define ∩n ξ =(ξ1, ..., ξn)onD = 1Di.Ifξ1, ..., ξn are each r.v.’s then ξ is a random vector, as in the theorem. Show that the converse may not be true, that is, if ξ is a random vector, it is not necessarily true that the ξi are r.v.’s (it is true if Di are measurable – e.g. if P is complete). n 9.9 Let F be an absolutely continuous d.f. on R (with density f (x1, ..., xn)) for r.v.’s ξ1, ..., ξn. Show that the r.v.’s ξ1, ..., ξk (k < n) have an absolutely continuous distribution and find their joint p.d.f. 9.10 The concept of a “continuous singular” d.f. or probability measure in R2 is more common than in R. For example, let F be any continuous d.f. on R. 2 0 0 For any Borel set B in R define μ(B)=μF(B )whereB is the section of B defined by y = 0. Show that μ has no point atoms but is singular with respect to two-dimensional Lebesgue measure. 9.11 More generally suppose the C is a simple curve in the plane given para- metrically as x = x(s), y = y(s), where x and y are (Borel) measurable 1-1 functions of s.Ifμ is a probability measure on (R, B)wemaydefine a probability measure on (R2, B2)byν(E)=μT–1(E)whereT is the mea- surable transformation Ts =(x(s), y(s)). The measure ν is singular with re- spect to Lebesgue measure and has no atoms if μ has no atoms. If s is dis- tance along the curve, ν(E) may be regarded as the μ-measure of E ∩ C 206 Foundations of probability

considered as a linear set with√ origin at s = 0. For example, if C is the di- agonal x = y we have x(s)=s/ 2=y(s). Write down the two-dimensional d.f. F(x, y)(=(P(–∞, x] × (–∞, y])) corresponding to ν in terms of the d.f. G corresponding to μ.NotethatF(x, y) is continuous (but μF is not absolutely continuous with respect to Lebesgue measure). { } E| | | | 9.12 Let ξ be discrete with P ξ = xn = pn. Show that ξ = pn xn and if E|ξ| < ∞ then Eξ = pnxn. 9.13 Let ξ bear.v.withE|ξ|n < ∞ for some positive integer n. Express the nth cen- tral moment for ξ in terms of the first n ordinary moments, and conversely. 9.14 Let ξ bear.v.withE|ξ|<∞ and let En be any sequence of sets with P(En) → E → 0. Show that (ξχEn ) 0 (cf. Theorem 4.5.3). Show in particular that E(ξχ(|ξ|>n))→ 0. 9.15 Let ξ be a r.v. on (Ω, F , P) and define En = {ω : |ξ(ω)|≥n}. Show that ∞ ∞ P(En) ≤E|ξ|≤1+ P(En) n=1 n=1 E| | ∞ ∞ ∞ and hence that ξ < if and only if n=1 P(En) < .Ifξ takes only E ∞ { ≤ positive integer values, show that ξ = n=1P(En). (Hint: Let Fn = ω : n | | } ∞ ∞ ξ(ω) < n +1 and note that n=1 nP(Fn)= 1 P(En).) 9.16 If ξ is a nonnegative r.v. with d.f. F show that E ∞ ξ = 0 [1 – F(x)] dx. (Hint: Use Fubini’s Theorem.) If ξ is a real-valued r.v. with d.f. F show that E| | 0 ∞ ξ = –∞ F(x) dx + 0 [1 – F(x)] dx E| | ∞ 0 ∞ ∞ ∞ and thus ξ < if and only if –∞ F(x) dx < and 0 [1 – F(x)] dx < , in which case E ∞ 0 ξ = 0 [1 – F(x)] dx – –∞ F(x) dx. 9.17 Let F be any d.f. Show that, for any h > 0, ∞ –∞(F(x + h)–F(x)) dx = h. ∞ Why does this not contradict the obvious statement that ∞ F(x + h) dx = ∞ – –∞ F(x) dx? 9.18 Let g be a nonnegative bounded function on R,andξ ar.v.Ifg is even and nondecreasing on 0 < x < ∞, show that

P{|ξ|≥a}≥E{g(ξ)–g(a)}/M

for any M < ∞ such that g(ξ(ω)) ≤ M a.s. (e.g. M =supg(x)). If g is instead nondecreasing on (–∞, ∞) show that the same inequality holds with ξ instead of |ξ| on the left. Exercises 207

9.19 Let ξ, η be r.v.’s with E|ξ|p < ∞, E|η|p < ∞. Show that for p > 0, E|ξ + η|p ≤ p p p–1 cp{E|ξ| + E|η| } where cp =1if0< p ≤ 1, cp =2 if p > 1. (Hint: p p (1 + x) ≤ cp(1 + x )forx ≥ 0. Note equality when x =0forp ≤ 1, and x =1 for p > 1 and consider derivatives.) 9.20 Show that the covariance γ of two r.v.’s ξ1, ξ2 satisfies |γ|≤σ1σ2 where σi is the standard deviation of ξi, i = 1, 2, and hence that the correlation ρ satisfies |ρ|≤1. The parameters γ and especially ρ are regarded as simple measures of dependence of ξ1, ξ2. What is the value of ρ if ξ1 = aξ2 (a) for some a > 0, (b) for a < 0? 9.21 Write down the covariance matrix Λ for a pair of r.v.’s ξ1, ξ2 in terms of their means μ1, μ2, standard deviations σ1, σ2 and correlation ρ. Show that Λ is nonsingular if |ρ| < 1 and then obtain its inverse. Hence write down the joint p.d.f. of ξ1 and ξ2 in terms of μi, σi, i =1,2,ρ,whenξ1 and ξ2 are assumed to be jointly normal. 9.22 If ξ1, ξ2, ..., ξn are jointly normal, means μi,1≤ i ≤ n, nonsingular covari- ance matrix Λ, show that the members of any subgroup (e.g. ξ1, ξ2, ..., ξk, k ≤ n) are jointly normal, writing down their covariance matrix in terms of Λ. 10

Independence

10.1 Independent events and classes Two events A, B are termed independent if P(A ∩ B)=P(A) · P(B). Phys- ically this means (as can be checked by interpreting probabilities as long term frequencies) that the proportion of those times A occurs, for which B also occurs in many repetitions of the experiment E, is ultimately the same as the proportion of times B occurs in all. That is, roughly “knowledge of the occurrence or not of A does not affect the probability of B” (and conversely). We are, of course, interested primarily in the mathematical definition given, and its consequences. The definition of independence can be usefully extended to a class of events. We say that A is a class of independent events (or that the events A of a class are independent) if for every finite subclass of distinct events A ∩n n A1, A2, ..., An of ,wehaveP( 1Ai)= 1 P(Ai). Note that it is not, in general, sufficient for this that the events of A be pairwise independent (see Ex. 10.1). A more general notion concerns a family of independent classes. If Aλ is a class of events for each λ in some index set Λ, {Aλ : λ ∈ Λ} is said to be a family of independent classes of events (or that the classes {Aλ : λ ∈ Λ} are independent), if for every choice of one member Aλ from each Aλ,the events {Aλ : λ ∈ Λ} are independent. Note that a class A of independent events may be regarded as a family of independent classes of events, where the classes of the family each consist of just one event of A. This viewpoint is sometimes useful. Note also that while the index set Λ may be infinite (of any order) a family A = {Aλ : ∈ Λ} {A A } λ is independent if and only if every finite subfamily λ1 , ..., λn is independent (for distinct λi). Thus it usually suffices to consider finite families.

Remark If A1, ..., An are classes of events such that each Ai contains a set Ci with P(Ci) = 1 (e.g. Ci = Ω) then to show that A1, A2, ..., An are

208 10.1 Independent events and classes 209 ∩n n independent classes it is only necessary to show that P( 1Ai)= 1 P(Ai) for this one n, and all choices of Ai ∈Ai,1≤ i ≤ n. For this relation then follows at once for subfamilies – e.g.

n–1 n–1 ∩n–1 ∩ P(Ai)= P(Ai)P(Cn)=P ( 1 Ai) Cn 1 1 = P ∩n–1A – P (∩n–1A ) ∩ Cc 1 i 1 i n ∩n–1 = P 1 Ai

c since P(Cn)=0. A family of independent classes may often be enlarged without losing independence. The following is a small result in this direction – its proof is left as an easy exercise (cf. Ex. 10.3).

Lemma 10.1.1 Let {Aλ : λ ∈ Λ} be independent classes of events, and A* A ∪G G λ = λ λ where, for each λ, λ is any class of sets E such that {A* ∈ Λ} P(E)=0or 1. Then λ : λ are independent classes. The next result is somewhat more sophisticated and very useful.

Theorem 10.1.2 Let {Aλ : λ ∈ Λ} be independent classes of events, and such that each Aλ is closed under finite intersections. Let Bλ be the σ-field generated by Aλ, Bλ = σ(Aλ). Then {Bλ : λ ∈ Λ} are also independent classes.

A* A ∪{Ω} {A* ∈ Λ} Proof Define λ = λ . Then by Lemma 10.1.1 λ : λ are independent classes, and clearly Bλ is also the σ-field generated by A* Ω ∈A λ. Thus we assume without loss of generality that λ for each λ. In accordance with a remark above, it is sufficient to show that any finite {B B B } subfamily λ1 , λ2 , ..., λn (with distinct λi), are independent classes. If {B A A } it is shown that λ1 , λ2 , ..., λn are independent classes, the result will then follow inductively. Let G be the class of sets E ∈Fsuch that P(E ∩ A2 ∩ ... ∩ An)= ∈A ∈G ∈G P(E)P(A2) ...P(An) for all Ai λi (i =2,..., n). If E , F and ⊃ ∈A E F, Ai λi (i =2,..., n),

P{(E – F) ∩ A2 ∩ ...∩ An}

= P(E ∩ A2 ∩ ...∩ An)–P(F ∩ A2 ∩ ...∩ An)

= P(E)P(A2) ...P(An)–P(F)P(A2) ...P(An)

= P(E – F)P(A2) ...P(An). 210 Independence

Thus E – F ∈Gand G is therefore closed under proper differences. Sim- ilarly it is easily checked that G is closed under countable disjoint unions G D G⊃A so that is a -class. But λ1 which is closed under intersections and hence by Theorem 1.8.5 (Corollary) G contains the σ-ring generated A B Ω ∈A G⊃B by λ1 . This σ-ring is the σ-field λ1 since λ1 and hence λ1 . {B A A } Hence (using the Remark preceding Lemma 10.1.1) λ1 , λ2 , ..., λn are independent classes and, as noted, this is sufficient for the result of the theorem.  If a class A of independent events is regarded as a family of indepen- dent classes in the manner described above (i.e. each class consisting of one member of A) we may, according to the theorem, enlarge each (1- member) class {A} to the σ-field it generates, viz. {A, Ac, Ω, ∅}. Thus these classes constitute, for A ∈A, a family of independent classes. A class of independent events may now be obtained by selecting one event from each {A, Ac, Ω, ∅}. Thus the following corollary to Theorem 10.1.2 holds. Corollary If A is a class of independent events, and if some of the events of A are replaced by their complements, then the resulting class is again a class of independent events. This result can, of course, be shown “by hand” from the definition. For example, if A, B are independent then it follows directly that so are A, Bc (which should be shown as an exercise). The final result of this section is a useful extension of Theorem 10.1.2 involving the “grouping” of a family of independent classes. In this, by a partition of the set Λ we mean any class of disjoint sets {Λγ : γ ∈ Γ} with ∪γ∈ΓΛγ = Λ.If{Aλ : λ ∈ Λ} are independent classes, clearly the {∪ A ∈ Γ} “grouped classes” λ∈Λγ λ : γ are independent. The following result B ∪ A ∈ Γ shows that the same is true for γ = σ( λ∈Λγ λ), γ provided each Aλ is closed under finite intersections. This does not follow immediately ∪ A from Theorem 10.1.2 since λ∈Λγ λ need not be closed under intersections, but the classes may be expanded to have this closure property and allow application of the theorem.

Theorem 10.1.3 Let {Aλ : λ ∈ Λ} be independent classes, each being as- sumed to be closed under finite intersections. Let {Λγ :γ ∈Γ} be a partition Λ B {∪ A } {B ∈ Γ} of , and γ = σ λ∈Λγ λ . Then γ : γ are independent classes.

Proof For each γ ∈ Γ let Gγ denote the class of all sets of the form ∩ ∩ ∩ ∈A A1 A2 ... An,forAi λi , where λ1, ..., λn are any distinct members of Λγ (n =1,2,...). Gγ is closed under finite intersections since each 10.2 Independent random elements 211

Aλ is so closed. Further {Gγ : γ ∈ Γ} are independent classes (which is easily checked from the definition of the sets of Gγ). Hence, by Theorem 10.1.2, the σ-fields {σ(Gγ):γ ∈ Γ} are independent classes. But clearly ∪ A ⊂G B ⊂ G {B ∈ Γ} λ∈Λγ λ γ so that γ σ( γ) and hence γ : γ are independent classes, as required. 

10.2 Independent random elements We will be primarily concerned with the concept of independence in the context of random variables. However, the definition and results of this section will apply more generally to arbitrary random elements, since this extra generality can be useful. Specifically, suppose that for each λ in an index set Λ, ξλ is a random element on a fixed probability space (Ω, F , P), with values in a measurable space (Xλ, Sλ) – which may change with λ. (If ξλ is a r.v., of course, Xλ = * * R , Sλ = B .) If the classes {σ(ξλ):λ ∈ Λ} are independent, then {ξλ : λ ∈ Λ} is said to be a family of independent r.e.’s or the r.e.’s {ξλ : λ ∈ Λ} are independent. –1S { –1 ∈S} –1S Since σ(ξλ)=σ(ξλ λ)=σ ξλ B : B λ and ξλ λ is closed un- der intersections it follows at once from Theorem 10.1.2 that the following criterion holds – facilitating the verification of independence of r.e.’s. { ∈ Λ} { –1S Theorem 10.2.1 The r.e.’s ξλ : λ are independent iff ξλ λ : λ ∈ Λ} are independent classes, i.e. iff for each n =1,2,..., distinct ∈ Λ ∈S ≤ ≤ λi , Bi λi ,1 i n n P ∩n –1B P –1B 1ξλi i = ξλi i . 1

Indeed these conclusions hold if each Sλ is replaced by Gλ where Gλ is any class of subsets of Xλ, closed under intersections and such that S(Gλ)=Sλ for each λ. Proof The main conclusion follows as noted prior to the statement of the theorem. The final conclusion follows by exactly the same pattern (see Ex. 10.9).  The above definition is readily extended to include independence of fam- ilies of r.e.’s. Specifically, let Cλ be a family of random elements for each λ in an index set Λ. Then if the σ-fields {σ(Cλ):λ ∈ Λ} are independent classes of events, we shall say that {Cλ : λ ∈ Λ} are independent families of random elements,or“the classes Cλ of r.e.’s are independent for λ ∈ Λ”. 212 Independence

Thus we have the notions of independence for random elements, and for families of r.e.’s, parallel to the corresponding notions for events and classes of events. (However, see Ex. 10.10.) Theorem 10.1.3 has the follow- ing obvious (and useful) analog for independent random elements.

Theorem 10.2.2 Let {Cλ : λ ∈ Λ} be independent families of random elements on a space (Ω, F , P),let{Λγ : γ ∈ Γ} be a partition of Λ, and H ∪ C {H ∈ Γ} write γ = λ∈Λγ λ. Then γ : γ are independent families of random elements.

Proof From Lemma 9.3.1 (iii) we have H ∪ C σ( γ)=σ( λ∈Λγ σ( λ)).

But since {σ(Cλ):λ ∈ Λ} are independent classes (each closed under intersections), it follows from Theorem 10.1.3 that {σ(Hγ):γ ∈ Γ} are also independent classes. 

The following result gives a useful characterization of independence of r.e.’s in terms of product forms for the distributions of finite subfamilies. This is especially important for the case of r.v.’s considered in the next section.

Theorem 10.2.3 Let ξ1, ξ2, ..., ξn be r.e.’s on (Ω, F , P) with values in S ≤ ≤ measurable spaces (Xi, i), 1 i n. Then ξ =(ξ1, ξ2, ..., ξn) isar.e.on Ω F n n S ( , , P) with values in ( 1 Xi, 1 i), and ξ1, ..., ξn are independent iff ⎛ ⎞ ⎜ n ⎟ –1 –1 × –1 × × –1 ⎜ –1⎟ Pξ = Pξ1 Pξ2 ... Pξn ⎝= Pξi ⎠ 1 i.e. the distribution of ξ is the product (probability) measure having the individual distributions as components.

Thus, for a general index set Λ,r.e.’s(ξλ : λ ∈ Λ) are independent iff the distribution of ξ =(ξλ1 , ..., ξλn ) factors in the above manner for each n and choice of distinct λi.

Proof That ξ =(ξ1, ..., ξn) is a r.e. follows simply (as in Theorem 9.3.3 for the special case of random variables and vectors) and

–1 × × × ∩n –1 ξ (B1 B2 ... Bn)= 1 ξi (Bi) ∈S ≤ ≤ –1 × × for any Bi i,1 i n. Thus if ξi are independent, Pξ (B1 B2 × n –1 –1 n –1 ... Bn)= 1 Pξi Bi so that Pξ and the product measure 1 Pξi agree 10.3 Independent random variables 213 n S on measurable rectangles and hence on all sets of 1 i. Conversely if –1 n –1 Pξ = 1 Pξi n ∩n –1 –1 × × × –1 P 1 ξi Bi = Pξ (B1 B2 ... Bn)= Pξi (Bi). 1

As noted the same relation is automatic for subclasses of (ξ1, ξ2, ..., ξn) by writing appropriate Bi = Xi, so that independence of (ξ1, ..., ξn) follows. 

10.3 Independent random variables The independence properties developed in the last section, of course, apply in particular to random variables, as will be seen in the following results. For simplicity these are mainly stated for finite families since the results for infinite families involve just finite subfamilies. Theorem 10.3.1 The following conditions are each necessary and sufficient for independence of r.v.’s ξ1, ξ2, ..., ξn (on a probability space (Ω, F , P)). ∩n –1 n –1 (i) P( i=1 ξi Bi)= 1 P(ξi Bi) for every choice of extended Borel sets B1, ..., Bn. (ii) (i) holds for all choices of (ordinary) Borel sets B1, ..., Bn (in place of all extended Borel sets). –1 (iii) The distribution Pξ of the random vector ξ =(ξ1, ξ2, ..., ξn) on Rn Bn R*n B*n –1 ( , ) (or ( , )) is the product of the distributions Pξi on (R, B) (or (R*, B*)), i.e. –1 –1 × –1 × × –1 Pξ = Pξ1 Pξ2 ... Pξn . n (iv) The joint d.f. F1,...,n(x1, ..., xn) of ξ1, ..., ξn factors as 1 Fi(xi), where Fi is the d.f. of ξi.

Proof Independence of (ξ1, ξ2, ..., ξn) is readily seen to be equivalent to each of (i)–(iii) using Theorem 10.2.3. (iii) at once implies (iv), and that (iv) implies e.g. (iii) is readily checked.  The next result is a useful application of Theorem 10.2.2.

Theorem 10.3.2 Let (ξ11 , ..., ξ1n1 , ξ21 , ..., ξ2n2 , ξ31 , ...) be independent r.v.’s on a space (Ω, F , P). Define random vectors ξ1, ξ2, ...by ξi =(ξi1, ξi2,

..., ξini ). Then (ξ1, ξ2, ...) are independent random vectors. Moreover if φi 214 Independence is a finite-valued measurable function on (R*ni , B*ni ) for i =1,2,..., and ηi = φi(ξi), then (η1, η2, ...) are independent r.v.’s. { } Proof By Theorem 10.2.2 (ξi1, ξi2, ..., ξini ):i =1,2,... are indepen- { } dent families of r.v.’s so that σ(ξi1, ..., ξini ):i =1,2,... are indepen- dent classes of events. But, by Lemma 9.3.2, σ(ξi)=σ(ξi1, ..., ξini ) so that (ξ1, ξ2, ...) are independent random vectors, as required. –1 ∈B –1 Further, a typical generating set of σ(ηi)isηi B for B .Butηi B = –1 –1 ∈ ⊂ { } ξi (φi B) σ(ξi) so that σ(ηi) σ(ξi). Since σ(ξi):i =1,2,... are independent classes, so are the classes {σ(ηi), i =1,2,...},i.e.(η1, η2, ...) are independent r.v.’s, completing the proof. 

Corollary The theorem remains true if the φi are defined only on (mea- *ni surable) subsets Di ⊂ R such that ξi ∈ Di a.s. (so that ηi may be defined at fewer ω-points than ξi – though still a.s.). In particular the theorem holds ni if Di = R i.e. if the φi are defined for finite values of their arguments only – the case of practical importance.

* R*ni Proof Define φi = φi on (the measurable set) Di and zero on – Di. * * * * * Then if ηi = φi ξi we have ηi = ηi a.s. Since (η1, η2, ...) are independent by the theorem, so are (η1, η2, ...) (Ex. 10.11).  The next result concerns the existence of a sequence of independent r.v.’s with given d.f.’s.

Theorem 10.3.3 Let Fi be a d.f. for each i =1,2,.... Then there is a probability space (Ω, F , P) and a sequence (ξ1, ξ2, ...) of independent r.v.’s such that ξi has d.f. Fi.

Proof Write μi for the Lebesgue–Stieltjes (probability) measure on (R, B) corresponding to Fi. Then by Theorem 7.10.4, there exists a probability ∞ ∞ measure P on (R , B ) such that for any n, Borel sets B1, B2, ..., Bn, n P(B1 ×B2 × ...×Bn ×R ×R × ...)= μi(Bi). 1 ∞ ∞ Write (Ω, F , P) for the probability space (R , B , P) and define ξ1, ξ2, ... on this space by ξiω = xi when ω =(x1, x2, x3, ...). Each ξi is clearly a r.v. and for Borel sets B1, B2, ..., Bn n {∩n –1 } × × × ×R × R P 1ξi (Bi) = P(B1 B2 ... Bn ...)= μi(Bi). 1 ··· R –1 In particular, B1 = B2 = = Bn–1 = gives P(ξn Bn)=μn(Bn) for each ∩n –1 n –1 n so that (writing i for n) P( i=1ξi Bi)= i=1 P(ξi Bi) and hence the ξi are 10.3 Independent random variables 215

–1 ∞ ∞ independent. Also Pξn (– , x]=μn(– , x]=Fn(x) so that ξn has d.f. Fn as required. 

Note that a more general result of this kind, where the ξi need not be independent, will be indicated in Chapter 15 for Stochastic Process Theory. If ξ1, ξ2 arer.v.’sinL2(Ω, F , P) then ξ1ξ2 ∈ L1(Ω, F , P) (i.e. E|ξ1ξ2| < ∞). This is not the case in general if we just assume that ξ1 and ξ2 each belong to L1. However, it is an interesting and important fact that it is true for independent r.v.’s, and then E(ξ1ξ2)=Eξ1·Eξ2. This will follow as a corollary from the following general result.

Theorem 10.3.4 Let ξ1, ξ2 be independent r.v.’s with d.f.’s F1, F2 and let 2 2 h be a finite measurable function on (R , B ). Then h(ξ1, ξ2) is a r.v. and Eh(ξ , ξ )= h (ξ (ω ), ξ (ω )) dP(ω ) dP(ω ) 1 2 Ω Ω 1 1 2 2 1 2 = R R h(x1, x2) dF1(x1) dF2(x2), whenever h is nonnegative, or E|h(ξ1, ξ2)| < ∞.

Proof It is clear that h(ξ1, ξ2) is a r.v. Writing ξ =(ξ1, ξ2)wehave Eh(ξ , ξ )= h (ξ(ω)) dP(ω)= h(x , x ) dPξ–1(x , x ) 1 2 Ω R2 1 2 1 2 –1 × –1 = R2 h(x1, x2) d(Pξ1 Pξ2 ) by Theorem 4.6.1 and Theorem 10.3.1 (iii). Fubini’s Theorem (the appro- priate version according as h is nonnegative, or h(ξ1, ξ2) ∈ L1) now gives the repeated integral E –1 –1 h(ξ1, ξ2)= R R h(x1, x2) dPξ1 (x1) dPξ2 (x2) which may be written either as R R h(x1, x2) dF1(x1) dF 2(x2)or,by Theorem 4.6.1 applied in turn to each of ξ1, ξ2,as Ω Ω h(ξ1(ω1), ξ2(ω2)) dP(ω1) dP(ω2). Hence the result follows.  E| | ∞ Theorem 10.3.5 Let ξ1, ..., ξn be independent r.v.’s with ξi < for E| | ∞ E n E each i. Then ξ1ξ2 ...ξn < and (ξ1ξ2 ...ξn)= 1 ξi.

Proof Since by Theorem 10.3.2, ξ1 and (ξ2ξ3 ...ξn) are independent the result will follow inductively from that for n = 2. The n = 2 result follows at once from Theorem 10.3.4 first with h(x1, x2)=|x1x2| to give E| | | || | E| |E| | ∞ ξ1ξ2 = Ω Ω ξ1(ω1) ξ2(ω2) dP(ω1) dP(ω2)= ξ1 ξ2 < , and then with h(x1, x2)=x1x2 to give E(ξ1ξ2)=Eξ1Eξ2.  216 Independence

E 2 ∞ Corollary If ξ1, ..., ξn are independent r.v.’s with ξi < for each i, then the variance of (ξ1 + ξ2 + ···+ ξn) is given by

var(ξ1 + ξ2 + ···+ ξn)=var(ξ1)+var(ξ2)+···+var(ξn). The simple proof is left as an exercise.

10.4 Addition of independent random variables We next obtain the distribution and d.f. of the sum of independent r.v.’s.

Theorem 10.4.1 Let ξ1, ξ2 be independent r.v.’s with distributions –1 –1 Pξ1 = π1, Pξ2 = π2. Then

(i) The distribution π of ξ1 + ξ2 is given for Borel sets B (writing B – y = {x – y : x ∈ B})by ∞ ∞ π B π B y dπ y π B y dπ y ( )= –∞ 1( – ) 2( )= –∞ 2( – ) 1( ) = π1 ∗ π2(B),

where π1 ∗ π2 is called the convolution of the measures π1, π2 (cf. Section 7.6).

(ii) In particular the d.f. F of ξ1 + ξ2 is given in terms of the d.f.’s F1, F2 of ξ1, ξ2 by ∞ ∞ F x F x y dF y F x y dF y ( )= –∞ 1( – ) 2( )= –∞ 2( – ) 1( ) = F1 ∗ F2(x)

where F1 ∗ F2 is the (Stieltjes) convolution of F1 and F2. (iii) If F1 is absolutely continuous with density f1, F is then absolutely continuous with density f (x)= f1(x – y) dF2(y). (iv) If also F2 is absolutely continuous (with density f2) then ∞ ∞ f x f x y f y dy f x y f y dy ( )= –∞ 1( – ) 2( ) = –∞ 2( – ) 1( ) = f1 ∗ f2(x),

i.e. the convolution of f1 and f2 (cf. Section 7.6).

Proof If φ(x1, x2)=x1 + x2 (measurable) and ξ =(ξ1, ξ2), we have

π(B)=P{ξ1 + ξ2 ∈ B} = P{φξ ∈ B} { ∈ –1 } E = P ξ φ B = χφ–1B(ξ) = R R χφ–1B(x1, x2) dπ1(x1) dπ2(x2) 10.5 Borel–Cantelli Lemma and zero-one law 217 by Theorem 10.3.4. The integrand is one if x1 + x2 ∈ B,i.e.ifx1 ∈ B – x2, and zero otherwise, so that the inner integral is π1(B – x2), measurable by Fubini’s Theorem giving the first result for π(B). The second follows similarly. Thus (i) holds. The expressions for F(x) in (ii) follow at once by writing B =(–∞, x], where e.g. π1(B – y)=F1(x – y) etc. If F1 is absolutely continuous with density f1 we have ∞ ∞ x–y F(x)= F (x – y) dF (y)= f (t) dt dF (y) –∞ 1 2 –∞ –∞ 1 2 ∞ x f u y du dF y = –∞ –∞ 1( – ) 2( ) by the transformation t = u – y for fixed y in the inner integral. Thus x ∞ F x f u y dF y du ( )= –∞ –∞ 1( – ) 2( ) x by Fubini’s Theorem for nonnegative functions. That is F(x)= ∞ f (u) du ∞ – f u f u y dF y where ( )= –∞ 1( – ) 2( ). It is easily seen that the (nonnegative) function f is in L1(–∞, ∞) (Lebesgue measure) and thus provides a density for F. Hence (iii) follows, and (iv) is immediate from (iii). 

10.5 Borel–Cantelli Lemma and zero-one law

We recall that if An is any sequence of subsets of the space Ω, then A = ∩∞ ∪∞ ∈ Ω limAn = n=1 m=n Am is the set of all ω which belong to An for infinitely many values of n. If An are measurable sets (i.e. events), so is A. In intuitive terms, A occurs if infinitely many of the An occur (simultaneously) when the underlying ex- periment is performed. The following result gives a simple but very useful condition under which P(A) = 0, i.e. with probability one only a finite number of An occur. { } Theorem 10.5.1 (Borel–Cantelli Lemma) Let An be a sequence of Ω F ∞ ∞ events of the probability space ( , , P), and A = limAn.If n=1 P(An) < , then P(A)=0. ∩∞ ∪∞ ≤ ∪∞ Proof P(A)=P( n=1 m=n Am) P( m=nAm) for any n =1,2,.... Hence ≤ ∞ →∞ P(A) m=n P(Am) for all n, and this tends to zero as n since P(An) converges. Thus P(A)=0.  The converse result is not true in general (Ex. 10.12). However, it is true if the events An form an independent sequence. Indeed, rather more is then true as the following result shows. 218 Independence

Theorem 10.5.2 (Borel–Cantelli Lemma for Independent Events) Let {An} Ω F be an independent sequence of events on ( , , P), and A = lim An. Then ∞ ∞ ∞ ∞ P(A) is zero or one, according as 1 P(An) < or 1 P(An)= . ∞ Proof Since P(A) = 0 when P(An) < it will be sufficient to show that P(A) = 1 when P(An)=∞. Suppose, then, that P(An)=∞. Then ∩∞ ∪∞ ∪∞ P(A)=P( n m=n Am) = lim P( m=nAm) =1 n→∞ ∪k = lim lim P( m=nAm). n→∞ k→∞ Now k ∪k c ∩k c c P(( m=nAm) )=P( m=nAm)= P(Am), m=n c c c since the events An, An+1, ..., Ak are independent by Theorem 10.1.2 (Corollary). Thus k k ∪k c ≤ –P(Am) P(( m=nAm) )= (1 – P(Am)) e m=n m=n ≤ –x ≤ ≤ (by using 1 – x e for all 0 x 1). The latter term is k – P(Am) e m=n which tends to zero as k →∞since P(Am)=∞. Thus ∪k  limk→∞ P( m=nAm) = 1, giving P(A) = 1, as required.

Note (though not shown here) that this result is in fact true if the An are only assumed to be pairwise independent. (See, for example, Theorem 4.3.2 of [Chung].) The above theorem states in particular that a certain event A must have probability zero or one. Results of such a kind are therefore often referred to as “zero-one laws”. A particularly well known result of this type is the “Kolmogorov Zero-One Law”, which is shown next. Theorem 10.5.2 is an example of a zero-one law, together with necessary and sufficient condi- tions for the two alternatives. First we require some general terminology. If Fn is a sequence of sub-σ- F G ∪∞ F fields of , then the σ-fields n = σ( k=n+1 k) form a decreasing sequence G ⊃G ∩∞ G F ( n n+1) whose intersection n=0 n = ∞ (clearly a σ-field) is called the tail σ-field of the sequence Fn.SetsofF∞ are called tail events and F∞-measurable functions are called tail functions (or tail r.v.’s if defined and finite a.s.). Theorem 10.5.3 (Kolmogorov Zero-One Law) Let (Ω, F , P) be a proba- bility space. If Fn is a sequence of independent sub-σ-fields of F , then each tail event has probability zero or one, and each tail r.v. is constant a.s. Exercises 219

H ∪nF G ∪∞ F Proof Write n = σ( 1 i) and, as above, n = σ( k=n+1 k). Then since each Fi is closed under intersections, it follows simply from Theorem 10.1.3 that Hn and Gn are independent classes. Since Gn ⊃F∞, it follows that Hn and F∞ are independent, from which it also follows at once that F∞ ∪∞H ∪∞H H and 1 n are independent. Now 1 n is a field (note that n is nonde- creasing), and hence closed under intersections, so that by Theorem 10.1.2, F ∪∞H ∪∞H ⊃ ∪∞F ∞ and σ( 1 n) are independent. But clearly σ( 1 n) σ( 1 n)= G0 ⊃F∞, so that {F∞, F∞} are independent. Thus if A ∈F∞ we must have P(A)=P(A ∩ A)=(P(A))2, so that P(A) is zero or one, as required. Finally suppose that ξ is a tail r.v. with d.f. F. For any x, {ω : ξ(ω)≤x} is a tail event and hence has probability zero or one, i.e. F(x) = 0 or 1. Since F is not identically either zero or one it must have a unit jump at a finite point a (= inf(x : F(x) = 1)) so that P{ξ = a} =1. 

Corollary 1 Let {ξn : n =1,2,...} be a sequence of independent r.v.’s and F ∩∞ define the tail σ-field ∞ = n=0σ(ξn+1, ξn+2, ...). Then each tail event has probability zero or one, and each tail r.v. is constant a.s.

F G ∪∞ Proof Identify n with σ(ξn) and hence n = σ( k=n+1σ(ξk)) = σ(ξn+1, ξn+2, ...). 

Corollary 2 If {Cn : n =1,2,...} is a sequence of independent classes of r.v.’s, the conclusion of the theorem holds, with tail σ-field F∞ = ∩∞ ∪∞ C n=0σ( k=n+1 k).

Corollary 2, which follows by identifying Fn with σ(Cn), and hence Gn ∪∞ C ∪∞ C with σ( k=n+1σ( k)) = σ( k=n+1 k) includes a zero-one law for an inde- pendent sequence of stochastic processes.

Exercises 10.1 Let Ω consist of the integers {1, 2, ...,9} with probabilities 1/9 each. Show that the events {1, 2, 3}, {1, 4, 5}, {2, 4, 6} are pairwise independent, but not independent as a class. 10.2 Construct an example of three events A, B, C which are not independent but which satisfy P(A ∩ B ∩ C)=P(A)P(B)P(C). 10.3 Let {Aλ : λ ∈ Λ} be a family of independent classes of events. Show that arbitrary events of probability zero or one may be added to any or all Aλ while still preserving independence. Show that if Bλ is formed from Aλ 220 Independence

by including (i) all proper differences of two sets of Aλ, (ii) all countable disjoint unions of sets of Aλ, or (iii) all limits of monotone sequences of sets of Aλ then {Bλ : λ ∈ Ω} is a family of independent classes. (Hint: Consider a finite index set Λ, Ω ∈Aλ and show that independence is preserved when just one Aλ is replaced by Bλ.) 10.4 If E1, E2, ..., En are independent, show that n  n ≤ ∪n ≤ P(Ej)– P(Ej)P(Ek) P( 1Ej) P(Ej). 1 jk 1 If the events E(n), ..., E(n) change with n so that n P(E(n)) → 0, show that 1 n 1 j ∪n (n) ∼ n (n) →∞ P( 1Ej ) 1 P(Ej )asn . 10.5 Let ξ, η be independent r.v.’s with E|ξ| < ∞. Show that, for any Borel set B, E ∈ η–1B ξ dP = ξ P(η B). 10.6 Let ξ, η be random variables on the probability space (Ω, F , P), let E ∈F, and let f be a Borel measurable function on the plane. If ξ is independent of η and E (i.e. if the classes of events σ(ξ)andσ{σ(η), E} are independent) show that f (ξ(ω ), η(ω )) dP(ω ) dP(ω ) E Ω 1 2 1 2 = E f (ξ(ω), η(ω)) dP(ω) whenever f is nonnegative or E|f (ξ, η)| < ∞. (Hint: Prove this first for an indicator function f .) If the random variable ζ defined on the probability space (Ω, F , P) has the same distribution as ξ, show that f (ζ(ω ), η(ω)) dP (ω )dP(ω) E Ω = E f (ξ(ω), η(ω)) dP(ω).

10.7 For n =1,2,... let Rn(x)betheRademacher functions Rn(x) = +1 or –1 k–1 ≤ k ≤ ≤ according as the integer k for which 2n < x 2n (0 x 1) is odd or even. Let (Ω, F , P) be the “unit interval probability space” (consisting of the unit interval, Lebesgue measurable sets and Lebesgue measure). Prove that {Rn, n =1,2,...} are independent r.v.’s with the same d.f. Show that any two of R1, R2, R1R2 are independent, but the three together are not. 10.8 A r.v. η is called symmetric if η and –η have the same distribution. Let ξ be a r.v. Let ξ1 and ξ2 be two independent r.v.’s each having the same distribution * as ξ and let ξ = ξ1 – ξ2. (a) Show that ξ* is symmetric (it is called the symmetrization of ξ)andthat * ∞ ∞ μ (B)= –∞ μ(x – B) dμ(x)= –∞ μ(x + B) dμ(x) for all Borel sets B,whereμ, μ* are the distributions of ξ, ξ* respect- ively, and x – B = {x – y : y ∈ B}, x + B = {x + y : y ∈ B}. Exercises 221

(b) Show that for all t ≥ 0, real a

P{|ξ*|≥t}≤2P{|ξ – a|≥t/2}.

10.9 Criterion for independence of r.e.’s analogous to Theorem 10.1.2: Let ξλ be a random element on (Ω, F , P)withvaluesin(Xλ, Sλ)say,for each λ in an index set Λ. For each λ,letEλ be a class of subsets of Xλ which is closed under finite intersections and whose generated σ-ring S(Eλ)=Sλ, G –1E { –1 ∈E } and write λ = ξλ λ (= ξλ E : E λ ). Then {ξλ : λ ∈ Λ} is a class of independent random elements if and only if {Gλ : λ ∈ Λ} is a family of independent classes of events. 10.10 A weaker concept of independence of a family of classes of random elements would be the following. Let {Cλ : λ ∈ Λ} be a family of classes of random elements and suppose that if for every choice of one member ξλ from each Cλ, {ξλ : λ ∈ Λ} is a class of independent random elements. Such a definition would be more strictly analogous to the procedure used for classes of sets. Show that it is, in fact, a weaker requirement than the definition in the text. (E.g. take two classes C1 = {ξ}, C2 = {η, ζ} where any two of ξ, η, ζ are independent but the three together are not (cf. Ex. 10.7). Show that {C1, C2} satisfies the weaker definition, but is not independent, however, in the sense of the text.) Λ ∗ Ω F 10.11 For each λ in an index set ,letξλ, ξλ be random elements on ( , , P), S ∗ { ∈ Λ} with values in (Xλ, λ)andsuchthatξλ = ξλ a.s. Show that if ξλ : λ is { ∗ ∈ Λ} a class of independent random elements, then so is ξλ : λ (e.g. show (∩nξ∗–1E )Δ(∩nξ–1E ) ⊂∪n{ω : ξ (ω)  ξ∗ (ω)}). 1 λi i 1 λi i 1 λi λi 10.12 A bag contains one black ball and m white balls. A ball is drawn at random. If it is black it is returned to the bag. If it is white, it and an additional white ball are returned to the bag. Let An denote the event that the black ball is not drawn in the first n trials. Discuss the (converse to) the Borel–Cantelli Lemma with reference to the events An. 10.13 Let (Ω, F , P) be the “unit interval probability space” of Ex. 10.7. Define r.v.’s ξn by

ξn(ω)=χ 1 1 (ω)+2χ 1 1 (ω). [0, 2 + n ) [ 2 + n ,1]

Find the tail σ-field of {ξn} and comment on the zero-one law. 10.14 Let ξ be a r.v. which is independent of itself. Show that ξ is a constant, with probability one. { }∞ 10.15 Let ξn n=1 be a sequence of independent random variables on the prob- ability space (Ω, F , P). Prove that the probability of pointwise conver- gence of { }∞ (i) the sequence ξn(ω) n=1 ∞ (ii) the series n=1 ξn(ω) 222 Independence

is equal to zero or one, and that whenever (i) converges its limit is equal to a constant a.s. (Hint: Show that the set C of all points ω ∈ Ω for which the { }∞ sequence ξn(ω) n=1 converges is given by ∩∞ ∪∞ ∩∞ ∩∞ { ∈ Ω | |≤ } C = k=1 N=1 n=N m=N ω : ξn(ω)–ξm(ω) 1/k .) 10.16 Prove that a sequence of independent identically distributed random vari- ables converges pointwise with zero probability, except when all random variables are equal to a constant a.s. (Hint: Use the result and the hint of the previous problem.) 11

Convergence and related topics

11.1 Modes of probabilistic convergence Various modes of convergence of measurable functions to a limit function were considered in Chapter 6, and will be restated here with the special terminology customarily used in the probabilistic context. In this section the modes of convergence all concern a sequence {ξn} of r.v.’s on the same probability space (Ω, F , P) such that the values ξn(ω) “become close” (in some “local” or “global” sense) to a “limiting r.v.” ξ(ω)asn →∞.Inthe next section we shall consider the weaker form of convergence where the ξn’s can be defined on different spaces, and where one is interested in only –1 the limiting form of the distribution of the ξn (i.e. Pξn B for Borel sets B). This “convergence in distribution” has wide use in statistical theory and application. The later sections of the chapter will be concerned with various import- ant relationships between the forms of convergence, convergence of series of independent r.v.’s, and related topics. Note that in certain calculations concerning convergence (especially in Section 11.5) it will be implicitly assumed that the r.v.’s involved are defined for all ω. No comment will be made in these cases, since it is a trivial matter to obtain these results for * r.v.’s ξn not defined everywhere by considering ξn defined for all ω, and equal to ξn a.s. In this section, then, we shall consider a sequence {ξn} of r.v.’s on the same fixed probability space (Ω, F , P). The following definitions will apply:

Almost sure convergence

Almost sure convergence of a sequence of r.v.’s ξn to a r.v. ξ (ξn → ξ a.s.) is, of course, just a.e. convergence of ξn to ξ with respect to the probability measure P. This is also termed convergence with probability 1. Similarly to say that {ξn} is Cauchy a.s. means that it is Cauchy a.e. (P), as defined in Chapter 6.

223 224 Convergence and related topics

A useful necessary and sufficient condition for a.s. convergence is pro- vided by Theorem 6.2.4 which is restated in the present context:

Theorem 11.1.1 ξn → ξ a.s. if and only if for every >0, writing En( )= {ω : |ξn(ω)–ξ(ω)|≥ } ∪∞ lim P m=nEm( ) (= P(limn→∞En( ))) = 0. n→∞

That is, ξn → ξ a.s. if (except on a zero probability set) the events En( ) occur only finitely often for each >0, or, equivalently, the probability that |ξm – ξ|≥ for some m ≥ n, tends to zero as n →∞. The following very simple but sometimes useful sufficient condition for a.s. convergence is immediate from the above criterion. Theorem 11.1.2 Suppose that, for each >0, ∞ P{|ξn – ξ|≥ } < ∞. n=1

Then ξn → ξ a.s. as n →∞. Proof This is an immediate and obvious application of the Borel–Cantelli Lemma (Theorem 10.5.1). 

A corresponding condition for {ξn} to be a Cauchy sequence a.s. (and hence convergent a.s. to some ξ) will now be obtained. { } Theorem 11.1.3 Let n be positive constants, n =1,2,... with ∞ ∞ n=1 n < and suppose that ∞ P{|ξn+1 – ξn| > n} < ∞. n=1

Then {ξn} is a Cauchy sequence a.s. (and hence convergent to some r.v. ξ a.s.). Proof By the Borel–Cantelli Lemma (Theorem 10.5.1) the prob- ability is zero that |ξn+1 – ξn| > n for infinitely many n. That is for each ω except on a set of P-measure zero, there is a finite N = N(ω) such that |ξn+1(ω)–ξn(ω)|≤ n when n ≥ N(ω). Given >0 we may (by increasing ∞ N if necessary) require that N n < (N now depends on and ω,of course). Thus if n > m ≥ N, n–1 ∞ ∞ |ξn – ξm|≤ |ξk+1 – ξk|≤ |ξk+1 – ξk|≤ k < k=m k=N k=N and hence {ξn(ω)} is a Cauchy sequence, as required.  11.1 Modes of probabilistic convergence 225

Convergence in probability This is just convergence in measure, with the previous terminology. That is P ξn tends to ξ in probability (ξn → ξ) if for each >0,

P{ω : |ξn(ω)–ξ(ω)|≥ }→0asn →∞ i.e. P(En( )) → 0asn →∞, with the notation of Theorem 11.1.1, or in probabilistic language P{|ξn – ξ|≥ }→0 for each >0. That is, for each (large) n there is high probability that ξn will be close to ξ – but not necessarily high probability that ξm will be close to ξ simultaneously for all m ≥ n. Thus convergence in probability is a weaker requirement than almost sure convergence. This is made specific by the corollary to Theorem 6.2.2 (or implied by Theorem 11.1.1) which shows that if ξn → ξ P a.s., then ξn → ξ. It also follows (from the corollary to Theorem 6.2.3) that if ξn converges to ξ in probability, then a subsequence ξnk ,say,ofξn converges to ξ a.s. We state these two results as a theorem:

P Theorem 11.1.4 (i) If ξn → ξ a.s., then ξn → ξ. →P { } (ii) If ξn ξ, then there exists a subsequence ξnk converging to ξ a.s. ( nk is the same for all ω).

The following result will be useful for later applications.

P Theorem 11.1.5 (i) ξn → ξ if and only if each subsequence of {ξn} con- tains a further subsequence which converges to ξ a.s. P P (ii) If ξn → ξ, and f is a continuous function on R, then f (ξn) → f (ξ). (iii) (ii) holds if f is continuous except for x ∈ D where Pξ–1D =0.

Proof (i) If ξn → ξ in probability, any subsequence also converges to ξ in probability, and, by Theorem 11.1.4 (ii), contains a further subsequence converging to ξ a.s. Conversely suppose that each subsequence of {ξn} contains a further sub- sequence converging a.s. to ξ.Ifξn does not converge to ξ in probability, there is some >0withP{|ξn – ξ|≥ } 0, and hence also some δ>0 such that P{|ξn – ξ|≥ } >δinfinitely often. That is for some subsequence { } {| |≥ } ξnk , P ξnk – ξ >δ, k =1,2,.... But this means that no subse- { } quence of ξnk can converge to ξ in probability (and thus certainly not a.s.), so a contradiction results. Hence we must have ξn → ξ in probability as asserted. 226 Convergence and related topics

P (ii) Suppose ξn → ξ and write ηn = f (ξn), η = f (ξ). Any subsequence { } { } { }∞ ξnk of ξn has, by (i), a further subsequence ξm =1, converging to ξ a.s. → { } Hence, by continuity, f (ξm ) f (ξ) a.s. That is the subsequence ηnk of {ηn} has a further subsequence converging to η a.s. and hence, again by (i), ηn → η in probability, so that (ii) holds.

For (iii) essentially the same proof applies – noting that f (ξm ) still con- verges to f (ξ) a.s. since any further points ω where convergence does not occur, are contained in the zero probability set ξ–1D. 

Convergence in pth order mean

Again, Lp convergence of measurable functions, (p > 0), includes Lp con- vergence for r.v.’s ξn. Specifically, if ξn, ξ have finite pth moments (i.e. ξn, ξ ∈ Lp(Ω, F , P)) we say that ξn → ξ in pth order mean if ξn → ξ in Lp,i.e.if p p E|ξn – ξ| = |ξn – ξ| dP → 0asn →∞.

The reader should review the properties of Lp-spaces given in Section 6.4, including the inequalities restated in probabilistic terminology in Sec- tion 9.5. Especially recall that Lp is a linear space for all p > 0(ifξ, η ∈ Lp then aξ + bη ∈ Lp for any real a, b), and that Lp is complete. Many of the useful results apply whether 0 < p < 1orp ≥ 1 and in particular we shall find the following lemma (which restates part of Theorem 6.4.6 (ii)) to be useful.

Theorem 11.1.6 Let {ξn} (n =1,2,...), ξ be r.v.’s in Lp for some p > 0 and ξn → ξ in Lp. Then P (i) ξn → ξ p p (ii) E|ξn| →E|ξ| .

P By (i) if ξn → ξ in Lp (p > 0) then ξn → ξ. This implies also, of → course, that a subsequence ξnk ξ a.s. (Theorem 11.1.4 (ii)). However, the sequence ξn itself does not necessarily converge a.s. Conversely, nor does a.s. convergence of ξn necessarily imply convergence in any Lp. There is, however, a converse result when the ξn are dominated by an Lp r.v. In particular the case p = 1 may be regarded as a form of the dominated convergence theorem applicable to finite measure (e.g. probability) spaces, 11.2 Convergence in distribution 227 with a.s. convergence replaced by convergence in probability. (We shall also see a more general converse later – Theorem 11.4.2.)

P Theorem 11.1.7 Let {ξn}, ξ be r.v.’s such that ξn → ξ. Suppose η ∈ Lp for some p > 0, and |ξn|≤η a.s., n =1,2,.... Then ξn → ξ in Lp.

P Proof Note first that clearly ξn ∈ Lp. Further, since ξn → ξ, a subsequence → | |≤|| ∈ ∈ ξnk ξ a.s. so that ξ η a.s. Since η Lp it follows that ξ Lp.Now |ξn – ξ|≤2η ∈ Lp and hence p p p p p p E|ξn – ξ| = |ξn – ξ| dP + |ξn – ξ| dP ≤ +2 η dP. |ξn–ξ|< |ξn–ξ|≥ (|ξn–ξ|≥ )

The last term tends to zero by Theorem 4.5.3 since P{|ξn – ξ|≥ }→0so E| |p ≤ p E| |p that lim supn→∞ ξn – ξ . Since is arbitrary, limn→∞ ξn – ξ =0as required. 

11.2 Convergence in distribution As noted in the previous section, it is of interest to consider another form of convergence – involving just the distributions of a sequence of r.v.’s, and not their values at each ω. That is, given a sequence {ξn} of r.v.’s we inquire whether the distributions P{ξn ∈ B} converge to that of a r.v. ξ,i.e.P{ξ ∈ B}, for sets B ∈B. In fact, it is a little too stringent to require this for all B ∈B. For suppose that ξn has d.f. Fn(x) which is zero for x ≤ –1/n, one for x ≥ 1/n and is linear in (–1/n,1/n). Clearly one would want to say that the limiting distribution of ξn is the probability measure π with unit mass at zero, i.e. the distribution of the r.v. ξ = 0. But, taking B to be the “singleton set” {0},wehaveP{ξn = 0} = 0, which does not converge to P{ξ =0} =1. It is easy to see (at least once one is told!) what should be done to give an appropriate definition. In the above example, the d.f.’s Fn(x)ofξn converge to a limiting d.f. F(x) (zero for x < 0, one for x ≥ 0) at all points x other 1 than the discontinuity point x =0ofF at which Fn(0) = 2 . Equivalently, as –1{ }→ { } { } { } we shall see, Pξn (a, b] μF (a, b] for all a, b with μF a = μF b =0. This is conveniently used as the basis for a definition of convergence in distribution. It will also then be true – though we shall neither need nor –1 → show this – that Pξn (B) μF(B) for all Borel sets B whose (topologi- cal) boundary has μ-measure zero. The definition below will be stated in what appears to be a slightly more general form, concerning a sequence {πn} of probability measures on B. The use of “π” in the present context will be helpful to distinguish probability measures on R from those on Ω. 228 Convergence and related topics

Of course, each πn may be regarded as the distribution of some r.v. (Sec- tion 9.2). We shall speak of weak convergence of the sequence πn since it is this terminology which is used in the most abstract and general setting for the subject described in a variety of treatises, beginning with the classic volume [Billingsley]. Suppose, then, that {πn} is a sequence of probability measures on (R, B). Then we say that πn converges weakly to a probability measure π on B w (πn → π)ifπn{(a, b]}→π{(a, b]} for all a, b such that π({a})=π({b})=0, (i.e. each “π-continuity interval” (a, b]). It is readily seen (Ex. 11.10) that open intervals (a, b) or closed intervals [a, b] may replace the semiclosed interval (a, b] in the definition. Correspondingly if Fn is a d.f. for n =1,2,..., and F is a d.f. we write w Fn → F if Fn(x) → F(x) for each x at which F is continuous. It is obvious that if Fn is the d.f. corresponding to πn, and F to π (πn = →w →w μFn , π = μF), then Fn F implies πn π. The converse is also quite easy to prove directly (Ex. 11.9) but will follow in the course of the proof of Theorem 11.2.1 below. If {ξn} is a sequence of r.v.’s with d.f.’s {Fn}, and ξ is a r.v. with d.f. F, d w we say that ξn converges in distribution to ξ (ξn → ξ), if Fn → F (i.e. –1 →w –1 Pξn Pξ ). Note that the ξn do not need to be defined on the same probability space for convergence in distribution.1 Further, even if they are d all defined on the same (Ω, F , P), the fact that ξ → ξ does not require that the values ξn(ω) approach those of ξ(ω) in any sense, as n →∞. This is in contrast to the other forms of convergence already considered and which (as we shall see) imply convergence in distribution. For example, if {ξn} is any sequence of r.v.’s with the same d.f. F, then ξn converges in distribution to any r.v. ξ with the d.f. F. This emphasizes that convergence in distribution is concerned only with limits of probabilities P{ξn ∈ B} as n becomes large. Relationships with other forms of convergence will be addressed in the next section. The following result is a central criterion for weak convergence, indeed leading to its definition in more abstract settings, in which the result is sometimes termed the “Portmanteau Theorem” (e.g. [Billingsley]).

Theorem 11.2.1 Let {πn : n =1,2,...}, π, be probability measures on (R, B), with corresponding d.f.’s {Fn : n =1,2,...}, F. Then the following are equivalent

1 Strictly we should write Pn since the ξn may be defined on different spaces (Ωn, Fn, Pn) but it is conventional to omit the n and unlikely to cause confusion. 11.2 Convergence in distribution 229

w (i) Fn → F ≤ ≥ (i ) For each x, lim supn Fn(x) F(x), lim infn Fn(x) F(x –0) w (ii) πn → π ∞ ∞ (iii) gdπ → gdπ for every real, bounded continuous function g –∞ n –∞ on R.

w w Further, weak limits are unique (e.g. if Fn → F and Fn → G then F = G).

Proof The uniqueness statement is immediate since, for example, if w w Fn → F and Fn → G then F = G at all continuity points of both F, G, and hence for all points x except in a countable set. From this it is seen at once that F(x +0)=G(x + 0) for all x, and hence F = G. It is immediate that (i) implies (i). On the other hand if (i) holds, for given x choose y > x such that F is continuous at y. Then lim sup Fn(x) ≤ lim Fn(y)=F(y) from which it follows that lim sup Fn(x) ≤ F(x) by letting y↓x. That lim infn Fn(x) ≥ F(x – 0) follows similarly. Hence (i) and (i )are equivalent. To prove the equivalence of (i), (ii), (iii), note first, as already pointed out above, that (i) clearly implies (ii). Suppose now that (ii) holds. To show (iii) let g be a fixed, real, bounded, R | | ∞ continuous function on , and M = supx∈R g(x) (< ). We shall show ≤ that lim sup gdπn gd π. Then replacing g by –g it will follow that ≥ lim inf gdπn = – lim sup –gdπn – –gdπ = gdπ, to yield the de- sired result lim gdπn = gdπ. It will be slightly more convenient to assume that 0 ≤ g(x) ≤ 1 for all x (which may be done by considering (g + M)/2M instead of g). Let D be the set of atoms of π (i.e. discontinuities of F). By Lemma 9.2.2, D is at most countable and thus every interval contains points of its complement Dc.Let >0. Since π(R) = 1 there are thus points a, b in c w D such that π{(a, b]} > 1– /2. Hence also, since πn → π,wemusthave πn{(a, b]} > 1– /2 for all n ≥ some N1 = N1( ). Thus for n ≥ N1, ∞ gdπ gdπ gdπ ≤ gdπ –∞ n = (a,b] n + (a,b]c n (a,b] n + /2 c since g ≤ 1 and πn{(a, b] } < /2 when n ≥ N1. Hence gdπ ≤ gdπ lim sup n lim sup (a,b] n + /2. n→∞ n→∞ Now g is uniformly continuous on the finite interval [a, b] and hence there exists δ = δ( ) such that |g(x)–g(y)| < /4 if |x – y| <δ, a ≤ x, y ≤ b. 230 Convergence and related topics

Choose a partition a = x0 < x1 < ... < xm = b of [a, b] such that xk  D, and xk – xk–1 <δ, k =1,..., m. Then if xk–1 < x ≤ xk we have

g(x) ≤ g(xk)+ /4 ≤ g(x)+ /2 and hence m gdπ ≤ g x π { x x } (a,b] n ( ( k)+ /4) n ( k–1, k] . k=1

Letting n →∞(with the partition fixed), πn{(xk–1, xk]}→π{(xk–1, xk]} giv- ing m lim sup gdπn ≤ (g(xk)+ /4)π{(xk–1, xk]} →∞ (a,b] n k=1 ∞ ≤ g x dπ ≤ gdπ (a,b]( ( )+ /2) –∞ + /2. Thus by gathering facts, we have, ∞ ∞ lim sup gdπ ≤ gdπ –∞ n –∞ + n→∞ from which the desired result follows since >0 is arbitrary. Thus (ii) implies (iii). Finally we assume that (iii) holds and show that (i) follows, i.e. ≤ ≥ lim supn Fn(x) F(x), lim infn Fn(x) F(x – 0), for any fixed point x. Let >0 and write g (t) for the bounded continuous function which is unity for t ≤ x, decreases linearly to zero at t = x+ , and is zero for t > x+ . Then ∞ ∞ F x g t dπ t ≤ g dπ → g dπ ≤ F x n( )= (–∞,x] ( ) n( ) –∞ n –∞ ( + ). ≤ → Hence lim supn→∞ Fn(x) F(x + )for >0, and letting 0 gives ≤ lim supn→∞ Fn(x) F(x). It may be similarly shown (by writing h (t)=1fort ≤ x – , zero for t ≥ x and linear in (x – , x)) that lim infn→∞ Fn(x) ≥ F(x – ) for all >0 and, hence lim inf Fn(x) ≥ F(x – 0) as required, so that (iii) implies (i ) and hence (i), completing the proof of the equivalence of (i)–(iii). 

w Corollary 1 If πn → π then (iii) also holds for bounded measurable functions g just assumed to be continuous a.e. (π).

Proof It may be assumed (by subtracting its lower bound) that g is non- negative. Then a sequence {gn} of continuous functions may be found (cf. 11.2 Convergence in distribution 231

Ex. 11.11 for a sketch of their construction) such that 0 ≤ gn(x) ↑ g(x)at each continuity point x of g. Hence, for fixed m, lim inf gdπn ≥ lim inf gm dπn = gm dπ n→∞ n→∞ by (iii) and hence by monotone convergence, letting m →∞, lim inf gdπn ≥ gdπ. n→∞ ≥ The same argument with –g shows that lim inf –gdπn –gdπ so that lim sup gdπn ≤ gdπ and hence (iii) holds for this g as required.  The above criteria may be translated as conditions for convergence in distribution of a sequence of r.v.’s, as follows.

Corollary 2 If {ξn : n =1,2,...}, ξ are r.v.’s with d.f.’s {Fn : n =1,2,...}, F, then the following are equivalent

d (i) ξn → ξ w (ii) Fn → F –1 →w –1 (iii) Pξn Pξ (iv) Eg(ξn) →Eg(ξ) for every bounded continuous real function g on R. If (iv) holds for all such g it also holds if g is just bounded and continuous a.e. (Pξ–1).

–1 –1 Proof These are immediate by identifying Pξn , Pξ with πn, π of The- orem 11.2.1, and noting that (iv) here becomes the statement of Corollary 1 of the theorem. 

The final result of this series is a very useful one which shows that an (a.e.) continuous function of a sequence converging in distribution also converges in distribution.

d Theorem 11.2.2 (Continuous Mapping Theorem) Let ξn → ξ where ξn, ξ have distributions πn, π and let h be a measurable function on R which is d continuous a.e. (π). Then h(ξn) → h(ξ). Proof This follows at once from the final statement in (iv) of Corollary 2 on replacing the bounded continuous g by its composition g◦h, which is clearly bounded and continuous a.e. (π), giving

Eg(h(ξn)) = E(g◦h)(ξn) →E(g◦h)(ξ)=Eg(h(ξ)).  232 Convergence and related topics

Note that this result may be equivalently stated that if πn, π are probabil- w –1 w –1 ity measures on B such that πn → π, then πnh → πh if h is continuous a.e. (π). More general, useful forms of the mapping theorem are given in [Kallenberg 2, Theorem 3.2.7].

w Remark The definition of weak convergence πn → π only involved πn(a, b] → π(a, b] for intervals (a, b]withπ{a} = π{b} =0.Itmay,how- ever, then be shown that πn(B) → π(B) for any Borel set B whose boundary has π-measure zero (so-called “π-continuity sets”). It may also be shown that two useful further necessary and sufficient conditions for weak con- vergence may be added to those of Theorem 11.2.1, viz.

≤ (iv) lim supn→∞ πn(F) π(F) all closed F (v) lim infn→∞ πn(G) ≥ π(G) all open G.

These are readily proved (see e.g. the “Portmanteau Theorem” of [Billings- ley]) and, of course, suggest extensions of the theory to more abstract (topo- logical) contexts.

We next obtain a useful and well known result, “Helly’s Selection The- orem”, concerning a sequence of d.f.’s. This theorem states that if {Fn} is { } any sequence of d.f.’s, a subsequence Fnk may be selected such that Fnk (x) converges to a nondecreasing function F(x) at all continuity points of the latter. The limit F need not be a d.f., however, as is easily seen from the example where Fn(x)=0, x < –n, Fn(x)=1, x > n, and Fn is linear in [–n, n]. (Fn(x) → 1/2 for all x.) A condition which will be seen to be useful in ensuring that such a limit is, in fact, a d.f., is the following. A family H of probability measures (or corresponding d.f.’s) on B is called tight if given >0 there exists A such that π{(–A, A]} > 1– for all π ∈H(or F(A)–F(–A) > 1– for all d.f.’s F with μF ∈H). Note that if w πn → π, it may be readily shown then that the sequence {πn} is tight (Ex. 11.18).

Theorem 11.2.3 (Helly’s Selection Theorem) Let {Fn : n =1,2,...} be { } a sequence of d.f.’s. Then there is a subsequence Fnk : k =1,2,... and a nondecreasing, right-continuous function F with 0 ≤ F(x) ≤ 1 for all x ∈ R → →∞ ∈ R such that Fnk (x) F(x) as k at all x where F is continuous. { } →w If in addition the sequence Fn is tight, then F is a d.f. and Fnk F.

Proof We will choose a subsequence Fnk whose values converge at all ra- tional numbers. Let {ri} be an enumeration of the rationals. Since {Fn(r1): 11.2 Convergence in distribution 233 n =1,2,...} is bounded, it has at least one limit point, and there is a subse- quence S1 of {Fn} whose members converge at x = r1. Similarly there is a subsequence S2 of S1 whose members converge at r2 as well as at r1. Proceeding in this way we obtain sequences S1, S2, ... which are such that Sn is a subsequence of Sn–1 and the members of Sn converge at x = r1, r2, ..., rn. Let S be the (infinite) sequence consisting of the first member of S1,the second of S2, and so on (the “diagonal” sequence). Clearly the members of S ultimately belong to Sn and hence converge at r1, r2, ..., rn, for any n,i.e. at all rk. { } Write S = Fnk and G(r) = limk→∞ Fnk (r) for each rational r. Clearly 0 ≤ G(r) ≤ 1 and G(r) ≤ G(s)ifr, s are rational (r < s). Now define F by F(x) = inf{G(r):r rational, r > x}. Clearly F is nondecreasing, 0 ≤ F(x) ≤ 1 for all x ∈ R and G(x) ≤ F(x) when x is rational. To see that F is right-continuous, fix x ∈ R. Then for any y ∈ R and rational r with x < y < r, F(x +0) ≤ F(y) ≤ G(r) so that F(x +0)≤ G(r) for all rational r > x. Hence F(x +0) ≤ inf{G(r):r rational, r > x} = F(x), showing that F is right-continuous. Now let x be a point where F is continuous. Then given >0 there exist rational numbers r, s, r < x < s such that F(x)–

F(x)– 0 and let A be such that Fn(A)–Fn(–A) > 1– for all n.Let 234 Convergence and related topics ≤ ≥ α –A, β A be continuity points of F. Then Fnk (β)–Fnk (α) > 1– ≥ for all k, and hence F(β)–F(α) = lim(Fnk (β)–Fnk (α)) 1– . It follows that F(∞)–F(–∞) ≥ 1– for all and hence F(∞)–F(–∞) = 1. Thus F(∞)=1+F(–∞) gives F(–∞) = 0 and F(∞) = 1. Thus F is d.f. and →w  Fnk F.

An important notion closely related to tightness (in fact identical to tight- ness in this real line context) is that of relative compactness. Specifically a family H of probability measures on B is called relatively compact if every { } H { } sequence πn of elements of has a weakly convergent subsequence πnk →w H H (i.e. πnk π for some probability measure π, not necessarily in ). If is a sequence this means that every subsequence has a further subsequence which is weakly convergent. It follows from the previous theorem that a family which is tight is also relatively compact. In fact it is easily seen that the converse is also true (in this real line framework and many other useful topological contexts). This is summarized in the following theorem.

Theorem 11.2.4 (Prohorov’s Theorem) A family H of probability mea- sures on B is relatively compact if and only if it is tight.

Proof In view of the preceding paragraph, we need only now prove that if H is relatively compact it is also tight. If it is not tight, there is some >0 such that π(–a, a] ≤ 1– for some π ∈H, whatever a is chosen. This means that for any n, there is a member πn of H with πn{(–n, n]}≤1– . H →w But since is relatively compact a subsequence πnk π, a probability measure, as k →∞. Let a, b be any points such that π({a})=π({b}) = 0. Then for suffi- ⊂ { } { }≤ ciently large k,(a, b] (–nk, nk] and hence π (a, b] = limk→∞ πnk (a, b] { }≤ lim supk πnk (–nk, nk] 1– . But this contradicts the fact that we may choose a, b with π({a})=π({b}) = 0 so that π{(a, b]} > 1– (since π(R)=1). Thus H is indeed tight. 

It is well known (and easily shown) that if every convergent subsequence of a bounded sequence {an} of real numbers, has the same limit a, then an → a (i.e. the whole sequence converges). The next result demonstrates an analogous property for weak convergence.

Theorem 11.2.5 Let {Fn} be a tight sequence of d.f.’s such that every { } weakly convergent subsequence Fnk has the same limiting d.f. F. Then w Fn → F. 11.3 Relationships between forms of convergence 235

Proof Suppose the result is not true. Then there is a continuity point x of the d.f. F such that Fn(x) F(x). By the above result stated for real se- { } { } →  quences, there must be a subsequence Fnk of Fn such that Fnk (x) λ { } { } F(x). By Theorem 11.2.3, a subsequence Fmk of Fnk converges weakly, → and by assumption its limit is F. Thus Fmk (x) F(x), contradicting the   convergence of Fnk (x)toλ F(x). Finally, as indicated earlier, the notion of weak convergence may be gen- eralized to apply to more abstract situations. The most obvious of these re- places R by Rk for which the generalization is immediate. Specifically we k say that a sequence {πn} of probability measures on B converges weakly to k w a probability measure π on B (πn → π)ifπn(I) → π(I) for every “continu- ity rectangle” I; i.e. any rectangle I whose boundary has π-measure zero. In R the boundary of I =(a, b] is just the two points {a, b}.InR2 it is the four edges, and in Rk it is the 2k bounding hyperplanes. k As in R we say that a sequence {Fn} of d.f.’s in R converges weakly to w ad.f.F, Fn → F,ifFn(x) → F(x) at all points x =(x1, ..., xk) at which w F is continuous. It may then be shown that Fn → F if and only if the →w corresponding probability measures converge (i.e. πn = μFn π = μF). If (1) (k) Fn is the joint d.f. of r.v.’s (ξn , ..., ξn )(=ξn say) and F is the joint d.f. of (1) (k) w (ξ , ..., ξ )=ξ, and Fn → F we say that ξn converges to ξ in distribution →d –1 →w –1 (ξn ξ) (i.e. Pξn Pξ ). More abstract (topological) spaces than Rk do not necessarily have an order structure to the notions of distribution functions and of rect- angles. However, the notion of bounded continuous functions does exist so that (iii) of Theorem 11.2.1 ( gdπn → gdπ for every bounded con- tinuous function g) can be used as the definition of weak convergence of w probability measures πn → π. This is needed for consideration of con- vergence in distribution of a sequence of random elements (e.g. stochastic processes) to a random element ξ in topological spaces more general than R –1 →w –1 (Pξn Pξ ) but our primary focus on random variables does not re- quire the generalization here. We refer the interested reader to [Billingsley] for an eminently readable detailed account.

11.3 Relationships between forms of convergence Returning now to the real line context, it is useful to note some relationships between the various forms of convergence. Convergence a.s. and convergence in Lp both imply convergence in probability. It is also simply shown by the next result that convergence 236 Convergence and related topics in probability implies convergence in distribution. (For another proof see Ex. 11.12.)

Theorem 11.3.1 Let {ξn} be a sequence of r.v.’s on the same probability P d space (Ω, F , P) and suppose that ξn → ξ as n →∞. Then ξn → ξ as n →∞. Proof Let g be any bounded continuous function on R. By Theorem 11.1.5 P (ii) it follows that g(ξn) → g(ξ). But |g(ξn)| is bounded by a constant and any constant is in L1, so that g(ξn) → g(ξ)inL1 by Theorem 11.1.7, and hence, in particular Eg(ξn) →Eg(ξ). Hence (iv) of Corollary 2 to Theorem d 11.2.1 shows that ξn → ξ.  Of course, the converse to Theorem 11.3.1 is not true (even though the ξn are defined on the same space). However, if ξn converges in distribution P to some constant a, it is easy to show that ξn → a (Ex. 11.13). Convergence in distribution by no means implies a.s. convergence (even for r.v.’s defined on the same (Ω, F , P)). However, the following represen- tation of Skorohod shows that a sequence {ξn} convergent in distribution may for some purposes be replaced by an a.s. convergent sequence ξ˜n with the same individual distributions as ξn, such that ξ˜n converges a.s. This can enable the use of simpler theory of a.s. convergence in proving results for convergence in distribution.

Theorem 11.3.2 (Skorohod’s Representation) Let {ξn}, ξ be r.v.’s and d ξn → ξ. Then there exist r.v.’s {ξ˜n}, ξ˜ on the “unit interval probability space” ([0, 1], B([0, 1]), m) (where m is Lebesgue measure) such that

d d (i) ξ˜n = ξn for each n, ξ˜ = ξ, and (ii) ξ˜n → ξ˜ a.s.

Proof Let ξn, ξ have d.f.’s Fn, F, respectively and let U(u)=u for 0 ≤ u ≤ 1. Then U is a uniform r.v. on [0, 1] and (cf. Section 9.6 and Ex. 9.5) ˜ –1 ˜ –1 ˜ d ˜ d ξn = Fn (U), ξ = F (U)haved.f.’sFn, F,i.e.ξn = ξn, ξ = ξ so that (i) holds. →d →w –1 → –1 Since ξn ξ, Fn F, and hence by Lemma 9.6.2, Fn F at continuity points of F–1. Thus

1 ≥ m{u ∈ [0, 1] : ξ˜n(u) → ξ˜(u)} { ∈ –1 → –1 } ˜ –1 –1 = m u [0, 1] : Fn (u) F (u) (ξn(u)=Fn (U(u)) = Fn (u)) ≥ m{u ∈ [0, 1] : F–1 is continuous at u} =1, 11.3 Relationships between forms of convergence 237

–1 since the discontinuities of F are countable. Hence ξ˜n(u) → ξ˜(u) for a.e. u, giving (ii). 

Note that while the r.v.’s ξn may be defined on different probability spaces, their “representatives” ξ˜n are defined on the same probability space (as they must be if a.s. convergent). w Finally, note that weak convergence, πn → π, has been defined for prob- ability measures πn, π but the same definition applies to measures μn and μ just assumed to be finite on B,i.e.μn(R) < ∞, μ(R) < ∞. Of course, w μn(R) and μ(R) need not be unity but if μn → μ it follows in particular that μn(R) → μ(R). Suppose now that μn, μ are Lebesgue–Stieltjes measures i.e. measures on B which are finite on bounded sets but possibly having infinite total measure (or equivalently are defined by finite-valued, nondecreasing but not necessarily bounded functions F). Then the previous definition of weak convergence could still be used but the important criterion (iii) of Theo- rem 11.2.1 does not apply sensibly since e.g. the bounded continuous func- tion g(x) = 1 may not be integrable. This is the case for Lebesgue measure itself, of course. However, an appropriate extended notion of convergence may be given in this case. Specifically if {μn}, μ are such measures on B (finite on bounded sets), v we say that μn converges vaguely to μ (μn → μ)if fdμn → fdμ for every continuous function f with compact support , i.e. such that f (x)=0if|x| > a for some constant a. Clearly fdμn and fdμ are defined and finite for such functions. The notion of vague convergence applies in particular if μn and μ are fi- nite measures and is clearly then implied by weak convergence. The follow- ing easily proved result (Ex. 11.20) summarizes the relationship between weak and vague convergence in this case when both apply.

Theorem 11.3.3 Let μn, μ be finite measures on B (i.e. μn(R) < ∞, w v μ(R) < ∞). Then, as n →∞, μn → μ if and only if μn → μ and μn(R) → μ(R).

As for weak convergence, the notion of vague convergence can be ex- tended to apply in more general topological spaces than the real line. Dis- cussion of these forms of convergence and their relationships may be found in the volumes [Kallenberg] and [Kallenberg 2]. 238 Convergence and related topics

11.4 Uniform integrability

We turn now to the relation between Lp convergence and convergence in probability. Lp convergence implies convergence in probability (Theorem 11.1.6). We have seen that the converse is true provided each term of the sequence is dominated by a fixed Lp r.v. (Theorem 11.1.7). A weaker con- dition turns out to be necessary and sufficient, and since it is important for other purposes, we investigate this now.

Specifically, a family {ξλ : λ ∈ Λ} of (L1) r.v.’s is said to be uniformly integrable if |ξ ω | dP ω → a →∞ sup {|ξ (ω)|>a} λ( ) ( ) 0as λ∈Λ λ |x| dF x → a →∞ F or equivalently if supλ∈Λ {|x|>a} λ( ) 0as , where λ is the d.f. of ξλ. From this latter form it is evident that (like convergence in distribution (Section 11.2)) uniform integrability does not require the r.v.’s to be defined on the same probability space. Of course, we always have |ξλ| dP → 0( |x| dFλ(x) → 0) for each λ as a →∞ {|ξλ|>a} {|x|>a} (dominated convergence). The extra requirement is that these should be uniform in λ ∈ Λ. It is clear that identically distributed (L1) r.v.’s are uni- |x| dF x → F formly integrable since {|x|>a} ( ) 0 where is the common d.f. of the family. It is also immediate that finite families of (L1) r.v.’s are uni- formly integrable, and that an arbitrary family {ξλ} defined on the same probability space and each dominated (in absolute value) by an integrable | | ≤|| r.v. ξ, is uniformly integrable. For then ξλ χ{|ξλ|≥a} ξ χ{|ξ|≥a} and hence |ξλ| dP ≤ |ξ| dP. {|ξλ|≥a} {|ξ|≥a} The concept of uniform integrability is closely related to what is called “uniform absolute continuity”. If ξ ∈ L1, we know that (the measure) |ξ| dP P E is absolutely continuous with respect to . Recall (Theorem 4.5.3) > δ> |ξ| dP < P E <δ that then, given 0 there exists 0 such that E if ( ) . {ξ λ ∈ Λ} L |ξ | dP If λ : is a family of ( 1) r.v.’s, each indefinite integral E λ is absolutely continuous. If for each , one δ may be found for all ξλ (i.e. |ξ | dP < all λ P E <δ if E λ for when ( ) ) then the family of indefinite { |ξ | dP λ ∈ Λ} uniformly absolutely continuous integrals E λ : is called .

Theorem 11.4.1 A family of L1 r.v.’s {ξλ : λ ∈ Λ} is uniformly integrable if and only if: (i) the indefinite integrals |ξ | dP are uniformly absolutely continuous, E λ and 11.4 Uniform integrability 239

(ii) the expectations E|ξλ| are bounded; i.e. E|ξλ| < M for some M < ∞ and all λ ∈ Λ. Proof Suppose the family is uniformly integrable. To see that (i) holds, note that for any E ∈F, λ ∈ Λ, |ξλ| dP = |ξλ| dP + |ξλ| dP ≤ aP(E)+ |ξλ| dP. E E∩{|ξλ|≤a} E∩{|ξλ|>a} {|ξλ|>a} Given >0 we may choose a so that the last term does not exceed /2, for all λ ∈ Λ by uniform integrability. For P(E) <δ= /2a we thus have |ξ | dP < λ ∈ Λ E λ for all , so that (i) follows. (ii) is even simpler. For we may choose a such that |ξλ| dP < 1for {|ξλ|>a} all λ ∈ Λ and hence E|ξλ|≤1+ |ξλ| dP ≤ 1+a which is a suitable {|ξλ|≤a} upper bound. Conversely, suppose that (i) and (ii) hold and write

sup E|ξλ| = M < ∞. λ∈Λ Then by the Markov Inequality (Theorem 9.5.3 (Corollary)), for all λ ∈ Λ, and all a > 0, P{|ξ | > a}≤E|ξ |/a ≤ M/a. λ λ > δ δ |ξ | dP < λ ∈ Λ Given 0, choose = ( ) so that E λ for all when P(E) <δ.Fora > M/δ we have P{|ξλ| > a} <δand thus |ξλ| dP < {|ξλ|>a} for all λ ∈ Λ. But this is just a statement of the required uniform integra- bility. 

The following result shows in detail how Lp convergence and convergence in probability are related, and in particular generalizes the (probabilistic form of) dominated convergence (Theorem 11.1.7), replacing domination by uniform integrability. P Theorem 11.4.2 If ξn ∈ Lp (0 < p < ∞) for all n =1,2,..., and ξn → ξ, then the following are equivalent

p (i) {|ξn| : n =1,2,...} is a uniformly integrable family (ii) ξ ∈ Lp and ξn → ξ in Lp as n →∞ p p (iii) ξ ∈ Lp and E|ξn| →E|ξ| as n →∞. Proof We show first that (i) implies (ii). →P → Since ξn ξ, a subsequence ξnk ξ a.s. Hence, by Fatou’s Lemma, and (ii) of the previous theorem, p p p E|ξ| ≤ lim inf E|ξn | ≤ sup E|ξn| < ∞ →∞ k k n≥1 240 Convergence and related topics so that ξ ∈ Lp. Further E|ξ ξ|p |ξ ξ|p dP |ξ ξ|p dP n – = {|ξ –ξ|p≤ } n – + {|ξ –ξ|p> } n – n n p p p p ≤ +2 |ξn| dP +2 |ξ| dP En En 1/p where En = {ω : |ξn – ξ| > } (hence P(En) → 0) and use has been made of the inequality |a + b|p ≤ 2p(|a|p + |b|p) (cf. proof of Theorem 6.4.1). p Uniform integrability of |ξn| implies the uniform absolute continuity of |ξ |p dP |ξ |p dP < P E <δ δ E n (Theorem 11.4.1). Thus E n when ( ) (= ( )), for all n, and hence there is some N1 (making P(En) <δfor n ≥ N1) such p that |ξn| dP < when n ≥ N1. Correspondingly for n ≥ some N2 we En p p p p have |ξ| dP < , and hence for n ≥ max(N1, N2), E|ξn–ξ| < +2 +2 , En showing that ξn → ξ in Lp. Thus (i) implies (ii). That (ii) implies (iii) follows at once from Theorem 11.1.6. The proof will be completed by showing that (iii) implies (i). Let A be any fixed nonnegative real number such that P{|ξ| = A} = 0, and define the p function h(x)=|x| for |x| < A, h(x) = 0 otherwise. Now since ξn → ξ in probability and h is continuous except at ±A (but P{ξ = ±A} =0),it follows from Theorem 11.1.5 (iii) that h(ξn) → h(ξ) in probability. Since p h(ξn) ≤ A ∈ L1 it follows from Theorem 11.1.7 that h(ξn) → h(ξ)inL1. Thus Eh(ξn) →Eh(ξ), and hence by (iii),

p p E|ξn| – Eh(ξn) →E|ξ| – Eh(ξ) or p p |ξn| dP → |ξ| dP. {|ξn|>A} {|ξ|>A} Now if >0 we may choose A = A( ) such that this limit is less than (and P{|ξ| = A} = 0), so that there exists N = N( ) such that p |ξn| dP < {|ξn|>A} p for all n ≥ N. Since as noted above the finite family {|ξn| : n =1,2,..., p N –1} is uniformly integrable, we have sup ≤ ≤ |ξn| dP → 0as 1 n N–1 {|ξn|≥a} a →∞, and hence there exists A = A( ) such that | |p max {| |p } ξn dP < . 1≤n≤N–1 ξn >A | |p Now taking A = A ( ) = max(A, A ), we have ξn dP < for {|ξn|>A } | |p p all n, and hence, finally, sup p ξn dP < whenever a > (A ( )) , n {|ξn| >a} demonstrating the desired uniform integrability.  11.5 Series of independent r.v.’s 241 Note that (iii) states that gdπn → gdπ where πn, π are the distribu- p tions of ξn and ξ, and g is the function g(x)=|x| . This result would have d followed under weak convergence of πn to π only (i.e. ξn → ξ)ifg were bounded (by Theorem 11.2.1). It is thus the fact the |x|p is not bounded that makes the extra conditions necessary. Finally, also note that while we are used to sufficient (e.g. “domination type”) conditions for (ii) the fact that (i) is actually necessary for (ii) indi- cates the appropriateness of uniform integrability as the correct condition P to consider for sufficiency when ξn → ξ.

11.5 Series of independent r.v.’s

It follows (Ex. 10.15) from the zero-one law of Chapter 10 that if {ξn} are independent r.v.’s then ∞ P{ω : ξn(ω) converges} =0or1. n=1 In this section necessary and sufficient conditions will be obtained for this ∞ probability to be unity, i.e. for 1 ξn to converge a.s. First, two inequalities are needed.

Theorem 11.5.1 (Kolmogorov Inequalities) Let ξ1, ξ2, ..., ξn be indepen- dent r.v.’s with zero means and (possibly different) finite second moments E 2 2 k ξi = σi .WriteSk = j=1 ξj. Then, for every a > 0 { | |≥ }≤ n 2 2 (i) P max1≤k≤n Sk a i=1 σi /a . | |≤ (ii) If in addition the r.v.’s ξi are bounded, ξi c a.s., i =1,2,..., n, then { | | }≤ 2 n 2 P max1≤k≤n Sk < a (c + a) / i=1 σi .

Proof First we prove (i), so do not assume ξi bounded. Write

E = {ω :max|Sk(ω)|≥a} 1≤k≤n

E1 = {ω : |S1(ω)|≥a} { | |≥ }∩∩k–1{ | | } Ek = ω : Sk(ω) a i=1 ω : Si(ω) < a , k > 1.

It is readily checked that χEk and χEk Sk are Borel functions of ξ1, ..., ξk.By Theorems 10.3.2 (Corollary) and 10.3.5 it follows that if i > k, E E E E 2 E E 2 (χEk Skξi)= (χEk Sk) ξi =0, (χEk ξi )= χEk ξi and for j > i > k E E E E (χEk ξiξj)= χEk ξi ξj =0. 242 Convergence and related topics

Hence since n n n  2 2 2 2 Sn =(Sk + ξi) = Sk +2Sk ξi + ξi +2 ξiξj k+1 k+1 k+1 n≥j>i>k it follows that n E 2 E 2 2 (χEk Sn)= (χEk Sk )+P(Ek) σi ,(11.1) k+1 so that E 2 ≥E 2 ≥ 2 (χEk Sn) (χEk Sk ) a P(Ek) χ S2 ≥ a2χ E E ∪nE since Ek k Ek by definition of k. Thus since = 1 k, and the sets n Ek are disjoint, χE = 1 χEk and

n n n 2 2 ≤ E 2 E 2 ≤E 2 2 a P(E)=a P(Ek) (χEk Sn)= (SnχE) Sn = σi 1 1 1 ≤ n 2 2 by independence of ξi. Thus P(E) i=1 σi /a , which is the desired result, (i). To prove (ii) assume now that |ξi|≤c a.s. for each i, and note that the equality (11.1) still holds, so that

n n E 2 ≤E 2 2 ≤ 2 2 (χEk Sn) (χEk Sk )+P(Ek) σi (a + c) P(Ek)+P(Ek) σi 1 1 since |Sk|≤|Sk–1| + |ξk|≤a + c on Ek. Summing over k from 1 to n we have

n E 2 ≤ 2 2 (χESn) (a + c) P(E)+P(E) σi 1

c and thus (noting that |Sn|≤a on E )

n 2 E 2 E 2 E 2 σi = Sn = (χESn)+ (χEc Sn) 1 n ≤ 2 2 2 c (a + c) P(E)+P(E) σi + a P(E ) 1 n ≤ 2 2 (a + c) + P(E) σi . 1 11.5 Series of independent r.v.’s 243

Rearranging gives n c ≤ 2 2 P(E ) (a + c) / σi 1 or n { | | }≤ 2 2 P max Sk < a (a + c) / σi 1≤k≤n 1 which is the desired result.  Note that the inequality (i) is a generalization of the Chebychev Inequal- ity (which it becomes when n = 1). Note also that the same inequality holds for P{max1≤k≤n |Sk|≤a} in (ii) as for P{max1≤k≤n |Sk| < a}.(Forwe may replace a in (ii) by a + and let ↓ 0.) The next lemma will be useful in obtaining our main theorems concern- ing a.s. convergence of series of r.v.’s. { } n Lemma 11.5.2 Let ξn be a sequence of r.v.’s and write Sn = 1 ξi. Then ∞ 1 ξn converges a.s. if and only if

lim P{max |Sr – Sn| > }→0 as n →∞ k→∞ n≤r≤k for each >0. (Note that the k-limit exists by monotonicity.) ∞ { } Proof Since 1 ξn converges if and only if the sequence Sn is Cauchy, it is readily seen that ∞ { } ∩∞ ∪∞ { | |≤ ≥ } ω : ξn converges = m=1 n=1 ω : Si – Sj 1/m for all i, j n 1 ∩∞ ∪∞ ∩∞ { | |≤ } = m=1 n=1 k=n ω :maxSi – Sj 1/m . n≤i,j≤k c { | | Now if Emnk denotes the set in braces, i.e. Emnk = ω :maxn≤i,j≤k Si – Sj > } ≤ 1/m , it is clear that Emnk is nonincreasing in n ( k), and nondecreasing ≥ ∞ in both k ( n) and m so that, writing D for the set where 1 ξn does not converge, we have {∪∞ ∩∞ ∪∞ } P(D)=P m=1 n=1 k=nEmnk = lim lim lim P(Emnk). m→∞ n→∞ k→∞

Since P(Emnk) is nondecreasing in m, P(D) = 0 if and only if limn→∞ limk→∞ P(Emnk) = 0 for each m, which clearly holds if and only if

lim P{ max |Si – Sj| > }→0asn →∞ k→∞ n≤i,j≤k 244 Convergence and related topics for each >0. But for fixed n, k,

P{max |Si – Sn| > }≤P{ max |Si – Sj| > }≤P{max |Si – Sn| > /2} n≤i≤k n≤i,j≤k n≤i≤k

(since |Si –Sj|≤|Si –Sn|+|Sn –Sj|), from which it is easily seen that P(D)=0 if and only if limk→∞ P{maxn≤r≤k |Sr –Sn| > }→0asn →∞for each >0, as required.  The next theorem (which will follow at once from the above results), while not as general as the “Three Series Theorem” to be obtained subse- quently nevertheless gives a simple useful condition for a.s. convergence of series of independent r.v.’s when the terms have finite variances. { } Theorem 11.5.3 Let ξn be a sequence of independent r.v.’s with zero E 2 2 ∞ 2 ∞ means and finite variances ξn = σn. Suppose that 1 σn < . Then ∞ 1 ξn converges a.s. n Proof Writing Sn = 1 ξi, and noting that Sr – Sn is (for r > n) the sum of r – n r.v.’s ξi, we have by Theorem 11.5.1 k { | | }≤ 2 2 P max Sr – Sn > σi / n≤r≤k i=n+1 so that ∞ { | | }≤ 2 2 lim P max Sr – Sn > σi / k→∞ n≤r≤k i=n+1 →∞ ∞ 2 which tends to zero as n by virtue of the convergence of 1 σi . Hence the result follows immediately from Lemma 11.5.2.  The next result is the celebrated “Three Series Theorem”, which gives necessary and sufficient conditions for a.s. convergence of series of inde- pendent r.v.’s, without assuming existence of any moments of the terms.

Theorem 11.5.4 (Kolmogorov’s Three Series Theorem) Let {ξn : n = 1, 2, ...} be independent r.v.’s and let c be a positive constant. Write En = { | |≤ } ∈ ω : ξn(ω) c and define ξn(ω) as ξn(ω) or c according as ω En ∈ c or ω En. Then a necessary and sufficient condition for the convergence ∞ (a.s.) of 1 ξn is the convergence of all three of the series ∞ ∞ ∞ c E 2 (a) P(En) (b) ξn (c) σn 1 1 1 2 σn being the variance of ξn. 11.5 Series of independent r.v.’s 245

Proof To see the sufficiency of the conditions note that (a) may be rewrit-  ten as P(ξn ξn), and convergence of this series implies (a.s.), by the Borel–Cantelli Lemma, that ξn(ω)=ξn(ω) when n is sufficiently large (how large, depending on ω). Hence ξn converges a.s. if and only if ξn does. E E E 2 2 But by Theorem 11.5.3 applied to ξn – ξn (using (c),(ξn – ξn) = σn ) E we have that (ξn – ξn) converges a.s. Hence by (b) ξn converges a.s., and, by the discussion above, so does ξn, as required. ∞ Conversely, suppose that 1 ξn converges a.s. Since this implies that → ξn 0a.s.wemusthaveξn = ξn a.s. when n is sufficiently large, and {  } ∞ hence P ξn ξn < by Theorem 10.5.2. That is, condition (a) holds, and further ξn converges a.s. Now let ηn, ζn be r.v.’s with the same distributions as ξn and such that {ηn, ζn : n =1,2,...} are all independent as a family. (Such r.v.’s may be readily constructed using product spaces.) It is easily shown (cf. Ex. 11.30) that ηn and ζn both converge a.s. (since ξn does) and hence k so does (ηn – ζn). Writing Sk = 1(ηn – ζn) we have, in particular, that {| | } { | | the series Sk : k =1,2,... is bounded for a.e. ω,i.e.P supk≥1 Sk < ∞} { | | } { | | = 1, and hence lima→∞ P supk≥1 Sk < a = 1 so that P supk≥1 Sk < a} >θfor some θ>0, a > 0. Thus, for any n, P{max1≤k≤n |Sk| < a} >θ. 2 But Theorem 11.5.1 (ii) applies to the r.v.’s ηk – ζk (with variance 2σk , and 2 n 2 { | | } writing 2c for c), to give (2c + a) /(2 1 σk ) > P max1≤k≤n Sk < a >θfor all n. That is, for all n

n 2 2 σk < (2c + a) /2θ 1 ∞ 2 which shows that 1 σk converges; i.e. (c) holds. E (b) is now easily checked, since the sequence of r.v.’s ξn – ξn have zero 2 E means, and the sum of their variances ( σn ) is finite. Hence (ξn – ξn) converges a.s., as does ξn. By choosing some fixed ω where convergence E (of both) takes place, we see that ξn must converge, concluding the proof of the theorem. 

Note that it follows from the theorem that if the series (a), (b), (c) con- verge for some c > 0, they converge for all c > 0. Note also that the proof of the theorem will apply if ξn(ω) is defined to be zero (rather than c) when ∈ c ω En. This definition of ξn can be simpler in practice. Convergence in probability does not usually imply convergence a.s. Our final task in this section is to show, however, that convergence of a series of independent r.v.’s in probability does imply its convergence a.s. 246 Convergence and related topics

Theorem 11.5.5 Let {ξn} be a sequence of independent r.v.’s. Then the ∞ series 1 ξn converges in probability if and only if it converges a.s. Proof Certainly convergence a.s. implies convergence in probability. By Lemma 11.5.2 (using 2 in place of ) the result will follow if it is shown that for each >0

lim P{max |Sr – Sn| > 2 }→0, as n →∞, k→∞ n≤r≤k n with Sn = 1 ξi. Instead of appealing to Kolmogorov’s Inequality (as in the previous theorem), the convergence in probability may be used to obtain this as follows. If n < r ≤ k and |Sr – Sn| > 2 , |Sk – Sr|≤ then

|Sk – Sn| = |(Sr – Sn)–(Sr – Sk)|≥|Sr – Sn| – |Sr – Sk| > and hence ∪k { | |≤ | | | |≤ } r=n+1 ω :maxSj – Sn 2 , Sr – Sn > 2 , Sk – Sr n≤j

⊂{ω : |Sk – Sn| > }.

The sets of the union are disjoint. Also maxn 2 }P{|Sk – Sr|≤ } n≤j }. ∞ { } Since 1 ξn converges in probability, Sn is a Cauchy sequence in prob- ability, and hence, given η>0, there is an integer N with P{|Sk –Sn| > } <η when k, n ≥ N. Hence also P{|Sk – Sr|≤ } > 1–η if k ≥ r ≥ N, giving k P{max |Sj – Sn|≤2 , |Sr – Sn| > 2 }≤η/(1 – η) n≤j n ≥ N. Rephrasing this, we have

P{max |Sr – Sn| > 2 }≤η/(1 – η) n≤r≤k and hence limk→∞ P{maxn≤r≤k |Sr – Sn| > 2 }≤η/(1 – η)forn ≥ N, giving

lim P{max |Sr – Sn| > 2 }→0asn →∞, k→∞ n≤r≤k concluding the proof.  11.6 Laws of large numbers 247

∞ It may even be shown that if a series 1 ξn of independent r.v.’s con- verges in distribution it converges in probability and hence a.s. Since we shall use characteristic functions to prove it, the explicit statement and proof of this still stronger result is deferred to the next chapter (Theorem 12.5.2).

11.6 Laws of large numbers The last section concerned convergence of series of independent r.v.’s ∞ 1 ξn. For convergence it is necessary in particular that the terms tend to zero i.e. ξn → 0 a.s. Thus the discussion there certainly does not apply to any (nontrivial) independent sequences for which the terms have the same distributions. It is mainly to such “independent and identically distributed” (i.i.d.) random variables that the present section will apply. { } Specifically we shall consider an independent sequence ξn with Sn = n 1 ξi and obtain conditions under which the averages Sn/n convergetoa constant either in probability or a.s., as n →∞. For i.i.d. random variables with a finite mean, the constant will turn out to be μ = Eξi. Results of this type are usually called laws of large numbers, convergence in probability being called a weak law and convergence with probability one a strong law. Two versions of the strong law will be given – one applying to indepen- dent r.v.’s with finite second moments (but not necessarily having the same distributions), and the other applying to i.i.d. r.v.’s with finite first moments. Since convergence a.s. implies convergence in probability, weak laws will follow trivially as corollaries. However, the weak law for i.i.d. r.v.’s may also be easily obtained directly by use of characteristic functions as will be seen in the next chapter. { } ∞ Lemma 11.6.1 If yn is a sequence of real numbers such that n=1 yn/n 1 n → →∞ converges, then n i=1 yi 0 as n . n n Proof Writing sn = i=1 yi/i (s0 =0), tn = 1 yi it is easily checked that 1 n–1 1 n tn/n =–n i=1 si + sn. Since n i=1 si is well known (or easily shown) to converge to the same limit as sn it follows that tn/n → 0, which is the result required. 

The first form of the strong law of large numbers requires the indepen- dent r.v.’s ξn to have finite variances but not necessarily to be identically distributed. 248 Convergence and related topics

Theorem 11.6.2 (Strong Law, First Form) If ξn are independent r.v.’s with 2 ∞ 2 2 ∞ finite means μn and finite variances σn, satisfying n=1 σn/n < , then n 1 (ξ – μ ) → 0 a.s. n i i i=1 1 n → → 1 n → In particular if n i=1 μi μ (e.g. if μn μ) then n i=1 ξi μ a.s.

Proof It is sufficient to consider the case where μn = 0 for all n since the general case follows by replacing ξi by (ξi – μi). Assume then that μn =0 for all n and write ηn(ω)=ξn(ω)/n. Then Eηn = 0 and ∞ ∞ 2 2 ∞ var(ηn)= σn/n < . n=1 n=1 ∞ ∞ Thus by Theorem 11.5.3, n=1 ξn/n = n=1 ηn converges a.s. and the desired conclusion follows at once from Lemma 11.6.1.  The following result also yields the most common form of the strong law, which applies to i.i.d. r.v.’s (but only assumes the existence of first moments).

Theorem 11.6.3 (Strong Law, Second Form) Let {ξn} be independent and identically distributed r.v.’s with (the same) finite mean μ. Then, n 1 ξ → μ a.s. as n →∞. n i i=1

Proof Again, if the result holds when μ = 0, replacing ξi by (ξi – μ) shows that it holds when μ  0. Hence we assume that μ =0. | |≤ Write ηn(ω)=ξn(ω)if ξn(ω) n, ηn(ω) = 0 otherwise (for n =1,2,...). 1 n → First it will be shown that n 1(ξi – ηi) 0a.s.Wehave ∞ ∞ ∞ P(ξn  ηn)= P(|ξn| > n)= (1 – F(n)) n=1 n=1 n=1 where F is the (common) d.f. of the |ξn|.But1–F(n) ≤ 1–F(x)forn –1< x ≤ n so that ∞ ∞ F n ≤ F x dx E|ξ | < ∞ (1 – ( )) 0 (1 – ( )) = 1 n=1  ∞ by e.g. Ex. 9.16, so that n P(ξn ηn) < . Hence by the Borel–Cantelli Lemma, for a.e. ω, ξn(ω)=ηn(ω) when n is sufficiently large and hence it 1 n → follows at once that n 1(ξi – ηi) 0a.s. Exercises 249 1 n → The proof will be completed by showing that n 1 ηi 0 a.s. Note first that the variance of η satisfies n η ≤Eη2 x2 dF x var( n) n = |x|≤n ( ) since the |ξi| have d.f. F. Hence ∞ ∞ n–2 η ≤ n–2 x2 dF x var( n) |x|≤n ( ) n=1 n=1 ∞ n n–2 x2 dF x = {(k–1<|x|≤k)} ( ) n=1 k=1 ∞ ∞ x2 dF x n–2 = {k–1<|x|≤k} ( ) k=1 n=k ∞ ≤ C k x2 dF x ( / ) {k–1<|x|≤k} ( ) k=1 ∞ 2 where C is a constant such that n=k 1/n < C/k for all k =1,2,.... (It is easily proved that such a C exists – e.g. by dominating the sum by an integral.) Hence ∞ ∞ n–2 η ≤ C |x| dF x CE|ξ | < ∞ var( n) {k–1<|x|≤k} ( )= 1 . n=1 k=1 It thus follows from Theorem 11.6.2 (since the ηn are clearly independent) –1 n that n (η – Eη ) → 0a.s.ButEη = E(ξ χ| |≤ )=Eξ – E(ξ χ| | )= i=1 i i n n ξn n n ∞ n ξn >n E ξ χ| | Eξ |Eη |≤E|ξ |χ| | xdF x → – ( n ξn >n) since n = 0. Hence n ( n ξn >n)= n ( ) →∞ E| | ∞ –1 n E → 0asn ( ξn < ). Thus n i=1 ηi 0 so that by the above –1 n →  n i=1 ηi 0 a.s., as required to complete the proof.

Exercises { }∞ E 2 ∞ 11.1 Let ξn n=1 be a sequence of r.v.’s with ξn < and let E 2 μn = ξn, σn =var(ξn). → ∞ 2 ∞ → If μn μ and 1 σn < , show that ξn μ a.s. { }∞ Ω 11.2 Let ξn n=1 be a sequence of random variables on the probability space ( , F { }∞ , P)and cn n=1 a sequence of positive numbers. Define the truncation of c ξn at cn by ηn = ξn χAn ,where

An = {ω ∈ Ω : |ξn(ω)| > cn}. ∞ ∞ → → Prove that if n=1 P(An) < and if ηn ξ almost surely, then ξn ξ almost surely. 250 Convergence and related topics

11.3 Prove that ξn → ξ in probability if and only if   |ξ – ξ| lim E n =0. →∞ n 1+|ξn – ξ| 11.4 The result of Ex. 11.3 may be expressed in terms of a “metric” d on the “space” of r.v.’s, provided we regard two r.v.’s which are equal a.s. as being E |ξ–η| the same in the space. Define d(ξ, η)= 1+|ξ–η| (d is well defined for any ξ, η). Then d(ξ, η) ≥ 0 with equality only if ξ = η a.s., and d(ξ, η)=d(η, ξ) for all ξ, η. Show that the “triangle inequality” holds, i.e.

d(ξ, ζ) ≤ d(ξ, η)+d(η, ζ) |a+b| ≤ |b| |a| for any ξ, η, ζ. (Hint: For any a, b it may be shown that 1+|a+b| 1+|b| + 1+|a| .) Ex. 11.3 may then be restated as “ξn → ξ in probability if and only if d(ξn, ξ) → 0, i.e. ξn → ξ in this metric space”. 11.5 Show that the statement “If Eξn → 0thenξn → 0 in probability” is false, though the statement “If ξn ≥ 0, and Eξn → 0thenξn → 0 in probability” is true. 11.6 Let {ξn} be a sequence of r.v.’s. Show that there exist constants An such that ξn/An → 0 a.s. → { | |≤ 11.7 If ξn ξ a.s. show that given >0 there exists M such that P supn≥1 ξn M} > 1– . 11.8 Complement the uniqueness statement in Theorem 11.2.1 by showing ex- ∗ plicitly that if {πn : n =1,2,...}, π, π are probability measures on (R, B) w w ∗ ∗ such that πn → π, πn → π ,thenπ = π on B. (Consider the corresponding d.f.’s.) 11.9 Let {Fn} be a sequence of d.f.’s with corresponding probability measures w w {πn}. Show directly from the definitions that if πn → π then Fn → F. (Hint: Show that if a, x are continuity points of F then lim infn→∞ Fn(x) ≥ F(x)– F(a), and let a → –∞.) 11.10 Show that in the definition πn(a, b] → π(a, b] for all finite a, b for weak w convergence of probability measures πn → π,intervals(a, b] or open inter- w vals (a, b) may be equivalently used. For example show that if πn → π then πn{b}→π{b} for any b such that π{b} = 0, and that this also holds under the alternative assumptions replacing semiclosed intervals by open or by closed intervals. 11.11 Prove the assertion needed in Corollary 1, Theorem 11.2.1 that if π is a probability measure on B and g is a nonnegative bounded B-measurable function which is continuous a.e. (π) then a sequence {gn} of continuous functions may be found with 0 ≤ gn(x) ↑ g(x) at each continuity point x of g. This may be shown by defining continuous functions h1, h2, ... such that ≤ ≤ 0 hn(x) g(x)andsupn hn(x)=g(x), and writing gn(x)=max1≤i≤n hi(x). Exercises 251

(Hint: Consider hm,r defined for each integer m and rational r by hm,r(x)= min(r, m inf{|x – y| : g(y) ≤ r})(inf(∅)=+∞).) { }∞ { }∞ 11.12 Let ξn n=1, ξ be r.v.’s with d.f.’s Fn n=1, F respectively. Assume that P ξn → ξ. Show that given >0,

Fn(x) ≤ F(x + )+P{|ξn – ξ|≥ }

F(x – ) ≤ Fn(x)+P{|ξn – ξ|≥ }. d Hence show that ξn → ξ (by this alternative method to that of Theorem 11.3.1). 11.13 Convergence in distribution does not necessarily imply convergence in d probability. However, if ξn → ξ and ξ(ω)=a, constant almost surely then ξn → ξ in probability. d 11.14 Let {ξn}, ξ be r.v.’s such that ξn → ξ.

(i) If each ξn is discrete, can ξ be absolutely continuous? (ii) If each ξn is absolutely continuous, can ξ be discrete? { }∞ Ω F 11.15 Let ξn n=1 and ξ be random variables on ( , , P) such that for each n and k =0,1,..., n, P{ξn = k/n} =1/(n +1), d and ξ has the uniform distribution on [0, 1]. Prove that ξn → ξ. { }∞ Ω F 11.16 Let ξn n=1 and ξ be random variables on ( , , P)andletξn = xn (constant) d a.s. for all n =1,2,.... Prove that ξn → ξ if and only if the sequence of real { }∞ numbers xn n=1 converges and ξ = limn xn a.s. { }∞ { }∞ 11.17 Let the random variables ξn n=1 and ξ have densities fn n=1 and f respec- tively with respect to Lebesgue measure m.Iffn → f a.e. (m) on the real d line R, show that ξn → ξ. (Hint: Prove that fn → f in L1(R, B, m) by looking at the positive and negative parts of f – fn.) { }∞ B →w 11.18 Let πn n=1, π be probability measures on . Show that if πn π then { }∞ πn n=1 is tight. 11.19 Weak convergence of d.f.’s, may also be expressed in terms of a metric. If F, G are d.f.’s, the “Levy´ distance” d(F, G)isdefinedbyd(F, G)=inf{ > 0:G(x – )– ≤ F(x) ≤ G(x + )+ for all real x}, show that d is a metric, w and Fn → F if and only if d(Fn, F) → 0. w 11.20 Prove Theorem 11.3.3, i.e. that for finite measures μn, μ on B, μn → μ if v and only if μn → μ and μn(R) → μ(R)asn →∞. 11.21 Suppose {ξu : u ∈ U}, {ηv : v ∈ V} are each uniformly integrable families. Show that the family {ξu + ηv : u ∈ U, v ∈ V} is uniformly integrable. { }∞ 11.22 If the random variables ξn n=1 are identically distributed with finite means, then ξn → ξ in probability if and only if ξn → ξ in L1. { }∞ E | |p ∞ 11.23 If the random variables ξn n=1 are such that supn ( ξn ) < for some { }∞ p > 1, show that ξn n=1 is uniformly integrable. 252 Convergence and related topics

{ }∞ As a consequence, show that if the random variables ξn n=1 have uniformly bounded second moments, then ξn → ξ in probability if and only if ξn → ξ in L1. 11.24 Let {ξn} be r.v.’s with E|ξn| < ∞ for each n. Show that the family {ξn : n = 1, 2, ...} is uniformly integrable if and only if the family {ξn : n ≥ N} is uniformly integrable for some integer N. Indeed this holds if given >0 there exist N = N( ), A = A( ) such that |ξn| dP < for all n ≥ {|ξn|≥a} N( ), a ≥ A( ). Show that a corresponding statement holds for uniform { | | ≥ } { | | absolute continuity of the families E ξn dP : n 1 and E ξn dP : n ≥ N}. { }∞ 11.25 Let ξn n=1 be a sequence of independent random variables such that ξn = ± { }∞ 1 each with probability 1/2 and let an n=1 be a sequence of real numbers. ∞ (i) Find a necessary and sufficient condition for the series n=1 anξn to converge a.s. –n ∞ (ii) If an =2 prove that n=1 anξn has the uniform distribution over [–1, 1]. { }∞ 11.26 Let ξn n=1 be a sequence of independent random variables such that for 1/3 1/3 every n, ξn has the uniform distribution on [–n , n ]. Find the probability ∞ n of convergence of the series n=1 ξn and of the sequence (1/n) k=1 ξk as →∞ n . ∞ ± 11.27 The random series n=1 1/n is formed where the signs are chosen indepen- dently and the probability of a positive sign for the nth term is pn. Express { }∞ the probability of convergence of the series in terms of the sequence pn n=1. ∞ 11.28 Let {ξn} be a sequence of independent r.v.’s such that each ξn has the n=1 ∞ uniform distribution on [an,2an], an > 0. Show that the series n=1 ξn con- ∞ ∞ ∞ ∞ verges a.s. if and only if n=1 an < . What happens if n=1 an =+ ? { }∞ 11.29 Let ξn n=1 be a sequence of nonnegative random variables such that for –λ x each n, ξn has the density λne n for x ≥ 0, where λn > 0. ∞ ∞ ∞ ∞ (i) If n=1 1/λn < show that n=1 ξn < almost surely. { }∞ (ii) If the random variables ξn n=1 are independent show that ∞ ∞ 1/λn < ∞ if and only if ξn < ∞ a.s. n=1 n=1 and ∞ ∞ 1/λn = ∞ if and only if ξn = ∞ a.s. n=1 n=1

{ } { *} 11.30 Let ξn , ξn be two sequences of r.v.’s such that, for each n, the joint distri- * * { ∞ bution of (ξ1, ..., ξn) is the same as that of (ξ1, ..., ξn). Show that P 1 ξn } { ∞ * } * converges = P 1 ξn converges . (Hint: If D, D denote respectively the * sets where ξn, ξn do not converge, use e.g. the expression for P(D)in Exercises 253

the proof of Lemma 11.5.2, and the corresponding expression for P(D*)to show that P(D)=P(D*). { } { *} In particular this result applies if ξn , ξn are each classes of independent * r.v.’s and ξn has the same distribution as ξn for each n – this is the case used in Theorem 11.5.4.) 11.31 For any sequence of random variables {ξ }∞ prove that n n=1 → n → (i) if ξn 0 a.s. then (1/n) k=1 ξk 0 a.s. → n → (ii) if ξn 0inLp, p > 1, then (1/n) k=1 ξk 0inLp and hence also in probability. { }∞ 11.32 Let ξn n=1 be a sequence of independent and identically distributed r.v.’s with E  E 2 2 ∞ ξn = μ 0and ξn = a < . Find the a.s. limit of the sequence 2 ··· 2 ξ + + ξn 1 . ξ1 + ···+ ξn { }∞ 11.33 Let ξn n=1 be a sequence of independent and identically distributed random n E | | ∞ variables and Sn = 1 ξi.If ( ξ1 )=+ prove that

lim sup |Sn|/n =+∞ a.s. n→∞ n It then follows from the strong law of large numbers that (1/n) k=1 ξk con- verges a.s. if and only if E(|ξ1|) < +∞. (Hint: Use Ex. 9.15 to conclude that for every a > 0theevents{ω ∈ Ω : |ξn(ω)|≥an} occur infinitely often with probability one.) 12

Characteristic functions and central limit theorems

12.1 Definition and simple properties This chapter is concerned with one of the most useful tools in probability theory – the characteristic function of a r.v. (not to be confused with the characteristic function (i.e. indicator) of a set). We shall investigate prop- erties of such functions, and some of their many implications especially concerning independent r.v.’s and central limit theory. Chapter 8 should be reviewed for the needed properties of integrals of complex-valued functions and basic Fourier Theory. If ξ is a r.v. on a probability space (Ω, F , P), eitξ(ω) is a complex F - measurable function (Chapter 8) (and therefore will be called a complex r.v.). The integration theory of Section 8.1 applies and Eξ will be used for | itξ| itξ ∈ Ω F ξ dP as for real r.v.’s. Since e = 1 it follows that e L1( , , P). The function φ(t)= eitξ(ω) dP(ω)(=Eeitξ) of the real variable t is termed the characteristic function (c.f.) of the r.v. ξ. By definition, if ξ has d.f. F, φ(t)=E cos tξ + iE sin tξ ∞ ∞ tx dF x i tx dF x = –∞ cos ( )+ –∞ sin ( ) ∞ eitx dF x = –∞ ( ). Thus φ(t) is simply the Fourier–Stieltjes Transform F*(t)ofthed.f.F of ξ (cf. Section 8.2). If F is absolutely continuous, with density f ,itisimme- diate that ∞ φ t eitxf x dx ( )= –∞ ( ) , † showing that φ is the L1 Fourier Transform f (t) of the p.d.f. f .IfF is discrete, with mass pj at xj, j =1,2,..., then ∞ itxj φ(t)= pje . j=1

254 12.1 Definition and simple properties 255

Some simple properties of a c.f. are summarized in the following theorem. Theorem 12.1.1 Ac.f.φ has the following properties (i) φ(0) = 1, (ii) |φ(t)|≤1, for all t ∈ R, (iii) φ(–t)=φ(t), for all t ∈ R, where the bar denotes the complex conju- gate, (iv) φ is uniformly continuous on R (cf. Theorem 8.2.1). Proof (i) φ(0) = E1=1. (ii) |φ(t)| = |Eeitξ|≤E|eitξ| = E1 = 1, using Theorem 8.1.1 (iii). (iii) φ(–t)=Ee–itξ = Eeitξ = φ(t). (iv) Let t, s ∈ R, t – s = h. Then |φ(t)–φ(s)| = |E(ei(s+h)ξ – eisξ)| = |Eeisξ(eihξ –1)| ≤E|eihξ –1| (|eisξ(ω)| =1).

ihξ(ω) ihξ(ω) Now for all ω such that ξ(ω) is finite, limh→0 |e –1| = 0 and |e – 1|≤|eihξ(ω)| + 1 = 2 (which is P-integrable). Thus by dominated conver- gence, E|eihξ –1|→0ash → 0. Finally this means that given >0 there exists δ>0 such that E|eihξ –1| < if |h| <δ. Thus |φ(t)–φ(s)| < for all t, s, such that |t – s| <δwhich shows uniform continuity of φ(t)onR.  The following result is simple but stated here for completeness. Theorem 12.1.2 If a r.v. ξ has c.f. φ(t), and if a, b are real, then the r.v. η = aξ + b has c.f. eibtφ(at). In particular the c.f. of –ξ is φ(–t)=φ(t). Proof Eeit(aξ+b) = eitbEeitaξ = eibtφ(at).  In Theorem 12.1.1 it was shown that φ(0) = 1 and |φ(t)|≤1 for all t if φ is a c.f. We shall see now that if |φ(t)| = 1 for any nonzero t then ξ must be a discrete r.v. of a special kind. We shall say that a r.v. ξ is of lattice type if there are real numbers a, b (b > 0) such that ξ(ω) belongs to the set {a + nb : n =0,±1, ±2, ...} with probability one. The d.f. F of such a r.v. thus has jumps at some or all of these points and is constant between them. The corresponding c.f. is, writing pn = P{ξ = a + nb}, ∞ ∞ i(a+nb)t iat inbt φ(t)= pne = e pne . –∞ –∞ | | | ∞ inbt| Hence φ(t) = –∞ pne is periodic with period 2π/b. 256 Characteristic functions and central limit theorems

Theorem 12.1.3 Let φ(t) bethec.f.ofar.v.ξ. Then one of the following three cases must hold:

(i) |φ(t)| < 1 for all t  0,

(ii) |φ(t0)| =1for some t0 > 0 and |φ(t)| < 1 for 0 < t < t0, (iii) φ(t)=eiat for all t, some real a (and hence |φ(t)| =1for all t).

In case (ii), ξ is of lattice type, belonging to the set {a + n2π/t0 : n = 0, ±1, ...} a.s., for some real a. The absolute value of its c.f. is then periodic with period t0. In case (iii), ξ = aa.s. Finally if ξ has an absolutely continuous distribution, then (i) holds. This is also the case if ξ is discrete but not constant or of lattice type.

Proof Since |φ(t)|≤1 it follows that either (i) holds or that |φ(t0)| =1 iat0 for some t0  0. Suppose the latter is the case. Then φ(t0)=e for some real a. Consider the r.v. η = ξ – a. The c.f. of η is ψ(t)=e–iatφ(t) and ψ(t0) = 1. Hence it0η 1=Ee = cos(t0η(ω)) dP(ω) since the imaginary part must vanish (to give the real value 1). Hence [1 – cos(t0η(ω))] dP(ω)=0. The integrand is nonnegative and thus must vanish a.s. by Theorem 4.4.7. Hence cos(t0η(ω)) = 1 a.s., showing that

t0η(ω) ∈{2nπ : n =0,±1, ...} a.s. and thus

ξ(ω) ∈{a +2nπ/t0 : n =0,±1, ...} a.s. Hence ξ is a lattice r.v. Now since we assume that (i) does not hold, either (ii) holds or else every neighborhood of t = 0 contains such a t0 with |φ(t0)| = 1. In this case a sequence tk → 0 may be found such that ξ(ω) ∈{ak +n2π/tk, n =0,±1 ...} a.s. (for some real ak), i.e. for each k, ξ belongs to a lattice whose points are 2π/tk apart. At least one of the values a1 +2nπ/t1 has positive probability, and if (ii) does not hold, there cannot be more than one. For if there were two, distance d apart we could choose k so that 2π/tk > d, and obtain a contra- diction since the values of ξ must also lie in a lattice whose points are 2π/tk 12.1 Definition and simple properties 257 apart. Thus if (ii) does not hold we have ξ = a a.s. where a is that one value of a1 +2nπ/t1 which has nonzero probability, and thus has probability 1. Hence (iii) holds and |φ(t)| = |eiat| = 1 for all t; indeed φ(t)=eiat.Note that if (ii) or (iii) holds, ξ is discrete. Hence |φ(t)| < 1 for all t  0ifξ is absolutely continuous. 

One of the most convenient properties of characteristic functions is the simple means of calculating the c.f. of a sum of independent r.v.’s, as con- tained in the following result.

Theorem 12.1.4 Let ξ1, ξ2, ..., ξn be independent r.v.’s with c.f.’s φ1, φ2, ..., φn respectively. Then the c.f. φ of η = ξ1 + ξ2 + ···+ ξn is simply the product φ(t)=φ1(t)φ2(t) ...φn(t).

Proof This follows by the analog of Theorem 10.3.5. For the complex itξj ≤ ≤ E n itξj r.v.’s e ,1 j n, are obviously independent, showing that 1 e = n E itξj itξj 1 e . This may also be shown directly from that result by writing e = cos tξj + i sin tξj and using independence of (cos tξj, sin tξj) and (cos tξk, sin tξk)forj  k. 

We conclude this section with a few examples of c.f.’s.

(i) Degenerate distribution If ξ = a (constant) a.s. then the c.f. of ξ is φ(t)=eita. (ii) Binomial distribution   n P(ξ = r)= pr(1 – p)n–r, r =0,1,..., n,0< p < 1 r     n n n n φ(t)= pr(1 – p)n–reitr = (peit)r(1 – p)n–r r r r=0 r=0 =(1–p + peit)n =(q + peit)n, where q =1–p.

(iii) Uniform distribution on [–a, a]. ξ has p.d.f. 1 ,–a ≤ x ≤ a, 2a a ita –ita φ t 1 eitx dx e –e sin at ( )= 2a –a = 2ita = at (φ(0) = 1). 2 (iv) Normal distribution N(μ,σ ) 2 ξ has p.d.f. 1 exp –(x–μ) σ(2π)1/2 2σ2 ∞ 2 φ t 1 eitx –(x–μ) dx ( )= σ(2π)1/2 –∞ exp 2σ2 . 258 Characteristic functions and central limit theorems

This is perhaps most easily evaluated, first for μ =0, σ = 1, as a contour integral, making the substitution z = x – it to give π –1/2e–t2/2 e–z2/2 dz (2 ) C where C is the line I(z)=–t (I denoting “imaginary part”). This may be evaluated along the real axis instead of C (by Cauchy’s Theorem) to give e–t2/2.Ifξ is N(μ, σ2), η =(ξ – μ)/σ is N(0, 1) and thus has this c.f. e–t2/2. By Theorem 12.1.2, ξ thus has c.f. φ(t)=eiμt–σ2t2/2.

12.2 Characteristic function and moments The c.f. of a r.v. ξ is very useful in determining the moments of ξ (when they exist), and the d.f. or p.d.f. of ξ. It is especially convenient to use the n c.f. for either of these purposes when ξ is a sum of independent r.v.’s, 1 ξi say, for then the c.f. of ξ is simply obtained as the product of those of the ξi’s. Both uses of the c.f. and related matters are explored here, first con- sidering the relation between existence of moments of ξ and of derivatives of φ. Theorem 12.2.1 Let ξ be a r.v. with d.f. F and c.f. φ.IfE|ξ|n < ∞ for some integer n ≥ 1, then φ has a (uniformly) continuous derivative of order n given by ∞ φ(n) t inE ξneitξ in xneitx dF x ( )= ( )= –∞ ( ), and, in particular, Eξn = φ(n)(0)/in. itx ihx Proof For any t,(φ(t + h)–φ(t))/h = e (e –1)/hdF (x). Since the func- x tion (eihx –1)/h → ix as h → 0 and |(eihx –1)/h| = | eihy dy|≤|x|, domi- 0 ∞ itx nated convergence shows that limh→0 (φ(t + h)–φ(t))/h = ∞ix e ∞ – dF x φ t φ t ixeitx dF x ( ), i.e. the derivative ( ) exists, given by ( )= –∞ ( ). The proof may be completed by induction using the same arguments. Uniform continuity follows as for φ itself. 

n k Corollary If for some integer n ≥ 1, E|ξ| < ∞ then, writing mk = Eξ ,

n n–1 (it)k (it)k θtn φ(t)= m + o(tn)= m + E|ξ|n k! k k! k n! k=0 k=0 n where θ = θt is a complex number with |θt|≤1.(The“o(t )” term above is to be taken as t → 0,i.e.o(tn) is a function ψ(t) such that ψ(t)/tn → 0 as t → 0.) 12.2 Characteristic function and moments 259

Proof The first relation follows at once from the Taylor series expansion n tk φ(t)= φ(k)(0) + o(tn). k! k=0 The second follows from the alternative Taylor expansion n–1 tk tn φ(t)= φ(k)(0) + φ(n)(αt)(|α| < 1), k! n! k=0 defining θ by ∞ θE|ξ|n φ(n) αt i n xneitαx dF x = ( )=() –∞ ( ) from which it follows that ∞ |θ|E|ξ|n ≤ |x|n dF x E|ξ|n –∞ ( )= . Thus |θ|≤1ifE|ξ|n > 0, and in the degenerate case where E|ξ|n =0,i.e. ξ = 0 a.s., we may clearly take θ =0.  The converse to Theorem 12.2.1 holds for derivatives and moments of even order, as shown in the following result (see also Exs. 12.12, 12.13, 12.14). Theorem 12.2.2 Suppose that, for some integer n ≥ 1, the c.f. φ(t) of the r.v. ξ has 2n finite derivatives at t =0. Then E|ξ|2n < ∞. Proof Consider first the second derivative (i.e. n = 1). Since φ exists at t =0wehave 1 φ(t)=φ(0) + tφ(0) + t2φ(0) + o(t2) 2 1 φ(–t)=φ(0) – tφ(0) + t2φ(0) + o(t2) 2 and thus by addition of these two equations, φ(t)–2φ(0) + φ(–t) φ(0) = lim t→0 t2 itx –itx ∞ e –2+e = lim ∞ dF(x) t→0 – t2 ∞ 1 – cos tx = –2 lim ∞ dF(x) t→0 – t2 (F being the d.f. of ξ). But (1 – cos tx)/t2 → x2/2 as t → 0 and hence by Fatou’s Lemma ∞ tx ∞ 1 – cos ≥ 2 –φ (0) = 2 lim ∞ dF(x) ∞x dF(x). t→0 – t2 – 260 Characteristic functions and central limit theorems Since –φ(0) is (real and) finite it follows that x2 dF(x) < ∞,i.e. Eξ2 < ∞. The case for n > 1 may be obtained inductively from the n = 1 case as follows. Suppose the result is true for (n – 1) and that φ(2n)(0) exists. Then Eξ2n–2 exists by the inductive hypothesis and by Theorem 12.2.1 ∞ φ(2n–2) n–1 x2n–2 dF x (0) = (–) –∞ ( ). ∞ x2n–2 dF x F If –∞ ( )=0, is the d.f. of the degenerate distribution with all its mass at zero, i.e. ξ = 0 a.s., so that the desired conclusion Eξ2n < ∞ follows trivially. Otherwise write x ∞ G x u2n–2 dF u u2n–2 dF u ( )= –∞ ( )/ –∞ ( ). ∞ G λ–1 u2n–2 dF u is clearly a d.f. and has c.f. (writing = –∞ ( )) ∞ ψ t eitx dG x λ x2n–2eitx dF x λ n–1φ(2n–2) t ( )= ( )= –∞ ( )= (–) ( ) 2n–2 (2n) (λx being the Radon–Nikodym derivative dμG /dμF ). Since φ (0) exists so does ψ(0) and by the first part of this proof (with n = 2 and ψ for φ) ∞ ∞ ψ ≥ x2 dG x λ x2n dF x – (0) –∞ ( )= –∞ ( ) (Theorem 5.6.1). Thus x2n dF(x) is finite as required. 

The corollary to Theorem 12.2.1 provides Taylor expansions of the c.f. φ(t) when n moments exist. The following is an interesting variant of such expansions when an even number of moments exists which sheds light on the nature of the remainder term. It is given here for two moments (which will be useful in the central limit theory to be considered in Section 12.6). The extension to 2n moments is evident.

Lemma 12.2.3 Let ξ be a r.v. with zero mean, finite variance σ2,d.f.F, and c.f. φ. Then φ can be written as 1 φ(t)=1– σ2t2ψ(t) 2 where ψ is a characteristic function. Specifically ψ corresponds to the p.d.f. 2 ∞ g(x)= [1 – F(u)] du, x ≥ 0 σ2 x 2 x = F(u) du, x < 0. σ2 –∞ 12.3 Inversion and uniqueness 261

Proof Clearly g(x) ≥ 0. Further, using Fubini’s Theorem ∞ 2 ∞ ∞ g(x) dx = dx du dF(y) 0 σ2 0 x (u,∞) 2 y u = dF(y) du dx σ2 (0,∞) 0 0 1 = y2 dF(y). σ2 (0,∞) Similarly 0 1 g(x) dx = y2 dF(y) –∞ σ2 (–∞,0] ∞ g x dx g and hence –∞ ( ) = 1. Thus is a p.d.f. Now by the same inversion of integration order as above, ∞ 2 y u g(x)eitx dx = dF(y) du eitx dx 0 σ2 (0,∞) 0 0 2 y = dF(y) (eitu –1)du itσ2 (0,∞) 0 2 = (eity –1–ity) dF(y). (it)2σ2 (0,∞) Similarly 0 2 g(x)eitx dx = (eity –1–ity) dF(y) –∞ (it)2σ2 (–∞,0] and hence the c.f. corresponding to g is ∞ 2 ψ(t)= eitxg(x) dx = (1 – φ(t)) –∞ σ2t2 ∞ ydF y Eξ φ t 1 σ2t2ψ t  since –∞ ( )= = 0. Thus ( )=1–2 ( ), as required. Note that the conclusion of this lemma may be written as φ(t)=1– 1 2 2 1 2 2 2 → → 2 σ t + 2 t σ (1 – ψ(t)). The final term is o(t )ast 0 since ψ(t) 1 1 2 2 2 so that the standard representation φ(t)=1–2 σ t + o(t )forac.f.(with zero mean and finite second moments) also follows from this. However, the present result gives a more specific form for the o(t2) term since ψ is known tobeac.f.

12.3 Inversion and uniqueness The c.f. completely characterizes the distribution by specifying the d.f. F precisely. In fact since φ is the Fourier–Stieltjes Transform of F, this may 262 Characteristic functions and central limit theorems be shown from the inversion formulae of Sections 8.3 and 8.4, which are summarized as follows. Theorem 12.3.1 Let φ bethec.f.ofar.v.ξ with d.f. F. Then ˜ 1 (i) If F(x)= 2 (F(x)+F(x –0)), for any a < b, –ibt –iat 1 T e – e F˜ (b)–F˜ (a) = lim φ(t) dt T→∞ 2π –T –it and for any real a the jump of F at a is 1 T F(a)–F(a – 0) = lim e–iatφ(t) dt. T→∞ 2T –T

(ii) If φ ∈ L1, then F is absolutely continuous with p.d.f. 1 ∞ f (x)= e–ixtφ(t) dt a.e. 2π –∞ f is continuous and thus also is the (continuous) derivative of F at each x. (iii) If F is absolutely continuous with p.d.f. f which is of bounded varia- tion in a neighborhood of some given point x, then 1 1 T {f (x +0)+f (x –0)} = lim e–ixtφ(t) dt. 2 T→∞ 2π –T ∞ If φ ∈ L this may again be written as 1 e–ixtφ t dt. 1 2π –∞ ( ) Proof (i) follows from Theorem 8.3.1. x F x f u du f (ii) It follows from Theorem 8.3.3 that ( )= –∞ ( ) where , de- 1 –ixt fined as 2π e φ(t) dt, is real, continuous, and in L1. We need to show that f is nonnegative, whence it will follow that f is a p.d.f. for F. But if f were negative for some x it would, by continuity, be negative in a neigh- borhood of that x and hence F would be decreasing in that interval. Thus f (x) ≥ 0 for all x. Finally since f is continuous it follows at once that x F x d f u du f x x ( )= dx –∞ ( ) = ( ) for each . (iii) just restates Theorem 8.4.2 and its corollary.  Theorem 12.3.1 shows that there is a one-to-one correspondence be- tween d.f.’s and their c.f.’s and this is now stated separately. Theorem 12.3.2 (Uniqueness Theorem) The c.f. of a r.v. uniquely deter- mines its d.f., and hence its distribution, and vice versa, i.e. two d.f.’s F1, F2 are identical if and only if their c.f.’s φ1, φ2 are identical.

Proof It is clear that F1 ≡ F2 implies φ1 ≡ φ2. For the converse assume that φ1 ≡ φ2. Then by Theorem 12.3.1 (i), F˜ 1(b)–F˜ 1(a)=F˜ 2(b)–F˜ 2(a)for 12.4 Continuity theorem for characteristic functions 263 all a, b and hence, letting a → –∞, F˜ 1(b)=F˜ 2(b) for all b. But, for any d.f. F, limb↓x F˜ (b)=F(x +0)=F(x) and thus, for all x,

F1(x) = lim F˜ 1(b) = lim F˜ 2(b)=F2(x) b↓x b↓x as required. 

12.4 Continuity theorem for characteristic functions In this section we shall relate weak convergence of the previous chapter to pointwise convergence of c.f.’s. It will be useful to first prove the following two results.

Lemma 12.4.1 If ξ is a r.v. with d.f. F and c.f. φ, there exists a constant C > 0 such that for all a > 0 a–1 P{|ξ|≥a} dF x ≤ Ca R φ t dt = |x|≥a ( ) 0 [1 – ( )] (R denoting “real part”). C does not depend on ξ.

Proof

–1 –1 a R a { ∞ } a (1 – φ(t)) dt = a ∞(1 – cos tx) dF(x) dt 0 0 – ∞ a–1 = {a (1 – cos tx) dt} dF(x) (Fubini) –∞  0    –1 –1 ∞ sin a x sin a x dF x ≥ dF x = –∞ 1– –1 ( ) |a–1x|≥1 1– –1 ( )  a x a x ≥ sin t inf 1– | |≥ dF(x) |t|≥1 t x a –1 sin t –1 | |≥ which gives the desired result if C = inf t 1 1– t . (Note that C = 1 – sin 1 so that C is approximately 6.3.)  The next result uses this one to provide a convenient necessary and suf- ficient condition for tightness of a sequence of d.f.’s in terms of their c.f.’s.

Theorem 12.4.2 Let {Fn} be a sequence of d.f.’s with c.f.’s {φn}. Then {Fn} R → → is tight if and only if lim supn→∞ (1 – φn(t)) 0 as t 0.

Proof If {Fn} is tight we may, given >0, choose A so that Fn(–A) < /8, 1–Fn(A) < /8 for all n and hence ∞ R φ t tx dF x ≤ tx dF x [1 – n( )] = –∞(1 – cos ) n( ) |x|≤A(1 – cos ) n( )+ /2. 264 Characteristic functions and central limit theorems

Now if a > 0 and aA <π, 1 – cos tx ≤ 1 – cos aA for |x|≤A, |t|≤a and thus

R[1 – φn(t)] ≤ (1 – cos aA)+ /2 | |≤ R | |≤ when t a. Hence lim supn→∞ [1 – φn(t)] < for t a if a is chosen so that 1 – cos aA < /2, giving the desired conclusion. R → → Conversely suppose that lim supn→∞ [1 – φn(t)] 0ast 0. By Lemma 12.4.1 there exists C such that for any a > 0, a–1 dF x ≤ Ca R φ t dt |x|≥a n( ) 0 [1 – n( )] .

Hence by Fatou’s Lemma (Theorem 4.5.4) applied to 2 – R[1 – φn(t)], or by Ex. 4.17, a–1 dF x ≤ Ca R φ t dt lim sup |x|≥a n( ) 0 lim sup [1 – n( )] . n→∞ n→∞ But given >0 the integrand on the right tends to zero by assumption and hence may be taken less than /C for 0 ≤ t ≤ a–1 if a = a( )is chosen to be large, and hence lim sup dF (x) < . Thus there ex- n→∞ |x|≥a n ists N such that dF (x) < for all n ≥ N. Since the finite family |x|≥a n F , F , ..., F is tight, dF (x) < for some a , n =1,2,..., N –1 1 2 N–1 |x|>a n dF x < n A {a a} and hence |x|>A n( ) for all if =max , . This exhibits the required tightness of {Fn}.  The following is the main result of this section (characterizing weak convergence in terms of c.f.’s).

Theorem 12.4.3 (Continuity Theorem for c.f.’s) Let {Fn} be a sequence of d.f.’s with c.f.’s {φn}. w (i) If F is a d.f. with c.f. φ and if Fn → F then φn(t) → φ(t) for all t ∈ R. (ii) Conversely if φ is a complex function such that φn(t) → φ(t) for all t ∈ R and if φ is continuous at t =0, then φ isthec.f.ofad.f.Fand w Fn → F. Proof w (i) If Fn → F then by Theorem 11.2.1, ∞ ∞ cos tx dF (x) → cos tx dF(x) and –∞ n –∞ sin tx dFn(x) → sin tx dF(x) ∞ ∞ eitx dF x → eitx dF x φ t → φ t and hence –∞ n( ) –∞ ( ), or n( ) ( ), as required. 12.5 Some applications 265

(ii) Since φn(t) → φ(t) for all t,wehaveφ(0) = lim φn(0) = 1 and

lim sup R[1 – φn(t)] = 1 – R[φ(t)] → 0ast → 0 n→∞ since φ is continuous at t = 0. Thus by Theorem 12.4.2, {Fn} is tight. { } { } →w If now Fnk is any weakly convergent subsequence of Fn , Fnk F say where F has c.f. ψ, then, by (i), ψ(t) = limk→∞ φnk (t)=φ(t). Hence F has c.f. φ. Thus every weakly convergent subsequence has the same weak limit F (determined by the c.f. φ), and the tight sequence {Fn} therefore converges weakly to F by Theorem 11.2.5, concluding the proof. 

Corollary If {ξn} is a sequence of r.v.’s with d.f.’s {Fn} and c.f.’s {φn}, and d w if ξ is a r.v. with d.f. F and c.f. φ, then ξn → ξ (Fn → F) if and only if φn(t) → φ(t) for all real t. This follows at once from the theorem since φ is a c.f. and hence contin- uous at t =0.

12.5 Some applications In this section we give some applications of the continuity theorem for characteristic functions, beginning with a useful condition for a sequence of r.v.’s to converge in distribution to zero. By Theorem 12.4.3, Corollary, this is equivalent to the convergence of their c.f.’s to one on the entire real line. As shown next it suffices for this special case that the sequence of c.f.’s converges to one in some neighborhood of zero.

Theorem 12.5.1 If {ξn} is a sequence of r.v.’s with c.f.’s {φn}, the following are equivalent

(i) ξn → 0 in probability, d (ii) ξn → 0, (iii) φn(t) → 1 for all t, (iv) φn(t) → 1 in some neighborhood of t =0. Proof The equivalence of (i) and (ii) is already known from Ex. 11.13. If d ξn → 0 then by Theorem 12.4.3, φn(t) → 1 for all t, so that (ii) implies (iii). Since (iii) implies (iv) trivially the proof will be completed by showing that (iv) implies (ii). Suppose then that for some a > 0, φn(t) → 1 for all t ∈ [–a, a]. Then R | |≤ lim supn (1–φn(t)) = 0 for t a and thus Theorem 12.4.2 applies trivially { } { } to show that the sequence Fn is tight (where Fn is the d.f. of ξn). Let Fnk { } →w be any weakly convergent subsequence of Fn , Fnk F, say, where F has 266 Characteristic functions and central limit theorems → c.f. φ. Then φnk (t) φ(t) for all t by Theorem 12.4.3 and hence φ(t)=1 for |t|≤a. Thus by Theorem 12.1.3, φ(t)=eibt for all t (some b) and since φ(t)=1for|t| < a it follows that b = 0 and φ(t)=1forall t so that F(x)is zero for x < 0 and one for x ≥ 0. This means that any weakly convergent subsequence of the tight sequence {Fn} has the weak limit F and hence w by Theorem 11.2.5, Fn → F. This, restated, is the desired conclusion (ii), d ξn → 0. 

Note that it is not true in general that if a sequence {φn} of c.f.’s converges toac.f.φ in some neighborhood of t = 0 then it converges to φ for all t.It is true, however, as shown in this proof, in the special case where φ ≡ 1. (Cf. Ex. 12.26 also.) In Theorem 11.5.5 it was shown that convergence of a series of inde- pendent r.v.’s in probability implies a.s. convergence. The following result shows that convergence in distribution is even sufficient for a.s. conver- gence in such a case. It also provides a single necessary and sufficient condition, expressed in terms of c.f.’s, for a.s. convergence of a series of independent r.v.’s and should thus be compared with Kolmogorov’s Three Series Theorem 11.5.4.

Theorem 12.5.2 Let {ξn} be a sequence of independent r.v.’s with c.f.’s {φn}. Then the following are equivalent ∞ (i) The series 1 ξn converges a.s. ∞ (ii) 1 ξn converges in probability. ∞ (iii) 1 ξn converges in distribution. n →∞ (iv) The products k=1 φk(t) converge to a nonzero limit as n ,in some neighborhood of the origin. Proof That (i) and (ii) are equivalent follows from Theorem 11.5.5. Clearly (ii) implies (iii), and (iii) implies (iv). The proof will be completed by showing that (iv) implies (ii). n →  ∈ If (iv) holds, k=1 φk(t) φ(t), say, where φ(t) 0fort [–a, a], some a > 0. Let {mk}, {nk} be sequences tending to infinity as k →∞,with nk > mk. Then

nk nk mk–1 φj(t)= φj(t)/ φj(t) → 1ask →∞for |t|≤a.

j=mk j=1 j=1 nk By Theorem 12.5.1, ξj → 0 in probability. Since {mk} and {nk} are j=mk n arbitrary sequences it is clear that 1 ξj is Cauchy in probability and hence ∞  1 ξj is convergent in probability, concluding the proof of the theorem. 12.5 Some applications 267

The weak law of large numbers is, of course, an immediate corollary of the strong law (Theorem 11.6.3). However, as noted in Section 11.6, it is useful to also obtain it directly since the use of c.f.’s gives a very easy proof.

Theorem 12.5.3 Let {ξn} be a sequence of independent r.v.’s with the same d.f. F and finite mean μ. Then n 1 ξ → μ in probability as n →∞. n i i=1 n n Proof If φ is the c.f. of each ξn, the c.f. of Sn = 1 ξi is (φ(t)) and that n of Sn/n is ψn(t)=(φ(t/n)) . But since φ(t)=1+iμt + o(t) (Theorem 12.2.1, t t 1 →∞ Corollary) we have, for any fixed t, φ( n )=1+iμ n + o( n )asn and thus   t 1 n ψ (t)= 1+iμ + o( ) . n n n It is well known (and if not should be made so!) that the right hand side converges to eiμt as n →∞. Since eiμt is the c.f. of the constant r.v. μ it d follows that Sn/n → μ (by Theorem 12.4.3, Corollary) and by Ex. 11.13, –1 n Sn → μ in probability. 

The weak law of large numbers just proved shows that the average 1 n n 1 ξj independent and identically distributed (i.i.d.) r.v.’s is likely to lie close to μ = Eξ1 as n becomes large. On the other hand, the simple form of the central limit theorem (CLT) to be given next shows how a limit- 1 n ing distribution may be obtained for n 1 ξj (suitably normalized). A more general form of the central limit theorem is given in the next section.

Theorem 12.5.4 (Central Limit Theorem – Elementary Form) Let {ξn} be a sequence of independent r.v.’s with the same distribution and with finite mean μ and variance σ2. Then the sequence of normalized r.v.’s √ n n 1 n 1 Z = √ (ξ – μ)= ( ξ – μ) n j σ n j σ n j=1 1 converges in distribution to a standard normal r.v. Z (p.d.f. (2π)–1/2e–x2/2). –1/2 n Proof Write Zn = n 1 ηj where ηj =(ξj – μ)/σ are independent with zero means, unit variances and the same d.f. Let φ(t) denote their common c.f. which may (by Theorem 12.2.1, Corollary) be written as φ(t)=1–t2/2 + o(t2). 268 Characteristic functions and central limit theorems

The c.f. of Zn is by Theorems 12.1.2, 12.1.4 –1/2 n ψn(t)=[φ(tn )] which may therefore (for fixed t,asn →∞) be written, by the corollary to Theorem 12.2.1  2 ! "n t 1 2 ψ (t)= 1– + o → e–t /2 as n →∞. n 2n n Since this limit is the c.f. corresponding to the standard normal distribution d (Section 12.1), Zn → Z by Theorem 12.4.3. 

12.6 Array sums, Lindeberg–Feller Central Limit Theorem

As seen in the elementary form of the CLT (Theorem 12.5.4) the partial n sums 1 ξi of i.i.d. r.v.’s with finite second moments have a normal limit when standardized by means and standard deviations i.e. ⎛ ⎞ ⎜n ⎟ 1 ⎜ ⎟ d √ ⎝⎜ ξj – nμ⎠⎟ → N(0, 1). σ n 1

A more general form of the result allows the ξi to have different distribu- tions with finite second moments and gives necessary and sufficient con- ditions for this normal limit. This is the Lindeberg–Feller result. It is useful to generalize further by considering a triangular array {ξni : 1 ≤ i ≤ kn, n ≥ 1}, independent in i for each n rather than just a single sequence (but including that case – with kn = n, ξni = ξi) and consider the kn limiting distribution of i=1 ξni. This is an extensively studied area, “Cen- tral Limit Theory”, where the types of possible limit for such sums are investigated. For the case of pure sums (ξni = ξi) the limits are so-called “stable” r.v.’s (if ξ, η are i.i.d. with a stable distribution G, then the αξ + βη, α>0, β>0, has the distribution G(ax + b), some a > 0, b). For array sums the possible limits are (under natural conditions) the more general “infinitely divisible laws” corresponding to r.v.’s which may be split up as the sum of n i.i.d. components for any n. Here we look at just the special case of the normal limit for array sums under the so-called Lindeberg conditions using a proof due to W.L. Smith. The following lemma will be useful in proving the main theorem. When unstated the range of j in a sum, or product is from j =1tokn.

Lemma 12.6.1 Let kn →∞and let {anj :1≤ j ≤ kn, n =1,2,...} be complex numbers such that 12.6 Array sums, Lindeberg–Feller Central Limit Theorem 269

(i) max |a |→0 and j nj | |≤ (ii) j anj K all n, some K > 0. → →∞ Then j(1 – anj)exp( j anj) 1 as n .

Proof This is perhaps most simply shown by use of the expansion

log(1 – z)=–z + ψ(z), |ψ(z)|≤A|z|2 for complex z, |z| < 1, valid for the “principal branch” of the logarithm. It may alternatively be shown from the version of this for real z, avoiding the multivalued logarithm but requiring more detailed calculation. Using the above expansion we have, for sufficiently large n,    | log{ (1 – anj) exp( anj)}| = | (log(1 – anj)+anj)| j j j 2 ≤ A |anj| j  ≤ A(max |anj|) |anj| j j × which tends to zero by the assumptions and hence the result j(1 – anj) →  exp( j anj) 1 as required.

Theorem 12.6.2 (Array Form of Lindeberg–Feller Central Limit Theorem) Let {ξnj,1≤ j ≤ kn, n =1,2,...} be a triangular array of r.v.’s, independent in j for each n, d.f. F , mean zero and finite variance σ2 such that s2 nj nj n = 2 → →∞ j σ 1 as n .Letξ be a standard normal (N(0, 1)) r.v. Then nj →d 2 → j ξnj ξ and maxj σnj 0 if and only if the Lindeberg condition (L) holds, viz.,   2 E 2 → →∞ x dF (x)(= ξ χ | | ) 0 as n , each >0. (L) (|x|> ) nj nj ( ξnj > ) j j

Proof σ2 → Note first that (L) implies that maxj nj 0 since clearly 2 ≤ 2 E{ 2 } 2 → max σ + ξ χ | | . Hence max σ 0 may be assumed j nj j nj ( ξnj > ) j nj as a basic condition in the proof in both directions. Now let φnj be the c.f. of ξnj and ψnj the corresponding c.f. determined as in Lemma 12.2.3, i.e. 1 φ (t)=1– σ2 t2ψ (t). nj 2 nj nj 270 Characteristic functions and central limit theorems Then the c.f. of ζn = j ξnj is   1 Φ (t)= φ (t)= (1 – σ2 t2ψ (t)). n nj 2 nj nj j j It is easily checked that the conditions of Lemma 12.6.1 are satisfied with 2 2 anj = σnjt ψnj(t)/2sothat t2 Φ (t)exp( s2Ψ (t)) → 1 n 2 n n Ψ –2 2 2 → Ψ → where n(t)=sn j σnjψnj(t). Since sn 1, if n(t) 1 it follows that Φ → –t2/2 Φ → –t2/2 t2 2 Ψ → n(t) e . Conversely if n(t) e clearly exp( 2 sn( n(t)–1)) 1 → Ψ → Φ → –t2/2 (since sn 1, so that n(t) 1). Hence n(t) e if and only if Ψ → Ψ 2 2 n(t) 1. But n(t) is a convex combination of the c.f.’s ψnj ( σnj = sn) and hence is clearly itself a c.f. for each n (see also next section). Thus Φ ζn = j ξnj (with c.f. n) converges in distribution to a standard normal r.v. if and only if Ψn(t) → 1 for each t or equivalently if and only if the d.f. Gn corresponding to Ψn converges weakly to U(x)=0forx < 0 and 1 for x ≥ 0. Now it follows from Lemma 12.2.3 that Ψn corresponds to the p.d.f. gn (d.f. Gn) where for x > 0  2 ∞ g (x)= (1 – F (u)) du. n s2 x nj n j Using the same inversions of integration as in Lemma 12.2.3 (or integration by parts) it follows readily that for any >0  ∞ 1 ∞ g (x) dx = (u – )2 dF (u). n s2 nj n j

This and the corresponding result for x < 0 (and noting sn → 1) show that w Gn → U if and only if for each >0  |x| 2 dF x → n →∞ |x|> ( – ) nj( ) 0as .(L) j Now (L) has the same form as (L) with integrand (|x| – )2 instead of x2 in the same range (|x| > ). But in this range 0 < |x| – <|x| so that (|x| – )2 ≤ x2 and hence (L) implies (L). Conversely if (L) holds for each >0 it holds with /2 instead of and hence (reducing the integration range)  |x| 2 dF x → |x|> ( – /2) nj( ) 0. j 12.7 Recognizing a c.f. – Bochner’s Theorem 271

But in the range |x| > ,1– /(2|x|) > 1/2 so that ! " 2 (|x| – /2)2 = x2 1– > x2/4 2|x| so that (L) holds. Thus (L) and (L) are equivalent, completing the proof. 

Corollary 1 (“Standard” Form of Lindeberg–Feller Theorem) Let {ξn} be independent r.v.’s with d.f.’s {F }, zero means, and finite variances {σ2} n n 2 2 n 2 –1 n →d with σ1 > 0.Writesn = j=1 σj . Then sn j=1 ξj ξ, standard normal, 2 2 → and max1≤j≤n σj /sn 0 if and only if the Lindeberg condition n –2 2 s x dFj(x) → 0 as n →∞, each >0. (L ) n |x|> sn j=1

Proof This follows from the theorem by writing ξnj = ξj/sn,1≤ j ≤ n, n =1,2,....  The theorem may also be formulated for r.v.’s with nonzero means in the obvious way: { } { } { } Corollary 2 If ξn are independent r.v.’s with d.f.’s Fn , means μn , and finite variances {σ2} with σ2 > 0,s2 = n σ2, max σ2/s2 → 0, n 1 n j=1 j j j n 1 n then a necessary and sufficient condition for (ξj – μj) to converge in sn j=1 distribution to a standard normal r.v. is the Lindeberg condition n 1 2 → →∞ | | (x – μj) dFj(x) 0 as n for each >0. (L ) s2 x–μj > sn n j=1

12.7 Recognizing a c.f. – Bochner’s Theorem A characteristic function is the Fourier–Stieltjes Transform of a d.f. It is sometimes important to know whether a given complex-valued function is a c.f. or not (i.e. whether it can be written as such a transform) and often this will not be immediately obvious. We shall, below, give necessary and sufficient conditions in terms of “positive definite” functions (Bochner’s Theorem). This is a most useful characterization for theoretical purposes – especially concerning applications to stationary stochastic processes – but it is not so readily used in the practical situation of recognizing whether a given function is a c.f. from its functional form. A simple sufficient cri- terion which is occasionally very useful in recognizing special types of c.f. is given in Theorem 12.7.4. 272 Characteristic functions and central limit theorems

First of all it should be noted that c.f.’s may sometimes be recognized by virtue of being certain combinations of known c.f.’s (see also [Chung]). n For example, if φj(t), j =1,..., n, are c.f.’s we know that 1 φj(t)isac.f. n ≥ (Theorem 12.1.4). So is any “convex combination” 1 αjφj(t)(αj 0, n n 1 αj = 1) which corresponds to the “mixed” d.f. 1 αjFj(x)ifφj corre- sponds to Fj. Indeed, we may have an infinite convex combination – as should be checked. (See also Ex. 12.11.) Of course, if φ is a c.f. so is eibtφ(at) for any real a, b (Theorem 12.1.2), and φ(–t). But φ(t)=φ(–t) and thus |φ(t)|2 = φ(t)φ(–t) is a c.f. also. In all cases mentioned the reader should determine what r.v.’s the indi- cated c.f.’s correspond to, where possible. For example, if ξ, η are indepen- dent with the same d.f. F (and c.f. φ) it should be checked that the c.f. of ξ – η is |φ(t)|2. Both Bochner’s Theorem and the criterion for recognizing certain c.f.’s will be consequences of the following lemma.

Lemma 12.7.1 Let φ(t) be a continuous complex function on R with φ(0) = 1, |φ(t)|≤1 for all t and such that for all T

1 T g(λ, T)= μ(t/T)φ(t)e–iλt dt 2π –T is real and nonnegative for each real λ where μ(t) is 1–|t| for |t|≤1 and zero for |t| > 1. Then

(i) for each fixed T, g(λ, T) is a p.d.f. with corresponding c.f. φ(t)μ(t/T). (ii) φ(t) isac.f.

Proof (ii) will follow at once from (i) by Theorem 12.4.3 since φ(t)= → →∞ limT→∞ φ(t)μ(t/T)(μ(t/T) 1asT ) and φ is continuous at t =0. ∞ ∞ To prove (i) we first show that g(λ, T) is integrable, i.e. ∞g(λ, T) dλ< ∞– g M > since is assumed nonnegative. Let 0. Then ( = –∞) ! " λ 1 λ t g(λ, T)μ( ) dλ = μ( ) μ( )φ(t)e–iλt dt dλ. 2M 2π 2M T

By the definition of μ(t), both ranges of integration are really finite and since the integrand is bounded (|φ(t)|≤1) the integration order may be 12.7 Recognizing a c.f. – Bochner’s Theorem 273 changed to give ! " λ 1 t λ g(λ, T)μ( ) dλ = μ( )φ(t) μ( )e–iλt dλ dt 2M 2π T  2M  1 t 2M |λ| = μ( )φ(t) (1 – )e–iλt dλ dt 2π T –2M 2M ! " 1 t 2M λ = μ( )φ(t) (1 – ) cos λtdλ dt π T 0 2M since cos λt is even, and sin λt is odd. Integration by parts then gives   λ M t sin Mt 2 g(λ, T)μ( ) dλ = μ( )φ(t) dt 2M π T Mt   M sin Mt 2 t ≤ dt (|φ(t)|≤1, μ( ) ≤ 1) π Mt T   1 sin t 2 = dt =1, π t →∞ λ ↑ as is well known. Now, letting M , monotone convergence (μ( 2M ) 1) gives g(λ, T) dλ ≤ 1. Thus g(λ, T) ∈ L1(–∞, ∞). To see that its integral is in fact equal to one, 1 t –iλt note that as defined g(λ, T) is a Fourier Transform 2π μ( T )φ(t) e dt of 1 t | | the L1-function 2π μ( T )φ(t) (zero for t > T). Since g(λ, T) is itself in L1, inversion (from Theorem 8.3.4 with obvious sign changes) gives 1 t 1 μ( )φ(t)= e+iλtg(λ, T) dλ. 2π T 2π This holds a.e. and hence for all t, since both sides are continuous. In par- ticular t = 0 gives g(λ, T) dλ = φ(0) = 1 so that g(λ, T) is a p.d.f. with the corresponding c.f. eiλtg(λ, T) dλ = t  μ( T )φ(t), which completes the proof of (i), and thus of the lemma also. Corollary The function ψ(t)=1–|t|/Tfor|t|≤T, and zero for |t| > Tis ac.f. Proof Take φ(t) ≡ 1 in the lemma and note (cf. proof) that     2 1 T |t| T sin Tλ/2 1– e–iλt dt = ≥ 0.  2π –T T 2π Tλ/2 We shall now obtain Bochner’s Theorem as a consequence of this lemma. For this it will first be necessary to define and state some simple properties of positive definite functions. 274 Characteristic functions and central limit theorems

A complex function f (t)(t ∈ R) will be called positive definite (or non- negative definite) if for any integer n = 1,2,3,..., and real t1, ..., tn and complex z1, ..., zn we have

n f (tj – tk)zjzk ≥ 0(12.1) j,k=1

(“≥ 0” is here used as a shorthand for the statement “is real and ≥ 0”). No- tice that by a well known result in positive definite quadratic forms, (12.1) { }n implies that the determinant of the matrix f (tj –tk) j,k=1 is nonnegative. The needed simple properties of a positive definite function are given in the following theorem.

Theorem 12.7.2 If f (t) is a positive definite function, then

(i) f (0) ≥ 0, (ii) f (–t)=f (t) for all t, (iii) |f (t)|≤f (0) for all t, (iv) |f (t + h)–f (t)|2 ≤ 4f (0)|f (0) – f (h)| for all t, h, (v) f (t) is continuous for all t (indeed uniformly continuous) if it is con- tinuous at t =0.

Proof

(i) That f (0) is real and nonnegative follows by taking n =1, t1 =0, z1 =1 in (12.1).

(ii) If n =2, t1 =0, t2 = t, z1 = z2 = 1 we obtain 2f (0) + f (t)+f (–t) ≥ 0 from (12.1), and hence f (t)+f (–t) is real (= α,say).

If n =2, t1 =0, t2 = t, z1 =1, z2 = i we see that if (t)–if (–t) is real and hence f (t)–f (–t) is purely imaginary (= iβ,say). 1 1 Thus f (t)= 2 (α + iβ) and f (–t)= 2 (α – iβ), giving f (–t)=f (t). (iii) If t1 – t2 = t, nonnegativity of the determinant of the matrix 2 2 {f (tj – tk)}j,k=1,2 gives f (0) ≥ f (t)f (–t)=|f (t)| so that |f (t)|≤f (0). (iv) If n =3, t1 =0, t2 = t, t3 = t + h, then    f (0) f (–t) f (–t – h)    det{f (t – t )}3 =  f (t) f (0) f (–h)  ≥ 0 j k j,k=1    f (t + h) f (h) f (0)  gives

f 3(0) – f (0)|f (t)|2 – f (0)|f (t + h)|2 – f (0)|f (h)|2 +2R[f (t)f (h)f (t + h)] ≥ 0 12.7 Recognizing a c.f. – Bochner’s Theorem 275 and thus, with obvious use of (iii), f (0)|f (t + h)–f (t)|2 = f (0)|f (t + h)|2 + f (0)|f (t)|2 –2f (0)R[f (t)f (t + h)] ≤ f 3(0) – f (0)|f (h)|2 +2R[f (t)f (t + h){f (h)–f (0)}] ≤ 2f 2(0){f (0) – |f (h)|} +2f 2(0)|f (0) – f (h)| ≤ 4f 2(0)|f (0) – f (h)| from which the desired inequality follows (even if f (0) = 0, by (iii)). (v) is clear from (iv).  Theorem 12.7.3 (Bochner’s Theorem) A complex function φ(t)(t ∈ R) is a c.f. if and only if it is continuous, positive definite, and φ(0) = 1.By Theorem 12.7.2 (v) continuity for all t may be replaced by continuity at t =0.

Proof If φ is a c.f., it is continuous and φ(0) = 1. If t1, ..., tn are real and itx z1, ..., zn complex (writing φ(t)= e dF(x)) then n n i(tj–tk)x φ(tj – tk)zjzk = ( e zjzk) dF(x) j,k=1 j,k=1 n itjx 2 = | zje | dF(x) ≥ 0 j=1 and hence φ is positive definite. Conversely suppose that φ is continuous and positive definite with T φ g λ T 1 |t| φ t e–iλt dt (0) = 1. As in Lemma 12.7.1, define ( , )= 2π –T (1 – T ) ( ) . It is easy to see that g may be written as 1 T T g(λ, T)= φ(t – u)e–iλ(t–u) dt du 2πT 0 0 (by splitting the square of integration into two parts above and below the diagonal t = u and putting t – u = s; see figure below). But this latter integral involves a continuous integrand and may be evaluated as the limit of Riemann sums of the form (using the same dissection {tj} on each axis) n 1 φ(t – t )z z¯ 2πT j k j k j,k=1

–iλtj with zj = e (tj – tj–1). Since φ is positive definite such sums are non- negative and hence so is g(λ, T). Since |φ(t)|≤φ(0) by Theorem 12.7.2 (iii) and φ(0) = 1 the conditions for Lemma 12.7.1 are satisfied and φ is thus a c.f.  276 Characteristic functions and central limit theorems

We turn now to the “practical criterion” referred to above. As will be seen, this criterion provides sufficient conditions for a function to be a c.f. and, while these are useful, they are far indeed from being necessary. Basically the result gives conditions under which a real function φ(t) which is convex on (0, ∞) will be a c.f. Theorem 12.7.4 Let φ(t) be a real, nonnegative, even, continuous func- tion on R such that φ(t) is nonincreasing and convex on t ≥ 0, and such that φ(0) = 1. Then φ is a c.f. Proof Consider first a convex polygon φ(t) of the type shown in the figure below with vertices at 0 < a1 < a2 <... an). It is easy to see that φ(t) may be written as n φ(t)= λkμ(t/ak)+λn+1 k=1 where μ(t)=1–|t| for |t|≤1 and μ(t) = 0 otherwise. (This expression is clearly linear between ak and ak+1, and at aj takes the value φ(aj)= n k=j+1 λkμ(aj/ak)+λn+1 so that λn+1, λn, ..., λ1, may be successively cal- culated from φ(an), φ(an–1), ..., φ(a1), φ(0) = 1.) a a n t a The polygon edge between j and j+1 has the form j+1λkμ( / k)+λn+1 t n+1 and hence (if continued back) intercepts = 0 at height j+1 λk. By con- n+1 vexity these intercepts decrease as j increases and hence λj = j λk – n+1 n+1 j+1 λk > 0. Since φ(0) = 1 we also have 1 λj =1. 12.8 Joint characteristic functions 277

Now μ(t/ak) is a c.f. (Lemma 12.7.1, Corollary) for each k, and so also is the constant function 1. φ(t) is thus seen to be a convex combination of c.f.’s and is thus itself a c.f. If now φ(t) is a function satisfying the conditions of the theorem, it may clearly be expressed as a limit of such convex polygons (e.g. inscribed with vertices at r/2n, r =0,1,...,2nn). Hence by Theorem 12.4.3, φ is a c.f.  Applications of this theorem are given in the exercises.

12.8 Joint characteristic functions

It is also useful to consider the joint c.f. of m r.v.’s ξ1, ..., ξm defined for real t1, ..., tm by

i(t1ξ1+···+tmξm) φ(t1, ..., tm)=Ee . We shall not investigate such functions in any great detail here, but will indicate a few of their more important properties. First it is easily shown that if F is the joint d.f. of ξ , ..., ξ , then 1 m i(t1x1+···+tmxm) φ(t1, ..., tm)= Rm e dF(x1, ..., xm) –1 (where “dF”, of course, means dμF = dP(ξ1, ..., ξm) in the notation of Section 9.3). Further, the simplest properties of c.f.’s of a single r.v. clearly generalize easily. For example, it is easily seen that φ(0, ...,0)=1, |φ(t1, ..., tm)|≤1, and so on. The following obvious but useful property should also be pointed out: The joint c.f. of ξ1, ..., ξm is uniquely deter- mined by the c.f.’s of all linear combinations a1ξ1 +···+amξm, a1, ..., am ∈ R ··· E { . Indeed if φa1,...,am (t) denotes the c.f. of a1ξ1 + +amξm,i.e. exp it(a1ξ1 + ··· } + amξm) , it is clear that φ(t1, ..., tm)=φt1,...,tm (1). 278 Characteristic functions and central limit theorems

Generalizations of the inversion, uniqueness and continuity theorems are, of course, of interest. First a useful form of the inversion theorem may be stated as follows (cf. Theorem 12.3.1).

Theorem 12.8.1 Let F and φ be the joint d.f. and c.f. of the r.v.’s ξ1, ..., ξm. Then if I =(a, b],a=(a1, ..., am),b=(b1, ..., bm)(ai ≤ bi,1≤ i ≤ m) is any continuity rectangle (Section 10.2) for F,   m –ib t –ia t 1 T T e j j – e j j μ (I) = lim ··· φ(t1, ..., tm) dt1 ...dtm F T→∞ (2π)m –T –T –it j=1 j

μF (I) is defined as in Lemma 7.8.2.

This result is obtained in a similar manner to Theorem 12.3.1 (from the m-dimensional form of Theorem 8.3.1), and we do not give a detailed proof. To obtain the uniqueness theorem, an m-dimensional form is needed of the fact that a d.f. F has at most countably many discontinuities (Lem- ma 9.2.2) (or equivalently that the corresponding measure μF has at most { } countably many points of positive mass, i.e. x such that μF ( x ) > 0). Con- sider the case m = 2, and for a given s let Ls denote the line x = s,–∞ < y < ∞.Ifμ is a probability measure on the Borel sets of R2 then by the same argument as for m = 1, there are at most countably many values of s for which μ(Ls) > 0. Similarly there are at most countably many val- ues of t such that μ(Lt) > 0ifLt denotes the line y = t,–∞ < x < ∞. It thus follows that given any values s0, t0, there are values s, t arbitrar- t t ily close to s0, t0 respectively, such that μ(Ls)=μ(L ) = 0. (Such Ls, L will be called lines of zero μ-mass.) Precisely the same considerations hold in Rm for m > 2, with (m – 1)-dimensional hyperplanes of the form {(x1, ..., xm):xi = constant} taking the place of lines. With these observa- tions we now obtain the uniqueness theorem for m-dimensional c.f.’s.

Theorem 12.8.2 The joint c.f. of m r.v.’s uniquely determines their joint m d.f., and hence their distribution, and conversely; i.e. two d.f.’s F1, F2 in R are identical if and only if their c.f.’s φ1, φ2 are identical.

Proof It is clear that F1 ≡ F2 implies φ1 ≡ φ2. For the converse assume φ1 ≡ φ2 and consider the case m = 2. (The case m > 2 follows with the obvious changes.) With the above notation let (a, b) be a point in R2 such that L , Lb have zero μ - and μ -mass. Choose a , b , both tending to –∞ a F1 F2 k k as k →∞, and such that L , Lbk have zero μ - and μ -mass (which is ak F1 F2 possible since only countably many lines have positive (μ + μ )-mass). F1 F2 12.8 Joint characteristic functions 279

Then writing Ik =(ak, a] × (bk, b],

F1(a, b) = lim[F1(a, b)–F1(ak, b)–F1(a, bk)+F1(ak, bk)] k→∞ = lim μ (Ik) k→∞ F1 = lim μ (Ik) k→∞ F2 by Theorem 12.8.1, since I is a continuity rectangle for both μ and μ , k F1 F2 and F1, F2 have the same c.f. But by the same argument (with F2 for F1), lim →∞ μ (I )=F (a, b). Hence F (a, b)=F (a, b) for any (a, b) such that k F2 k 2 1 2 L and Lb have zero μ - and μ -mass. a F1 F2 ↓ ↓ dk Finally for any a, b, ck a, dk b may be chosen such that Lck and L have zero μ - and μ -mass and hence F (c , d )=F (c , d ) by the above. F1 F2 1 k k 2 k k By right-continuity of F1 and F2 in each argument F1(a, b)=F2(a, b), as required. 

The following characterization of independence of n r.v.’s ξ1, ..., ξm may now be obtained as an application. (Compare this theorem with Theo- rem 12.1.4.)

Theorem 12.8.3 The r.v.’s ξ1, ..., ξm are independent if and only if their m joint c.f. φ(t1, ..., tm)= i φi(ti) where φi is the c.f. of ξi. Proof If the r.v.’s are independent

i(t1ξ1+···+tmξm) φ(t1, ..., tm)=Ee m = φj(tj) j=1 by (the complex r.v. form of) Theorem 10.3.5. Conversely if ξ1, ..., ξm m have joint d.f. F and individual d.f.’s Fj, and φ(t1, ..., tm)= j=1 φj(tj) for all t1, ..., tm, then F(x1, ..., xm) and F1(x1) ...Fm(xm) are both d.f.’s ··· Rm i(t1x1+ +tmxm) on with the same c.f. (clearly e d[F1(x1) ...Fm(xm)] = m j=1 φj(tj)). Hence by the uniqueness theorem, F(x1, ..., xm)=F(x1) ... F(xm), so that the r.v.’s are independent by Theorem 10.3.1. 

Finally, weak convergence of d.f.’s in Rm (Section 11.2) may be consid- ered by means of their c.f.’s, giving rise to the following general version of the continuity theorem (Theorem 12.4.3).

Theorem 12.8.4 Let {Fn(x1, ..., xm)} be a sequence of m-dimensional d.f.’s with c.f.’s {φn(t1, ..., tm)}. 280 Characteristic functions and central limit theorems

w (i) If F(x1, ..., xm) is a d.f. with c.f. φ(t1, ..., tm) and if Fn → F, then φn(t1, ..., tm) → φ(t1, ..., tm) as n →∞, for all t1, ..., tm ∈ R. (ii) If φ(t1, ..., tm) is a complex function which is continuous at (0, ...,0) and if φn(t1, ..., tm) → φ(t1, ..., tm) as n →∞, for all t1, ..., tm ∈ R, w then φ is the c.f. of a (m-dimensional) d.f. F and Fn → F. As a corollary to this result we may obtain an elegant simple device due to H. Cramer´ and H. Wold, which enables convergence in distribution of random vectors to be reduced to convergence of ordinary r.v.’s.

Theorem 12.8.5 (Cramer–Wold´ Device) Let ξ =(ξ1, ..., ξm), ξn = (ξn1, ..., ξnm), n =1,2,..., be random vectors. Then

d ξn → ξ as n →∞ if and only if

d a1ξn1 + ···+ amξnm → a1ξ1 + ···+ amξm as n →∞ for all a1, ..., am ∈ R.

d Proof By the continuity theorems 12.4.3 and 12.8.4, ξn → ξ is equivalent to ··· ··· Eei(t1ξn1+ +tmξnm) →Eei(t1ξ1+ +tmξm) d for all t1, ..., tm ∈ R, and a1ξn1 +···+amξnm → a1ξ1 +···+amξm is equivalent to ··· ··· Eeit(a1ξn1+ +amξnm) →Eeit(a1ξ1+ +amξm) for all t ∈ R. It is then clear that the former implies the latter (by taking tj = taj) and conversely (by taking t =1).  This result shows that to prove convergence in distribution of a sequence of random vectors it is sufficient to consider convergence of arbitrary (but fixed) finite linear combinations of the components. This is especially useful for jointly normal r.v.’s since then each linear combination is also normal.

Exercises 12.1 Find the c.f.’s for the following r.v.’s (a) Geometric: P{ξ = n} = pqn–1, n =1,2,3... (0 < p < 1, q =1–p) (b) Poisson: P{ξ = n} = e–λλn/n!, n =0,1,2... (λ>0) Exercises 281

(c) Exponential: p.d.f. λe–λx, x ≥ 0(λ>0) λ ∞ ∞ (d) Cauchy: p.d.f. π(λ2+x2) ,– < x < (λ>0). 12.2 Let ξ, η be independent r.v.’s each being uniformly distributed on (–1, 1). Evaluate the distribution of ξ + η and hence its c.f. Check this with the square of (the absolute value of) the c.f. of ξ. 2 12.3 Let ξ be a standard normal r.v. Find the p.d.f. and c.f. of ξ . n 2 12.4 If ξ1, ..., ξn are independent standard normal r.v.’s, find the c.f. of 1 ξi . Check that this corresponds to the p.d.f. 2–n/2Γ(n/2)–1x(n/2)–1e–x/2 (x > 0) (χ2 with n degrees of freedom). 12.5 Find two r.v.’s ξ, η which are not independent but have the same p.d.f. f ,and are such that the p.d.f. of ξ + η is the convolution f ∗ f . (Hint: Try ξ = η with an appropriate d.f.) 12.6 According to Section 7.6 if f , g are in L1(–∞, ∞) then the convolution h = f ∗ g ∈ L1 and has L1 Fourier Transform hˆ = fˆgˆ.Inthecasewheref and g are nonnegative (e.g. p.d.f’s) give an alternative proof of this result based on Theorem 10.4.1 and Section 12.1. Give a corresponding result for Fourier– Stieltjes Transforms of the Stieltjes Convolution (F1 ∗ F2)(x)= F1(x – y) dF2(y)oftwod.f.’sF1, F2. 12.7 If ξ is a r.v. with c.f. φ show that

1 ∞ R[1 – φ(t)] E|ξ| = dt. π –∞ t2 ∞ sin t 2 (Hint: –∞( t ) dt = π.) 12.8 Let φ be the c.f. of a r.v. ξ. Suppose that

lim(1 – φ(t))/t2 = σ2/2 < ∞. t↓0

Show that Eξ =0andEξ2 = σ2. In particular if φ(t)=1+o(t2)show that ξ = 0 a.s. (Hints: R[1 – φ(t)]/t2 = [(1 – cos tx)/t2] dF(x) → σ2/2. Apply Fatou’s Lemma to show x2 dF(x) < ∞. Then use the corollary of Theorem 12.2.1.) 12.9 A r.v. ξ is called symmetric if ξ and –ξ have the same d.f. Show that ξ is symmetric if and only if its c.f. φ is real-valued. 12.10 Show that the real part of a c.f. is a c.f. but that the same is never true of the imaginary part. 12.11 Let ξ1 and ξ2 be independent r.v.’s with d.f.’s F1 and F2 and c.f.’s φ1 and φ2.

(i) Show that the c.f. φ of ξ1ξ2 is given by ∞ ∞ ∈ R φ(t)= –∞φ1(ty) dF2(y)= –∞φ2(tx) dF1(x)forallt .

(ii) If F2(0–) = F2(0), show that the r.v. ξ1/ξ2 is well defined and its c.f. φ is given by ∞ ∈ R φ(t)= –∞φ1(t/y) dF2(y)forallt . 282 Characteristic functions and central limit theorems

∞ As a consequence of (i) and (ii), if φ is a c.f. and G a d.f., then ∞φ(ty) dG(y) ∞ – is a c.f. and so is –∞φ(t/y) dG(y)ifG(0–) = G(0). 12.12 If f (t) is a function defined on the real line write Δhf (t)=f (t + h)–f (t), for real h, and say that f has a generalized second derivative at t when the following limit exists and is finite

Δh Δhf (t) lim h,h→0 h h for all sequences h → 0andh → 0. Show that if f has two derivatives at t then it has a generalized second derivative at t, and that the converse is not true. If φ(t) is a characteristic function show that the following are equivalent: (i) φ has a generalized second derivative at t =0, (ii) φ has two finite derivatives at t =0, (iii) φ has two derivatives at every real t, ∞ 2 ∞ (iv) –∞x dF(x) < ,whereF is the d.f. of φ. 12.13 If f (t) is a function defined on the real line its first symmetric difference may be defined by Δ1 s f (t)=f (t + s)–f (t – s) for real s, and its higher order symmetric differences by Δn+1 Δ1Δn s f (t)= s s f (t) for n =1,2,.... If the limit Δnf (t) lim s s→0 (2s)n exists and is finite, we say that f has nth symmetric derivative at t.Nowlet φ be the c.f. of a r.v. ξ,andn a positive integer. Show that if   Δ2nφ(0) lim inf  s  < ∞ s→0  (2s)2n  then Eξ2n < ∞. (Hint: Show that n   n Δnf (t)= (–1)k f [t +(n –2k)s] s k k=0 and Δ2n ∞ itx 2n s φ(t)= –∞e (2i sin sx) dF(x).) Show also that the following are equivalent (i) φ has (2n)th symmetric derivative at t =0, (ii) φ has 2n finite derivatives at t =0, Exercises 283

(iii) φ has 2n finite derivatives at every real t, (iv) Eξ2n < ∞.

12.14 Let ξ bear.v.withc.f.φ and denote by ρn the nth symmetric difference of φ at 0: Δn ρn(t)= t φ(0) (see Ex. 12.13). If 0 < p < 2n, show that E|ξ|p < ∞ if and only if |ρ (t)| 2n dt < ∞ 0 t1+p for some >0, in which case  –1 ∞ (sin x)2n ∞ |ρ (t)| E|ξ|p = 22n dx 2n dt. 0 x1+p 0 t1+p (Hint: Show that   2n |ρ (t)| ∞ |x| (sin u) 2n dt =22n |x|p du dF(x).) 0 t1+p –∞ 0 u1+p 12.15 Let φ be the c.f. corresponding to the d.f. F. Note that by Theorem 12.3.1 the jump (if any) of F at x may be written as 1 T F(x)–F(x – 0) = lim e–ixtφ(t) dt. T→∞ 2T –T

If φ(t0) = 1 for some t0  0 show that the mass of F is concentrated on { ± } the points 2nπ/t0 : n =0, 1, ... and the μF -measure of the point 2nπ/t0 is t 1 0 φ(t)e–2πnit/t0 dt. (Compare Theorem 12.1.3.) t0 0 12.16 Show that | cos t| is not a c.f. (e.g. use the result of Ex. 12.15 with n =4). Hence the absolute value of a c.f. is not necessarily a c.f.

12.17 If φ is the c.f. corresponding to the d.f. F (and measure μF ) prove that  1 T [μ ({x})]2 = lim |φ(t)|2 dt. F T→∞ 2T –T x∈R (Hint: Mimic proof of the last part of Theorem 8.3.1 or (more simply) apply the second inversion formula of Theorem 12.3.1 (i) for a =0andξ = ξ1 – ξ2 where ξ1, ξ2 are i.i.d. with c.f. φ.) What is the implication of this if φ ∈ L2(–∞, ∞)? 12.18 If φ is the c.f. corresponding to the d.f. F and φ ∈ L2(–∞, ∞), show that F is absolutely continuous with density a multiple of the Fourier Transform of φ. (Hint: Use Parseval’s Theorem.) This is an L2 analog of Theorem 12.3.1 (ii). 12.19 Show that the conclusion of the continuity theorem for characteristic func- tions is not necessarily true if φ is not continuous at t = 0 by considering a { }∞ sequence of random variables ξn n=1 such that for each n, ξn has the uni- form distribution on [–n, n]. 284 Characteristic functions and central limit theorems

12.20 If φ(t) is a characteristic function, then so is eλ[φ(t)–1] for each λ>0. (Hint: λ(φ–1) λ(φ–1) n Use e = limn 1+ n .) 12.21 If the random variable ξn has a binomial distribution with parameters (n, pn), n =1,2,...,andnpn → λ>0asn →∞, prove that ξn converges in distribution to a random variable which has the Poisson distribution with parameter λ. Show also that otherwise as pn → 0, npn →∞,thenξn (suit- ably standardized) has a limiting normal distribution. { }∞ 12.22 If the r.v.’s ξ and ξn n=1 are such that for every n, ξn is normal with mean 0 2 and variance σn, show that the following are equivalent

(i) ξn → ξ in probability (ii) ξn → ξ in L2 and that in each case ξ is normal with zero mean. { }∞ 12.23 Let ξn n=1 be a sequence of random variables such that for each n, ξn has a d Poisson distribution with parameter λn.Ifξn → ξ (after any normalization needed) show that ξ has either a Poisson or normal distribution. sin(t/2) 12.24 Show that t/2 is the c.f. of the uniform distribution on (–1/2, 1/2) and prove by using the c.f.’s that for all real t,   –1/2 n sin(n t) 2 lim = e–t /6. n→∞ n–1/2t { }∞ 12.25 Let ξn n=1 be independent random variables with finite means μn and 2 2 n 2 variances σn,andletsn = k=1 σk. Prove that the Lindeberg condition is satisfied, and thus the Lindeberg Central Limit Theorem (Corollary 2 of { }∞ Theorem 12.6.2) is applicable, if the random variables ξn n=1: (i) are uniformly bounded, i.e. for some 0 < M < ∞, |ξn|≤M a.s. for 2 →∞ all n,andsn ;or (ii) are identically distributed; or (iii) satisfy Liapounov’s condition n 1 E(|ξ – μ |2+δ) → 0 for some δ>0. 2+δ k k sn k=1

12.26 If two c.f.’s φ1, φ2 are equal on a neighborhood of zero then whatever deriva- tives of φ1 existatzeromustbeequaltothoseofφ2 there. Hence existing moments corresponding to each distribution must be the same. Show that, however, it is not necessarily true that φ1 = φ2, everywhere, and hence not necessarily true that the d.f.’s are the same. Note that if φ2 ≡ 1andφ1 = φ2 in a neighborhood of zero it is true that φ1 = φ2 everywhere. 13

Conditioning

13.1 Motivation In this chapter (Ω, F , P) will, as usual, denote a fixed probability space. If A and B are two events and P(B) > 0, the conditional probability P(A|B) of A given B is defined to be P(A ∩ B) P(A|B)= P(B) and has a good interpretation; given that event B occurs, the probability of event A is proportional to the probability of the part of A which lies in B.It has also an appealing frequency interpretation – as the proportion of those repetitions of the experiment in which B occurs, for which A also occurs. It is also important to be able to define P(A|B) in many cases for which P(B) = 0, for example if B is the event η = y where η is a continuous r.v. and y is a fixed value. There are various ways of making an appropri- ate definition depending on the purpose at hand. Here we are interested in integration over y to provide formulae such as P(A)= P(A|η = y)f (y) dy (13.1) if η has a density f which will be a particular case of the general definitions to be given. Other situations require different conditioning definitions – e.g. especially if particular fixed values of y are involved without integration in a condition η = y. A particular such case occurs if η(t) is the value of say temperature at time t and one is interested in defining P(A|η(t) = 0). The definition used for (13.1) will not have the empirical interpretation as the proportion of those time instants t where η(t) = 0 for which A occurs. In such cases so-called “Palm distributions” can be appropriate. Here, however, we consider the definitions of conditional probability and expectation for obtaining the probability P(A) by conditioning on values of ar.v.η and integrating over those values as in (13.1). This will be achieved

285 286 Conditioning in a much more general setting via the Radon–Nikodym Theorem, (13.1) being a quite special case. To motivate the approach it is illuminating to proceed from the special case where η is a r.v. which can take one of n possible values y1, y2, ..., yn ≤ ≤ n ∈F with P(η = yj)=pj > 0, 1 j n, j=1 pj = 1. Then for all A P(A|η = y )=P(A ∩ η–1{y })/Pη–1{y } so that j  j j  –1 P(A)= P(A ∩ η (yj)) = P(A|η = yj)pj j j ∞ P A|η y dPη–1 y = –∞ ( = ) ( ) where P(A|η = y)isP(A|η = yj)atyj and (say) zero otherwise. More generally it is easily shown that for all A ∈F and B ∈B P A ∩ η–1B P A|η y dPη–1 y ( )= B ( = ) ( ). (13.2) This relation holds in the above case where Pη–1 is confined to the points y1, y2, ..., yn so that the condition “η = y” has positive probability for each such value. However, in other cases where Pη–1 need not have atoms, the relation may (as will be seen) be used to provide a definition of P{A|η = y}. First, however, note that in the case considered (13.2) may be written with g(y)=P(A|η = y)as P A ∩ η–1B g y dPη–1 y g η ω dP ω ( )= B ( ) ( )= η–1B ( ( )) ( ). Since σ(η)=σ{η–1(B):B ∈B}it follows that for E ∈ σ(η) P A ∩ E g η ω dP ω ( )= E ( ( )) ( ). The function g(η(ω)) depends on the set A ∈F and writing it explicitly as P(A|η)(ω)wehave P A ∩ E P A|η ω dP ω ( )= E ( )( ) ( )(13.3) for each A ∈F, E ∈ σ(η). Since g is trivially Borel measurable, P(A|η) as defined on Ω is a σ(η)-measurable function for each fixed A ∈F and is referred to as the “conditional probability of A given η”. This is related to but distinguished from the function P(A|η = y) in (13.2), naturally referred to as the “conditional probability of A given η = y”. The version P(A|η)(ω) leads to a yet more general abstraction. The func- tion P(A|η)(ω) was defined in such a way that it is σ(η)-measurable and sat- isfies (13.3) for each E ∈ σ(η). These requirements involve η only through its generated σ-field σ(η)(⊂F) and it is therefore natural to write alterna- tively P(A|η)(ω)=P(A|σ(η))(ω) 13.2 Conditional expectation given a σ-field 287 for a σ(η)-measurable function of ω satisfying (13.3) for E ∈ σ(η). This immediately suggests a generalization to consider arbitrary σ-fields G⊂F and to define the conditional probability P(A|G)(ω)ofA ∈F with respect to the σ-field G⊂Fas a G-measurable function such that P(A ∩ E)= P A|G ω dP ω A ∈F E ∈G E ( )( ) ( ) for each , . Existence of such a function follows simply from the Radon–Nikodym Theorem. However, this will be done within the context of conditional ex- E |G E| | ∞ |G E |G pectations (ξ )ofar.v.ξ (with ξ < )withP(A )= (χA ) appear- ing as a special case. The conditioning P(A|η = y) “given the value of a r.v. η” considered above, will be discussed subsequently.

13.2 Conditional expectation given a σ-field Let ξ be a r.v. with E|ξ| < ∞ and G a sub-σ-field of F . The conditional ex- pectation of ξ given G will be defined in a way which extends the definition of conditional probability suggested in the previous section. Consider the set function ν defined for all E ∈Gby ν E ξ dP ( )= E .

Then ν is a finite signed measure on G and ν  PG where PG denotes the restriction of P from F to G. Thus by the Radon–Nikodym Theorem (Theorem 5.5.3) there is a finite-valued G-measurable and PG-integrable function f on Ω uniquely determined a.s. (PG) such that for all E ∈G, ν E fdPG fdP ( )= E = E (for the second equality see Ex. 4.10). We write f = E(ξ|G) and call it the conditional expectation of ξ given the σ-field G. Thus the conditional expectation E(ξ|G)ofξ given G is a G-measurable and P-integrable r.v. which is determined uniquely a.s. by the equality ξ dP E ξ|G dP E ∈G E = E ( ) for all . It is readily seen that this definition extends that suggested in Section 13.1 when G = σ(η)forar.v.η taking a finite number of values (Ex. 13.1). The E E E E |G equality may also be rephrased in “ -form” as (χE ξ)= (χE (ξ )) for all E ∈G. If η is a r.v. the conditional expectation E(ξ|η) of ξ given η is defined by taking G = σ(η), i.e. E(ξ|η)=E(ξ|σ(η)) so that E(ξ|η)isaσ(η)-measurable f ξ dP fdP E ∈ σ η function satisfying E = E for each ( ). It is enough that this equality holds for all E of the form η–1(B)forB ∈Bsince the class of such sets is either σ(η)ifη is defined for all ω or otherwise generates σ(η). 288 Conditioning

For a family {ηλ : λ ∈ Λ} of r.v.’s the conditional expectation E(ξ|ηλ : λ ∈ Λ) of ξ given {ηλ : λ ∈ Λ} is defined by

E(ξ|ηλ : λ ∈ Λ)=E(ξ|σ(ηλ : λ ∈ Λ)) where σ(ηλ : λ ∈ Λ) is the sub-σ-field of F generated by the union of the σ-fields {σ(ηλ):λ ∈ Λ} (cf. Section 9.3). The simplest properties of conditional expectations are stated in the fol- lowing result. Theorem 13.2.1 ξ and η are r.v.’s with finite expectations and a, b real numbers. (i) E{E(ξ|G)} = Eξ. (ii) E(aξ + bη|G)=aE(ξ|G)+bE(η|G) a.s. (iii) If ξ = η a.s. then E(ξ|G)=E(η|G) a.s. (iv) If ξ ≥ 0 a.s., then E(ξ|G) ≥ 0 a.s. Hence if ξ ≤ η a.s., then E(ξ|G) ≤ E(η|G) a.s. (v) If ξ is G-measurable then E(ξ|G)=ξ a.s. Proof (i) Since Ω ∈Gwe have E E |G E{E |G } ξ = Ωξ dP = Ω (ξ ) dP = (ξ ) . (ii) For every E ∈Gwe have (aξ + bη) dP = a ξ dP + b η dP E E E = a E(ξ|G) dP + b E(η|G) dP E E {aE ξ|G bE η|G } dP = E ( )+ ( ) and since the r.v. within brackets is G-measurable the result follows from the definition. (iii) This is obvious from the definition of conditional expectation. ξ ≥ ν ν E ξ dP (iv) If 0 a.s., (as defined at the start of this section, ( )= E ) is a measure (rather than a signed measure) and from the Radon–Nikodym Theorem we have E(ξ|G) ≥ 0 a.s. The second part follows from the first part and (ii) since by (ii) E(η|G)–E(ξ|G)=E((η – ξ)|G) ≥ 0a.s. (v) This also follows at once from the definition of conditional expecta- tion.  A variety of general results concerning conditional expectations will now be obtained – some involving conditional versions of standard the- orems. The first is an important result on successive conditioning. 13.2 Conditional expectation given a σ-field 289

Theorem 13.2.2 If ξ is a r.v. with E|ξ| < ∞ and G1, G2 two σ-fields with G2 ⊂G1 ⊂F then

E{E(ξ|G1)|G2} = E(ξ|G2)=E{E(ξ|G2)|G1} a.s.

Proof Repeated use of the definition shows that for all E ∈G2 ⊂G1, E{E ξ|G |G } dP E ξ|G dP ξ dP E ( 1) 2 = E ( 1) = E which implies that E{E(ξ|G1)|G2} = E(ξ|G2) a.s. The right hand equality follows from Theorem 13.2.1 (v). 

The fundamental convergence theorems for integrals and expectations (monotone and dominated convergence, Fatou’s Lemma) have conditional versions. We prove the monotone convergence result – the other two then follow from it in the same way as for the corresponding “unconditional” theorems.

Theorem 13.2.3 (Conditional Monotone Convergence Theorem) Let {ξn} be an increasing sequence of nonnegative r.v.’s with lim ξn = ξ a.s., where Eξ<∞. Then

E(ξ|G) = lim E(ξn|G) a.s. n→∞

Proof By Theorem 13.2.1 (iv) the sequence {E(ξn|G)} is increasing and nonnegative a.s. The limit limn→∞ E(ξn|G) is then G-measurable and two applications of (ordinary) monotone convergence give, for any E ∈G, lim E(ξn|G) dP = lim E(ξn|G) dP = lim ξn dP E n→∞ n→∞ E n→∞ E ξ dP = E showing that limn→∞ E(ξn|G) satisfies the conditions required to be a ver- sion of E(ξ|G) and hence the desired result follows. 

Theorem 13.2.4 (Conditional Fatou Lemma) Let {ξn} be a sequence of nonnegative r.v.’s with Eξn < ∞ and E{lim infn→∞ ξn} < ∞. Then

E(lim inf ξn|G) ≤ lim inf E(ξn|G) a.s. n→∞ This and the next result will not be proved here since – as already noted – they follow from Theorem 13.2.3 in the same way as the ordinary versions of Fatou’s Lemma and dominated convergence follow from monotone con- vergence. 290 Conditioning

Theorem 13.2.5 (Conditional Dominated Convergence Theorem) Let {ξn} be a sequence of r.v.’s with ξn → ξ a.s. and |ξn|≤η a.s. for all n where E|η| < ∞. Then

E(ξ|G) = lim E(ξn|G) a.s. n→∞ The following result is frequently useful. Theorem 13.2.6 Let ξ, η be r.v.’s with E|η| < ∞, E|ξη| < ∞ and such that η is G-measurable (ξ being F -measurable, of course). Then E(ξη|G)=ηE(ξ|G) a.s. ∈G E |G G Proof If η = χG for some G then η (ξ )is -measurable and for any E ∈G, ηE ξ|G dP E ξ|G dP ξ dP ξη dP E ( ) = E∩G ( ) = E∩G = E and hence E(ξη|G)=ηE(ξ|G) a.s. It follows from Theorem 13.2.1 (ii) that the result is true for simple G-measurable r.v.’s η. Now if η is an arbitrary G-measurable r.v. (with η ∈ L1, ξη ∈ L1), let {ηn} be a sequence of simple G-measurable r.v.’s such that for all ω ∈ Ω, limn ηn(ω)=η(ω) and |ηn(ω)|≤|η(ω)| for all n (Theorem 3.5.2, Corol- lary). It then follows from the conditional dominated convergence theorem (|ξηn|≤|ξη|∈L1) that

E(ξη|G) = lim E(ηnξ|G) = lim ηnE(ξ|G)=ηE(ξ|G)a.s.  n→∞ n→∞ The next result shows that in the presence of independence conditional expectation is the same as expectation. Theorem 13.2.7 If ξ isar.v.withE|ξ| < ∞ and σ(ξ) and G are indepen- dent then E(ξ|G)=Eξ a.s. In particular if ξ and η are independent r.v.’s and E|ξ| < ∞, then E(ξ|η)= Eξ a.s. Proof For any E ∈Gthe r.v.’s ξ and χ are independent and thus E ξ dP E ξχ Eξ ·Eχ Eξ dP E = ( E )= E = E . Since the constant Eξ is G-measurable, it follows that E(ξ|G)=E(ξ)a.s. 

The conditional expectation E(ξ|η)ofξ given a r.v. η is σ(η)-measurable and hence it immediately follows as shown in the next result that it is a Borel measurable function of η. 13.3 Conditional probability given a σ-field 291

Theorem 13.2.8 If ξ and η are r.v.’s with E|ξ| < ∞ then there is a Borel measurable function h on R such that E(ξ|η)=h(η) a.s. Proof This follows immediately from Theorem 3.5.3 since E(ξ|η)is σ(η)-measurable, i.e. E(ξ|η)(ω)=h(η(ω)) for some (Borel) measurable h. 

Finally in this list we note the occasionally useful property that condi- tional expectations satisfy Jensen’s Inequality just as expectations do. Theorem 13.2.9 If g is a convex function on R and ξ and g(ξ) have finite expectations then g(E{ξ|G}) ≤E{g(ξ)|G} a.s. Proof As stated in the proof of Theorem 9.5.4, g(x) ≥ g(y)+(x – y)h(y) for all x and y and some h(y) which is easily seen to be bounded on closed and bounded intervals. Thus whenever yn → x, g(yn)+(x–yn)h(yn) → g(x). Hence for every real x, g(x) = sup {g(r)+(x – r)h(r)}. r:rational Putting x = ξ and y = r in the inequality gives g(ξ) ≥ g(r)+(ξ – r)h(r)a.s. and by taking conditional expectations and using (ii) and (iv) of Theo- rem 13.2.1 E{g(ξ)|G} ≥ g(r)+(E(ξ|G)–r)h(r)a.s. Since the last inequality holds for all rational r, by taking the supremum of the right hand side and combining a countable set of events of zero probability we find E{g(ξ)|G} ≥ sup {g(r)+(E(ξ|G)–r)h(r)} = g(E{ξ|G})a.s.  r:rational A different proof is suggested in Ex. 13.7.

13.3 Conditional probability given a σ-field If A is an event in F and G is a sub-σ-field of F the conditional probability P(A|G) of A given G is defined by |G E |G P(A )= (χA ). 292 Conditioning E ∈G P A ∩ E χ dP E χ |G dP P A|G dP Then for , ( )= E A = E ( A ) = E ( ) so that P(A|G)isaG-measurable (and P-integrable) r.v. which is determined uniquely a.s. by the equality P A ∩ E P A|G dP E ∈G ( )= E ( ) for all ∩ E{ |G } Ω (i.e. P(A E)= χE P(A ) ). In particular (by putting E = ) |G E |G P(A)= ΩP(A ) dP (i.e. P(A )=P(A)) for all A ∈F.Ifη is a r.v. then the conditional probability P(A|η) of A ∈F | | E | given η is defined as P(A η)=P(A σ(η)) = (χA η). The particular conse- quence EP(A|η)=P(A) is, of course, natural. The properties of conditional probability follow immediately from those of conditional expectation. Some of these properties are collected in the following theorems for ease of reference.

Theorem 13.3.1 (i) If A ∈Gthen  1 for ω ∈ A P(A|G)(ω)=χ (ω)= a.s. A 0 for ω  A

(ii) If the event A is independent of the class G of events then

P(A|G)(ω)=P(A) a.s.

Theorem 13.3.2 (i) If A ∈F then 0 ≤ P(A|G) ≤ 1 a.s. (ii) P(Ω|G)=1a.s., P(∅|G)=0a.s. { } F ∪∞ (iii) If An is a disjoint sequence of events in and A = n=1An then ∞ P(A|G)= P(An|G) a.s. n=1 (iv) If A, B ∈F and A ⊂ B then

P(A|G) ≤ P(B|G) a.s.

and P(B – A|G)=P(B|G)–P(A|G) a.s. { }∞ (v) If An n=1 is a monotone (increasing or decreasing) sequence of events in F and A is its limit, then

P(A|G) = lim P(An|G) a.s. n→∞ 13.4 Regular conditioning 293

Proof These conclusions follow readily from the properties established χ for conditional expectations. For example, to show (iii) note that A = ∞ χ and conditional monotone convergence (Theorem 13.2.3) gives 1 An E |G E |G  (χA )= (χAn ) a.s. which simply restates (iii). The above properties look like those of a probability measure, with the exception that they hold a.s., and it is natural to ask whether for fixed ω ∈ Ω, P(A|G)(ω) as a function of A ∈Fis a probability measure. Un- fortunately the answer is in general negative and this is due to the fact that the exceptional G-measurable set of zero probability that appears in each property of Theorem 13.3.2 depends on the events for which the property is expressed. In particular property (i) stated in detail would read: (i) For every A ∈F there is NA ∈Gdepending on A such that P(NA)=0 and for all ω  NA 0 ≤ P(A|G)(ω) ≤ 1. It is then clear that the statement 0 ≤ P(A|G) ≤ 1 for all A ∈F a.s. is not necessarily true in general, since to obtain this we would need to com- bine the zero probability sets NA to get a single zero probability set N. This can be done (as in the example of Section 13.1) if there are only countably many sets A ∈F, but not necessarily otherwise. In fact, in general, there may not even exist an event E ∈Gwith P(E) > 0 such that 0 ≤ P(A|G)(ω) ≤ 1 for all A ∈F and all ω ∈ E. Thus in general there is no event E ∈Gwith P(E) > 0 such that for every fixed ω ∈ E, P(A|G)(ω) is a probability measure on F . In the next section we consider the case where the conditional probabil- ity does have a version which is a probability measure for all ω (a “regular conditional probability”) and show that then conditional expectations can be expressed as integrals with respect to this version.

13.4 Regular conditioning As seen in the previous section, conditional probabilities are not in general probability measures for fixed ω. If a conditional probability has a version which is a probability measure for all ω, then this version is called a regular conditional probability. Specifically let G be a sub-σ-field of F . A function P(A, ω) defined for each A ∈F and ω ∈ Ω, with values in [0, 1] is called a regular conditional probability on F given G if 294 Conditioning

(i) for each fixed A ∈F, P(A, ω)isaG-measurable function of ω,and for each fixed ω ∈ Ω, P(A, ω) is a probability measure on F , and (ii) for each fixed A ∈F, P(A, ω)=P(A|G)(ω)a.s. Regular conditional probabilities do not always exist without any fur- ther assumptions on Ω, F and G. As we have seen, a simple case when they exist is when G is the σ-field generated by a discrete r.v. However, if a regular conditional probability does exist we can express conditional ex- pectations as integrals with respect to it, just as ordinary expectations are expressed as integrals with respect to the probability measure. The nota- tion Ωξ(ω )P(dω , ω) will be convenient to indicate integration of ξ with respect to the measure P(·, ω). Theorem 13.4.1 If ξ isar.v.withE|ξ| < ∞, and P(A, ω) is a regular conditional probability on F given G, then E |G (ξ )(ω)= Ωξ(ω )P(dω , ω) a.s. ∈F Proof If ξ = χA for some A , then Ωξ(ω )P(dω , ω)=P(A, ω) which is G-measurable and equal a.s. to P(A|G)(ω)=E(χ |G)(ω)=E(ξ|G)(ω). A G E |G Thus Ωξ(ω )P(dω , ω)is -measurable and equal a.s. to (ξ )(ω) when ξ is a set indicator. It follows by Theorem 13.2.1 (ii) that the same is true for a simple r.v. ξ and, by using the ordinary and the conditional monotone convergence theorem, it is also true for any r.v. ξ ≥ 0withEξ<∞. Using again Theorem 13.2.1 (ii), this is also true for any r.v. ξ with E|ξ| < ∞.  If one is only interested in expressing a conditional expectation E{g(ξ)|G} for a particular ξ and Borel measurable g, as an integral with respect to a conditional probability (as in the previous theorem) then at- tention may be restricted to conditional probabilities P(A|G) of events A in σ(ξ) since F may be replaced by σ(ξ) in defining integrals of ξ over Ω (Ex. 4.10). We will call this restriction the conditional probability of ξ given G and it will be seen in Theorem 13.4.5 that a regular version exists under a simple condition on ξ. To be precise let ξ be a r.v. and G a sub-σ-field of F . A function Pξ|G(A, ω) defined for each A ∈ σ(ξ) and ω ∈ Ω,with values in [0, 1] is called a regular conditional probability of ξ given G if (i) for each fixed A ∈ σ(ξ), Pξ|G(A, ω)isaG-measurable function of ω, and for each fixed ω ∈ Ω, Pξ|G(A, ω) is a probability measure on σ(ξ), and (ii) for each fixed A ∈ σ(ξ), Pξ|G(A, ω)=P(A|G)(ω)a.s. Theorem 13.4.5 will show that under a very mild condition on ξ (that the range of ξ is a Borel set) Pξ|G of ξ given G exists for all G. Also as 13.4 Regular conditioning 295 already noted if G = σ(η) and η is a discrete r.v. then a regular conditional probability Pξ|G exists. Two further cases where Pξ|G exists trivially (in view of Theorem 13.3.1) are the following: (i) if σ(ξ) and G are independent then

Pξ|G(A, ω)=P(A) for all A ∈ σ(ξ) and ω ∈ Ω and (ii) if ξ is G-measurable then ∈ ∈ Ω Pξ|G(A, ω)=χA (ω) for all A σ(ξ) and ω . As will now be shown, when a regular conditional probability of ξ given G exists, then the conditional expectation of every σ(ξ)-measurable r.v. with finite expectation can be expressed as an integral with respect to the regular conditional probability. Theorem 13.4.2 If ξ is a r.v., g a Borel measurable function on R such that E|g(ξ)| < ∞, and Pξ|G is a regular conditional probability of ξ given G, then E{ |G} g(ξ) (ω)= Ωg(ξ(ω ))Pξ|G(dω , ω) a.s. Proof The proof extends that of Theorem 13.4.1, with σ(ξ) replacing F ∈ E |G .IfA σ(ξ)ther.v.η = χA satisfies (η )(ω)= η(ω )Pξ|G(dω , ω) a.s. This remains true if χA is replaced by a nonnegative simple σ(ξ)- measurable r.v. η and hence by the standard extension (cf. Theorem 13.4.1) for any σ(ξ)-measurable η with E|η| < ∞.Butg(ξ) is such a r.v. and hence the result follows.  The distribution of a r.v. ξ (Chapter 9) is the probability measure Pξ–1 induced from P on the Borel sets of the real line by ξ and expectations of functions of ξ are expressible as integrals with respect to this distribution. Similarly, conditional distributions on the Borel sets of the real line may be induced from regular conditional probabilities and used to obtain condi- tional expectations. Indeed if the regular conditional probability Pξ|G(A, ω) of ξ given G exists then a (regular) conditional distribution Qξ|G(B, ω)ofξ given G may be defined for any Borel set B on the real line (i.e. B ∈B) and ω ∈ Ω by

–1 Qξ|G(B, ω)=Pξ|G(ξ B, ω) for all B ∈B, ω ∈ Ω.

Clearly Qξ|G has properties similar to Pξ|G and the only problem is that this “definition” of Qξ|G requires the existence of Pξ|G (which is not always guaranteed). However, this problem is easily eliminated by defining Qξ|G in terms of properties it inherits from Pξ|G but without reference to the latter. More specifically let ξ be a r.v. and G a sub-σ-field of F . A function 296 Conditioning

Qξ|G(B, ω) defined for each B ∈Band ω ∈ Ω, with values in [0, 1] is called a regular conditional distribution of ξ given G if (i) for each fixed B ∈B, Qξ|G(B, ω)isaG-measurable function of ω, and for each fixed ω ∈ Ω, Qξ|G(B, ω) is a probability measure on the Borel sets B, and –1 (ii) for each fixed B ∈B, Qξ|G(B, ω)=P(ξ B|G)(ω)a.s. It is clear that if a regular conditional probability Pξ|G of ξ given G exists then Qξ|G as defined above from it, is a regular conditional distribution of ξ given G. We shall see that, in contrast to regular conditional probability, a regular conditional distribution of ξ given G always exists (Theorem 13.4.3) and that the conditional expectation of every σ(ξ)-measurable r.v. with finite expectation may be expressed as an integral over R with respect to the regular conditional distribution (Theorem 13.4.4). As for the regular conditional probability of ξ given G the following intuitively appealing results hold: (i) if σ(ξ) and G are independent, then

–1 Qξ|G(B, ω)=Pξ (B) for all B ∈Band ω ∈ Ω, i.e. for each fixed ω ∈ Ω the conditional distribution of ξ given G is just the distribution of ξ. (ii) If ξ is G-measurable, then

Q |G(B, ω)=χ (ω)=χ (ξ(ω)) for all B ∈Band ω ∈ Ω, ξ ξ–1B B i.e. for each fixed ω ∈ Ω the conditional distribution of ξ given G is a probability measure concentrated at the point ξ(ω).

Theorem 13.4.3 If ξ is a r.v. and G a sub-σ-field of F , then there exists a regular conditional distribution of ξ given G.

–1 Proof Write Ax = ξ (–∞, x] for any real x. By Theorem 13.3.2 it is clear that for any fixed x, y with x ≥ y, P(Ax|G)(ω) ≥ P(Ay|G)(ω) a.s., for any fixed x, P(Ax+1/n|G)(ω) → P(Ax|G)(ω)a.s.asn →∞, and for any fixed { } →∞ ∞ |G → sequence xn with xn (– ), P(Axn )(ω) 1 (0) a.s. By combining a countable number of zero measure sets in G we obtain a G-measurable set N with P(N) = 0 such that for each ω  N (a) P(Ax|G)(ω) is a nondecreasing function of rational x (b) limn→∞ P(Ax+1/n|G)(ω)=P(Ax|G)(ω) for all rational x (c) limx→∞ P(Ax|G)(ω) = 1, limx→–∞ P(Ax|G)(ω)=0forrational x → ±∞. 13.4 Regular conditioning 297

Define functions F(x, ω) as follows:

for ω  N: F(x, ω)=P(Ax|G)(ω)ifx is rational = lim{F(r, ω):r rational, r ↓ x} if x is irrational for ω ∈ N: F(x, ω) = 0 or 1 according as x < 0orx ≥ 0. Then it is easily checked that F(x, ω) is a distribution function for each fixed ω ∈ Ω and hence defines a probability measure Q(B, ω) on the class B of Borel sets, satisfying Q((–∞, x], ω)=F(x, ω) for each real x. It will follow that Q(B, ω) is the desired regular conditional distribution of ξ given G if we show that for each B ∈B, (i) Q(B, ω)isaG-measurable function of ω (ii) Q(B, ω)=P(ξ–1B|G)(ω)a.s. Let D be the class of all Borel sets B for which (i) and (ii) hold. If x is rational and B =(–∞, x], then Q(B, ω)=F(x, ω) which is equal to the G- measurable function P(Ax|G)(ω)ifω  N and a constant (0 or 1) if ω ∈ N. –1 Further N ∈Gand P(N) = 0. Since Ax = ξ B, (i) and (ii) both follow when B =(–∞, x], for rational x. Thus (–∞, x] ∈Dwhen x is rational. D D D It is easily checked that is a -class. If Bi are disjoint sets of ,with ∪∞ ∞ G B = 1 Bi we have Q(B, ω)= 1 Q(Bi, ω) which is -measurable since ∞ ∞ –1 |G each term is, so that (i) holds. Also, 1 Q(Bi, ω)= 1 P(ξ Bi )(ω)= ∪∞ –1 |G –1 |G D P( 1 ξ Bi )(ω) a.s. by Theorem 13.3.2, and this is P(ξ B ), so that is closed under countable disjoint unions. Similarly it is closed under proper differences. Thus D is a D-class containing the class of all sets of the form (–∞, x] for rational x. But this latter class is closed under intersections, and its generated σ-ring is B (cf. Ex. 1.21). Hence D⊃B, as desired.  The following result shows in particular that the conditional expectation of a function g of a r.v. ξ may be obtained by integrating g with respect to a regular conditional distribution of ξ (cf. Theorem 13.4.2).

Theorem 13.4.4 Let ξ be a r.v. and Qξ|G a regular conditional distribu- tion of ξ given G.Letη be a G-measurable r.v. and g a Borel measurable function on the plane such that E|g(ξ, η)| < ∞. Then ∞ E{g ξ η |G} ω g x η ω Q |G dx ω a.s. ( , ) ( )= –∞ ( , ( )) ξ ( , ) In particular, if E is a Borel measurable set of the plane and Ey its y-section {x ∈ R :(x, y) ∈ E}, then

η(ω) P{(ξ, η) ∈ E|G}(ω)=Qξ|G(E , ω) a.s. 298 Conditioning

2 η(ω) Proof We will first show that for every E ∈B, Qξ|G(E , ω)isG- η(ω) measurable and P{(ξ, η) ∈ E|G}(ω)=Qξ|G(E , ω)a.s.LetE = A × B η(ω) where A, B ∈B. Then Qξ|G(E , ω)=Qξ|G(A, ω)orQξ|G(∅, ω) according c η(ω) as η(ω) ∈ B or B , so that clearly Qξ|G(E , ω)isG-measurable. Further –1 –1 since Qξ|G(A, ω)=P(ξ A|G) a.s. and P(ξ ∅|G) = 0 a.s., it follows that η(ω) –1 Qξ|G(E , ω)=χη–1B(ω)P{ξ A|G}(ω)a.s.

= χη–1B(ω)E{χξ–1A|G}(ω)a.s.

= E{χξ–1Aχη–1B|G}(ω)a.s. = P{(ξ, η) ∈ E|G}(ω)a.s.

η(ω) (since χη–1B is σ(η)-measurable). Hence Qξ|G(E , ω) is (a version of) P{(ξ, η) ∈ E|G} when E = A × B, A, B ∈B. 2 η(ω) Now denote by D the class of subsets E of R such that Qξ|G(E , ω) η(ω) is G-measurable and P{(ξ, η) ∈ E|G}(ω)=Qξ|G(E , ω) a.s. (the excep- tional set depending in general on each set E). Then by writing P{(ξ, η) ∈ |G} E{ |G} E = χ{(ξ,η)∈E} and using the properties of conditional expectation and the regular conditional distribution it is seen immediately that D is a D- class (i.e. closed under countable disjoint unions and proper differences). Since D contains the Borel measurable rectangles of R2, it will contain the σ-field they generate, the Borel sets B2 of R2. Hence the second equality of the theorem is proved.

The first equality is then obtained by the usual extension. If g = χE ,the indicator of a set E ∈B2, then by the above the equality holds. Hence it also holds for a B-measurable simple function g. By using the ordinary and the conditional monotone convergence theorem (and Theorem 3.5.2) we see that it is true for all nonnegative B2-measurable functions g and hence also for all g as in the theorem. 

Since a regular conditional distribution Qξ|G of ξ given G always exists, one may attempt to obtain a regular conditional probability Pξ|G of ξ given G by –1 Pξ|G(A, ω)=Qξ|G(B, ω) when A ∈ σ(ξ), B ∈B, A = ξ B

(as was pointed out earlier in this section, if Pξ|G exists this relationship defines a regular conditional distribution Qξ|G). However, given A ∈ σ(ξ) there may be several Borel sets B such that A = ξ–1B for which the values Qξ|G(B, ω) are not all equal (for fixed ω) and then Pξ|G is not defined in the above way. Under a rather mild condition on ξ it is shown in the fol- lowing theorem that this difficulty is eliminated and a regular conditional probability can then be defined from a regular conditional distribution. 13.4 Regular conditioning 299

Theorem 13.4.5 Let ξ be a r.v. (for convenience defined for all ω) and G a sub-σ-field of F . If the range E = {ξ(ω): ω ∈ Ω} of ξ is a Borel set then there exists a regular conditional probability of ξ given G.

Proof Let Qξ|G be a regular conditional distribution of ξ given G, which always exists by Theorem 13.4.3. Then since E ∈Band ξ–1(E)=Ω,

–1 Qξ|G(E, ω)=P(ξ (E)|G)(ω)=P(Ω|G)(ω)=1a.s. and thus there is a set N ∈G,withP(N) = 0, such that for all ω  N, Qξ|G(E, ω)=1. –1 –1 Now fix A ∈ σ(ξ)withA = ξ (B1)=ξ (B2) where B1, B2 ∈B. Then c B1 – B2 and B2 – B1 are Borel subsets of E and thus for all ω  N (since Qξ|G is a measure for every ω)

Qξ|G(B1 – B2, ω)=0=Qξ|G(B2 – B1, ω) so that

Qξ|G(B1, ω)=Qξ|G(B1 ∩ B2, ω)=Qξ|G(B2, ω). Hence the following definition is unambiguous.  Qξ|G(B, ω)forω  N P |G(A, ω)= and all A ∈ σ(ξ) ξ p(A)forω ∈ N where B ∈Bis such that A = ξ–1(B) and p is an arbitrary but fixed prob- ability measure on σ(ξ). Since Qξ|G is a regular conditional distribution of ξ given G and since P(N) = 0, it is clear that Pξ|G is a regular conditional probability of ξ given G. 

Finally, if η is a r.v. then the following notions regular conditional probability on F given η regular conditional probability of ξ given η regular conditional distribution of ξ given η are defined (as usual) as the corresponding quantities introduced in this section with G = σ(η), the notation used here for the last two being Pξ|η and Qξ|η. A regular conditional distribution Qξ|η of ξ given η always ex- ists (Theorem 13.4.3) and the conditional expectation given η of every σ(ξ, η)-measurable r.v. with finite expectation is expressed as an integral with respect to Qξ|η, as follows from Theorem 13.4.4. Thus, if g is a Borel measurable function on the plane such that E|g(ξ, η)| < ∞, then ∞ E{g ξ η |η} ω g x η ω Q | dx ω ( , ) ( )= –∞ ( , ( )) ξ η( , )a.s. 300 Conditioning

In particular, if E is a Borel measurable set of the plane and Ey its y-section {x ∈ R :(x, y) ∈ E}, then

η(ω) P{(ξ, η) ∈ E|η}(ω)=Qξ|η(E , ω)a.s.

13.5 Conditioning on the value of a r.v. As promised in Section 13.1 we will now define conditional expectation (and hence then also conditional probability) given the event that a r.v. η takes the value y, which may have probability zero even for all y. The con- ditional expectation given η = y will be defined first giving the conditional probability as a particular case. Specifically if ξ, η are r.v.’s, with E|ξ| < ∞, it is known by Theorem 13.2.8 that the conditional expectation of ξ given η is a Borel measurable function of η,i.e.E(ξ|η)(ω)=h(η(ω)) for some Borel function h. The conditional expectation of ξ given the value y of η may then be simply defined by E(ξ|η = y)=h(y) that is E(ξ|η = y) may be regarded as a version of the conditional expecta- tion induced on R by the transformation η(ω) (and thus Borel, rather than σ(η)-measurable). If B ∈Bit follows at once that E(ξ|η = y) dPη–1(y)= h(y) dPη–1(y)= h(η(ω)) dP(ω) B B η–1B E ξ|η ω dP ω ξ dP = η–1B ( )( ) ( )= η–1B . Since in particular h(y) dPη–1(y)= ξ dP, any two choices of h(y) B η–1B hdPη–1 B have the same integral B for every and hence must be equal a.s. (Pη–1) so that E(ξ|η = y) is uniquely defined (a.s.). This is, of course, totally analogous to the defining property for E(ξ|η) and may be similarly used as an independent definition of E(ξ|η = y)as indicated in the following result. Theorem 13.5.1 Forar.v.ξ with E|ξ| < ∞ and a r.v. η, the conditional expectation of ξ given η = y may be equivalently defined (uniquely a.s. (Pη–1)) as a B-measurable function E{ξ|η = y} satisfying ξ dP E ξ|η y dPη–1 y η–1B = B ( = ) ( ) for each B ∈B . In particular it follows by taking B = R that Eξ = E(ξ|η = y) –1 dPη (y)= E(ξ|η = y) dFη(y) where Fη is the d.f. of η. 13.5 Conditioning on the value of a r.v. 301

Proof That E(ξ|η = y) exists satisfying the defining equation and is a.s. unique follow as above, or may be shown directly from use of the Radon– Nikodym Theorem similarly to the definition of E(ξ|G) in Section 13.2. 

The conditional probability P(A|η = y) of A ∈Fgiven η = y is now defined as | E | –1 P(A η = y)= (χA η = y)a.s.(Pη ).

Thus P(A|η = y) is a Borel measurable (and Pη–1-integrable) function on R which is determined uniquely a.s. (Pη–1) by the equality P A ∩ η–1B P A|η y dPη–1 y B ∈B ( )= B ( = ) ( ) for all . In particular, for B = R

∞ P A P A|η y dPη–1 y ( )= –∞ ( = ) ( ). Since P(A|η = y)=f (y) where P(A|η)(ω)=f (η(ω)), the properties of P(A|η = y) are easily deduced from those of P(A|η). In particular all proper- ties of Theorem 13.3.2 are valid, with “given G ” replaced by “given η = y” and “a.s.” replaced by “a.s. (Pη–1)”.

In a similar way the following notions can be defined for r.v.’s ξ, η: regular conditional probability of F given η = y regular conditional probability of ξ given η = y regular conditional distribution of ξ given η = y with properties similar to the properties of the corresponding notions “given η” or “given G ” as developed in Section 13.4. These definitions and properties will not all be listed here, in order to avoid overburdening the text, but as an example consider the third notion (which always exists), defined as follows. A function Qˆ ξ|η(B, y) defined on B×R to [0, 1] is called a regular conditional distribution of ξ given η = y if (i) for each fixed B ∈B, Qˆ ξ|η(B, y) is a Borel measurable function of y, and for each fixed y ∈ R, Qˆ ξ|η(B, y) is a probability measure on the Borel sets B, and –1 –1 (ii) for each fixed B ∈B, Qˆ ξ|η(B, y)=P(ξ B|η = y)a.s.(Pη ). As for a regular conditional distribution of ξ given η there are the fol- lowing extreme cases: –1 (i) if ξ and η are independent then Qˆ ξ|η(B, y)=Pξ (B) for all B ∈B and y ∈ R,i.e.foreveryfixedy ∈ R, the conditional distribution of ξ given η = y is equal to the distribution of ξ; and 302 Conditioning ˆ ∈B (ii) if ξ is σ(η)-measurable then Qξ|η(B, y)=χB (f (y)) for all B and y ∈ R; where f is defined by ξ = f (η), i.e. for each fixed y ∈ R, the condi- tional distribution of ξ given η = y is a probability measure concentrated at the point f (y). The main properties of a regular conditional distribution of ξ given η = y are collected in the following result.

Theorem 13.5.2 Let ξ and η be r.v.’s. Then

(i) There exists a regular conditional distribution of ξ given η = y.

(ii) If Qξ|η and Qˆ ξ|η are regular conditional distributions of ξ given η and given η = y respectively, then

Qξ|η(B, ω)=Qˆ ξ|η(B, η(ω)) for all B ∈Band ω  N where N ∈ σ(η) and P(N)=0. (iii) If g is a Borel measurable function on the plane such that E|g(ξ, η)| < ∞, then ∞ –1 E{g(ξ, η)|η = y} = g x y Qˆ | dx y a.s. Pη –∞ ( , ) ξ η( , ) ( ). In particular, if E is a Borel measurable set of the plane and Ey its y-section {x ∈ R :(x, y) ∈ E}, then

y –1 P{(ξ, η) ∈ E|η = y} = Qˆ ξ|η(E , y) a.s. (Pη ). Proof The construction of a regular conditional distribution of ξ given η = y follows that of Theorem 13.4.3 in detail, with the obvious adjustments: “given G ” is replaced by “given η = y”, the exceptional G-measurable sets with P-measure zero become Borel sets with Pη–1-measure zero, and instead of defining F(x, ω)fromR × Ω to [0, 1], it is defined from R × R to [0, 1]. All the needed properties for conditional probabilities given η = y are valid since as already noted Theorem 13.3.2 holds with “G ” replaced by “η = y”. Now let Qξ|η and Qˆ ξ|η be a regular conditional distribution of ξ given η –1 and η = y respectively. Then for each fixed B ∈B, Qξ|η(B, ω)=P(ξ B|η)(ω) –1 –1 a.s., Qˆ ξ|η(B, y)=P(ξ B|η = y)a.s.(Pη ) and it follows from the condi- tional probability version of Theorem 13.5.1 that

Qξ|η(B, ω)=Qˆ ξ|η(B, η(ω)) a.s.

From now on we write Q and Qˆ for Qξ|η and Qˆ ξ|η.Let{Bn} be a sequence of Borel sets which generates the σ-field of Borel sets B (cf. Ex. 1.21). 13.6 Regular conditional densities 303

Then by combining a countable number of σ(η)-measurable sets of zero probability we obtain a set N ∈ σ(η)withP(N) = 0 such that

Q(Bn, ω)=Qˆ (Bn, η(ω)) for all n and all ω  N. Denote by C the class of all subsets B of the real line such that Q(B, ω)= Qˆ (B, η(ω)) for all ω  N. Since for each ω ∈ Ω, Q(B, ω) and Qˆ (B, η(ω)) are probability measures on B, it follows simply that C is a σ-field and since it contains {Bn} it will contain its generated σ-field B. Thus Q(B, ω)= Qˆ (B, η(ω)) for all B ∈Band ω  N, i.e. (ii) holds. (iii) follows immediately from Theorem 13.4.4 (see also the last para- graph of Section 13.4), the relationship between Qξ|η and Qˆ ξ|η, and Theorem 13.5.1 in the following form: If E{g(ξ, η)|η}(ω)=f (η(ω)) a.s. then E{g(ξ, η)|η = y} = f (y)a.s.(Pη–1). 

13.6 Regular conditional densities For two r.v.’s ξ and η we have (in Sections 13.4 and 13.5) defined the regular conditional distribution Qξ|η(B, ω)ofξ given η and the regular conditional distribution Qˆ ξ|η(B, y)ofξ given η = y, and have shown that both always exist. For each fixed ω and y, Qξ|η(·, ω) and Qˆ ξ|η(·, y) are probability mea- sures on the Borel sets B, and if they are absolutely continuous with respect to Lebesgue measure it is natural to call their Radon–Nikodym derivatives conditional densities of ξ given η, and given η = y respectively. As is clear from the previous sections regular versions of conditional densities will be of primary interest. To be precise, a function fξ|η(x, ω) defined on R × Ω to [0, ∞] is called a regular conditional density of ξ given η if it is B×σ(η)- measurable, for every fixed ω, fξ|η(x, ω) is a probability density function in x, and for all B ∈Band ω ∈ Ω, Q | B ω f | x ω dx ξ η( , )= B ξ η( , ) . 2 Similarly a function fˆξ|η(x, y) defined on R to [0, ∞] is called a regular conditional density of ξ given η = y if it is B×B-measurable, for every fixed y, fˆξ|η(x, y) is a probability density function in x, and for all B ∈Band ∈ R y , Qˆ | B y fˆ | x y dx ξ η( , )= B ξ η( , ) .

It is easy to see that fξ|η exists if and only if fˆξ|η exists and that in this case they are related by

fξ|η(x, ω)=fˆξ|η(x, η(ω)) a.e. 304 Conditioning

(with respect to the product of Lebesgue measure and P) (cf. Theorem 13.5.2). It is also clear (in view of Theorems 13.4.2 and 13.5.2) that con- ditional expectations can be expressed in terms of regular conditional den- sities, whenever the latter exist; for instance if g is a Borel measurable function on the plane such that E|g(ξ, η)| < ∞ then we have the following:

∞ –1 E{g ξ η |η y} g x y fˆ | x y dx Pη ( , ) = = –∞ ( , ) ξ η( , ) a.s. ( ) ∞ E{g ξ η |η} ω g x η ω f | x ω dx ( , ) ( )= –∞ ( , ( )) ξ η( , ) a.s.

The following result shows that a regular conditional density exists if the r.v.’s ξ and η have a joint probability density function. If f (x, y) is a joint p.d.f. of ξ and η (assumed defined and nonnegative everywhere) then the functions fξ(x) and fη(y) defined for all x and y by

∞ ∞ f x f x y dy f y f x y dx ξ( )= –∞ ( , ) , η( )= –∞ ( , ) are p.d.f.’s of ξ, η respectively (Section 9.3).

Theorem 13.6.1 Let ξ and η be r.v.’s with joint p.d.f. f (x, y) and fη(y) defined as above. Then the function fˆ(x, y) defined by ⎧ ⎪ ⎨⎪ f (x, y)/fη(y) if fη(y) > 0 ˆ f (x, y)=⎩⎪ h(x) if fη(y)=0 where h(x) is an arbitrary but fixed p.d.f., is a regular conditional density of ξ given η = y. Hence a regular conditional density of ξ given η is given by fξ|η(x, ω)=fˆ(x, η(ω)).

Proof Since f is B×B-measurable, it follows by Fubini’s Theorem that fη is B-measurable and hence fˆ is B×B-measurable. From the definition of fˆ it is clear that it is nonnegative and that for every ∞ y fˆ x y dx y fˆ x y x fixed , –∞ ( , ) = 1. Hence for fixed , ( , ) is a p.d.f. in . Now define Qˆ (B, y) for all B ∈Band y ∈ R by Qˆ B y fˆ x y dx ( , )= B ( , ) .

It follows from the properties of fˆ just established, that for each fixed B ∈B, Qˆ (B, y) is a Borel measurable function of y, and for each fixed y ∈ R, Qˆ (B, y) is a probability measure on the Borel sets. In order to con- clude that Qˆ = Qˆ ξ|η it suffices then to show that for each fixed B ∈B, 13.7 Summary 305

Qˆ (B, y)=P(ξ–1B|η = y)a.s.(Pη–1). Now for every fixed B ∈Band every E ∈Bwe have ˆ –1 ˆ –1 Q(B, y) dPη (y)= ∩{ } f (x, y) dx dPη (y) E E fη(y)>0 B ˆ = ∩{ } f (x, y)fη(y) dx dy E fη(y)>0 B = ∩{ } f (x, y) dx dy E fη(y)>0 B –1 –1 –1 –1 = P ξ B ∩ η (E ∩{fη(y) > 0}) = P{ξ B ∩ η E}

–1 –1 –1 since Pη {fη(y)=0} = 0. It follows that Qˆ (B, y)=P(ξ B|η = y)a.s.(Pη ) and thus fˆ(x, y) is a regular conditional density of ξ given η = y. 

13.7 Summary This is a summary of the main concepts defined in this chapter and their mutual relationships. I. 1. E(ξ|G): conditional expectation of ξ given G 2. P(A|G): conditional probability of A ∈F given G |G E |G Relationship: P(A )= (χA ).

II. 1. Pξ|G(A, ω): regular conditional probability of ξ given G (A ∈ σ(ξ)) (exists if ξ(Ω) ∈B) 2. Qξ|G(B, ω): regular conditional distribution of ξ given G (B ∈B) (always exists) Relationship, when they both exist: For a.e. ω ∈ Ω

–1 Qξ|G(B, ω)=Pξ|G(ξ B, ω) for all B ∈B. If G = σ(η) all concepts in I and II retain their name with “given η” replac- ing “given G”. III. 1. E(ξ|η = y): conditional expectation of ξ given η = y. 2. P(A|η = y): conditional probability of A ∈F given η = y. Relationship to I: E(ξ|η = y)=f (y)a.e.(Pη–1) if and only if E(ξ|η)=f (η)a.s. P(A|η = y)=f (y)a.e.(Pη–1) if and only if P(A|η)=f (η)a.s.

3. Qˆ ξ|η(B, y): regular conditional distribution of ξ given η = y (B ∈B) (always exists) 306 Conditioning

Relationship to II:

Qξ|η(B, ω)=Qˆ ξ|η(B, η(ω)) for all B ∈B, ω  N ∈ σ(η)withP(N)=0.

Exercises 13.1 Let ξ bear.v.withE|ξ| < ∞ and G a purely atomic sub-σ-field of F , i.e. G is generated by the disjoint events {E0, E1, E2, ...} with P(E0)=0, P(En) > 0 for n =1,2,... and Ω = ∪n≥0En. Using the definition of E(ξ|G) given in Section 13.2 show that  1 E(ξ|G)= χ ξ dP a.s. En P(E ) En n≥1 n (Hint: Show first that every set E in G is the union of a subsequence of {En, n ≥ 0}.) 13.2 If the r.v.’s ξ and η are such that E|ξ| < ∞ and η is bounded then show that E[E(ξ|G)η]=E[ξE(η|G)] = E[E(ξ|G)E(η|G)]. 13.3 Let ξ, η, ζ be r.v.’s with E|ξ| < ∞ and η independent of the pair ξ, ζ.Show that E(ξ|η, ζ)=E(ξ|ζ) a.s. Show also that if ξ is a Borel measurable function of η and ζ (ξ = f (η, ζ)) then it is a Borel measurable function of ζ only (ξ = g(ζ)). 13.4 State and prove the conditional form of the Holder¨ and Minkowski Inequal- ities. 13.5 If ξ ∈ Lp(Ω, F , P), p ≥ 1, show that E(ξ|G) ∈ Lp(Ω, F , P)and 1/p p 1/p p ||E(ξ|G)||p = E [|E(ξ|G)| ] ≤E (|ξ| )=||ξ||p. (Hint: Use the Conditional Jensen’s Inequality (Theorem 13.2.9).) 13.6 Two r.v.’s ξ and η in L2(Ω, F , P) are called orthogonal if E(ξη)=0.Let ξ ∈ L2(Ω, F , P); then E(ξ|G) ∈ L2(Ω, F , P) by Ex. 13.5. Show that E(ξ|G) 2 is the unique r.v. η ∈ L2(Ω, G, PG) which minimizes E(ξ – η) and that the minimum value is E(ξ2)–E{E2(ξ|G)}. E(ξ|G) is called the (in general, nonlinear) mean square estimate of ξ based on G. (Hint: Show that ξ – E(ξ|G) is orthogonal to all r.v.’s in L2(Ω, G, PG), so that E(ξ|G) is the projection of ξ onto L2(Ω, G, PG), and that for every 2 2 2 η ∈ L2(Ω, G, PG), E(ξ – η) = E{ξ – E(ξ|G)} + E{η – E(ξ|G)} .) In particular, if η is a r.v., then E(ξ|η) is the unique r.v. ζ ∈ L2(Ω, σ(η), Pσ(η)) which minimizes E(ξ – ζ)2, or equivalently h(η)=E(ξ|η) is the unique –1 2 function g ∈ L2(R, B, Pη ) which minimizes E[ξ – g(η)] . E(ξ|η) is called Exercises 307

the (in general, nonlinear) mean square estimate or least square regression of ξ based on η. It follows from Ex. 13.12 that if ξ and η have a joint normal distribution then E(ξ|η)=a + bη a.s. and thus the least squares regression of ξ based on η is linear. 13.7 Prove the conditional form of Jensen’s Inequality (Theorem 13.2.9) by using regular conditional distributions and the ordinary form of Jensen’s Inequal- ity (Theorem 9.5.4). 13.8 Let ξ and η be independent r.v.’s. Show that for every Borel set B,

P(ξ + η ∈ B|η)(ω)=Pξ–1{B – η(ω)} a.s.

where B – y = {x : x + y ∈ B}. What is then P(ξ + η ∈ B|η = y)equalto? Show also that

–1 Qξ+η(B, ω)=Pξ {B – η(ω)}

is a regular conditional distribution of ξ + η given η. 13.9 Let G be a sub-σ-field of F .Wesaythatafamilyofclassesofevents {Aλ, λ ∈ Λ} is conditionally independent given G if

n ∩n |G |G P k=1Aλk = P Aλk a.s. k=1 ∈ Λ ∈A for any n,anyλ1, ..., λn and any Aλk λk , k =1,..., n. Generalize { }∞ the Kolmogorov Zero-One Law to conditional independence: if ξn n=1 is a sequence of conditionally independent r.v.’s given G and A is a tail event, show that

P(A|G) = 0 or 1 a.s.,

and if ξ is a tail r.v., show that ξ = η a.s. for some G-measurable r.v. η. 13.10 Let ξ and η be r.v.’s with E|ξ| < ∞.Ify ∈ R is such that P(η = y) > 0then show that E(ξ|η = y) as defined in Section 13.5 is given by

1 E(ξ|η = y)= ξ dP. P(η = y) {η=y}

(Hint: Let D be the at most countable set of points y ∈ R such that P(η = y) > 0. Define f : R → R by f (y)= 1 ξ dP if y ∈ D and P(η=y) {η=y} E |  –1 f(y)= (ξ η = y)ify D,andshowthatforallBorelsetsB, BfdPη = η–1Bξ dP.) 13.11 Let ξ bear.v.andη a discrete r.v. with values y1, y2, .... Find expressions for the regular conditional probability of ξ given η and for the regular con- ditional distribution of ξ given η and given η = y. Simplify further these expressions when ξ is discrete with values x1, x2, .... 308 Conditioning

13.12 Let the r.v.’s ξ1 and ξ2 have a joint normal distribution with E(ξi)=μi, 2 E{ } | | var(ξi)=σi > 0, i =1,2,and (ξ1 – μ1)(ξ2 – μ2) = ρσ1σ2, ρ < 1, i.e. ξ1 and ξ2 have the joint p.d.f. 1 # 2πσ σ 1–ρ2 1 2⎧ ⎡ ⎤⎫ ⎪ ⎢ 2 2 ⎥⎪ ⎨ 1 ⎢(x1 – μ1) 2ρ(x1 – μ1)(x2 – μ2) (x2 – μ2) ⎥⎬ × exp ⎪– ⎣⎢ – + ⎦⎥⎪ . ⎩ 2(1 – ρ2) 2 σ σ 2 ⎭ σ1 1 2 σ2

Find the regular conditional density of ξ1 given ξ2 = x2 and show that σ1 E(ξ1|ξ2)=μ1 + ρ (ξ2 – μ2) a.s. σ2 (What happens when |ρ| =1?) 13.13 Let the r.v.’s ξ and η be such that ξ has a uniform distribution on [0, 1] and the (regular) conditional distribution of η given ξ = x, x ∈ [0, 1], is uniform on [–x, x]. Find the regular conditional densities of ξ given η = y and of η given ξ = x, and the conditional expectations E(ξ + η|ξ)andE(ξ + η|η). 14

Martingales

14.1 Definition and basic properties In this chapter we consider the notion of a martingale sequence, which has many of the useful properties of a sequence of partial sums of independent r.v.’s (with zero means) and which forms the basis of a significant segment of basic probability theory.

As usual, (Ω, F , P) will denote a fixed probability space. Let {ξn} be a sequence of r.v.’s and {Fn} a sequence of sub-σ-fields of F . Where noth- ing else is specified in writing sequences such as {ξn}, {Fn} etc. it will be assumed that the range of n is the set of positive integers {1, 2, ...}.Wesay that {ξn, Fn} is a martingale (respectively, a submartingale,asupermartin- gale)ifforeveryn,

(i) Fn ⊂Fn+1

(ii) ξn is Fn-measurable and integrable

(iii) E(ξn+1|Fn)=ξn (resp. ≥ ξn, ≤ ξn)a.s.

This definition trivially contains the notion of {ξn, Fn,1 ≤ n ≤ N} being a martingale (respectively, a submartingale, a supermartingale); just take ξn = ξN and Fn = FN for all n > N. Clearly {ξn, Fn} is a submartingale if and only if {–ξn, Fn} is a supermartingale. Thus the properties of super- martingales can be obtained from those of submartingales and in the sequel only martingales and submartingales will typically be considered.

Example 1 Let {ξn} be a sequence of independent r.v.’s in L1 with zero means and let

Sn = ξ1 + ···+ ξn, Fn = σ(ξ1, ..., ξn), n =1,2,....

309 310 Martingales

Then {Sn, Fn} is a martingale since for every n, Sn is clearly Fn-measurable and integrable, and

E(Sn+1|Fn)=E(ξn+1 + Sn|Fn)

= E(ξn+1|Fn)+E(Sn|Fn)

= Eξn+1 + Sn = Sn a.s. since Sn is Fn-measurable, σ(ξn+1) and Fn are independent, and Eξn+1 =0.

Example 2 Let {ξn} be a sequence of independent r.v.’s in L1 with finite, nonzero means Eξn = μn, and let n ξ η = k , F = σ(ξ , ..., ξ ), n =1,2,.... n μ n 1 n k=1 k

Then {ηn, Fn} is a martingale since for every n, ηn is clearly Fn-measurable and integrable, and ! " ! " ξ ξ E(η |F )=E n+1 η |F = η E n+1 |F n+1 n μ n n n μ n !n+1 " n+1 ξn+1 = ηnE = ηn a.s. μn+1 since ηn is Fn-measurable, and σ(ξn+1) and Fn are independent.

Example 3 Let ξ be an integrable r.v. and {Fn} an increasing sequence of sub-σ-fields of F (i.e. Fn ⊂Fn+1, n =1,2,...). Let

ξn = E(ξ|Fn)forn =1,2,....

Then {ξn, Fn} is a martingale since for each n, ξn is Fn-measurable and integrable, and

E(ξn+1|Fn)=E{E(ξ|Fn+1)|Fn}

= E(ξ|Fn)=ξn a.s. by Theorem 13.2.2 since Fn ⊂Fn+1. It will be shown in Section 14.3 that a martingale {ξn, Fn} is of this type, i.e. ξn = E(ξ|Fn) for some ξ ∈ L1, if and only if the sequence {ξn} is uniformly integrable. The following results contain the simplest properties of martingales.

Theorem 14.1.1 (i) If {ξn, Fn} and {ηn, Fn} are two martingales (resp. submartingales, supermartingales) then for any real numbers a and b (resp. nonnegative numbers a and b) {aξn + bηn, Fn} is a martingale (resp. submartingale, supermartingale). 14.1 Definition and basic properties 311

(ii) If {ξn, Fn} is a martingale (resp. submartingale, supermartingale) then the sequence {Eξn} is constant (resp. nondecreasing, nonincreasing). (iii) Let {ξn, Fn} be a submartingale (resp. supermartingale). Then {ξn, Fn} is a martingale if and only if the sequence {Eξn} is constant. Proof (i) is obvious from the linearity of conditional expectation (Theo- rem 13.2.1 (ii)). (ii) If {ξn, Fn} is a martingale we have for every n =1,2,..., E(ξn+1| Fn)=ξn a.s. and thus

Eξn+1 = E{E(ξn+1|Fn)} = Eξn. Similarly for a sub- and supermartingale. (iii) The “only if” part follows from (ii). For the “if” part assume that {ξn, Fn} is a submartingale and that {Eξn} is constant. Then for all n,

E{E(ξn+1|Fn)–ξn} = Eξn+1 – Eξn =0 and since E(ξn+1|Fn)–ξn ≥ 0 a.s. (from the definition of a submartingale) and E(ξn+1|Fn)–ξn ∈ L1, it follows (Theorem 4.4.7) that

E(ξn+1|Fn)–ξn = 0 a.s.

Hence {ξn, Fn} is a martingale.  The next theorem shows that any martingale is also a martingale relative to σ(ξ1, ..., ξn), and extends property (iii) of the martingale (submartin- gale, supermartingale) definitions.

Theorem 14.1.2 If {ξn, Fn} is a martingale, then so is {ξn, σ(ξ1, ..., ξn)} and for all n, k =1,2,...

E(ξn+k|Fn)=ξn a.s. with corresponding statements for sub- and supermartingales.

Proof If {ξn, Fn} is a martingale, since for every n, ξn is Fn-measurable and F1 ⊂F2 ⊂ ...⊂Fn,wehave

σ(ξ1, ..., ξn) ⊂Fn. It follows from Theorem 13.2.2, and Theorem 13.2.1 (v) that

E(ξn+1|σ(ξ1, ...ξn)) = E{E(ξn+1|Fn)|σ(ξ1, ..., ξn)}

= E{ξn|σ(ξ1, ..., ξn)}

= ξn a.s. so that {ξn, σ(ξ1, ..., ξn)} is indeed a martingale. 312 Martingales

The equality E(ξn+k|Fn)=ξn a.s. holds for k = 1 and all n by the defini- tion of a martingale. If it holds for some k and all n, then it also holds for k + 1 and all n since

E(ξn+k+1|Fn)=E{E(ξn+k+1|Fn+k)|Fn}

= E{ξn+k|Fn} = ξn a.s. by Theorem 13.2.2 (Fn ⊂Fn+k), the definition of a martingale, and the inductive hypothesis. The result thus follows for all n and k. The corresponding statements for submartingales and supermartingales follow with the obvious changes. 

In the sequel the statement that “{ξn} is a martingale or sub-, super- martingale” without reference to σ-fields {Fn} will mean that Fn is to be understood to be σ(ξ1, ..., ξn). The following result shows that appropriate convex functions of martin- gales (submartingales) are submartingales.

Theorem 14.1.3 Let {ξn, Fn} be a martingale (resp. a submartingale) and g a convex (resp. a convex nondecreasing) function on the real line. If g(ξn) is integrable for all n, then {g(ξn), Fn} is a submartingale.

Proof Since g is Borel measurable, g(ξn)isFn-measurable for all n. Also, since g is convex and ξn, g(ξn) are integrable, Theorem 13.2.9 gives

g(E{ξn+1|Fn}) ≤E{g(ξn+1)|Fn} a.s. for all n.If{ξn, Fn} is a martingale then E(ξn+1|Fn)=ξn a.s. and thus

g(ξn) ≤E{g(ξn+1)|Fn} a.s. which shows that {g(ξn), Fn} is a submartingale. If {ξn, Fn} is a submartin- gale then E(ξn+1|Fn) ≥ ξn a.s. and if g is nondecreasing we have

g(ξn) ≤ g(E{ξn+1|Fn}) ≤E{g(ξn+1)|Fn} a.s. which again shows that {g(ξn), Fn} is a submartingale. 

The following properties follow immediately from this theorem.

Corollary (i) If {ξn, Fn} is a submartingale, so is {ξn+, Fn} (where ξ+ = ξ for ξ ≥ 0 and ξ+ =0for ξ<0).

(ii) If {ξn, Fn} is a martingale then {|ξn|, Fn} is a submartingale, and so is p {|ξn| , Fn},1< p < ∞, provided ξn ∈ Lp for all n. 14.1 Definition and basic properties 313

A connection between martingales and submartingales is given in the following.

Theorem 14.1.4 (Doob’s Decomposition) Every submartingale {ξn, Fn} can be uniquely decomposed as

ξn = ηn + ζn for all n, a.s. where {ηn, Fn} is a martingale and the sequence of r.v.’s {ζn} is such that ζ1 =0a.s. ζn ≤ ζn+1 for all n a.s. ζn+1 is Fn-measurable for all n. 1 {ζn} is called the predictable increasing sequence associated with the submartingale {ξn}. Proof Define

η1 = ξ1, ζ1 =0 and for n ≥ 2 n n ηn = ξ1 + {ξk – E(ξk|Fk–1)}, ζn = {E(ξk|Fk–1)–ξk–1} k=2 k=2 or equivalently

ηn = ηn–1 + ξn – E(ξn|Fn–1), ζn = ζn–1 + E(ξn|Fn–1)–ξn–1.

Then η1 + ζ1 = ξ1 and for all n ≥ 2 n n ηn + ζn = ξ1 + ξk – ξk–1 = ξn a.s. k=2 k=2

Now {ηn, Fn} is a martingale, since for all n, ηn is clearly Fn-measurable and integrable and

E(ηn+1|Fn)=E{ηn + ξn+1 – E(ξn+1|Fn)|Fn}

= ηn + E(ξn+1|Fn)–E(ξn+1|Fn)

= ηn a.s.

Also, ζ1 = 0 by definition, and for all n, ζn+1 is clearly Fn-measurable and integrable, and the submartingale property E(ξn+1|Fn) ≥ ξn a.s. implies that

ζn+1 = ζn + E(ξn+1|Fn)–ξn ≥ ζn a.s.

Thus {ζn} has the stated properties.

1 This terminology is most evident when e.g. Fn = σ(ξ1, ..., ξn)sothatξn+1 ∈Fn implies that ξn+1 may be written as a function of (ξ1, ..., ξn) so is “predictable” from these values. 314 Martingales

The uniqueness of the decomposition is shown as follows. Let ξn = ηn+ζn { } { } be another decomposition with ηn and ζn having the same properties as {ηn} and {ζn}. Then for all n, ηn – ηn = ζn – ζn = θn, { F } { F } { F } say. Since ηn, n and ηn, n are martingales, so is θn, n so that

E(θn+1|Fn)=θn for all n a.s. F Also, since ζn+1 and ζn+1 are n-measurable, so is θn+1 and thus

E(θn+1|Fn)=θn+1 for all n a.s.

It follows that θ1 = ··· = θn = θn+1 = ··· a.s. and since θ1 = 0 a.s. we have θn = 0 for all n a.s. and thus ηn = ηn and ζn = ζn for all n a.s. 

14.2 Inequalities There are a number of basic and useful inequalities for probabilities, mo- ments and “crossings” of submartingales, and the simpler of these are given in this section. The first provides a martingale form of Kolmogorov’s Inequality (Theorem 11.5.1).

Theorem 14.2.1 If {(ξn, Fn):1≤ n ≤ N} is a submartingale, then for all real a { ≥ }≤ ≤E| | aP max ξn a { ≥ }ξN dP ξN . 1≤n≤N max1≤n≤N ξn a Proof Define (as in the proof of Theorem 11.5.1)

E = {ω :maxξn(ω) ≥ a} 1≤n≤N

E1 = {ω : ξ1(ω) ≥ a} { ≥ }∩∩n–1{ } En = ω : ξn(ω) a k=1 ω : ξk(ω) < a , n =2,..., N. ∈F { } ∪N Then En n for all n =1,..., N, En are disjoint and E = n=1En. Thus N ξN dP = ξN dP. E En n=1 Now for each n =1,..., N, ξN dP = E(ξN |Fn) dP ≥ ξn dP ≥ aP(En) En En En 14.2 Inequalities 315 since En ∈Fn, E(ξN |Fn) ≥ ξn by Theorem 14.1.2, and ξn ≥ a on En.It follows that N ξ dP ≥ a P E aP E E N ( n)= ( ). n=1 This proves the left half of the inequality of the theorem and the right half is obvious. 

That Theorem 14.2.1 contains Kolmogorov’s Inequality (Theorem 11.5.1) follows from Example 1 and the following corollary.

Corollary Let {(ξn, Fn):1≤ n ≤ N} be a martingale and a > 0. Then 1 (i) P{max1≤n≤N |ξn|≥a}≤ |ξN | dP ≤E|ξN |/a. a {max1≤n≤N |ξn|≥a} E 2 ∞ (ii) If also ξN < , then { | |≥ }≤E2 2 P max ξn a ξN /a . 1≤n≤N

Proof Since {(ξn, Fn):1≤ n ≤ N} is a martingale, {(|ξn|, Fn):1≤ n ≤ N} is a submartingale ((ii) of Theorem 14.1.3, Corollary) and (i) follows from the theorem. E 2 ∞ E 2 ∞ For (ii) we will show that ξN < implies ξn < for all n =1,..., N. { 2 F ≤ ≤ } Then by part (ii) of the corollary to Theorem 14.1.3, (ξn, n):1 n N is a submartingale and (ii) follows from the theorem. { F ≤ ≤ } E 2 ∞ To show that if (ξn, n):1 n N is a martingale and ξN < , E 2 ∞ then ξn < for all n =1,..., N, we define gk on the real line for each k =1,2,...,by  x2 for |x|≤k g (x)= k 2k(|x| – k/2) for |x| > k.

2 Then each gk is convex and gk(x) ↑ x for all real x. For each fixed k = 1, 2, ..., since for all n =1,..., N, 2 E|gk(ξn)| = ξ dP + 2k(|ξn| – k/2) dP {|ξn|≤k} n {|ξn|>k} 2 ≤ k +2kE|ξn| < ∞, it follows from Theorem 14.1.3 that {(gk(ξn), Fn):1≤ n ≤ N} is a sub- martingale and thus, by Theorem 14.1.1 (ii),

0 ≤E{gk(ξ1)}≤...≤E{gk(ξN )} < ∞. 316 Martingales

2 Since gk(x) ↑ x for each x as k →∞, the monotone convergence theorem E{ }↑E 2 implies that for each n =1,..., N, gk(ξn) ξn. Hence we have ≤E2 ≤ ≤E2 0 ξ1 ... ξN E 2 ∞  and the result follows since ξN < . As a consequence of Theorem 14.2.1, the following inequality holds for nonnegative submartingales.

Theorem 14.2.2 If {(ξn, Fn):1≤ n ≤ N} is a submartingale such that ξn ≥ 0 a.s. n =1,..., N, then for all p > 1, ! "p E p ≤ p E p (max ξn) ξ . 1≤n≤N p –1 N

Proof Define ζ =max1≤n≤N ξn and η = ξN . Then ζ, η ≥ 0 a.s. and it follows from Theorem 14.2.1 that for all x > 0, 1 G(x)=P{ζ>x}≤ η dP. x {ζ≥x} Now by applying the monotone convergence theorem and Fubini’s Theorem (i.e. integration by parts) we obtain ∞ ∞ E(ζp)= xp d{1–G(x)} = xp d{–G(x)} 0 0 A = lim xp d{–G(x)} ↑∞ 0 A A = lim{p xp–1G(x) dx – ApG(A)} ↑∞ 0 A A ∞ ≤ lim p xp–1G(x) dx = p xp–1G(x) dx A↑∞ 0 0 ∞ 1 ≤ p xp–1 η dP dx 0 x {ζ≥x} by the inequality for G shown above. Change of integration order thus gives ! " ζ(ω) E ζp ≤ p η ω xp–2 dx dP ω ( ) Ω ( ) 0 ( ) p p = η(ω)ζp–1(ω) dP(ω)= E(ηζp–1) p –1 Ω p –1 p 1 p p–1 p ≤ E p (η )E p (ζ ), p –1

1 p 1 E p p ≤ E p p by Holder’s¨ Inequality. It follows that ζ p–1 (η ) which implies the result.  The following corollary follows immediately from the theorem and (ii) of Theorem 14.1.3, Corollary. 14.2 Inequalities 317

Corollary If {(ξn, Fn):1≤ n ≤ N} is a martingale and p > 1, then   p p p p E(max |ξn| ) ≤ E|ξN | . 1≤n≤N p –1 The final result of this section is an inequality for the number of “up- crossings” of a submartingale, which will be pivotal in the next section in deriving the submartingale convergence theorem. This requires the fol- lowing definitions and notation. Let {x1, ..., xN } be a finite sequence of real numbers and let a < b be real numbers. Let τ1 be the first integer { } ≤ { } in 1, ..., N such that xτ1 a, τ2 be the first integer in 1, ..., N larger ≥ { } than τ1 such that xτ2 b, τ3 be the first integer in 1, ..., N larger than τ2 ≤ { } such that xτ3 a, τ4 be the first integer in 1, ..., N larger than τ3 such that ≥ xτ4 b, and so on, and define τi = N +1 if the condition cannot be satisfied. In other words,

τ1 =min{j :1≤ j ≤ N, xj ≤ a},

τ2 =min{j : τ1 < j ≤ N, xj ≥ b},

τ2k+1 =min{j : τ2k < j ≤ N, xj ≤ a},3≤ 2k +1≤ N

τ2k+2 =min{j : τ2k+1 < j ≤ N, xj ≥ b},4≤ 2k +2≤ N and τi = N + 1 if the corresponding set is empty. Let M be the number of τi that do not exceed N. Then the number of upcrossings U[a,b] of the interval [a, b] by the sequence {x1, ..., xN } is defined by  M/2 if M is even U = [M/2] = [a,b] (M – 1)/2 if M is odd and is the number of times the sequence (completely) crosses from ≤ a to ≥ b.

Theorem 14.2.3 Let {(ξn, Fn):1≤ n ≤ N} be a submartingale, a < b real numbers, and let U[a,b](ω) be the number of upcrossings of the interval [a, b] by the sequence {ξ1(ω), ..., ξN (ω)}. Then E(ξ – a) – E(ξ – a) Eξ + a EU ≤ N + 1 + ≤ N+ – . [a,b] b – a b – a

Proof It should be checked that U[a,b](ω) is a r.v. This may be done by first showing that {τn(ω):1≤ n ≤ N} are r.v.’s and then using the definition of U[a,b] in terms of the τn’s. Next assume first that a = 0 and ξn ≥ 0 for all n =1,..., N. Define {ηn(ω):1≤ n ≤ N} by  1ifτ (ω) ≤ n <τ (ω) for some k =1,..., [N/2] η (ω)= 2k–1 2k n 0 otherwise. 318 Martingales

We now show that each ηn is an Fn-measurable r.v. Since by definition {η1 =1} = {ξ1 =0}, η1 is an F1-measurable r.v. If ηn is Fn-measurable, 1 ≤ n ≤ N, then it is clear from the definition of the ηn’s that

{ηn+1 =1} = {ηn =1,0≤ ξn+1 < b}∪{ηn =0, ξn+1 =0} and thus ηn+1 is Fn+1-measurable. It follows by finite induction that each ηn is Fn-measurable. Define N–1 ζ = ξ1 + ηn(ξn+1 – ξn). n=1

If M(ω) is the number of τn(ω)’s that do not exceed N, so that U[0,b](ω)= [M(ω)/2], then if M is even

U[0,b]

ζ = ξ1 + (ξτ2k – ξτ2k–1 ) k=1 and if M is odd U[0,b]

ζ = ξ1 + (ξτ2k – ξτ2k–1 )+(ξN – ξτM ). k=1 ≥ ≥ Since ξτ2k – ξτ2k–1 b and ξN – ξτM = ξN –0 0, we have in either case, i.e. for all ω ∈ Ω,

ζ ≥ ξ1 + bU[0,b] and thus Eζ – Eξ EU ≤ 1 . [0,b] b Also N–1 Eζ = Eξ1 + E{ηn(ξn+1 – ξn)}. n=1

Since ηn is Fn-measurable, 0 ≤ ηn ≤ 1, and E(ξn+1 – ξn|Fn) ≥ 0bythe submartingale property, we have for n =1,..., N –1,

E{ηn(ξn+1 – ξn)} = E(E{ηn(ξn+1 – ξn)|Fn})

= E(ηnE{ξn+1 – ξn|Fn})

≤E(E{ξn+1 – ξn|Fn})

= E(ξn+1 – ξn). It follows that N–1 Eζ ≤Eξ1 + E(ξn+1 – ξn)=EξN n=1 14.3 Convergence 319 and hence Eξ – Eξ EU ≤ N 1 . [0,b] b For the general case note that the number of upcrossings of [a, b]by { }N { }N ξn n=1 is equal to the number of upcrossings of [0, b – a]by ξn – a n=1 and this is also equal to the number of upcrossings of [0, b – a]by{(ξn – a)+ :1≤ n ≤ N}. Since {(ξn, Fn):1≤ n ≤ N} is a submartingale, so is {(ξn – a, Fn):1≤ n ≤ N} and also {((ξn – a)+, Fn):1≤ n ≤ N} by (i) of Theorem 14.1.3, Corollary. It follows from the particular case just considered that E(ξ – a) – E(ξ – a) EU ≤ N + 1 + [a,b] b – a E E ≤ (ξN – a)+ ≤ ξN+ + a– b – a b – a since (ξN – a)+ ≤ ξN+ + a–. 

14.3 Convergence In this section it is shown that under mild conditions submartingales and martingales (and also supermartingales) converge almost surely. The con- vergence theorems which follow are very useful in probability and statistics. We start with a sufficient condition for a.s. convergence of a submartingale.

Theorem 14.3.1 Let {ξn, Fn} be a submartingale. If

lim Eξn+ < ∞ n→∞ then there is an integrable r.v. ξ∞ such that ξn → ξ∞ a.s.

(n) Proof For every pair of real numbers a < b,letU[a,b](ω) be the num- { ≤ ≤ } { (n) } ber of upcrossings of [a, b]by ξi(ω):1 i n . Then U[a,b](ω) is a nondecreasing sequence of random variables and thus has a limit

(n) U[a,b](ω) = lim U (ω)a.s. n→∞ [a,b] By monotone convergence and Theorem 14.2.3, we have

(n) EU[a,b] = lim EU n→∞ [a,b] Eξ + a ≤ lim n+ – < ∞, n→∞ b – a 320 Martingales so that U[a,b] < ∞ a.s. It follows that if

E[a,b] = {ω ∈ Ω : lim inf ξn(ω) < a < b < lim sup ξn(ω)} n n then

P(E[a,b]) = 0 for all a < b. Thus if

E = ∪a,b:rationalE[a,b] = {ω ∈ Ω : lim inf ξn(ω) < lim sup ξn(ω)} n n then P(E) = 0. It follows that lim infn ξn(ω) = lim supn ξn(ω) a.s. and thus the limit limn→∞ ξn exists a.s. Denote this limit by ξ∞. Then, by Fatou’s Lemma,

E|ξ∞|≤lim inf E|ξn| n and since (by Theorem 14.1.1 (ii)) Eξn ≥Eξ1,

E|ξn| = E(2ξn+ – ξn) ≤ 2Eξn+ – Eξ1 we obtain

E|ξ∞|≤lim inf{2Eξn+ – Eξ1} n = 2 lim Eξn+ – E(ξ1) < ∞. n

Thus ξ∞ is integrable.  The next theorem gives conditions under which the a.s. converging sub- martingale of Theorem 14.3.1 converges also in L1. Throughout the fol- lowing, given a sequence of σ-fields {Fn}, we denote by F∞ the σ-field ∪∞ F F generated by n=1 n. Also, by including (ξ∞, ∞) in the sequence, we call {(ξn, Fn):n =1,2,..., ∞} a martingale (respectively submartingale, supermartingale) if for all m, n in {1, 2, ..., ∞} with m < n,

(i) Fm ⊂Fn (ii) ξn is Fn-measurable and integrable (iii) E(ξn|Fm)=ξm a.s. (resp. ≥ ξm, ≤ ξm). We have the following result.

Theorem 14.3.2 If {ξn, Fn} is a submartingale, the following are equiva- lent

(i) the sequence {ξn} is uniformly integrable (ii) the sequence {ξn} converges in L1 14.3 Convergence 321

(iii) the sequence {ξn} converges a.s. to an integrable r.v. ξ∞ such that {(ξn, Fn):n =1,2,..., ∞} is a submartingale and limn Eξn = Eξ∞.

Proof (i) ⇒ (ii): Since {ξn} is uniformly integrable, Theorem 11.4.1 im- E| | ∞ plies supn ξn < and thus, by Theorem 14.3.1, there is an integrable r.v. ξ∞ such that ξn → ξ∞ a.s. Since a.s. convergence implies convergence in probability, it follows from Theorem 11.4.2 that ξn → ξ∞ in L1. (ii) ⇒ (iii): If ξn → ξ∞ in L1 we have by Theorem 11.4.2, E|ξn|→E|ξ∞| < ∞ E| | ∞ → and thus supn ξn < . It then follows from Theorem 14.3.1 that ξn ξ∞ a.s. In order to show that {(ξn, Fn):n =1,2,..., ∞} is a submartingale it suffices to show that for all n =1,2,...

E(ξ∞|Fn) ≥ ξn a.s.

For every fixed n and E ∈Fn, using the definition of conditional expec- tation and the convergence ξ → ξ∞ in L (which implies Eξ →Eξ∞) m 1 m E(ξ∞|F ) dP = ξ∞ dP E n E = lim ξm dP m→∞ E = lim E(ξm|Fn) dP m→∞ E ≥ ξ dP E n since E(ξm|Fn) ≥ ξn a.s. for m > n. Thus E(ξ∞|Fn) ≥ ξn a.s. (see Ex. 4.14) and as already noted above limn Eξn = Eξ∞. (iii) ⇒ (i): Since {(ξn, Fn):n =1,2,..., ∞} is a submartingale, so is {(ξn+, Fn):n =1,2,..., ∞}. Thus using the submartingale property repeat- edly we have ξn+ dP ≤ E(ξ∞+|Fn) dP = ξ∞+ dP {ξn+>a} {ξn+>a} {ξn+>a} and 1 1 1 P{ξ > a}≤ Eξ ≤ E{E(ξ∞ |F )} = Eξ∞ → 0asa →∞ n+ a n+ a + n a + which clearly imply that {ξn+} is uniformly integrable. Since ξn+ → ξ∞+ a.s. and thus also in probability, and since the sequence is uniformly integrable, it follows by Theorem 11.4.2 that ξn+ → ξ∞+ in L1, and hence that Eξn+ →Eξ∞+. Since by assumption Eξn →Eξ∞,it also follows that Eξn– →Eξ∞–. Since clearly ξn– → ξ∞– a.s. and hence in probability, Theorem 11.4.2 implies that {ξn–} is uniformly integrable. Since ξn = ξn+ –ξn–, the uniform integrability of {ξn : n =1,2,...} follows (see Ex. 11.21).  322 Martingales

For martingales the following more detailed and useful result holds.

Theorem 14.3.3 If {ξn, Fn} is a martingale, the following are equivalent

(i) the sequence {ξn} is uniformly integrable (ii) the sequence {ξn} converges in L1 (iii) the sequence {ξn} converges a.s. to an integrable r.v. ξ∞ such that {(ξn, Fn):n =1,2,..., ∞} is a martingale (iv) there is an integrable r.v. η such that ξn = E(η|Fn) for all n =1,2,... a.s. Proof That (i) implies (ii) and (ii) implies (iii) follow from Theo- rem 14.3.2. That (iii) implies (i) is shown as in Theorem 14.3.2 by con- sidering |ξn| instead of ξn+, and it is shown trivially by taking η = ξ∞ that (iii) implies (iv). (iv) ⇒ (i): Put ξ∞ = η. Then E(ξ∞|Fn)=E(η|Fn)=ξn and clearly {(ξn, Fn):n =1,2,..., ∞} is a martingale and thus {(|ξn|, Fn):n = 1, 2, ..., ∞} is a submartingale. We thus have |ξn| dP ≤ E(|ξ∞||Fn) dP = |ξ∞| dP {|ξn|>a} {|ξn|>a} {|ξn|>a} and 1 1 P{|ξ | > a}≤ E|ξ |≤ E|ξ∞|→0asa →∞, n a n a which clearly imply that {ξn} is uniformly integrable.  As a simple consequence of the previous theorem we have the following very useful result.

Theorem 14.3.4 Let ξ be an integrable r.v., {Fn} a sequence of sub-σ- fields of F such that Fn ⊂Fn+1 all n, and F∞ the σ-field generated by ∪∞ F n=1 n. Then

lim E(ξ|Fn)=E(ξ|F∞) a.s. and in L1. n→∞

Proof Let ξn = E(ξ|Fn), n =1,2,.... Then {ξn, Fn} is a martingale (by Example 3 in Section 14.1) which satisfies (iv) of Theorem 14.3.3. It follows by (ii) and (iii) of that theorem that there is an integrable r.v. ξ∞ such that

ξn → ξ∞ a.s. and in L1.

It suffices now to show that E(ξ|Fn) →E(ξ|F∞) a.s. Since by (iii) of Theo- rem 14.3.3, {(ξn, Fn):n =1,2,..., ∞} is a martingale, we have that for all E ∈F, n ξ∞ dP E ξ∞|F dP ξ dP E ξ|F dP ξ dP E = E ( n) = E n = E ( n) = E . 14.3 Convergence 323 ∞ ξ∞ dP ξ dP E F ∪ F Hence E = E for all sets in n and thus in n=1 n. It is clear that the class of sets for which it holds is a D-class, and since it contains ∪∞ F F n=1 n (which is closed under intersections) it contains also ∞. Hence ξ∞ dP ξ dP E ∈F∞ E = E for all and since ξ∞ = limn ξn is F∞-measurable, it follows that ξ∞ = E(ξ|F∞)a.s. 

A result similar to Theorem 14.3.4 is also true for decreasing (rather than increasing) sequences of σ-fields and follows easily if we introduce the concept of reverse submartingale and martingale as follows. Let {ξn} be a sequence of r.v.’s and {Fn} a sequence of sub-σ-fields of F . We say that {ξn, Fn} is a reverse martingale (respectively, submartingale, supermartin- gale)ifforeveryn,

(i) Fn ⊃Fn+1

(ii) ξn is Fn-measurable and integrable

(iii) E(ξn|Fn+1)=ξn+1 (resp. ≥ ξn+1, ≤ ξn+1)a.s.

The following convergence result corresponds to Theorem 14.3.1.

Theorem 14.3.5 Let {ξn, Fn} be a reverse submartingale. Then there is a r.v. ξ∞ such that ξn → ξ∞ a.s. and if

lim Eξn > –∞ n→∞ then ξ∞ is integrable.

Proof The proof is similar to that of Theorem 14.3.1. For each fixed n, define

ηk = ξn–k+1, Gk = Fn–k+1 k =1,2,..., n, i.e. {η1, G1; η2, G2; ...; ηn, Gn} = {ξn, Fn; ξn–1, Fn–1; ...; ξ1, F1}. Then {(ηk, Gk):1≤ k ≤ n} is a submartingale since

E(ηk+1|Gk)=E(ξn–k|Fn–k+1)=ηk a.s.

(n) If U[a,b](ω) denotes the number of upcrossings of the interval [a, b]bythe { } (n) sequence ξn(ω), ξn–1(ω), ..., ξ1(ω) , then U[a,b](ω) is equal to the number 324 Martingales of upcrossings of the interval [a, b] by the submartingale {η1(ω), ..., ηn(ω)} and by Theorem 14.2.3 we have Eη + a Eξ + a EU(n) ≤ n+ – = 1+ – . [a,b] b – a b – a

As in the proof of Theorem 14.3.1 it follows that the sequence {ξn} con- verges a.s., i.e. ξn → ξ∞ a.s. Again as in the proof of Theorem 14.3.1 we have by Fatou’s Lemma,

E|ξ∞|≤lim inf E|ξn| and E|ξn| =2Eξn+ – Eξn. n But now

Eξn+ = Eη1+ ≤Eηn+ = Eξ1+ since {(ηk+, Gk):1≤ k ≤ n} is a submartingale. Also {Eξn} is clearly a nonincreasing sequence. Since limn Eξn > –∞ it follows that

E|ξ∞|≤2Eξ1+ – lim Eξn < ∞ n→∞ and thus ξ∞ is integrable. 

Corollary If {ξn, Fn} is a reverse martingale, then there is an integrable r.v. ξ∞ such that ξn → ξ∞ a.s.

Proof If {ξn, Fn} is a reverse martingale, clearly the sequence {Eξn} is con- stant and thus limn Eξn = Eξ1 > –∞. The result then follows from the theorem. 

We now prove the result of Theorem 14.3.4 for decreasing sequences of σ-fields.

Theorem 14.3.6 Let ξ be an integrable r.v., {Fn} a sequence of sub-σ- F F ⊃F F ∩∞ F fields of such that n n+1 for all n, and ∞ = n=1 n. Then

lim E(ξ|Fn)=E(ξ|F∞) a.s. and in L1. n→∞

Proof Let ξn = E(ξ|Fn). Then {ξn, Fn} is a reverse martingale since Fn ⊃ Fn+1, ξn is Fn-measurable and integrable and by Theorem 13.2.2,

E(ξn|Fn+1)=E{E(ξ|Fn)|Fn+1} = E(ξ|Fn+1)=ξn+1 a.s.

It follows from the corollary of Theorem 14.3.5 that ξn → ξ∞ a.s. for some integrable r.v. ξ∞. 14.4 Centered sequences 325

We first show that ξn → ξ∞ in L1 as well. This follows from Theo- { }∞ rem 11.4.2 since the sequence ξn n=1 is uniformly integrable as is seen from |ξn| dP ≤ E(|ξ||Fn) dP = |ξ| dP {|ξn|>a} {|ξn|>a} {|ξn|>a} and 1 1 P{|ξ | > a}≤ E|ξ |≤ E|ξ|→0asa →∞ n a n a since |ξn| = |E(ξ|Fn)|≤E(|ξ||Fn) a.s. and thus E|ξn|≤E|ξ|. We now show that ξ∞ = E(ξ|F∞) a.s. For every E ∈F∞ we have E ∈Fn for all n and since ξ = E(ξ|F∞) and ξ → ξ∞ in L , n n 1 ξ dP = ξn dP → ξ∞ dP as n →∞. E E E ξ dP ξ∞ dP E ∈F∞ ξ∞ ξ Hence E = E for all . Also the relations = limn n a.s. and Fn ⊃Fn+1 imply that ξ∞ is Fn-measurable for all n and thus F∞- measurable. It follows that ξ∞ = E(ξ|F∞)a.s. 

14.4 Centered sequences In this section the results of Section 14.3 will be used to study the con- vergence of series and the law of large numbers for “centered” sequences of r.v.’s, a concept which generalizes that of a sequence of independent and zero mean r.v.’s. We will also give martingale proofs for some of the previous convergence results for sequences of independent r.v.’s. A sequence of r.v.’s {ξn} is called centered if for every n =1,2,..., ξn is integrable and

E(ξn|Fn–1)=0a.s. where Fn = σ(ξ1, ..., ξn) and F0 = {∅, Ω}.Forn = 1 this condition is just Eξ1 = 0 while for n > 1 it implies the weaker condition Eξn =0.Fn will be assumed to be σ(ξ1, ..., ξn) throughout this section unless otherwise stated. The basic properties of centered sequences are collected in the following theorem. Property (i) shows that results obtained for centered sequences are directly applicable to arbitrary sequences of integrable r.v.’s appropriately modified, i.e. centered.

Theorem 14.4.1 (i) If {ξn} is a sequence of integrable r.v.’s then the se- quence {ξn – E(ξn|Fn–1)} is centered. (ii) The sequence of partial sums of a centered sequence is a zero mean martingale, and conversely, every zero mean martingale is the se- quence of partial sums of a centered sequence. 326 Martingales

(iii) A sequence of independent r.v.’s {ξn} is centered if and only if for each n, ξn ∈ L1 and Eξn =0. (iv) If the sequence of r.v.’s {ξn} is centered and ξn ∈ L2 for all n, then the r.v.’s of the sequence are orthogonal: Eξnξm =0for all n  m.

Proof (i) is obvious. For (ii) let {ξn} be centered and let Sn = ξ1 +···+ξn = Sn–1 + ξn for n =1,2,..., where S0 = 0. Then each Sn is integrable and Fn-measurable and

E(Sn|Fn–1)=E(Sn–1|Fn–1)+E(ξn|Fn–1)=Sn–1 a.s.

Note that Fn = σ(ξ1, ..., ξn)=σ(S1, ..., Sn). It follows that {Sn} is a mar- tingale with zero mean since ES1 = Eξ1 = 0. Conversely, if {Sn} is a zero mean martingale, let ξn = Sn – Sn–1 for n =1,2,..., where S0 = 0. Then each ξn is Fn-measurable and

E(ξn|Fn–1)=E(Sn|Fn–1)–Sn–1 = 0 a.s.

Hence {ξn} is centered and clearly ξ1 + ···+ ξn = Sn – S0 = Sn. (iii) follows immediately from the fact that for independent integrable r.v.’s {ξn} and all n =1,2,... we have from Theorem 10.3.2 that the σ- fields Fn–1 and σ(ξn) are independent and thus by Theorem 13.2.7,

E(ξn|Fn–1)=Eξn a.s.

(iv) Let {ξn} be centered, ξn ∈ L2 for all n, and m < n. Then since ξm is Fm ⊂Fn–1-measurable and E(ξn|Fn–1)=0a.s.wehave  E(ξnξm)=E{E(ξnξm|Fn–1)} = E{ξmE(ξn|Fn–1)} = E{0} =0. We now prove for centered sequences of r.v.’s some of the convergence results shown in Sections 11.5 and 11.6 for sequences of independent r.v.’s. In view of Theorem 14.4.1 (iii), the following result on the convergence of series of centered r.v.’s generalizes the corresponding result for series of independent r.v.’s (Theorem 11.5.3). { } ∞ E 2 Theorem 14.4.2 If ξn is a centered sequence of r.v.’s and if n=1 ξn < ∞ ∞ , then the series n=1 ξn converges a.s. and in L2. n ∈ E 2 ∞ Proof Let Sn = k=1 ξk. Then Sn L2 since by assumption ξn < for all n. It follows from Theorem 14.4.1 (iv) that for all m < n, ⎛ ⎞ ⎜ n ⎟2 n E 2 E ⎜ ⎟ E 2 → →∞ (Sn – Sm) = ⎝ ξk⎠ = ξk 0asm, n k=m+1 k=m+1 14.4 Centered sequences 327 ∞ E 2 ∞ { }∞ since k=1 ξk < . Hence Sn n=1 is a Cauchy sequence in L2 and by Theorem 6.4.7 (i) there is a r.v. S ∈ L2 such that Sn → S in L2. Thus the series converges in L2. Now Theorem 9.5.2 shows that convergence in L2 implies convergence in L1 and thus Sn → S in L1. Since by The- { }∞ orem 14.4.1 (ii), Sn n=1 is a martingale, condition (ii) of Theorem 14.3.3 is satisfied and thus (by (iii) of that theorem) Sn → S a.s. and the series converges also a.s. 

Note that the result of this theorem follows also directly from Ex. 14.8. We now prove a strong law of large numbers for centered sequences which generalizes the corresponding result for sequences of independent r.v.’s (Theorem 11.6.2).

Theorem 14.4.3 If {ξn} is a centered sequence of r.v.’s and if ∞ E 2 2 ∞ ξn/n < n=1 then n 1 ξ → 0 a.s. n k k=1 Proof This follows from Theorem 14.4.2 and Lemma 11.6.1 in the same way as Theorem 11.6.2 follows from Theorem 11.5.3 and Lemma 11.6.1. 

The special convergence results for sequences of independent r.v.’s, i.e. Theorems 11.5.4, 11.6.3 and 12.5.2, can also be obtained as applications of the martingale convergence theorems. As an illustration we include here martingale proofs of the strong law of large numbers (second form, Theo- rem 11.6.3) and of Theorem 12.5.2.

Theorem 14.4.4 (Strong Law, Second Form) Let {ξn} be independent and identically distributed r.v.’s with (the same) finite mean μ. Then

n 1 ξ → μ a.s. and in L . n k 1 k=1

Proof Let Sn = ξ1 + ···+ ξn. We first show that for each 1 ≤ k ≤ n, 1 E(ξ |S )= S a.s. k n n n 328 Martingales

Every set E ∈ σ(S )isoftheformE = S–1(B), B ∈B, and thus n n ξ dP E ξ χ{ ∈ } E k = ( k Sn B ) ∞ ∞ ··· x χ x ··· x dF x ... dF x = –∞ –∞ k B( 1 + + n) ( 1) ( n) where F is the common d.f. of the ξn’s. It follows from Fubini’s Theorem that the last expression does not depend on k and thus n 1 1 ξ dP = ξ dP = S dP E k n E i n E n i=1 E | 1 which implies (ξk Sn)= n Sn a.s. F F ⊃F F ∩∞ F Now let n = σ(Sn, Sn+1, ...) (hence n n+1) and let ∞ = n=1 n. Since Sn+1 –Sn = ξn+1 it is clear that Fn = σ(Sn, ξn+1, ξn+2, ...). Also since the classes of events σ(ξ1, Sn) and σ(ξn+1, ξn+2, ...) are independent, an obvious generalization of Ex. 13.3 gives

E(ξ1|Sn)=E(ξ1|Fn)a.s. Thus 1 S = E(ξ |F )a.s. n n 1 n and Theorem 14.3.6 implies that 1 S →E(ξ |F∞) a.s. and in L . n n 1 1 1 1 1 Now limn n Sn = limn n (Sn – Sk) implies that limn n Sn is a tail r.v. of the independent sequence {ξn} and by Kolmogorov’s Zero-One Law (Theo- rem 10.5.3) it is constant a.s. Hence E(ξ1|F∞) is constant a.s. and thus 1 E |F∞ E →  (ξ1 )= ξ1 = μ a.s. It follows that n Sn μ a.s. and in L1. The following result gives a martingale proof of Theorem 12.5.2.

Theorem 14.4.5 Let {ξn} be a sequence of independent random variables with characteristic functions {φn}. Then the following are equivalent: ∞ (i) the series n=1 ξn converges a.s. ∞ (ii) the series n=1 ξn converges in distribution n (iii) the products k=1 φk(t) converge to a nonzero limit in some neighbor- hood of the origin. Proof Clearly, it suffices to show that (iii) implies (i), i.e. assume that n lim φk(t)=φ(t)  0 for each t ∈ [–a, a] for some a > 0. n→∞ k=1 14.4 Centered sequences 329 n F Let Sn = k=1 ξk and n = σ(ξ1, ..., ξn)= σ(S1, ..., Sn). For each fixed ∈ itSn n t [–a, a] the sequence e / k=1 φk(t) is integrable (dP), indeed uni- itS formly bounded, and it follows from Example 2 of Section 14.1 that e n / n F k=1 φk(t), n is a martingale, in the sense that its real and imaginary parts are martingales. Since for each t the sequence is uniformly bounded, Theo- rem 14.3.1 applied to the real and imaginary parts shows that the sequence itSn n →∞ e / k=1 φk(t) converges a.s. as n . Since the denominator converges to a nonzero limit, it follows that eitSn converges a.s. as n →∞, for each t ∈ [–a, a]. Some analysis using this fact will lead to the conclusion that Sn converges a.s. We have that for every t ∈ [–a, a] there is a set Ωt ∈Fwith P(Ωt)=0 itSn(ω) itSn(ω) such that for every ω  Ωt, e converges. Now consider e as a function of the two variables (t, ω), i.e. in the product space ([–a, a] × Ω, B[–a,a] ×F, m × P), where B[–a,a] is the σ-field of Borel subsets of [–a, a] and m denotes Lebesgue measure. Then clearly eitSn(ω) is product measurable and hence

itSn(ω) D = {(t, ω) ∈ [–a, a] × Ω : e does not converge}∈B[–a,a] ×F.

Note that the t-section of D is

itSn(ω) Dt = {ω ∈ Ω :(t, ω) ∈ D} = {ω ∈ Ω : e does not converge} = Ωt.

It follows from Fubini’s Theorem that a a m×P D P D dt dt ( )( )= –a ( t) = –a0 =0 and hence × ω 0=(m P)(D)= Ωm(D ) dP(ω).

ω Hence m(D ) = 0 a.s., i.e. there is Ω0 ∈Fwith P(Ω0) = 0 such that ω m(D ) = 0 for all ω  Ω0.But

Dω = {t ∈ [–a, a]; (t, ω) ∈ D} = {t ∈ [–a, a]:eitSn(ω) does not converge}.

ω ω Hence for every ω  Ω0, P(Ω0) = 0, there is D ∈B[–a,a] with m(D )=0 such that eitSn(ω) converges for all t ∈ [–a, a]–Dω. The proof will be com- pleted by showing that for all ω  Ω0, Sn(ω) converges to a finite limit and since P(Ω0) = 0, this means that Sn converges a.s. Fix ω  Ω0. To show the convergence of Sn(ω), we argue first that the sequence {Sn(ω)} is bounded. Indeed, by passing to a subsequence if neces- itSn(ω) sary, suppose by contradiction that Sn(ω) →∞. Denote the limit of e 330 Martingales by g(t), defined a.e. (m)on[–a, a]. Dominated convergence yields that

iuS (ω) e n –1 u u eitSn(ω) dt → g t dt = 0 0 ( ) iSn(ω) u u ∈ a a S ω →∞ g t dt for any [– , ]. But since n( ) , it follows that 0 ( ) =0for any u ∈ [–a, a], and hence g(t)=0a.e.(m)on[–a, a]. This is a contradic- itSn(ω) tion since |g(t)| = 1 = limn |e | a.e. (m)on[–a, a]. If {Sn(ω)} is bounded → → and there are two convergent subsequences Snk (ω) s1 and Smk (ω) s2, then eits1 = eits2 a.e. (m)on[–a, a]. Since eits is continuous for t ∈ [–a, a], it follows that eits1 = eits2 for all t ∈ [–a, a]. Differentiating the two sides of the last equality and setting t = 0 yields s1 = s2 and hence that Sn(ω) converges. 

14.5 Further applications In this section we give some further applications of the martingale conver- gence results of Section 14.3. The first application is related to the Lebesgue decomposition of one measure with respect to another, and thus also to the Radon–Nikodym Theorem; it helps to identify Radon–Nikodym deriva- tives and is also of interest in probability and especially in statistics.

Theorem 14.5.1 Let (Ω, F , P) be a probability space and {Fn} a sequence F F ⊂F ∪∞ F F of sub-σ-fields of such that n n+1 for all n with σ( n=1 n)= . Let Q be a finite measure on (Ω, F ) and consider its Lebesgue–Radon– Nikodym decomposition with respect to P: Q E ξ dP Q E ∩ N for all E ∈F ( )= E + ( ) where 0 ≤ ξ ∈ L1(Ω, F , P), N ∈Fand P(N)=0. Denote by Pn, Qn the restrictions of P, QtoFn.IfQn  Pn for all n =1,2,..., then dQn (i) , Fn is a martingale on (Ω, F , P) and dPn dQ n → ξ a.s. (P). dPn (ii) Q  P if and only if dQn is uniformly integrable on (Ω, F , P) in which dPn case

dQn dQ → a.s. (P) and in L1(Ω, F , P). dPn dP 14.5 Further applications 331

dQn Proof (i) Let ξn = . Since Q and thus Qn are finite, it follows that ξn ∈ dPn L1(Ω, F , P), i.e. ξn is Fn-measurable and P-integrable. For every E ∈Fn we have ξ dP = ξ dP = Q (E)=Q (E) E n+1 E n+1 n+1 n+1 n ξ dP ξ dP = E n n = E n . E |F { F }∞ Hence (ξn+1 n)=ξn for all n a.s. and thus ξn, n n=1 is a martingale on (Ω, F , P). We also have ξ ≥ 0 a.s. and n E Ω Ω ∞ ξn = Ωξn dP = Qn( )=Q( ) < . It follows from Theorem 14.3.1 that there is an integrable random variable ξ∞ such that

ξn → ξ∞ a.s. (P).

Since ξn ≥ 0a.s.wehaveξ∞ ≥ 0 a.s. We now show that ξ∞ = ξ a.s. Since ξ → ξ∞ a.s., Fatou’s Lemma gives n ξ∞ dP ≤ lim inf ξn dP for all E ∈F. E n E Hence for all E ∈F, n ξ∞ dP ≤ Q E Q E E lim inf n( )= ( ) n ∞ ξ∞ dP ≤ Q E E ∈∪ F and thus E ( ) for all n=1 n. We conclude that the same is true for all E ∈F, either from the uniqueness of the extension of the finite μ E Q E ξ∞ dP measure ( )= ( )– E (Theorem 2.5.3) or from the monotone class theorem (Ex. 1.16). Since P(N) = 0 it follows that for every E ∈F, c ξ∞ dP ξ∞ dP ≤ Q E ∩ N ξ dP ξ dP E = E∩Nc ( )= E∩Nc = E and thus ξ∞ ≤ ξ a.s. ξ dP ≤ Q E E ∈F For the inverse inequality we have E ( ) for all , and hence for all E ∈F, n E ξ|F dP ξ dP ≤ Q E Q E ξ dP E ( n) = E ( )= n( )= E n .

Since both E(ξ|Fn) and ξn are Fn-measurable, it follows as in the previous paragraph that

E(ξ|Fn) ≤ ξn a.s.

Since this is true for all n and since ξn → ξ∞ a.s. and by Theorem 14.3.4, E(ξ|Fn) →E(ξ|F )=ξ a.s., it follows that ξ ≤ ξ∞ a.s. Thus ξ∞ = ξ a.s., i.e. (i) holds. 332 Martingales  dQ (ii) First assume that Q P. Then Q(N) = 0 and ξ = dP . Hence by (i), ξn → ξ a.s. Also for all E ∈Fn we have ξ dP Q E Q E ξ dP ξ dP E = ( )= n( )= E n n = E n and thus ξn = E(ξ|Fn). Hence condition (iv) of Theorem 14.3.3 is satisfied { }∞ and from (i) and (ii) of the same theorem we have that ξn n=1 is uniformly integrable on (Ω, F , P), and ξn → ξ in L1(Ω, F , P). { }∞ Conversely, assume that the sequence ξn n=1 is uniformly integrable on Ω F { F }∞ ( , , P). Then by Theorem 14.3.3, since ξn, n n=1 is a martingale on (Ω, F , P), there is a r.v. ξ ∈ L1(Ω, F , P) such that ξn = E(ξ|Fn)a.s. for all n. It follows from Theorem 14.3.4 that

ξn = E(ξ|Fn) →E(ξ|F )=ξ a.s. and in L1(Ω, F , P).  dQ ∈F It now suffices to show that Q P and ξ = dP a.s. Indeed for all E n we have Q E Q E ξ dP E ξ|F dP ξ dP ( )= n( )= E n = E ( n) = E . Q E ξ dP E ∈∪∞ F Hence ( )= E for all n=1 n and since the class of sets for which it is true is clearly a σ-field, it follows that it is true for all E ∈F.  dQ  Thus Q P and ξ = dP a.s. Application of the theorem to the positive and negative parts in the Jor- dan decomposition of a finite signed measure gives the following result.

Corollary 1 The theorem remains true if Q is a finite signed measure.

We now show how Theorem 14.5.1 can be used in finding expressions for Radon–Nikodym derivatives.

Corollary 2 Let (Ω, F , P) be a probability space and Q a finite signed F  { (n) ≥ } measure on such that Q P. For every n let Ek : k 1 be a measur- Ω Ω ∪∞ (n) (n) F able partition of (i.e. = k=1Ek where the Ek are disjoint sets in ) and let Fn be the σ-field it generates. Assume that the partitions become (n) { (n+1)} finer as n increases (i.e. each Ei is the union of sets from Ek ) so that F ⊂F F ∪∞ F n n+1. If the partitions are such that = σ( n=1 n), then

(n) dQ Q(Ekn(ω)) (ω) = lim a.s. and in L1(Ω, F , P) dP n→∞ (n) P(Ekn(ω))

n ∈ (n) where for every ω and n, k (ω) is the unique k such that ω Ek . 14.5 Further applications 333

Proof This is obvious from the simple observation that ∞ (n) dQn Q(Ek ) (n) (ω)= (n) χE (ω)a.s. dPn k k=1 P(Ek )

Q(E(n)) k (n)  where (n) is taken to be zero whenever P(Ek )=0. P(Ek ) Since conditional expectations and conditional probabilities as defined in Chapter 13 are Radon–Nikodym derivatives of finite signed measures with respect to probability measures, Corollary 2 can be used to express them as limits and the resulting expressions are also intuitively appealing. Such a result will be stated for a conditional probability given the value of ar.v. Corollary 3 Let η be a r.v. on the probability space (Ω, F , P) and A ∈F. { (n) ∞ ∞} For each n, let Ik :– < k < be a partition of the real line into intervals. Assume that the partitions become finer as n increases and that (n) (n) → →∞ δ = sup m(Ik ) 0 as n k (m = Lebesgue measure). Then P(A ∩ η–1I(n) ) kn(y) –1 –1 P(A|η = y) = lim a.s. (Pη ) and in L1(R, B, Pη ) n→∞ –1 (n) P(η Ikn(y)) n ∈ (n) where for each y and n, k (y) is the unique k such that y Ik . Proof By Section 13.5, P(A|η = y) is the Radon–Nikodym derivative of the finite measure ν, defined for each B ∈Bby ν(B)=P(A ∩ η–1B), with respect to Pη–1. The result follows from Corollary 2 and the simple B { (n)}∞ B ⊂B ∪∞ B observation that if n = σ( Ik k=–∞) then n n+1 and σ( n=1 n)= B.  The second application concerns “likelihood ratios” and is related to the principle of maximum likelihood.

Theorem 14.5.2 Let {ξn} be a sequence of r.v.’s on the probability space (Ω, F , P), and Fn = σ(ξ1, ..., ξn). Let Q be another probability measure on (Ω, F ). Assume that for every n,(ξ1, ..., ξn) has p.d.f. pn under the probability P and qn under the probability Q, and define  qn(ξ1(ω),...,ξn(ω)) if the denominator  0 η (ω)= pn(ξ1(ω),...,ξn(ω)) n 0 otherwise. 334 Martingales { F }∞ Ω F Then ηn, n n=1 is a supermartingale on ( , , P) and there is a P-integra- ble r.v. η∞ such that

ηn → η∞ a.s. and

0 ≤Eη∞ ≤Eηn+1 ≤Eηn ≤ 1 for all n.

Proof Since pn and qn are Borel measurable functions, ηn is Fn-measur- n able. Also ηn ≥ 0. If An = {(x1, ..., xn) ∈ R : pn(x1, ..., xn) > 0} then –1 c –1 c × R P(ξ1, ..., ξn) (An) = 0 and thus P(ξ1, ..., ξn, ξn+1) (An ) = 0. Further q Eη = η dP = n χ dP(ξ , ..., ξ )–1 n Ω n Rn p An 1 n n qn = Rn χAn pn dx1 ... dxn = Rn qnχAn dx1 ... dxn pn ≤ Rn qn dx1 ... dxn =1 and thus 0 ≤Eηn ≤ 1. n –1 Also, for every E ∈Fn there is a B ∈B such that E =(ξ1, ..., ξn) (B) and ηn+1 dP = ηn+1χE dP E Ω qn+1 –1 = Rn+1 χBχAn+1 dP(ξ1, ..., ξn+1) pn+1 q n+1 χ dP ξ ... ξ –1 = A B ( 1, , n+1) n+1 pn+1 q n+1 χ dP ξ ... ξ –1 = A –Ac ×R B ( 1, , n+1) n+1 n pn+1 –1 c c since P(ξ1, ..., ξn+1) (A × R) = 0. Hence, since An+1 – A × R ⊂ An × R n n

ηn+1 dP = c ×Rqn+1χB dx1 ... dxn dxn+1 E An+1–An ≤ ×Rqn+1χB dx1 ... dxn dxn+1 An = Rqn+1(x1, ..., xn, xn+1) dxn+1 χB dx1 ... dxn An = qnχB dx1 ... dxn An q n χ dP ξ ... ξ –1 = A B ( 1, , n) n pn η χ dP η dP = Ω n E = E n . E |F ≤ { F }∞ It follows that (ηn+1 n) ηn for all n, a.s., and thus ηn, n n=1 is a su- Ω F { F }∞ permartingale on ( , , P). Hence –ηn, n n=1 is a negative submartingale which, by the submartingale convergence Theorem 14.3.1, converges a.s. to 14.5 Further applications 335 a P-integrable r.v. –η∞. Then by Theorem 14.1.1 (ii) and the first result of this proof we have 0 ≤Eηn+1 ≤Eηn ≤ 1 for all n. Finally by Fatou’s Lemma Eη∞ ≤Eηn and this completes the proof. 

If for each n the distribution of (ξ1, ..., ξn) under Q is absolutely con- tinuous with respect to its distribution under P then the following stronger result holds.

Corollary 1 Under the assumptions of Theorem 14.5.2, if for all n, Q(ξ1, –1  –1 ..., ξn) P(ξ1, ..., ξn) (which is the case if qn =0whenever pn =0) F ∞ F { F }  and = σ( n=1 n), then ηn, n is a martingale. Furthermore Q Pif and only if {ηn} is uniformly integrable in which case dQ η → a.s. and in L (Ω, F , P), as n →∞. n dP 1

Proof For each n let Qn, Pn be the restrictions of Q, P to Fn. For every –1 n E ∈Fn we have E =(ξ1, ..., ξn) (B), B ∈B and since by absolute –1 c –1 c continuity P(ξ1, ..., ξn) (An) = 0 implies Q(ξ1, ..., ξn) (An) = 0, we have –1 –1 ∩ Qn(E)=Q(ξ1, ..., ξn) (B)=Q(ξ1, ..., ξn) (B An) = qn dx1 ... dxn B∩An q n dP ξ ... ξ –1 = B∩A ( 1, , n) n pn q n dP ξ ... ξ –1 = B ( 1, , n) pn η dP = B n n.

dQn Hence = ηn and the result follows from Theorem 14.5.1.  dPn

When the r.v.’s {ξn} are i.i.d. under both P and Q the following result provides a test for the distribution of a r.v. using independent observations.

Corollary 2 Assume that the conditions of Theorem 14.5.2 are satisfied and that under each probability measure P, Qther.v.’s{ξn} are independent and identically distributed with (common) p.d.f. p, q. Then ηn → 0 a.s. and P ⊥ Q, provided the distributions determined by p and q are distinct.

Proof In this case we have

n q(ξ ) η = k a.s. (P) n p(ξ ) k=1 k 336 Martingales and thus by Theorem 14.5.2, ∞ q(ξk) η∞ = a.s. (P). p(ξ ) k=1 k { } Now let ξn be an i.i.d. sequence of r.v.’s independent also of the sequence {ξn}, with the same distribution as the sequence {ξn} (such r.v.’s can always be constructed using product spaces). Let also ∞ q(ξ ) η = k a.s. (P). ∞ p(ξ ) k=1 k Then η∞ and η∞η∞ are clearly identically distributed and η∞ η∞ are inde- pendent and identically distributed so that

P{η∞ =0} = P{η∞η∞ =0} =1–P{η∞η∞ > 0} =1–P{η∞ > 0}P{η∞ > 0} 2 =1–[1–P{η∞ =0}] .

It follows that P{η∞ =0} =0or1. Assume now that P{η∞ =0} = 0, so that η∞ > 0a.s.(P). Then the r.v.’s log(η∞η∞) = log η∞ + log η∞ are identically distributed and log η∞, log η∞ are independent and identically distributed and thus if φ(t) is the c.f. of 2 log η∞ we have φ (t)=φ(t) for all t ∈ R. Since φ(0) = 1 and φ is continuous, ∈ R it follows that φ(t) = 1 for all t and thus η∞ = 1 a.s. (P). It follows that ∞ q(ξk) q(ξ1) = 1 a.s. (P) and thus η1 = = 1 a.s. Then for each B ∈Bwe k=1 p(ξk) p(ξ1) have, using the notation and facts from the proof of Corollary 1, –1 –1 –1 –1 Qξ1 (B)=Q1ξ1 (B)= –1 η1 dP1 = P1ξ1 (B)=Pξ1 (B) ξ1 (B) which contradicts the assumption that the distributions of ξ1 under P and –1 Q are distinct. (In fact one can similarly show that Q(ξ1, ..., ξn) (B)= –1 n P(ξ1, ..., ξn) (B) for all B ∈B and all n, which implies that P = Q.) Hence, under the assumptions of the theorem, P{η∞ =0} = 1 and the proof may be completed by showing that P ⊥ Q. By reversing the role of the probability measures P and Q we have that

n p(ξ ) k → 0a.s.(Q). q(ξ ) k=1 k ∈ Ω n p(ξk(ω)) → Let EQ be the set of ω such that k 0 and EP the set =1 q(ξk(ω)) n q(ξk(ω)) of ω ∈ Ω such that → 0. Then P(EP)=1=Q(EQ) and k=1 p(ξk(ω)) Exercises 337 n q(ξk) n p(ξk) clearly EP ∩ EQ = ∅ since = 1 for all n. It follows that k=1 p(ξk) k=1 q(ξk) P ⊥ Q. 

Exercises

14.1 Let {ξn, Fn} be a submartingale. Let the sequence of r.v.’s {εn} be such that for all n, εn is Fn-measurable and takes only the values 0 and 1. Define the sequence of r.v.’s {ηn} by

η1 = ξ1

ηn+1 = ηn + εn(ξn+1 – ξn), n ≥ 1.

Show that {ηn, Fn} is also a submartingale and Eηn ≤Eξn for all n.If{ξn, Fn} is a martingale show that {ηn, Fn} is also a martingale and Eηn = Eξn for all n. (Do you see any gambling interpretation of this?) 14.2 Prove that every uniformly integrable submartingale {ξn, Fn} can be uniquely decomposed in

ξn = ηn + ζn for all n a.s.

where {ηn, Fn} is a uniformly integrable martingale and {ζn, Fn} is a negative (ζn ≤ 0foralln a.s.) submartingale such that limn ζn = 0 a.s. This is called the Riesz decomposition of a submartingale. 14.3 Let {Fn} be a sequence of sub-σ-fields of F such that Fn ⊂Fn+1 for all n ∞ F∞ ∪ F ∈F∞ and = σ( n=1 n). Show that if E then

lim P(E|Fn)=χE a.s. n→∞ 14.4 (Polya’s urn scheme) Suppose an urn contains b blue and r red balls. At each drawing a ball is drawn at random, its color is noted and the drawn ball together with a > 0 balls of the same color are added to the urn. Let bn be the number of blue balls and rn the number of red balls after the nth drawing and let ξn = bn/(bn + rn) be the proportion of blue balls. Show that {ξn} is a martingale and that ξn converges a.s. and in L1. 14.5 The inequalities proved in Theorems 14.2.1 and 14.2.2 for finite submartin- gales depend only on the fact that the submartingales considered have a “last element”. Specifically show that if {ξn, Fn : n =1,2,..., ∞} is a sub- martingale then for all real a, { ≥ }≤ ∞ ≤E| ∞| aP sup ξn a {sup ξ ≥a}ξ dP ξ , 1≤n≤∞ 1≤n≤∞ n

and if also ξn ≥ 0, a.s. for all n =1,2,..., ∞, then for all 1 < p < ∞,   p p p p E(sup ξn) ≤ Eξ∞. 1≤n≤∞ p –1 338 Martingales

14.6 The following is an example of a martingale converging a.s. but not in L1. Let Ω be the set of all positive integers, F the σ-field of all subsets of Ω, and P defined by 1 1 P({n})= – for all n =1,2,.... n n +1 Let [n, ∞) denote the set of all integers ≥ n and define

Fn = σ({1}, {2}, ..., {n},[n +1,∞))

ξn =(n +1)χ[n+1,∞) { F }∞ E for n =1,2,.... Show that ξn, n n=1 is a martingale with ξn =1.Show also that ξn converges a.s. (and find its limit) but not in L1. 14.7 If {ξn, Fn : n =1,2,..., ∞} is a nonnegative submartingale, show that {ξn, n =1,2,...} is uniformly integrable (cf. Theorem 14.3.2). { F }∞ 14.8 Let ξn, n n=1 be a martingale or a nonnegative submartingale. If p lim E(|ξn| ) < ∞ n→∞

for some 1 < p < ∞, show that ξn converges a.s. and in Lp. (Hint: Use Theorems 14.3.1 and 14.2.2.) Ω F {F }∞ 14.9 Let ( , , P) be a probability space and n n=1 a sequence of sub-σ-fields F F ⊂F F ∪∞ F of such that n n+1 and = σ( n=1 n). Let Q be a finite measure on (Ω, F ). Denote by Pn, Qn the restriction of P, Q to Fn and the corresponding Lebesgue–Radon–Nikodym decomposition by Q (E)= ξ dP + Q (E ∩ N ), E ∈F n E n n n n n ∩ ∈F Q(E)= Eξ dP + Q(E N), E

where 0 ≤ ξn ∈ L1(Ω, Fn, Pn), 0 ≤ ξ ∈ L1(Ω, F , P), Nn ∈Fn, N ∈Fand { F }∞ Pn(Nn)=0, P(N)=0.Showthat ξn, n n=1 is a supermartingale and that ξn → ξ a.s. (P). (Hint: Imitate the proof of Theorem 14.5.1.) 14.10 Let f be a Lebesgue integrable function defined on [0, 1]. For each n,let (n) (n) (n) (n) 0=a0 < a1 < ... < an = 1 be a partition of [0, 1] with δ = (n) (n) → sup0≤k≤n–1(ak+1 – ak ) 0, and assume that the partitions become finer as n increases. For each n,definefn on [0, 1] by 1 a(n) f (x)= k+1 f (y) dy for a(n) < x ≤ a(n) n (n) (n) (n) k k+1 ak ak+1 – ak and by continuity at x = 0. Then show that

lim fn(x)=f (x) a.e. (m)andinL1 (m = Lebesgue measure). n→∞ 14.11 Let (Ω, F ) be a measurable space and assume that F is purely atomic, i.e. F { }∞ Ω ∪∞ T is generated by the disjoint sets En n=1 with = n=1En.Let(T, ) be another measurable space, {Pt, t ∈ T} a family of probability measures Exercises 339 on (Ω, F )and{Qt, t ∈ T} a family of signed measures on (Ω, F ). Assume that for each t ∈ T, Qt  Pt and that for each E ∈F, Pt(E)andQt(E) are measurable functions on (T, T ). Show that there is a T×F-measurable function ξ(t, ω) such that for each fixed t ∈ T,

dQt ξ(t, ω)= (ω) a.s. (Pt). dPt

(Hint: Apply Theorem 14.5.1 with Fn = σ(E1, ..., En).) 15

Basic structure of stochastic processes

Our aim in this final chapter is to indicate how basic distributional theory for stochastic processes, alias random functions, may be developed from the considerations of Chapters 7 and 9. This is primarily for reference and for readers with a potential interest in the topic. The theory will be first illustrated by a discussion of the definition of the Wiener process, and con- ditions for sample function continuity. This will be complemented, and the chapter completed with a sketch of construction and basic properties of point processes and random measures in a purely measure-theoretic frame- work, consistent with the nontopological flavor of the entire volume.

15.1 Random functions and stochastic processes In this section we introduce some basic distributional theory for stochas- tic processes and random functions, using the product space measures of Chapter 7 and the random element concepts of Chapter 9. By a stochastic process one traditionally means a family of real random variables {ξt : t ∈ T} (ξt = ξt(ω)) on a probability space (Ω, F , P), T being a set indexing the ξt.IfT = {1, 2, 3, ...} or {...,–2,–1,0,1,2,...} the family {ξn : n =1,2,...} or {ξn : n = ...,–2,–1,0,1,2,...} is referred to as a stochastic sequence or discrete parameter stochastic process, whereas {ξt : t ∈ T} is termed a continuous parameter stochastic process if T is an interval (finite or infinite). We assume throughout this chapter that each r.v. ξt(ω) is defined (and finite) for all ω (not just a.e.). Then for a fixed ω the values ξt(ω) define a T function ξ ((ξω)(t)=ξt(ω), t ∈ T)inR and the F|B-measurability of each T ξt(ω) implies F|B -measurability of ξ as will be shown in Lemma 15.1.1. The mapping ξ is thus a random element (r.e.) of (RT , BT ) and is termed a random function (r.f.). As will be seen in Lemma 15.1.1 the converse also holds – if ξ is a measurable mapping from (Ω, F , P)to(RT , BT ) then the ω-functions ξt(ω)=(ξω)(t)areF|B-measurable for each t,i.e.ξt are

340 15.1 Random functions and stochastic processes 341 r.v.’s. Thus the notions of a stochastic process (family of r.v.’s) and a r.f. are entirely equivalent. For a fixed ω, the function (ξω)(t), t ∈ T, is termed a sample function (or sample path or realization) of the process.

Lemma 15.1.1 For each t ∈ T, let ξt = ξt(ω) be a real function of ω ∈ Ω T and let ξ be the mapping from Ω to R defined as ξω = {ξt(ω):t ∈ T}. T Then ξt is F|B-measurable for each t ∈ Tiffξ is F|B -measurable (see Section 7.9 for the definition of BT ).

RT Rk Proof For u =(t1, ..., tk) the projection πu = πt1,...,tk from to is BT |Bk ∈Bk –1 clearly -measurable since if B , πu B is a cylinder and hence T T is in B . Hence if ξ is F|B -measurable, ξt = πtξ is F|B-measurable for each t. F|B F|Bk Conversely if each ξt is -measurable, (ξt1 , ..., ξtk ) is clearly - k k measurable, i.e. πuξ is F|B -measurable for u =(t1, ..., tk). Hence if B ∈B , –1 –1 –1 ∈F –1 ∈F ξ πu B =(πuξ) B or ξ E for each cylinder E. Since these cylin- ders generate BT , it follows that ξ is F|BT -measurable as required. 

Probabilistic properties of individual ξt or finite groups (ξt1 , ..., ξtk )are, of course, defined by the respective marginal or joint distributions –1 { ∈ } ∈B Pξt (B)=P ω : ξt(ω) B , B , –1 { ∈ } ∈Bk P(ξt1 , ..., ξtk ) (B)=P ω :(ξt1 (ω), ..., ξtk (ω)) B , B . { ∈ } { ∈ } These are respectively read as P ξt B , P (ξt1 , ..., ξtk ) B and are as noted Lebesgue–Stieltjes measures on B and Bk corresponding to the dis- tribution functions { ≤ } { ≤ ≤ ≤ } Ft(x)=P ξt x , Ft1,...,tk (x1, ..., xk)=P ξti xi,1 i k . ∈ ≤ ≤ These joint distributions of ξt1 , ..., ξtk for ti T,1 i k, k =1,2,..., are termed the finite-dimensional distributions (fidi’s) of the process {ξt : t ∈ T}. The fidi’s determine many useful probabilistic properties of the process but are restricted to probabilities of sets of values taken by finite groups of ξt’s. On the other hand, one may be interested in the probability that the entire sample function ξt, t ∈ T, lies in a given set of functions, i.e. P{ξ ∈ E} = P{ω : ξω ∈ E} = Pξ–1(E) which is defined for E ∈BT . Further assumptions may be needed for sets E of interest but not in BT , e.g. to determine that the sample functions are continuous a.s. (see Sections 15.3, 15.4). 342 Basic structure of stochastic processes

This probability measure Pξ–1 on BT is called the distribution of (the r.f.) ξ and it encompasses the fidi’s. Specifically, the fidi’s are special cases of values of Pξ–1, for example, if B ∈Bk

P{ ξ ... ξ ∈ B} P{π ξ ∈ B} Pξ–1 π–1 B ( t1 , , tk ) = t1,...,tk = ( t1,...,tk ) ξω π–1 B i.e. the probability that the sample function lies in the cylinder t1,...,tk BT Pξ–1π–1 k t ... t ∈ T of . That is the fidi’s have the form t1,...,tk for each , 1, , k . On the other hand, note also that the fidi’s determine the distribution of a stochastic process, that is, if two stochastic processes have the same fidi’s, then they have the same distribution. This follows from Theorem 2.2.7 and BT π–1 B the fact that is generated by the cylinders t1,...,tk ( ). The fidi’s of a stochastic process are thus related to the distribution Pξ–1 T of ξ on B exactly as the measures νu are related to μ in Section 7.10. In particular the fidi’s are consistent as there defined, i.e. if u =(t1, ..., tk),  = ⊂ –1 –1 –1 (s1, ..., sl) u, ξu =(ξt1 , ..., ξtk ), ξ =(ξs1 , ..., ξsl ), then Pξu πu = Pξ , –1 –1 i.e. P(πuξu) = Pξ . This may be made more transparent by noting its equivalence to consistency of the d.f.’s in the sense that for each n =1,2,... any choice of t1, ..., tn and x1, ..., xn

(i) Ft1,...,tn (x1, ..., xn) is unaltered by the same permutation of both t1, ..., tn and x1, ..., xn, ∞ (ii) Ft1,...,tn–1 (x1, ..., xn–1)=Ft1,...,tn–1,tn (x1, ..., xn–1, )= limxn→∞ Ft1,...,tn–1,tn (x1, ..., xn–1, xn). The requirement (i) can of course be achieved (on the real line) by defining ··· Ft1,...,tn for t1 < < tn and rearranging other time sets to natural order, and hence is not an issue when T is a subset of R. Kolmogorov’s Theorem (Theorem 7.10.3) may then be put in the follow- ing form.

Theorem 15.1.2 Let {νu} be as in Theorem 7.10.3, a family of probability measures defined on (Ru, Bu) for finite subsets u of an index set T. If the { } –1 ⊂ family νu is consistent in the sense that νuπu, = ν for each u,  with  u, then there is a stochastic process {ξt : t ∈ T} (unique in distribution) having { } { ∈ } νu as its fidi’s. That is P (ξt1 , ..., ξtk ) B = νu(B) for each choice of k, k u =(t1, ..., tk),B∈B.

Proof Let P denote the unique probability measure on (RT , BT ) in Theo- –1 ⊂ rem 7.10.3, satisfying Pπu = νu for each finite set u T. Define the prob- T T ability space (Ω, F , P)as(R , B , P). The projection r.v.’s ξt(ω)=πtω = T ω(t)forω ∈ R give the desired stochastic process {ξt : t ∈ T} with the given fidi’s νu.  15.2 Construction of the Wiener process in R[0,1] 343

Corollary 1 below restates the theorem in terms of distribution functions. Corollary 2 considers the special case of an independent family. { ∈ } Corollary 1 Let Ft1,...,tk : t1, ..., tk T, k =1,2,... be a family of k- dimensional d.f.’s, assumed consistent in the sense described prior to the statement of the theorem. Then there is a stochastic process {ξt : t ∈ T} having these d.f.’s defining its fidi’s, i.e. { ≤ ≤ ≤ } P ξti xi,1 i k = Ft1,...,tk (x1, ..., xk) for each choice of k, t1, ..., tk.

Proof This follows since the d.f.’s Ft1,...,tk clearly determine consistent probability distributions νu for each u =(t1, ..., tk). 

Corollary 2 If Fi are d.f.’s for i =1,2,..., there exists a sequence of independent r.v.’s ξ1, ξ2, ...such that ξi has d.f. Fi for each i. Proof This follows from Corollary 1 by noting consistency of the d.f.’s k  Ft1,...,tk (x1, ..., xk)= Fti (xi). i=1

15.2 Construction of the Wiener process in R[0,1]

The Wiener process Wt on [0, 1] (a.k.a. Brownian motion) provides an illuminating and straightforward example of the use of Kolmogorov’s Theorem to construct a stochastic process. Wt is to be defined by the requirement that all its fidi’s be normal with zero means and cov(Ws, Wt) = min(s, t). Thus the fidi for (Wt1 , Wt2 , ..., ≤ ··· ≤ Wtk ), 0 t1 < t2 < < tk 1, is to be normal, with zero means and covariance matrix (see Section 9.4) ⎡ ⎤ ⎢ ··· ⎥ ⎢ t1 t1 t1 t1 ⎥ ⎢ ··· ⎥ ⎢ t1 t2 t2 t2 ⎥ ⎢ ··· ⎥ Λ = ⎢ t1 t2 t3 t3 ⎥ . t1,...,tk ⎢ ⎥ ⎢ ...... ⎥ ⎣⎢ . . . . . ⎦⎥ t1 t2 t3 ··· tk This matrix is readily seen to be nonnegative definite (e.g. its determi- nant is t1(t2 – t1)(t3 – t2) ···(tk – tk–1) as may be simply shown by subtracting Λ the (i – 1)th row from the ith for i = k, k –1,..., 2). Thus t1,...,tk is a covari- ance matrix of a k-dimensional normal distribution, and the elimination of one or more points tj gives a matrix of the same form in the remaining 344 Basic structure of stochastic processes tj’s, showing the consistency required for Kolmogorov’s Theorem (or The- orem 15.1.2). Hence, by that theorem, there is a process {Wt : t ∈ [0, 1]} with the desired fidi’s.

15.3 Processes on special subspaces of RT A stochastic process ξ constructed via Kolmogorov’s Theorem is a random element of (RT , BT ). Hence one may determine the probability P{ξ ∈ E} that the sample function ξt, t ∈ T, lies in the set E of functions, for any E ∈BT . However, one is sometimes interested in sets E which are not in BT (as, for example, when T = [0, 1], E = C[0, 1], the set of continuous functions on [0, 1]). A small but useful extension to the framework occurs when ξ ∈ A a.s. where A ⊂ RT but A mayormaynotbeinBT . Note that the statement ξ ∈ A c T –1 a.s. means that A ⊂ A0 for some A0 ∈B, Pξ (A0) = 0. The extension may be simply achieved by assuming that the space (Ω, F , P) is complete (or if not, by completing it to be so in the standard manner – see Section –1 c 2.6). Then with A, A0 as above ξ A ∈F since P is complete on F . Hence also ξ–1A ∈F, Pξ–1(Ac) = 0 and ξ–1(A ∩ E)=ξ–1A ∩ ξ–1E ∈Ffor all E ∈BT . Hence if ξt, t ∈ T, is redefined as a fixed function in A at points ω ∈ Ω for which {ξt(ω):t ∈ T}  A (or if the space Ω is reduced to eliminate such points), then A includes all the values of (ξt(ω):t ∈ T) and may be regarded as a space with a σ-field A = A ∩BT . ξ is then a random element in (A, A) with distributions satisfying Pξ–1(F)=Pξ–1(E)forF = E ∩ A, E ∈BT . An interesting and useful special case occurs when T is an interval and A is the set of real, continuous functions on T. For example, take T to be the unit interval [0, 1] (with standard notation A = C[0, 1], the space of continuous functions on [0, 1]). If a stochastic process {ξt : t ∈ [0, 1]} has a.s. continuous sample functions (i.e. ξt(ω) is continuous on 0 ≤ t ≤ 1 a.s.), then the r.f. ξ may be regarded as a random element of (C, C) where C = C[0, 1] (⊂ R[0,1]) and C = C ∩B[0,1]. This is a natural and simple viewpoint. It is, of course, possible to regard C as a space of continuous functions, without reference to RT , and to view it as a metric space, with metric de- fined by the norm (||x|| = sup{|x(t)| :0≤ t ≤ 1}). The class of Borel sets of such a topological space is then defined to be the σ-field generated by the open sets. This may be shown to be also generated by the (finite- C π–1 B B ∈Bk dimensional) cylinder sets of ,i.e.setsoftheform t1,...,tk where 15.4 Conditions for continuity of sample functions 345

and πt1,...,tk is the usual projection mapping but restricted to C rather than RT . It may thus be seen that the Borel sets form precisely the same σ-field C ∩BT in C as defined and used above. This connection provides a vehicle for the consideration of properties which involve topology more intimately – such as the development of weak convergence theory in C.

15.4 Conditions for continuity of sample functions In view of the above discussion it is of interest to give conditions on a process which will guarantee a.s. continuity of sample functions. The theo- rem to be shown, generalizing original results of Kolmogorov (see [Loeve]` and [Cramer´ & Leadbetter]) gives sufficient conditions for a process ξt on [0, 1] to have an equivalent version ηt (i.e. ξt = ηt a.s. for each t)witha.s. continuous sample functions.

Theorem 15.4.1 Let ξt be a process on [0, 1] such that for all t, t + h ∈ [0, 1]

P{|ξt+h – ξt|≥g(h)}≤q(h) ↓ where g, q are nonnegative functions of h > 0, nonincreasing as h 0 and –n n –n such that g(2 ) < ∞, 2 q(2 ) < ∞. Then there exists a process ηt on [0, 1] with a.s. continuous sample functions and such that ξt = ηt a.s. for each t. In particular, of course, η has the same fidi’s as ξ.

n Proof Approximate ξt by piecewise linear processes ξt with the values ξt n n at t = tn,r = r/2 , r =0,1,...,2 , and linear between such points. Then clearly for tn,r ≤ t ≤ tn,r+1,   | n+1 n|≤ 1  ≤ 1 1 ξt – ξt ξtn+1,2r+1 – 2 (ξtn+1,2r + ξtn+1,2r+2 ) 2 A + 2 B where | | | | A = ξtn+1,2r+1 – ξtn+1,2r , B = ξtn+1,2r+1 – ξtn+1,2r+2 and hence { | n+1 n|≥ –n–1 }≤ { ≥ –n–1 } { ≥ –n–1 } P max ξt – ξt g(2 ) P A g(2 ) + P B g(2 ) tn,r≤t≤tn,r+1 ≤ 2q(2–n–1) so that { | n+1 n|≥ –n–1 }≤ n+1 –n–1 P max ξt – ξt g(2 ) 2 q(2 ). 0≤t≤1 Since 2nq(2–n) < ∞ it follows by the Borel–Cantelli Lemma (Theo- | n+1 n| –n–1 ≥ rem 10.5.1) that a.s., max0≤t≤1 ξt – ξt < g(2 )forn n0 = n0(ω). 346 Basic structure of stochastic processes –n ∞ { n} Since g(2 ) < it follows that ξt is uniformly Cauchy a.s. and thus uniformly convergent a.s. to a continuous ηt as n →∞. Also ηt = ξt a.s. for n+p t = tn,r since ξt = ξt, p =0,1,.... –n If t is not equal to any tn,r, t = lim tn,rn ,0< t – tn,rn < 2 and {| |≥ }≤ ≤ –n P ξtn,rn – ξt g(t – tn,rn ) q(t – tn,rn ) q(2 ) {| |≥ –n }≤ –n so that P ξtn,rn – ξt g(2 ) q(2 ) and the Borel–Cantelli Lemma gives → ξtn,rn ξt a.s. → Since ηtn,rn ηt a.s. and ξtn,rn = ηtn,rn a.s., it follows that ξt = ηt a.s. for each t as required. 

15.5 The Wiener process on C and Wiener measure The preceding theorem readily applies to the Wiener process yielding the following result.

Theorem 15.5.1 The Wiener process {Wt : t ∈ [0, 1]} maybetakento have a.s. continuous sample functions.

Proof This follows from the above result. For Wt+h – Wt is normal, zero mean and variance |h|.Take0< a < 1/2. Then

a a–1/2 1/2–a a–1/2 P{|Wt+h – Wt|≥|h| } =2{1–Φ(|h| )}≤2|h| φ(|h| ) (where Φ, φ are the standard normal d.f. and p.d.f. respectively) since 1 – Φ(x) ≤ φ(x)/x for x > 0. If g(h)=|h|a, q(h)=2|h|1/2–aφ(|h|a–1/2) then     g(2–n)= 2–na < ∞, 2nq(2–n)=2 2n(1+2a)/2φ(2n(1–2a)/2) < ∞

(the last convergence being easily checked). Hence a.s. continuity of (an equivalent version of) Wt follows from Theorem 15.4.1.  As seen in Section 15.3, a process with a.s. continuous sample functions may be naturally viewed as a random element of (C, C) where C = C[0, 1], [0,1] and C = C ∩B . By Theorem 15.5.1, the Wiener process Wt may be so regarded. The steps in the construction were (a) to use Kolmogorov’s Theo- 0 RT BT rem to define a process, say Wt in ( , ) having the prescribed (normal) 0 fidi’s, (b) to replace Wt by an equivalent version Wt with a.s. continuous 0 sample functions, i.e. Wt = Wt a.s. for each t (hence with the same fidi’s), and (c) to consider W = {Wt : t ∈ [0, 1]} as a random element of (C, C)by restricting to C = C[0, 1] (and taking C = C ∩B[0,1], equivalently the Borel σ-field of the topological space C as noted in Section 15.3). 15.6 Point processes and random measures 347

As a result of this construction a probability measure PW–1 (the distri- bution of W) is obtained on the measurable space (C, C). This probability measure is termed Wiener measure and is customarily also denoted by W. This measure has, of course, multivariate normal form for the fidi probabil- u ities induced on the sets B , u =(t1, ..., tk) for each k. Of course, the space (C, C, W) can be used to be the (Ω, F , P) on which the Wiener process is defined as the identity mapping Wω = ω. Finally, it may be noted that an alternative approach to Wiener measure and the Wiener process is to define the latter as a distributional limit of simple processes of random walk type (cf. [Billingsley]). This is less direct and does require considerable weak convergence machinery but has the ad- vantage of simultaneously producing the “invariance principle” (functional central limit theorem) of Donsker, which has significant use e.g. in appli- cations to areas such as sequential analysis.

15.6 Point processes and random measures In the preceding sections we have indicated some basic structural theory for stochastic processes with continuous sample functions and given use- ful sufficient conditions for continuity. This included the construction and continuity of the celebrated Wiener process – a key component along with its various extensions in stochastic modeling in diverse fields. At the other end of the spectrum are processes whose sample functions are patently discontinuous, which may be used to model random sequences of points (i.e. point processes) and their extensions to more general random measures. A special position among these is held by the Poisson process which is arguably equally as prominent as the Wiener process for its exten- sions and applications. There are a number of ways of providing a framework for point pro- cesses on the (e.g. positive) real line, perhaps the most obvious being the description as a family {τn : n = 0,1,2,...} of r.v.’s 0 ≤ τ1 ≤ τ2 ≤ ··· (defined on (Ω, F , P)), representing the positions of points. To avoid accu- mulation points it is assumed that τn →∞a.s. In particular the assumption that τ1, τ2 – τ1, τ3 – τ2, ...are independent and identically distributed with d.f. F(·) leads to a renewal process and the particular case F(x)=1–e–λx, x > 0, gives a Poisson process with intensity λ. Fine detailed accounts of these and related processes abound, of which, for example [Feller] may be regarded as a seminal work. Our purpose here is just to indicate how a general abstract framework may arise naturally by adding randomness to the measure-theoretic structure considered throughout this volume in line 348 Basic structure of stochastic processes with the random element approach to real-valued processes of the preced- ing sections. An alternative viewpoint to that above of regarding a point process as the sequence {τn :0<τ1 <τ2 < ···} of its point occurrence times is to consider the family of (extended) r.v.’s ξ(B) taking values 0, 1, 2, ...,+∞, consisting of the numbers of τi in (Borel) sets B. The assumption τn →∞ means that ξ(B) < ∞ for bounded Borel sets B. Since ξ(B) is clearly count- ably additive, it may be regarded as a (random) counting measure on the Borelsetsof[0,∞). The two alternative viewpoints are connected e.g. by relation {ξ(0, x] ≥ n} = {τn ≤ x}. A simple Poisson process with inten- sity λ may then be regarded as a random counting measure ξ(B) as above with P{ξ(B)=r} = e–λm(B)(λm(B))r/r!(m = Lebesgue measure as always) for each Borel B ⊂ [0, ∞) and such that ξ(B1), ξ(B2) are independent for disjoint such B1, B2. It is natural to extend this latter view of a point process (a) to include ξ(B) which are not necessarily integer-valued (i.e. to define random mea- sures (r.m.’s) which are not necessarily point processes) and (b) to con- sider such concepts on a space more general than the real line, such as Rk or a space S with a topological structure. A detailed, encyclopedic ac- count of r.m.’s may be found in [Kallenberg] for certain metric (“Polish”) spaces. The topological assumptions involved are most useful for consid- eration of more intricate properties (such as weak convergence) of point processes and r.m.’s. However, for the basic r.m. framework they are pri- marily used to define a purely measure-theoretic structure involving classes of sets (semirings, rings, σ-fields) considered without topology in this vol- ume. Hence our preferred approach in this brief introduction is to define a “clean” purely measure-theoretic framework in the spirit of this volume, leaving topological consideration for possible later study and as a setting for development of more complex properties of interest. Our interest in the possible use of a measure-theoretic framework arose from hearing a splendid lecture series on random measures in the early 1970’s by Olav Kallenberg – leading to his subsequent classic book [Kallen- berg]. Similar developments were also of interest to others at that time and since – including papers by D.G. Kendall, B.D. Ripley, J. Mecke and a subsequent book on the Poisson processes by J.F.C. Kingman.

15.7 A purely measure-theoretic framework for r.m.’s Let S be an abstract space on which a r.m. is to be defined and S a σ-field of subsets of S,i.e.(S, S) is a measurable space (Chapter 3). Our basic 15.7 A purely measure-theoretic framework for r.m.’s 349 structural assumption about S is that there is a countable semiring P in S P { } ∪∞ whose members cover S (i.e. if = E1, E2, ... , 1 Ei = S) and such that P S S P S ∪∞ ∈SP S P generates (i.e. ( )= ). Note that since S = 1 Ei ( )= , also generates S as a σ-field (σ(P)=S(P)=S). We shall refer to a system (S, S, P) satisfying these assumptions as a basic structure for defining a random measure or point process. Two rings connected with such a basic structure are of interest: (i) R(P), the ring generated by P, i.e. the class of all finite (disjoint) unions of sets of P. S S P ∈S ⊂∪n (ii) 0 = 0( ), the class of all sets E such that E 1Ei for some n and sets E1, E2, ..., En in P.

S0 is clearly a ring and P⊂R(P) ⊂S0 ⊂S. The ring S0 will be referred to as the class of bounded measurable sets, since they play this role in the real line, where P = {(a, b]:a, b rational, –∞ < a < b < ∞}. This is incidentally also the case in popular topological frameworks, e.g. where S is a second countable locally compact Hausdorff space, S is the class of Borel sets (generated by the open sets) and P is the ring generated by a countable base of bounded sets.

In these examples, the ring S0 is precisely the class of all bounded mea- surable sets. As noted S0 will be referred to as the “class of bounded measurable sets” even in the general context.

Let (S, S, P) be a basic structure, and (Ω, F , P) a probability space. Let ξ = {ξω(B):ω ∈ Ω, B ∈S}be such that

(i) For each fixed ω ∈ Ω, ξω(B) is a measure on S.

(ii) For each fixed B ∈P, ξω(B)isar.v.on(Ω, F , P). Then ξ is called a random measure (r.m.) on S (defined with respect to (Ω, F , P)). Further if the r.m. ξ is such that ξω(B) is integer-valued a.s. for each B ∈Pwe call ξ a point process.

If ξ is a r.m., since ξω(B) is finite a.s. for each B ∈Pand P is countable, the null sets may be combined to give a single null set Λ ∈F, P(Λ)=0 such that ξω(B) is finite for all B ∈P, ω ∈ Ω – Λ. Indeed ξω(B) < ∞ for all B ∈S0 when ω ∈ Ω – Λ since such B can be covered by finitely many sets of P. If desired, Ω may be reduced to Ω – Λ thus assuming that ξω(B) is finite for all ω, B ∈S0. If ξ is a r.m., ξω(B) is an extended r.v. for each B ∈S, and a r.v. for ∈S ∪∞ P ∪∞ ∩ B 0. For ifS = 1 Bi where Bi are disjoint sets of , B = 1 (B Bi)so ∞ ∩ that ξω(B)= 1 ξω(B Bi) which is the measurable sum of (nonnegative) measurable terms. 350 Basic structure of stochastic processes

If ξ is a r.m., its expectation or intensity measure λ = Eξ is defined by λ(B)=Eξ(B)forB ∈S. Countable additivity is immediate (e.g. from Theorem 4.5.2 (Corollary)). Note that λ is not necessarily finite, even on P. Point processes and r.m.’s have numerous properties which we do not consider in detail here. Some of these provide means of defining new r.m.’s from one or more given r.m.’s. An example is the following direct definition of a r.m. as an integral of an existing r.m., proved by D-class methods:

Theorem 15.7.1 If ξ is a r.m. and f is a nonnegative S-measurable func- tion then ξf f s dξ s is F -measurable. Furthermore, if f is bounded = S ( ) ω() on each set of P, ν B f s dξ s ,B∈S,isar.m. f ( )= B ( ) ω( ) It follows from the first part of this result that e–ξf = e– fdξ is a nonnega- tive bounded r.v. for each nonnegative S-measurable function f and hence –ξf has a finite mean. Lξ(f )=Ee is termed the Laplace Transform (L.T.) of the r.m. ξ, and is a useful tool for many calculations. In particular for –tξ(B) B ∈S, Lξ(tχB)=Ee is the L.T. of the nonnegative r.v. ξ(B), a useful alternative to the c.f. for nonnegative r.v.’s.

15.8 Example: The sample point process

Let τ be a r.e. in our basic space (S, S), and consider δs(B)=χB(s) which may be viewed as unit mass at s, even if the singleton set {s} is not S-measurable. Then it is readily checked that the composition δτω(B) defines a point process ξ(1) with unit mass at the single point τω.Ifthe r.e. τ has distribution ν = Pτ–1 (Section 9.3), ξ(1) has intensity Eξ(1)(B)= –1 EχB(τω)=Eχτ–1B(ω)=Pτ (B)=ν(B). Further straightforward calcula- tions show that ξ(1) has L.T. –f (τω) –f (s) –1 –f Lξ(1) (f )=Ee = e dPτ (s)=ν(e ).

Suppose now that τ1, τ2, ..., τn are independent r.e.’s of S with common –1 distribution Pτj = ν. Then f (τ1), f (τ2), ..., f (τn) are i.i.d. (extended) r.v.’s for any nonnegative measurable f and in particular χB(τ1), χB(τ2), ..., { } { } χB(τn) are i.i.d. with P χB(τ1)=1 = ν(B)=1–P χB(τ1)=0. Hence (n) n ∈S if ξ is the point process 1 δτj and B , n n (n) ξ (B)= δτj (B)= χB(τj), 1 1 so that ξ(n)(B) is binomial with parameters (n, ν(B)). ξ(n) is thus a point process consisting of n events at points {τ1, τ2, ..., τn}, its intensity being 15.10 Mixtures of random measures 351

Eξ(n) = nν, and its L.T. is readily calculated to be n n n n – 1 δτ (f ) – f (τj) –f (τ1) –f Lξ(n) (f )=Ee j = Ee 1 = Ee = ν(e ) .

ξ(n) is referred to as the sample point process consisting of n independent points τ1, τ2, ..., τn.

15.9 Random element representation of a r.m. As seen in Section 15.1, a real-valued stochastic process (family of r.v.’s) {ξt : t ∈ T} may be equivalently viewed as a random function, i.e. r.e. of RT . Similarly one may regard a r.m. {ξ(B):B ∈S}as a mapping ξ from Ω into the space M of all measures μ on S which are finite on P,i.e.ξω is the element of M defined by (ξω)(B)=ξω(B), B ∈S. A natural σ-field for the space M is that generated by the functions φB(μ)=μ(B), B ∈S,i.e.the M M|B M { –1 ∈ smallest σ-field making each φB -measurable ( = σ φB E : B S, E ∈B}(cf. Lemma 9.3.1)). It may then be readily checked (cf. Section 9.3) that a r.m. ξ is a measur- able mapping from (Ω, F , P)to(M, M), i.e. a random element of (M, M). As defined in Section 9.3 for r.e.’s, the distribution of the r.m. ξ is the probability measure Pξ–1 on M. It is then true that any probability measure π on M may be taken to be the distribution of a r.m., namely the identity r.m. ξ(μ)=μ on the probability space (M, M, π).

15.10 Mixtures of random measures As noted r.m.’s may be obtained by specifying their distributions as any probability measures on (M, M). Suppose now that (Θ, T , Q) is a prob- (θ) ability space, and for each θ ∈ Θ, ξ is a r.m. in (S, S) with distribution πθ, (θ) (θ) πθ(A)=P{ξ ∈ A} for each A ∈M. (Note that the ξ ’s can be defined on different probability spaces.) If for each A ∈M, πθ(A)isaT -measurable function of θ, it follows from Theorem 7.2.1 that π(A)= Θπθ(A) dQ(θ) is a probability measure on M, and thus may be taken to be the distribution ofar.m.ξ, which may be called the mixed r.m. formed by mixing ξ(θ) with respect to Q. Of course, it is the distribution of ξ rather than ξ itself which is uniquely specified. 352 Basic structure of stochastic processes

The following intuitively obvious results are readily shown: (i) If ξ is the mixture of ξ(θ) (Pξ–1(A)= P{ξ(θ) ∈ A} dQ(θ)) and B ∈S, the distribution of the (extended) r.v. ξ(B) is (for Borel sets E) P{ξ(B) ∈ E} = P{φ ξ ∈ E} = Pξ–1(φ–1E) B B { (θ) ∈ –1 } { (θ) ∈ } = P ξ φB E dQ = P ξ (B) E dQ(θ). (ii) The intensity Eξ satisfies (for B ∈S) Eξ(B)= Eξ(θ)(B) dQ(θ).

(iii) The Laplace Transform Lξ(f ) is, for nonnegative measurable f , Lξ(f )= Lξ(θ) (f ) dQ(θ).

Example Mixing the sample point process. (0) ≥ (n) n Write ξ = 0 and for n 1, ξ = 1 δτj as in Section 15.8, where τ1, ..., τn are i.i.d. random elements of (S, S) with (common) distribution –1 Pτj = ν say. Θ { } T Θ Let = 0, 1, 2, 3, ... , = all subsets of , Q the probability measure ≥ ∞ with mass qn at n =0,1,...(qn 0, 0 qn = 1). Then the mixture ξ has distribution ∞ –1 Pξ (A)= Pθ(A) dQ(θ)= qnPn(A) n=0

(n) where Pn(A)=P{ξ ∈ A}. For each B ∈Sthe distribution of ξ(B) is given by the probabilities ∞ ∞   n P{ξ(B)=r} = q P{ξ(n)(B)=r} = q ν(B)r(1 – ν(B))n–r n n r n=r n=r and ∞ Eξ(B)= qnnν(B)=¯qν(B) n=0 whereq ¯ is the mean of the distribution {qn}. That is Eξ =¯qν. The Laplace Transform of ξ is ∞ ∞ –f n –f Lξ(f )= Lξ(θ) (f ) dQ(θ)= qnLξ(n) (f )= qn(ν(e )) = G(ν(e )) n=0 n=0 where G denotes the probability generating function (p.g.f.) of the distri- bution {qn}. 15.11 The general Poisson process 353

15.11 The general Poisson process We now outline how the general Poisson process may be obtained on our basic space (S, S) from the mixed sample point process considered in the last section. First define a “finite Poisson process” as simply a mixed sample point –a n process with qn = e a /n!fora > 0, n =0,1,2,..., i.e. Poisson probabili- ties. For B ∈S, ∞   e–aan n P{ξ(B)=r} = ν(B)r(1 – ν(B))n–r n! r n=r which reduces simply to e–aν(B)(aν(B))r/r!, r = 0,1,2,..., i.e. a Poisson distribution for any B ∈S, with mean aν(B). In particular if B = S, ξ(S) has a Poisson distribution with mean a. This, of course, implies ξ(S) < ∞ a.s. so that the total number of Poisson points in the whole space is finite. This limits the process (ordinarily one thinks of a Poisson process – e.g. on the line – as satisfying P{ξ(S)=∞} = 1), which is the reason for referring to this as a “finite Poisson process”. This process has intensity measure aν = λ say, and Laplace Transform G(ν(e–f )) where G(s)=e–a(1–s),i.e.

–a(1–ν(e–f )) –aν(1–e–f ) –λ(1–e–f ) Lξ(f )=e = e = e (ν(1) = 1).

Any finite (nonzero) measure λ on S may be taken as the intensity mea- sure of a finite Poisson process (by taking a = λ(S) and ν = λ/λ(S)). The general Poisson process (for which ξ(S) can be infinite-valued) can be obtained by summing a sequence of independent finite Poisson pro- cesses as we now indicate, following the construction of a sequence of independent r.v.’s as in Corollary 2 of Theorem 15.1.2. Let λ ∈ M (i.e. a measure on S which is finite on P). From the basic assumptions it is readily ∪∞ P checked that S may be written as i Si, where Si are disjoint sets of and we write λi(B)=λ(B ∩ Si), B ∈S. The λi(B), i =1,2,..., are finite mea- sures on S and may thus be taken as the intensities of independent finite Poisson processes ξi, whose distributions on (M, M)arePi,say.(Pi assigns { ∈ } measure 1 to the set μ M : μ(S – Si)=0.) ∞ ∈PE{ ∞ } ∞ Define now ξ = 1 ξj. Since, for B , 1 ξj(B) = 1 λj(B)= ∞ ∩ ∞ ∈ ∞ 1 λ(B Sj)=λ(B) < (λ M) we see that 1 ξj(B) converges a.s. on P and hence ξ is a point process. By the above Eξ(B)=λ(B) so that ξ has intensity measure λ. ξ is the promised Poisson process in S with intensity measure λ ∈ M. Some straightforward calculation using independence and dominated 354 Basic structure of stochastic processes convergence shows that its L.T. is ∞ –f –f n – λj(1–e ) –λ(1–e ) Lξ(f ) = lim Π Lξ (f ) = e 1 = e n→∞ 1 j i.e. the same form as in the finite case. In summary then the following result holds. Theorem 15.11.1 Let (S, S, P) be a basic structure, and let λ be a mea- sure on S which is finite on (the semiring) P. Then there exists a Poisson process ξ on S with intensity Eξ = λ, thus having the L.T.

–λ(1–e–f ) Lξ(f )=e . n By writing f = i=1 tiχBi and using the result for L.T.’s corresponding to Theorem 12.8.3 for c.f.’s (with analogous proof using the uniqueness theo- rem for L.T.’s, see e.g. [Feller]), it is seen simply that ξ(Bi), i =1,2,..., n, are independent Poisson r.v.’s with means λ(Bi) when Bi are disjoint sets of S.

15.12 Special cases and extensions As defined the general Poisson process ξ has intensity Eξ = λ where λ is a measure on S which is finite on P. The simple familiar stationary Poisson process on the real line is a very special case where (S, S)is(R, B), P can be taken to be the semiclosed intervals {(a, b]:a, b rational, –∞ < a < b < ∞} and λ is a multiple of Lebesgue measure, λ(B)=λm(B) for a finite positive constant λ, termed the intensity of the simple Poisson process. Nonstationary Poisson processes on the line are simply obtained by taking an intensity measure λ  m, having a time varying intensity function λ(t), λ B λ t dt s ( )= B ( ) . These Poisson processes have no fixed atoms (points at which P{ξ{s} > 0} > 0) and no “multiple atoms” (random points s with ξ{s} > 1). On the other hand fixed atoms or multiple atoms are possible if a chosen intensity measure has atoms. Poisson processes’ distributions may be “mixed” to form “mixed Pois- son process” or “compound Poisson processes” and intensity measures may themselves be taken to be stochastic to yield “doubly stochastic Poisson processes” (“Cox processes” as they are generally known). These latter are particularly useful for modeling applications involving stochastic oc- currence rates. The very simple definition of a basic structure in Section 15.7 suffices admirably for the definition of Poisson processes. However, its extensions such as those above and other random measures typically require at least 15.12 Special cases and extensions 355 a little more structure. One such assumption is that of separation of two points of S by sets of P – a simple further requirement closely akin to the definition of Hausdorff spaces. Such an assumption typically suffices for the definition and basic framework of many point processes. However, more intricate properties such as a full theory of weak convergence of r.m.’s are usually achieved by the introduction of more topological assumptions about the space S. References

Billingsley, P. Convergence of Probability Measures, 2nd edn, Wiley – Interscience, 1999. Chung, K.L. A Course in Probability Theory, 3rd edn, Academic Press, 2001. Cramer,´ H., Leadbetter, M.R. Stationary and Related Stochastic Processes, Probability and Mathematical Statistics Series, Wiley, 1967. Reprinted by Dover Publications Inc., 2004. Feller, W. An Introduction to Probability Theory and Its Applications,vol.1,JohnWiley & Sons, 1950. Halmos, P.R. Measure Theory, Springer-Verlag, 1974. Kallenberg, O. Random Measures, 4th edn, Academic Press, 1986. Kallenberg, O. Foundations of Modern Probability Theory, 2nd edn, Springer Series in Statistics, Springer-Verlag, 2002. Loeve,` M. Probability Theory I, II, 4th edn, Graduate Texts in Mathematics, vol. 45, Springer-Verlag, 1977. Resnick, S.I. Extreme Values, Regular Variation, and Point Processes, 2nd edn, Springer-Verlag, 2008.

356 Index

Lp-space, 127 central limit theorem complex, 180 array form of Lindeberg–Feller, 269 λ-system, 19 elementary form, 267 μ*-measurable, 29 standard form of Lindeberg–Feller, 271 σ-algebra change of variables in integration, 106 see σ-field, 13 characteristic function (c.f.) of a random σ-field, 13 variable, 254 generated by a class of sets, 14 inversion and uniqueness, 261 generated by a random variable, 195 inversion theorem, 278 generated by a transformation, 47 joint, 277 σ-finite, 22, 86 recognizing, 271 σ-ring, 13 uniqueness, 262, 278 generated by a class of sets, 14 Chebychev Inequality, 202, 243 D-class, 15 classes of sets, 1, 2 absolute continuity, 94, 105, 110, 193, completion, 34, 41, 81 199 conditional distribution, 295 almost everywhere, 57 conditional expectation, 287, 288, 300, almost surely (a.s.), 190 305 atoms, 192 conditional probability, 285, 291, 301, 305 Banach space, 127 conditionally independent, 307 binomial distribution, 193, 257 consistency of a family of measures Bochner’s Theorem, 275 (distributions), 167, 342 Borel measurable function, 59, 190 continuity theorem for characteristic Borel sets, 16 functions, 264, 279 extended, 45 continuous from above (below), 25 n-dimensional, 158 continuous mapping theorem, 231 two-dimensional, 153 convergence Borel–Cantelli Lemma, 217 almost everywhere (a.e.), 58 bounded variation, 110, 180 almost sure (a.s.), with probability one, Brownian motion 223 see Wiener process, 343 almost uniform (a.u.), 119 Cauchy sequence, 118 in distribution, 227, 228 almost surely (a.s.), 224 in measure, 120 almost uniformly, 119 in probability, 225 in measure, 121 in pth order mean (Lp-spaces), 226 in metric space, 125 modes, summary, 134 uniformly, 118 of integrals, 73 centered sequences, 325 pointwise, 118

357 358 Index convergence (cont.) Fubini’s Theorem, 150, 158 uniformly, 118 functional central limit theorem uniformly a.e., 118 (invariance principle), 347 vague, 204, 237 gamma distribution, 194 weak, 228 generalized second derivative, 282 convex, 202 Hahn decomposition, 88 convolution, 153, 216 minimal property, 90 correlation, 200 Hausdorff space, 349, 355 counting measure, 41, 81 Heine–Borel Theorem, 37 covariance, 200 Helly’s Selection Theorem, 232 Cox processes, 354 Holder’s¨ Inequality, 128, 179, 201 Cramer–Wold´ device, 280 cylinder set, 164 increasing sequence of functions, 55 independent events and their classes, 208 De Morgan laws, 6 independent random elements and their degenerate distribution, 257 families, 211 density function, 105 independent random variables, 213 discrete measures, 104, 105 addition, 216 distribution existence, 214 marginal, 198, 341 indicator (characteristic) functions, 7 of a random element, 197 integrability, 68 of a random measure, 351 integrable function, 67–69 of a random variable, 190 integral, 68 distribution function (d.f.), 191 defined, 69 absolutely continuous, 193, 199 indefinite, 66 discrete, 193 of complex functions, 177 joint, 197 of nonnegative measurable functions, dominated convergence, 76, 92, 179 63 conditional, 290 of nonnegative simple functions, 62 Doob’s decomposition, 313 with respect to signed measures, 92 Egoroff’s Theorem, 120 integration by parts, 154 equivalent inverse functions, 203 signed measures, 95 inverse image, 46 stochastic processes, 345 Jensen’s Inequality, 202 essentially unique, 96 conditional, 291, 306 event, 189 Jordan decomposition, 89, 152 expectation, 199 Kolmogorov Inequalities, 241, 314 extension of measures, 27, 31 Kolmogorov Zero-One Law, 218 Fatou’s Lemma, 76 Kolmogorov’s Extension Theorem, 167, conditional, 289 169, 342 field (algebra), 9 Kolmogorov’s Three Series Theorem, finite-dimensional distributions (fidi’s), 244 341 Laplace Transform (L.T.), 350 Fourier Transform, 181, 254 laws of large numbers, 247, 248, 327 Dirichlet Limit, 186 Lebesgue decomposition, 96, 98, 106, inverse, 185 194 inversion, 182 Lebesgue integrals, 78 “local” inversion, 186 Lebesgue measurable function, 59 local inversion theorem, 187 Lebesgue measurable sets, 38 Fourier–Stieltjes Transform, 180, 254 n-dimensional, 158 inversion, 182 two-dimensional, 153 Index 359

Lebesgue measure, 37 multivariate, 200 n-dimensional, 158 normed linear space, 126 two-dimensional, 153 outer measure, 29 Lebesgue–Stieltjes integrals, 78, 111 Lebesgue–Stieltjes measures, 39, 78, 111, Palm distributions, 285 158, 162 point process, 349 Levy´ distance, 251 Poisson distribution, 193 Liapounov’s condition, 284 Poisson process, 353 likelihood ratios, 333 compound, 354 Lindeberg condition, 269 doubly stochastic, 354 linear mapping (transformation), 17, 38 stationary and nonstationary, 354 linear space, 126 Polya’s urn scheme, 337 Portmanteau Theorem, 228 Markov Inequality, 202 positive definite, 274 martingale (submartingale, probability density function (p.d.f.), supermartingale), 309, 320 193 convergence, 319 predictable increasing sequence, 313 joint, 198 reverse, 323 probability measure (probability), 44, 189 upcrossings, 317 frequency interpretation, 190 mean square estimate, 306 inequalities, 200 measurability criterion, 48 probability space, 189 measurable functions, 47 probability transforms, 204 combining, 50 product measurable space, 155 complex-valued, 178 product measure, 149, 156 extended, 45 product spaces, 141 measurable space, 44 σ-field, 142, 165 measurable transformation, 47 σ-ring, 141, 142 measure space, 44 diagonal, 171 measures, 22 finite-dimensional, 155 RT BT complete, 34 space ( , ), 163 complex, 87 Prohorov’s Theorem, 234 from outer measures, 29 projection map, 164, 165 induced by transformations, 58 Rademacher functions, 220 mixtures of, 143 Radon–Nikodym derivative, 102 on RT , 167 chain rule, 103 regularity, 162 Radon–Nikodym Theorem, 96, 100, 179 metric space, 124 random element (r.e.), 195 complete, 126 random experiment, 189 separable, 126 random function (r.f.), 340 Minkowski’s Inequality, 129, 180, 201 random measure (r.m.), 350 reverse, 130 basic structure, 349 moments, 199 intensity measure, 350 absolute, 199, 200 mixed, 351 central, 200 random element representation, 351 inequalities, 200 random variables (r.v.’s), 190 monotone class theorem, 14, 19 absolutely continuous, 193 monotone convergence theorem, 74 discrete, 193 conditional, 289 extended, 190 nonnegative definite, 274 identically distributed, 192 norm, 126 symmetric, 281 normal distribution, 194, 257 random vector, 195, 196 360 Index real line applications, 78, 104, 153 intersection, 3 rectangle, 141 limits, 6 regular conditional density, 303 lower limit, 6 regular conditional distribution, 296, 299, monotone increasing (decreasing), 7 301, 302, 305 proper difference, 4 regular conditional probability, 293, 299, symmetric difference, 4 301, 305 union (sum), 3 relatively compact, 234 upper limit, 6 repeated (iterated) integral, 148, 157 signed measure, 86, 152 Riemann integrals, 79, 80, 84 null, negative, positive, 87 Riemann–Lebesgue Lemma, 182 total variation, 112 rings, 8, 11 simple functions, 54 sample functions (paths), 341 singularity, 94, 105, 194 continuity, 345 Skorohod’s Representation, 236 sample point process, 351 stochastic process, 195, 340 mixing, 352 continuous parameter, 340 RT Schwarz Inequality, 129 on special subspaces of , 344 section of a set, 142 realization, 341 semiring, 10 stochastic sequence or discrete set functions, 21 parameter, 340 additive, 22 tail σ-field, event and random variables, countable subadditivity, 29 218 extensions and restrictions, 22 three series theorem, 244 finitely additive (countably additive), tight family, 232 22 transformation, 45 monotone, 23 transformation theorem, 77, 93, 179 subtractive, 23 triangular array, 268 set mapping, 46 Tychonoff’s Theorem, 168 sets, 1 uniform absolute continuity, 238 complement of a set, 4 uniform distribution, 257 convergent, 7 uniform integrability, 238 difference, 4 disjoint, 4 variance, 200 empty, 3 Wiener measure, 347 equalities, 5 Wiener process, 343, 346