<<

Probability and

Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA

Convergence of Random Variables

1 Convergence Concepts

1.1 Convergence of Real Numbers

A sequence of real numbers an converges to a limit a if and only if, for each ǫ > 0, the sequence an eventually lies within a ball of radius ǫ centered at a. It’s okay if the first few (or few million) terms lie outside that ball— and the number of terms that do lie outside the ball may depend on how big ǫ is (if ǫ is small enough it will take millions of terms before the remaining sequence lies inside the ball). This can be made mathematically precise by introducing a letter (say, Nǫ) for how many initial terms we have to throw away, so that a a if and only if there is an N < so that, for each n → ǫ ∞ n N , a a <ǫ: only finitely many a can be farther than ǫ from a. ≥ ǫ | n − | n The same notion of convergence really works in any (complete) metric space, where we require that some measure of the distance d(an, a) from an to a tend to zero in the sense that it exceeds each number ǫ> 0 for at most some finite number Nǫ of terms. Points a in d-dimensional Euclidean space will converge to a limit a Rd n ∈ if and only if each of their coordinates converges; and, since there are only finitely many of them, if they all converge then they do so uniformly (i.e., for each ǫ we can take the same Nǫ for all d of the coordinate sequences).

1 2 Convergence of Random Variables

For random variables Xn the idea of convergence to a limiting random vari- able X is more delicate, since each X is a of ω Ω and usually n ∈ there are infinitely many points ω Ω. What should we mean in asking ∈ about the convergence of a sequence Xn of random variables to a limit X? Should we mean that Xn(ω) converges to X(ω) for each fixed ω? Or that these sequences converge uniformly in ω Ω? Or that some notion of the ∈ distance d(Xn, X) between Xn and the limit X decreases to zero? Should the probability measure P be involved in some way? Here are a few different choices of what we might mean by the statement that “Xn converges to X,” for a sequence of random variables Xn and a X, all defined on the same (Ω, F, P):

pw: The sequence of real numbers X (ω) X(ω) for every ω Ω (point- n → ∈ wise convergence):

( ǫ> 0) ( ω Ω) ( N < ) ( n N ) X (ω) X(ω) < ǫ. ∀ ∀ ∈ ∃ ǫ,ω ∞ ∀ ≥ ǫ,ω | n − | uni: The sequences of real numbers X (ω) X(ω) uniformly for ω Ω: n → ∈ ( ǫ> 0) ( N < ) ( ω Ω) ( n N ) X (ω) X(ω) < ǫ. ∀ ∃ ǫ ∞ ∀ ∈ ∀ ≥ ǫ | n − | a.s..: Outside some null event N F, each sequence of real numbers X (ω) ∈ n → X(ω) (Almost-Sure convergence, or “” (a.e.)): for some N F with P[N] = 0, ∈ ( ǫ> 0) ( ω / N) ( N < ) ( n N ) X (ω) X(ω) < ǫ, ∀ ∀ ∈ ∃ ǫ,ω ∞ ∀ ≥ ǫ,ω | n − | i.e., P X (ω) X(ω) ǫ = 0. ∪ǫ>0 ∩N<∞ ∪n≥N | n − |≥ L : Outside some null event N F, the sequences of real numbers X (ω) ∞ ∈ n → X(ω) converge uniformly (“almost-uniform” or “L∞” convergence): for some N F with P[N] = 0, ∈ ( ǫ> 0) ( N < ) ( ω / N) ( n N ) X (ω) X(ω) < ǫ. ∀ ∃ ǫ ∞ ∀ ∈ ∀ ≥ ǫ | n − | i.p.: For each ǫ > 0, the probabilities P[ X X > ǫ] 0 (convergence | n − | → “in probability”, or “in measure”):

( ǫ> 0) ( η > 0) ( N < ) ( n N ) P[ X X >ǫ] < η. ∀ ∀ ∃ ǫ,η ∞ ∀ ≥ ǫ,η | n − |

2 L : The expectation E[ X X ] converges to zero (convergence “in L ”): 1 | n − | 1 ( ǫ> 0) ( N < ) ( n N ) E[ X X ] < ǫ. ∀ ∃ ǫ ∞ ∀ ≥ ǫ | n − | L : For some fixed number p> 0, the expectation of the pth power E[ X p | n − X p] converges to zero (convergence “in L ,” sometimes called “in the | p pth mean”):

( ǫ> 0) ( N < ) ( n N ) E[ X X p] < ǫ. ∀ ∃ ǫ ∞ ∀ ≥ ǫ | n − | i.d.: The distributions of Xn converge to the distribution of X, i.e., the measures P X−1 converge in some way to P X−1 (“vague” or ◦ n ◦ “weak” convergence, or “convergence in distribution”, sometimes writ- ten X X): n ⇒ ( ǫ> 0) ( φ C (R)) ( N < ) ( n N ) E[ φ(X ) φ(X) ] < ǫ. ∀ ∀ ∈ b ∃ ǫ,φ ∞ ∀ ≥ ǫ,φ | n − |

Which of these eight notions of convergence is right for random variables? The answer is that all of them are useful in for one purpose or another. You will want to know which ones imply which other ones, under what conditions. All but the first two (pointwise, uniform) notions depend upon the measure P; it is possible for a sequence Xn to converge to X in any of these senses for one probability measure P, but to fail to converge for another P′. Most of them can be phrased as metric convergence for some notion of distance between random variables: i.p.: X X in probability if and only if d (X, X ) 0 as real numbers, n → 0 n → where: X Y d (X,Y ) E | − | 0 ≡ 1+ X Y  | − | L : X X in L if and only if d (X, X ) = X X 0 as real 1 n → 1 1 n || − n||1 → numbers, where: X Y E X Y || − ||1 ≡ | − | L : X X in L if and only if d (X, X ) = X X 0 as real p n → p p n || − n||p → numbers, where:

X Y (E X Y p)1/p || − ||p ≡ | − |

3 L : X X almost uniformly if and only if d (X, X )= X Y 0 ∞ n → ∞ n || − ||∞ → as real numbers, where:

X Y = l.u.b. r < : P[ X Y > r] > 0 || − ||∞ { ∞ | − | }

As the notation suggests, convergence in probability and in L∞ are in some sense limits of convergence in L as p 0 and p , respectively. Almost- p → →∞ sure convergence is an exception: there is no metric notion of distance d(X,Y ) for which X X if and only if d(X, X ) 0. n → n →

2.1 Almost-Sure Convergence

Let X and X be a collection of RV’s on some (Ω, F, P). The set of points { n} ω for which Xn(ω) does converge to X(ω) is just

∞ ∞ [ω : X (ω) X(ω) ǫ], | n − |≤ n=m ǫ>\0 m[=1 \ the points which, for all ǫ > 0, have X (ω) X(ω) less than ǫ for all but | n − | finitely-many n. The sequence Xn is said to converge “almost everywhere” (a.e.) to X, or to converge to X “almost surely” (a.s..), if this set of ω has probability one, or (conversely) if its complement is a :

∞ ∞ P [ω : X (ω) X(ω) >ǫ] = 0. | n − | ǫ>0 m=1 n=m h [ \ [ i The union over ǫ > 0 is only a countable one, since we need include only rational ǫ (or, for that matter, any sequence ǫk tending to zero, such as ǫ = 1/k). Thus X X a.e. if and only if, for each ǫ> 0, k n → ∞ ∞ P [ω : X (ω) X(ω) >ǫ] = 0. (a.e.) | n − | m=1 n=m h \ [ i This combination of intersection and union occurs frequently in probability, ∞ ∞ and has a name; for any sequence En of events, [ m=1 n=m En] is called the lim sup of the E , and is sometimes described more colorfully as [E i.o.], { n} T S n the set of points in En “infinitely often.” Its complement is the lim inf of c ∞ ∞ the sets Fn = En, [ m=1 n=m Fn]: the set of points in all but finitely many of the F . Since P is countably additive, and since the intersection in the n S T definition of lim sup is decreasing and the union in the definition of lim inf

4 is increasing, always we have P[ ∞ E ] P[ ∞ ∞ E ] and P[ ∞ F ] P[ ∞ ∞ F ] as n=m n ց m=1 n=m n n=m n ր m=1 n=m n m . Thus, S→∞ T S T S T Theorem 1 X X P-a.s.. if and only if for every ǫ> 0, n →

lim P[ Xn X >ǫ for some n m] = 0. m→∞ | − | ≥

In particular, X X P-a.s.. if P[ X X > ǫ] < for each ǫ > 0 n → | n − | ∞ (why?). P

2.2 Convergence In Probability

The sequence Xn is said to converge to X “in probability” (i.p.) if, for each ǫ> 0,

P[ω : X (ω) X(ω) >ǫ] 0. (i.p.) | n − | → If we denote by E the event [ω : X (ω) X(ω) >ǫ] we see that convergence n | n − | almost surely requires that P[ E ] 0 as m , while convergence n≥m n → →∞ in probability requires only that P[E ] 0. Thus: S n → Theorem 2 If X X a.e. then X X i.p. n → n → Here is a partial converse:

Theorem 3 If X X i.p., then there is a subsequence n such that n → k X X a.e. nk → Proof. Set n = 0 and, for each integer k 1, set 0 ≥ 1 n = inf n>n : P ω : X (ω) X(ω) > 2−k . k k−1 | n − | k ≤     For any ǫ> 0 we have 1 <ǫ eventually (namely, for k > k = 1 ) and for k 0 ⌈ ǫ ⌉ each m > k0,

5 ∞ ∞ 1 P [ω : X (ω) X(ω) >ǫ] P [ω : X (ω) X(ω) > ] | nk − | ≤ | nk − | k h k=m i h k=m i [ ∞ [ 1 P[ω : X (ω) X(ω) > ] ≤ | nk − | k k=m X∞ 2−k = 21−m 0 as m . ≤ → →∞ kX=m

2.2.1 A Counter-Example

If X X a.e. implies X X i.p., and if the converse holds at least n → n → along subsequences, are the two notions really identical? Or is it possible for RV’s Xn to converge to X i.p., but not a.e.? The answer is that the two notions are different, and that a.e. convergence is strictly stronger than convergence i.p. Here’s an example: Let (Ω, F, P) be the unit interval with Borel sets and . Define a sequence of random variables X :Ω R by n →

i i+1 1 if 2j <ω 2j j j Xn(ω)= ≤ where n = i + 2 , 0 i< 2 . (0 otherwise ≤ Each X is one on an interval of length 2−j, where j = log (n) ; since n ⌊ 2 ⌋ 1 1 < 2 , n ≤ 2j n 2 P[ X >ǫ] = 2−j < 0 | n| n → for each 0 <ǫ< 1 and X 0 i.p. On the other hand, for every j > 0 we n → have

2j −1 2j+1−1 i i + 1 Ω = , = ω : X (ω) = 1 2j 2j n i=0 n=2j [  i [   so [ω : X (ω) 0] is empty, not a set of probability one! Obviously X does n → n not converge a.e. This example is a building-block for several examples to

6 come, so getting to know it well is worth while. Try to verify that X 0 in n → probability and in L but not almost surely. What is X ? Why doesn’t p || n||p X 0 a.s.? What would happen if we multiplied X by n? By n2? What n → n about the subsequence Yn = X2n ? Does Xn converge in L∞?

3 Cauchy Convergence

Sometimes we wish to consider a sequence Xn that converges to some limit X, perhaps without knowing X in advance; the concept of Cauchy Con- vergence is ideal for this. For any of the distance measures dp above, with 0 p , say “X is a Cauchy sequence in L ” if ≤ ≤∞ n p ( ǫ> 0)( N < )( n m N) d (X , X ) < ǫ. ∀ ∃ ∞ ∀ ≥ ≥ p m n The spaces L for 0 p are all complete in the sense that if X p ≤ ≤ ∞ n is Cauchy for dp then there exists X Lp for which dp(Xn, X) 0. To ∈ → −k see this, take an increasing subsequence Nk along which dp(Xm, Xn) < 2 for n m N , and set X = 0 and N = 0; set Y X X . ≥ ≥ k 0 0 k ≡ Nk − Nk−1 Check to confirm that ∞ Y converges a.s., to some limit X L with k=1 k ∈ p d (X , X) 0. p n → P

4 Uniform Integrability

Let Y 0 be integrable on some probability space (Ω, F, P), ≥ E[Y ]= Y dP < ; ∞ ZΩ it follows (from DCT or MCT, for example) that

E lim [Y 1[Y >t]]= Y dP = 0 t→∞ Z[ω:Y (ω)>t] and, consequently, that for any sequence of random variables Xn “domi- nated” by Y in the sense that X Y a.s.., | n|≤ E lim [Xn 1[X >t]] = Xn dP t→∞ n Z[ω:Xn(ω)>t] Y dP ≤ Z[ω:Y (ω)>t] = 0, uniformly in n.

7 Call the sequence X uniformly integrable (or simply UI) if E[X 1 ] n n [Xn>t] → 0 uniformly in n, even if it is not dominated by a single integrable random variable Y . The big result is:

Theorem 4 If X X i.p. and if X is UI then X X in L . n → n n → 1 Proof. Without loss of generality take X 0. Fix any ǫ> 0; find (by UI) ≡ t > 0 such that E[ X 1 ] ǫ for all n. Now find (by X X i.p.) ǫ | n| [Xn>tǫ] ≤ n → N N such that, for n N , P[ X >ǫ] < ǫ/t ; then: ǫ ∈ ≥ ǫ | n| ǫ E[ X ] = X dP + X dP + X dP | n| | n| | n| | n| Z[|Xn|≤ǫ] Z[ǫ<|Xn|≤tǫ] Z[tǫ<|Xn|] ǫ dP + t dP + X dP ≤ ǫ | n| Z[|Xn|≤ǫ] Z[ǫ<|Xn|≤tǫ] Z[tǫ<|Xn|] ǫ + (t )P[ X >ǫ]+ ǫ ≤ ǫ | n| 3ǫ. ≤

Similarly, for any p> 0, X X (i.p.) and X p UI (for example, X n → | n| | n|≤ Y L ) gives X X (L ). In the special case of X Y L this is ∈ p n → p | n| ≤ ∈ p just Lebesgue’s Dominated Convergence Theorem (DCT). We have seen that X is UI whenever X Y L , but UI is more { n} | n| ≤ ∈ 1 general than that. Here are two more criteria:

Theorem 5 If X is uniformly bounded in L for some p> 1 then X { n} p { n} is UI.

Proof. Let c R be an upper bound for E X p. First recall that, by ∈ + | n| Fubini’s Theorem, any random variable X satisfies for any q > 0

∞ E X q = qxq−1 P[ X >x] dx. | | | | Z0

8 We apply this for q = 1 and q = p to the random variables Xn and, for t> 0, to Xn 1|Xn|>t. Fix any t> 0; then

∞ E X 1 = P[ X 1 >x] dx | n| |Xn|>t | n| |Xn|>t Z0   t ∞ = P[ X >t] dx + P[ X >x] dx | n| | n| Z0 Zt ∞ pxp−1 t P[ X p >tp]+ P[ X >x] dx ≤ | n| ptp−1 | n| Zt E X p 1 ∞ t | n| + pxp−1P[ X >x] dx ≤ tp ptp−1 | n| Z0 = t1−p (1 + p−1)E X p | n| ct1−p (1 + p−1) ≤ 0 as t , uniformly in n. → →∞

Theorem 6 If X is UI, then { n} ( ǫ> 0)( δ > 0)( A F w/ P(A) < δ) E[ X 1 ] < ǫ. ∀ ∃ ∀ ∈ | n| A Conversely, if X is uniformly bounded in L and if ( ǫ > 0)( δ > 0) { n} 1 ∀ ∃ such that E[ X 1 ] <ǫ whenever P[A] < δ, then X is UI. | n| A { n} Proof. Straightforward. The condition “ X is uniformly bounded in { n} L1” is unnecessary if (Ω, F, P) is non-atomic.

9 5 Summary: Uniform Integrability and Conver- gence Concepts

I. Uniform Integrability (UI) r r A. Xn 0, implies Xn dP Y dP 0 | | ∈ [|Xn|>t] | | ≤ [Y >t] → as t , uniformly in n. This is the definition of UI. →∞ R R B. If (Ω, F, P) nonatomic, X UI iff ǫ δ X dP <ǫ for P[Λ] δ n ∀ ∃ ∋ Λ | n| ≤ (take δ = ǫ/2t) R [If (Ω, F, P) has atoms must also require E X B] | n|≤ p p r q C. E Xn c < implies Xn UI for each r

A. Xn X i.p. iff nk nk Xn X a.e. (by contradiction) → ∀ ∃ i ∋ ki → B. X X a.s.. and φ(x) continuous implies φ(X ) φ(X) a.s.. n → n → C. X X i.p. and φ(x) continuous implies φ(X ) φ(X) i.p. (use n → n → A) D. Definition: X X if Eφ(X ) Eφ(X) φ C (R) n ⇒ n → ∀ ∈ b 1. Prop: X X i.p. implies X X (use II.C) n → n ⇒ 2. Prop: X X implies F (r) F (r) wherever F (r) = n ⇒ n → F (r ). − a. Remark: Even if X X, F (r) may not converge where n ⇒ n F (r) jumps; b. Remark: Even if X X, f (r) = F ′ (r) may not con- n ⇒ n n verge to f(r)= F ′(r); in fact, either may fail to exist. III. Implications among these notions: a.e., i.p., L , L , L , i.d. (0

10 2. i.p. L (counterexample: X = n1/p1 ) 6⇒ p n (0,1/n] C. L = L (by Jensen’s inequality) p ⇒ r 1. L L (counterexample: X = n1/p1 ) r 6⇒ p n (0,1/n] D. L = L (simple estimate) ∞ ⇒ p 1. L L (counterexample: X = n1/2p1 ) p 6⇒ ∞ n (0,1/n] E. L = a.e. (uniform cgce implies pointwise cgce) ∞ ⇒ F. i.p. = i.d. (II.D.1 above) ⇒ 1. i.d. i.p. (counterexample: X , X on different spaces) 6⇒ n 2. i.d. = a.s.. ( (Ω, F, P), X , X X X a.e....) ⇒ ∃ n ∋ n →

11 6 Infinite Coin-Toss and the Laws of Large Num- bers

The traditional interpretation of the probability of an event E is its asymp- totic frequency: the limit as n of the fraction of n repeated, similar, → ∞ and independent trials in which E occurs. Similarly the “expectation” of a random variable X is taken to be its asymptotic average, the limit as n →∞ of the average of n repeated, similar, and independent replications of X. As statisticians trying to make inference about the underlying probability dis- tribution f(x θ) governing observed random variables X , this suggests that | i we should be interested in the probability distribution for large n of quan- 1 n tities like the average of the RV’s, n i=1 Xi. Three of the most celebrated theoremsP of probability theory concern this sum. For independent random variables Xi, all with the same probability 3 2 2 distribution satisfying E Xi < , set µ = EXi, σ = E Xi µ , and n | | ∞ | − | Sn = i=1 Xi. The three main results are: LawsP of Large Numbers: S nµ n − 0 (i.p. and a.s..) σn −→ Central Limit Theorem: S nµ n − = N(0, 1) (i.d.) σ√n ⇒

Law of the Iterated Logarithm: S nµ lim sup n − = 1.0 (a.s..) ±σ√2n log log n

Together these three give a clear picture of how quickly and in what sense 1 n Sn tends to µ. We begin with the (LLN), in its “weak” form (asserting convergence i.p.) and in its “strong” form (con- vergence a.s..). There are several versions of both theorems. The simplest requires the Xi to be IID and L2; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distribu- tions, and consider what happens if the RV’s are only L1 (or worse!) instead of L2. The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and (2)

12 Show what happens with Cauchy random variables, which don’t satisfy the requirements (the LLN fails).

I. Weak version, non-iid, L : µ = EX , σ = E[X µ ][X µ ] 2 i i ij i − i j − j 2 1 2 A. Y = (S Σµ )/n satisfies EY = 0, EY = 2 Σ σ + 2 Σ σ ; n n − i n n n i ii n i nǫ] < 2 2 = 2 | n| n ǫ nǫ 2 M 2 Mπ2 1. P[ S 2 >n ǫ] < 2 2 ,Σ P[ S 2 >n ǫ] < 2 | n | n ǫ n | n | 6ǫ 2 1 2. Borel-Cantelli: P[ S 2 >n ǫ i.o.] = 0, 2 S 2 0 a.s.. | n | n n → 2 3. D = max 2 2 S S 2 , ED 2nE S 2 S 2 n n ≤k<(n+1) | k − n | n ≤ | (n+1) − n |≤ 4n2M 2 4n2M 4. Chebychev: P[D >n ǫ] < 4 2 , D 0 a.s.. n n ǫ n → |Sn2 |+Dn B. S /k 2 0 a.s.., QED | k |≤ n → 1. Bernoulli RV’s, theorem, Monte Carlo.

III. Weak version, pairwise-iid, L1 A. Equivalent sequences: P[X = Y ] < n n 6 n ∞ 1. n[Xn Yn] < Pa.s.. n − n∞ n n 2. Pi=1[Xi], an i=1[Xi] converge iff i=1[Yi], an i=1[Yi] do 3. Y = X 1 Pn n [|Xn|≤Pn] P P IV. Counterexamples: Cauchy, dx 2 −1 A. X 2 = P[ S /n ǫ] tan (ǫ) 1, WLLN fails. i ∼ π[1+x ] ⇒ | n| ≤ ≡ π 6→ P ±c B. [Xi = n]= n2 , n 1; Xi / L1, and Sn/n 0 i.p. or a.s.. ±c ≥ ∈ 6→ C. P[X = n] = 2 , n 3; X / L , but S /n 0 i.p. and not i n log n ≥ i ∈ 1 n → a.s.. D. Medians: for ANY RV’s X X i.p., then m m if m is n → ∞ n → ∞ ∞ unique. Let Xi be iid standard Cauchy RV’s, with

t dx 1 1 P[X t]= = + arctan(t) 1 ≤ π[1 + x2] 2 π Z−∞ and characteristic function ∞ dx E eiλX1 = eiλx = e−|λ|, π[1 + x2] Z−∞ 13 so Sn/n has characteristic function

λ λ n λ iλS /n i [X1+···+X ] i X1 −| | n −|λ| E e n = E e n n = E e n = (e n ) = e   and Sn/n also has the standard Cauchy distribution with P[Sn/n t] = 1 1 ≤ 2 + π arctan(t); in particular, Sn/n does not converge almost surely, or even in probability.

14