STA 711: & Measure Theory Robert L. Wolpert

8 The Laws of Large Numbers

The traditional interpretation of the probability of an event E is its asymptotic : the limit as n of the fraction of n repeated, similar, and independent trials in which E occurs. Similarly →∞ the “expectation” of a X is taken to be its asymptotic , the limit as n →∞ of the average of n repeated, similar, and independent replications of X. For statisticians trying to make inference about the underlying f(x θ) governing observed random | variables Xi, this suggests that we should be interested in the probability distribution for large n of ¯ 1 n n quantities like the sample average of the RVs, Xn := n i=1 Xi or the partial sum Sn := i=1 Xi. Three of the most celebrated of probability theoryP concern this quantity. For independentP random variables X , all with the same probability distribution satisfying E X 3 < , set µ := EX , i | i| ∞ i σ2 := E X µ 2, and S := n X . The three main results are: | i − | n i=1 i Laws of Large Numbers (LLN):P S nµ n − 0 (pr. and a.s.) σn −→ Central Limit (CLT): S nµ n − = No(0, 1) (in dist.) σ√n ⇒ Law of the Iterated (LIL): S nµ lim sup n − = 1.0 (a.s.) n→∞ ±σ√2n log log n

1 Together these three give a clear picture of how quickly and in what sense n Sn tends to µ. We begin with the (LLN), first in its “weak” form (asserting convergence pr.) and then in its “strong” form (convergence a.s.). There are several versions of both theorems. The simplest requires the Xi to be IID and L2; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distributions, and consider what happens if we relax the L2 requirement and allow the RVs to be only L1 (or worse!). The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and (2) Show what happens with Cauchy random variables, which don’t satisfy the requirements (the LLN fails).

8.1 Proofs of the Weak and Strong Laws

Here are two simple versions (one Weak, one Strong) of the Law of Large Numbers; first we prove an elementary but very useful result:

Proposition 1 (Markov’s Inequality) Let φ(x) 0 be non-decreasing on R . For any random ≥ + variable X 0 and constant a R , ≥ ∈ + P[X a] P[φ(X) φ(a)] E[φ(X)]/φ(a) ≥ ≤ ≥ ≤

1 STA 711 Week 8 R L Wolpert To see this, set Y := φ(a)1 for the event A := φ(X) φ(a) and note Y φ(X) so EY Eφ(X). A { ≥ } ≤ ≤ Theorem 1 (L WLLN) Let X be independent random variables with the same µ = 2 { n} E[X ] and uniformly bounded E(X µ)2 B for some fixed bound B < . Set S := n n − ≤ ∞ n X and X¯ := S /n = 1 X . Then, as n , j≤n j n n n j≤n j →∞ P P( ǫ> 0) P[ X¯n µ >ǫ] 0. (1) ∀ | − | → Proof. n E(S nµ)2 = E(X µ)2 nB n − i − ≤ Xi=1 so for ǫ> 0

P[ X¯ µ >ǫ]= P[(S nµ)2 > (nǫ)2] | n − | n − E[(S nµ)2]/n2ǫ2 ≤ n − B/nǫ2 0 as n . ≤ → →∞

This Law of Large Numbers is called weak because its conclusion is only that X¯n converges to zero in probability (Eqn (1)); the strong Law of Large Numbers asserts convergence of a stronger sort, called almost sure convergence (Eqn (2) below). If P[ X¯ µ >ǫ] were summable then by B-C we could | n − | conclude almost-sure convergence; unfortunately we have only the bound P[ X¯ µ > ǫ] < c/n | n − | which tends to zero but isn’t summable. It is summable along the subsequence n2, however; our 2 2 approach to proving a strong LLN is to show that S S 2 isn’t too big for any n k< (n+1) . | k − n | ≤

Theorem 2 (L2 SLLN) Under the same conditions, P[X¯ µ] = 1. (2) n →

Proof. Without loss of generality take µ = 0 (otherwise subtract µ from each Xn), and fix ǫ> 0. Set Sn := j≤n Xj. Then 2 2 2 2 2 4 2 2 2 P P S 2 >n ǫ E S 2 /(n ǫ) n B/n ǫ = B/n ǫ | n | ≤ | n | ≤ 2 2 P S 2 >n ǫ i.o. = 0 by B-C S 2 /n 0 a.s. Set | n |  ⇒ n → Dn := max Sk Sn2   n2≤k<(n+1)2 | − | E 2 E 2 Dn = max Sk Sn2 n2≤k<(n+1)2 | − |   E 2 E 2 Sk Sn2 = Sk Sn2 ≤ 2 2 | − | 2 2 | − | n ≤kn2ǫ] 4n2B/n4 ǫ2 = 4B/n2 ǫ2 D /n2 0 a.s. n ≤ ⇒ n → S S 2 + D k | n | n 0 a.s. as k , where n = √k . k ≤ n2 → →∞ ⌊ ⌋

Page 2 STA 711 Week 8 R L Wolpert

Each of these LLNs required only that Cov(X , X ) 0, not pairwise (let alone full) independence. n m ≤ To see that, suppose EX = µ for all n, Cov(X , X ) 0 for n = m, and VX B. Then n n m ≤ 6 n ≤ n,n E(S nµ)2 = E (X µ)(X µ) n − i − j − i,jX=1,1 n = E(X µ)2 + 2 E(X µ)(X µ) i − i − j − Xi=1 1≤Xi 0, 0

8.2 Other Strong Laws

Let’s first state two lemmas: Lemma 1 (L´evy) If X is an independent sequence then ∞ X converges pr. if and only { n} n=1 n if it converges a.s. P Lemma 2 (Kronecker) Suppose x R and 0 < a . Then { n}⊂ n ր∞ ∞ n xk 1 converges xk 0. ak ⇒ an → Xk=1 Xk=1 and, from these, prove two useful theorems. First,

Theorem 3 (Kolmogorov Convergence Criterion) Let Xn L2 be an independent se- E 2 V { } ⊂ quence with µn := Xn and σn := (Xn). Then ∞ ∞ σ2 < (X µ ) converges a.s. n ∞ ⇒ n − n nX=1 nX=1 Proof. Without loss of generality take µ 0. Fix ǫ> 0. For 1 n N < , n ≡ ≤ ≤ ∞ P S S >ǫ = P X >ǫ N − n k n

Theorem 4 (Kronecker L SLLN) Let X L be an independent sequence, and let b 2 { n}⊂ 2 { n}⊂ R+ be a monotone sequence such that X V n < . b ∞ n n X   Then S ES n − n 0 a.s. bn →

In particular, for iid X we may take b = n to see X¯ µ a.s., a SLLN for independent but { n} n n → non-identically-distributed L2 random variables. Proof. Again take E[X ] 0, and set Y := X /b , with variance σ2. By hypothesis σ2 < , n ≡ n n n n n ∞ so Y coverges a.s. by Theorem 3. Thus, for almost every ω, X (ω)/b converges and, by n n n P Lemma 2, also S /b 0. P n n → P

Kolmogorov’s L1 SLLN

The most-cited and most-used version of the Strong Law for iid sequences is that due to Kolmogorov, with no moment assumptions:

Theorem 5 (Kolmogorov’s L SLLN) Let X be iid and set S := X . There exists 1 { n} n i≤n i c R such that ∈ P X¯ = S /n c a.s. n n → if and only if E X < , in which case c = EX . | 1| ∞ 1 This has the pleasant consequence that the usual estimators for the mean and variance of iid sequences are consistent:

Corollary 1

X L X¯ µ a.s. { n}⊂ 1 ⇒ n → 1 X L (X X¯ )2 σ2 a.s. { n}⊂ 2 ⇒ n n − n → Xi≤n

Page 4 STA 711 Week 8 R L Wolpert

Here’s a quick summary of some LLN facts: I. Weak version, non-iid, L : µ = EX , σ = E[X µ ][X µ ] 2 i i ij i − i j − j A. Y = (S Σµ )/n satisfies EY = 0, EY 2 = 1 Σ σ + 2 Σ σ ; n n − i n n n2 i≤n ii n2 i nǫ] < Mn = M | n| n2ǫ2 nǫ2 2 M 2 Mπ2 1. P[ S 2 >n ǫ] < ,Σ P[ S 2 >n ǫ] < < | n | n2ǫ2 n | n | 6ǫ2 ∞ 2 1 2. Borel-Cantelli: P[ S 2 >n ǫ i.o.] = 0, ∴ S 2 0 a.s. | n | n2 n → 3. Dn := maxn2≤k<(n+1)2 Sk Sn2 , so E 2 E | − | 2 Dn 2n S(n+1)2−1 Sn2 4n M ≤ | − |≤2 4. Chebychev: P[D >n2ǫ] < 4n M , ∴ D /n2 0 a.s. n n4ǫ2 n → |S |+Dn B. S /k n2 0 a.s. as k , QED | k |≤ n2 → →∞ 1. Bernoulli RVs, normal number theorem, Monte Carlo integration.

III. Weak version, pairwise-iid, L1 A. Equivalent sequences: If P[X = Y ] < , then: n n 6 n ∞ a.s 1. n Xn Yn < P. n | − | ∞ n 2. Pi=1 Xi converges iff i=1 Yi converges 3. If ( a )( X) 1 X X then 1 Y X too. P ∃ n ր ∃ ∋ an Pi≤n i → an i≤n i → B. If X are iid L then X , Y := X 1 are equivalent { n} 1 nP n n [|Xn|≤n] P IV. Strong version, iid, L1 A. Kolmogorov: For X IID, ( c X¯ c a.s.) (X L ). { n} ∃ ∋ n → ⇔ n ∈ 1 V. Counterexamples: Cauchy, A. X dx = P[ S /n ǫ] 2 tan−1(ǫ) 1, WLLN fails. i ∼ π[1+x2] ⇒ | n| ≤ ≡ π 6→ B. P[X = n]= c , n 1; X / L , and S /n 0 pr. or a.s. i ± n2 ≥ i ∈ 1 n 6→ C. P[X = n]= c , n> 1; X / L , but S /n 0 pr. and not a.s. i ± n2 log n i ∈ 1 n → D. : for ANY RVs X X pr., then m m if m is unique. n → ∞ n → ∞ ∞

Page 5 STA 711 Week 8 R L Wolpert

8.3 An example where the LLN fails: iid Cauchy RVs

Let Xi be iid standard Cauchy RVs, with

t dx 1 1 P[X t]= = + arctan(t) 1 ≤ π[1 + x2] 2 π Z−∞ and characteristic (we’ll learn more about these next week) ∞ dx E eiωX1 = eiωx = e−|ω|. π[1 + x2] Z−∞

The sample mean X¯n := Sn/n has characteristic function

ω ω n ω E eiωSn/n = E ei n [X1+···+Xn] = E ei n X1 = (e−| n |)n = e−|ω|.   Thus S /n also has the standard with P[S /n t] = 1 + 1 arctan(t). In n n ≤ 2 π particular, Sn/n does not converge , or even in probability.

8.4 A LLN for Correlated Sequences

In many applications we would like a Law of Large Numbers for sequences of random variables that are not independent. For example, in Monte Carlo integration, we have a stationary Markov chain X (this means that the distribution of X is the same for all t and that { t} t the conditional distribution of a future value X for u>t, given the past X s t , depends u { s | ≤ } only on the present X ) and want to estimate the population mean E[φ(X )] for some function φ( ) t t · by the sample mean T −1 1 E[φ(X )] φ(X ). t ≈ T t Xt=0 Even though they are identically distributed, the random variables Yt := φ(Xt) won’t be indepen- dent if the Xt aren’t independent, so the LLN we already have doesn’t quite apply.

Theorem 6 If an L sequence Y has a constant mean µ = EY and a summable autocovariance 2 { t} t γ := E(Y µ)(Y µ) that satisfies ∞ γ c < uniformly in s, then the random st s − t − t=−∞ | st| ≤ ∞ variables Yt obey a LLN: { } P T −1 1 E[Yt] = lim Yt. T →∞ T t=0 X

Proof. Let ST be the sum of the first T Yts and (as usual) set Y¯T := ST /T . The variance of ST is

T −1 T −1 E[(S Tµ)2] = E[(Y µ)(Y µ)] T − s − t − s=0 t=0 X X T −1 ∞ γ T c, ≤ | st|≤ s=0 t=−∞ X X

Page 6 STA 711 Week 8 R L Wolpert so Y¯ has variance V[Y¯ ] c/T and by Chebychev’s inequality T T ≤ E[(Y¯ µ)2] E[(S Tµ)2] P[ Y¯ µ >ǫ] T − = T − | T − | ≤ ǫ2 T 2ǫ2 T c c = 0 as T . ≤ T 2ǫ2 Tǫ2 → →∞ As in the iid case, this WLLN can be extended to a SLLN by first applying it along the sequence T 2 : T N , applying B/C, then filling in the gaps. { ∈ } A sequence of random variables Yt is called stationary if each Yt has the same probability distri- bution and, moreover, each finite set (Yt1+h,Yt2+h, ,Ytk+h) has a joint distribution that doesn’t · · · 2 depend on h. The sequence is called “L2” if each Yt has a finite variance σ (and hence also a well-defined mean µ); by stationarity it also follows that the covariance

γ = E[(Y µ)(Y µ)] st s − t − is finite and depends only on the absolute difference t s (write: γ = γ(s t)= γ(t s)). | − | st − −

Corollary 2 If a stationary L2 sequence Yt has a summable covariance function, i.e., satisfies ∞ { } t=−∞ γ(t) c< , then | |≤ ∞ T −1 P 1 E[Yt] = lim Yt. T →∞ T t=0 X Note we didn’t need full stationarity. It would be good enough for Y to be “2nd-order stationary,” { t} i.e., to have a common mean µ = EY and a covariance function γ = E(Y µ)(Y µ)= γ(s t) t st s − t − − that depends only on the absolute difference s t . | − |

8.4.1 Examples

1. IID: If Xt are independent and identically distributed, and if Yt = φ(Xt) has finite variance 2 σ , then Yt has a well-defined finite mean µ and Y¯T µ. 2 → σ if s = t 2 Here γst = , so c = σ < . (0 if s = t ∞ 6 2. AR : If Z are iid No(0, 1) for 0, 1 <ρ< 1, and 1 t −∞ ∞ ∈ − ∞ s Xt := µ + σ ρ Zt−s s=0 X = ρX + α + σZ , ( ) t−1 t ∗ No σ2 where α = (1 ρ)µ, then the Xt are identically distributed (all with the (µ, 1−ρ2 ) dis- − 2 tribution) but not independent (since γ = σ ρ|s−t| = 0 for s = t). This is called an st 1−ρ2 6 6 “autoregressive process” (because of equation (*), expressing Xt as a regression of previous

Page 7 STA 711 Week 8 R L Wolpert

Xss) of order one (because only one earlier Xs appears in (*)), and is about the simplest non-iid sequence occuring in applications. Since the covariance is summable, ∞ σ2 1+ ρ σ2 γ(t) = | | = < , | | 1 ρ2 1 ρ (1 ρ )2 ∞ t=−∞ X − − | | − | | we again have X¯ µ a.s. as T . T → →∞ |s−t| 3. Geometric Ergodicity: If for some 0 <ρ< 1 and c> 0 we have γst cρ for a Markov t ≤ chain Yt the chain is called Geometrically Ergodic (because cρ is a geometric sequence), and the same argument as for AR1 shows that Y¯t converges. Meyn & Tweedie (1993), Tierney (1994), and others have given conditions for MCMC chains to be Geometric Ergodic, and hence for the almost-sure convergence of sample to population means. The mean squared error MSE := E Y¯ µ 2 for a geometrically ergodic sequence is bounded by MSE | t − | ≤ 1+ρ (σ2/t) 1/t, but the constant grows without bound as ρ 1. Irreducible aperiodic 1−ρ ≍ → finite-state Markov chains are geometrically ergodic, with ρ the second-largest eigenvalue of the one-step transition matrix.

4. General Ergodicity: Consider the three sequences of random variables on (Ω, , P) with F Ω = (0, 1] and = (Ω), each with X (ω)= ω: F B 0 (a) Xn+1 := 2 Xn (mod 1);

(b) Xn+1 := Xn + α (mod 1) (Does it matter if α is rational?); (c) X := 4X (1 X ). n+1 n − n For each there exists a P (a distribution for X0) such that the Xn are identically distributed— the Un(0, 1) distribution for (a) and (b), and the “arcsin” distri- Be 1 1 n bution ( 2 , 2 ) for (c), and each is of the form Xn = X0 T (ω) for a measure-preserving transformation T :Ω Ω, a measurable mapping T :Ω Ω for which P(E) = P T −1(E) → →  for each E . Such a transformation is called ergodic if the only T -invariant events (those ∈ F  that satisfy E = T −1(E)) are “almost trivial” in the sense that P[E]=0or P[E] = 1. A sequence X := X T n is called ergodic if T is. Each of the sequences (a)–(c) is ergodic (can n 0 ◦ you show that?). Birkhoff’s Ergodic Theorem asserts that, if X L (Ω, , P) is an integrable ergodic se- n ∈ 1 F quence, then X¯ converges almost-surely to a T -invariant limit X as n . Since only n ∞ → ∞ constants are T -invariant for ergodic sequences, it follows that X¯ µ := EX a.s. for er- n → n godic sequences. The conditions here are weaker than those for the usual LLN; in all three cases above, for example, all ergodic, each Xn is completely determined by X0 so there is complete dependence, with σ(X ) σ(X ) σ(X ) for all 0 m n! n ⊂ m ⊂ 0 ≤ ≤ For any L distribution µ(dx) on (R, ), we can construct iid random variables X iid µ(dx) 1 B { n} ∼ on the product Ω = R∞, = ∞, P = µ and a measure-preserving F B transformation T :Ω Ω called the left-shift by → N  X (ω ,ω ,ω , ) := ω T (ω ,ω ,ω , ) := (ω ,ω ,ω , ). n 1 2 3 · · · n 1 2 3 · · · 2 3 4 · · · The σ-algebra = A : A = T −1(A) of T -invariant sets is just the tail σ-algebra for the T { } independent random variables X , so by Kolmogorov’s zero-one law is almost-trivial { n} T

Page 8 STA 711 Week 8 R L Wolpert

and so T is ergodic. It follows from Birkhoff’s Ergodic Theorem that the sample mean ¯ 1 n Xn := n j=1 Xj converges almost-surely to a T -invariant and hence almost-surely constant random variable whose value must be µ, proving a strong LLN for iid random variables that P assumes only L1:

Theorem 7 (L1 iid SLLN) Let Xn be iid L1(Ω, , P) random variables with mean µ = E[Xn]. ¯ { }1 F Set Sn := j≤n Xj and Xn := Sn/n = n j≤n Xj. Then:

P P[X¯ µ] = 1. n → ¯ Thus the “space average” R Xn dµ and the limiting “time average” limn→∞ Xn coincide for ergodic sequences. R

Last edited: October 25, 2017

Page 9