8 the Laws of Large Numbers

STA 711: Probability & Measure Theory Robert L. Wolpert

8 The Laws of Large Numbers

The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and independent trials in which E occurs. Similarly →∞ the “expectation” of a random variable X is taken to be its asymptotic average, the limit as n →∞ of the average of n repeated, similar, and independent replications of X. For statisticians trying to make inference about the underlying probability distribution f(x θ) governing observed random | variables Xi, this suggests that we should be interested in the probability distribution for large n of ¯ 1 n n quantities like the sample average of the RVs, Xn := n i=1 Xi or the partial sum Sn := i=1 Xi. Three of the most celebrated theorems of probability theoryP concern this quantity. For independentP random variables X , all with the same probability distribution satisfying E X 3 < , set µ := EX , i | i| ∞ i σ2 := E X µ 2, and S := n X . The three main results are: | i − | n i=1 i Laws of Large Numbers (LLN):P S nµ n − 0 (pr. and a.s.) σn −→ Central Limit Theorem (CLT): S nµ n − = No(0, 1) (in dist.) σ√n ⇒ Law of the Iterated Logarithm (LIL): S nµ lim sup n − = 1.0 (a.s.) n→∞ ±σ√2n log log n

1 Together these three give a clear picture of how quickly and in what sense n Sn tends to µ. We begin with the Law of Large Numbers (LLN), ﬁrst in its “weak” form (asserting convergence pr.) and then in its “strong” form (convergence a.s.). There are several versions of both theorems. The simplest requires the Xi to be IID and L2; stronger results allow us to weaken (but not eliminate) the independence requirement, permit non-identical distributions, and consider what happens if we relax the L2 requirement and allow the RVs to be only L1 (or worse!). The text covers these things well; to complement it I am going to: (1) Prove the simplest version, and with it the Borel-Cantelli theorems; and (2) Show what happens with Cauchy random variables, which don’t satisfy the requirements (the LLN fails).

8.1 Proofs of the Weak and Strong Laws

Here are two simple versions (one Weak, one Strong) of the Law of Large Numbers; ﬁrst we prove an elementary but very useful result:

Proposition 1 (Markov’s Inequality) Let φ(x) 0 be non-decreasing on R . For any random ≥ + variable X 0 and constant a R , ≥ ∈ + P[X a] P[φ(X) φ(a)] E[φ(X)]/φ(a) ≥ ≤ ≥ ≤

1 STA 711 Week 8 R L Wolpert To see this, set Y := φ(a)1 for the event A := φ(X) φ(a) and note Y φ(X) so EY Eφ(X). A { ≥ } ≤ ≤ Theorem 1 (L WLLN) Let X be independent random variables with the same mean µ = 2 { n} E[X ] and uniformly bounded variance E(X µ)2 B for some ﬁxed bound B < . Set S := n n − ≤ ∞ n X and X¯ := S /n = 1 X . Then, as n , j≤n j n n n j≤n j →∞ P P( ǫ> 0) P[ X¯n µ >ǫ] 0. (1) ∀ | − | → Proof. n E(S nµ)2 = E(X µ)2 nB n − i − ≤ Xi=1 so for ǫ> 0

P[ X¯ µ >ǫ]= P[(S nµ)2 > (nǫ)2] | n − | n − E[(S nµ)2]/n2ǫ2 ≤ n − B/nǫ2 0 as n . ≤ → →∞

This Law of Large Numbers is called weak because its conclusion is only that X¯n converges to zero in probability (Eqn (1)); the strong Law of Large Numbers asserts convergence of a stronger sort, called almost sure convergence (Eqn (2) below). If P[ X¯ µ >ǫ] were summable then by B-C we could | n − | conclude almost-sure convergence; unfortunately we have only the bound P[ X¯ µ > ǫ] < c/n | n − | which tends to zero but isn’t summable. It is summable along the subsequence n2, however; our 2 2 approach to proving a strong LLN is to show that S S 2 isn’t too big for any n k< (n+1) . | k − n | ≤

Theorem 2 (L2 SLLN) Under the same conditions, P[X¯ µ] = 1. (2) n →

Proof. Without loss of generality take µ = 0 (otherwise subtract µ from each Xn), and ﬁx ǫ> 0. Set Sn := j≤n Xj. Then 2 2 2 2 2 4 2 2 2 P P S 2 >n ǫ E S 2 /(n ǫ) n B/n ǫ = B/n ǫ | n | ≤ | n | ≤ 2 2 P S 2 >n ǫ i.o. = 0 by B-C S 2 /n 0 a.s. Set | n | ⇒ n → Dn := max Sk Sn2 n2≤k<(n+1)2 | − | E 2 E 2 Dn = max Sk Sn2 n2≤k<(n+1)2 | − | E 2 E 2 Sk Sn2 = Sk Sn2 ≤ 2 2 | − | 2 2 | − | n ≤kn2ǫ] 4n2B/n4 ǫ2 = 4B/n2 ǫ2 D /n2 0 a.s. n ≤ ⇒ n → S S 2 + D k | n | n 0 a.s. as k , where n = √k . k ≤ n2 → →∞ ⌊ ⌋

Page 2 STA 711 Week 8 R L Wolpert

Each of these LLNs required only that Cov(X , X ) 0, not pairwise (let alone full) independence. n m ≤ To see that, suppose EX = µ for all n, Cov(X , X ) 0 for n = m, and VX B. Then n n m ≤ 6 n ≤ n,n E(S nµ)2 = E (X µ)(X µ) n − i − j − i,jX=1,1 n = E(X µ)2 + 2 E(X µ)(X µ) i − i − j − Xi=1 1≤Xi 0, 0

8.2 Other Strong Laws

Let’s ﬁrst state two lemmas: Lemma 1 (L´evy) If X is an independent sequence then ∞ X converges pr. if and only { n} n=1 n if it converges a.s. P Lemma 2 (Kronecker) Suppose x R and 0 < a . Then { n}⊂ n ր∞ ∞ n xk 1 converges xk 0. ak ⇒ an → Xk=1 Xk=1 and, from these, prove two useful theorems. First,

Theorem 3 (Kolmogorov Convergence Criterion) Let Xn L2 be an independent se- E 2 V { } ⊂ quence with means µn := Xn and variances σn := (Xn). Then ∞ ∞ σ2 < (X µ ) converges a.s. n ∞ ⇒ n − n nX=1 nX=1 Proof. Without loss of generality take µ 0. Fix ǫ> 0. For 1 n N < , n ≡ ≤ ≤ ∞ P S S >ǫ = P X >ǫ N − n k n

Theorem 4 (Kronecker L SLLN) Let X L be an independent sequence, and let b 2 { n}⊂ 2 { n}⊂ R+ be a monotone sequence such that X V n < . b ∞ n n X Then S ES n − n 0 a.s. bn →

In particular, for iid X we may take b = n to see X¯ µ a.s., a SLLN for independent but { n} n n → non-identically-distributed L2 random variables. Proof. Again take E[X ] 0, and set Y := X /b , with variance σ2. By hypothesis σ2 < , n ≡ n n n n n ∞ so Y coverges a.s. by Theorem 3. Thus, for almost every ω, X (ω)/b converges and, by n n n P Lemma 2, also S /b 0. P n n → P

Kolmogorov’s L1 SLLN

The most-cited and most-used version of the Strong Law for iid sequences is that due to Kolmogorov, with no moment assumptions:

Theorem 5 (Kolmogorov’s L SLLN) Let X be iid and set S := X . There exists 1 { n} n i≤n i c R such that ∈ P X¯ = S /n c a.s. n n → if and only if E X < , in which case c = EX . | 1| ∞ 1 This has the pleasant consequence that the usual estimators for the mean and variance of iid sequences are consistent:

Corollary 1

X L X¯ µ a.s. { n}⊂ 1 ⇒ n → 1 X L (X X¯ )2 σ2 a.s. { n}⊂ 2 ⇒ n n − n → Xi≤n

Page 4 STA 711 Week 8 R L Wolpert

Here’s a quick summary of some LLN facts: I. Weak version, non-iid, L : µ = EX , σ = E[X µ ][X µ ] 2 i i ij i − i j − j A. Y = (S Σµ )/n satisﬁes EY = 0, EY 2 = 1 Σ σ + 2 Σ σ ; n n − i n n n2 i≤n ii n2 i nǫ] < Mn = M | n| n2ǫ2 nǫ2 2 M 2 Mπ2 1. P[ S 2 >n ǫ] < ,Σ P[ S 2 >n ǫ] < < | n | n2ǫ2 n | n | 6ǫ2 ∞ 2 1 2. Borel-Cantelli: P[ S 2 >n ǫ i.o.] = 0, ∴ S 2 0 a.s. | n | n2 n → 3. Dn := maxn2≤k<(n+1)2 Sk Sn2 , so E 2 E | − | 2 Dn 2n S(n+1)2−1 Sn2 4n M ≤ | − |≤2 4. Chebychev: P[D >n2ǫ] < 4n M , ∴ D /n2 0 a.s. n n4ǫ2 n → |S |+Dn B. S /k n2 0 a.s. as k , QED | k |≤ n2 → →∞ 1. Bernoulli RVs, normal number theorem, Monte Carlo integration.

III. Weak version, pairwise-iid, L1 A. Equivalent sequences: If P[X = Y ] < , then: n n 6 n ∞ a.s 1. n Xn Yn < P. n | − | ∞ n 2. Pi=1 Xi converges iﬀ i=1 Yi converges 3. If ( a )( X) 1 X X then 1 Y X too. P ∃ n ր ∃ ∋ an Pi≤n i → an i≤n i → B. If X are iid L then X , Y := X 1 are equivalent { n} 1 nP n n [|Xn|≤n] P IV. Strong version, iid, L1 A. Kolmogorov: For X IID, ( c X¯ c a.s.) (X L ). { n} ∃ ∋ n → ⇔ n ∈ 1 V. Counterexamples: Cauchy, A. X dx = P[ S /n ǫ] 2 tan−1(ǫ) 1, WLLN fails. i ∼ π[1+x2] ⇒ | n| ≤ ≡ π 6→ B. P[X = n]= c , n 1; X / L , and S /n 0 pr. or a.s. i ± n2 ≥ i ∈ 1 n 6→ C. P[X = n]= c , n> 1; X / L , but S /n 0 pr. and not a.s. i ± n2 log n i ∈ 1 n → D. Medians: for ANY RVs X X pr., then m m if m is unique. n → ∞ n → ∞ ∞

Page 5 STA 711 Week 8 R L Wolpert

8.3 An example where the LLN fails: iid Cauchy RVs

Let Xi be iid standard Cauchy RVs, with

t dx 1 1 P[X t]= = + arctan(t) 1 ≤ π[1 + x2] 2 π Z−∞ and characteristic function (we’ll learn more about these next week) ∞ dx E eiωX1 = eiωx = e−|ω|. π[1 + x2] Z−∞

The sample mean X¯n := Sn/n has characteristic function

ω ω n ω E eiωSn/n = E ei n [X1+···+Xn] = E ei n X1 = (e−| n |)n = e−|ω|. Thus S /n also has the standard Cauchy distribution with P[S /n t] = 1 + 1 arctan(t). In n n ≤ 2 π particular, Sn/n does not converge almost surely, or even in probability.

8.4 A LLN for Correlated Sequences

In many applications we would like a Law of Large Numbers for sequences of random variables that are not independent. For example, in Markov Chain Monte Carlo integration, we have a stationary Markov chain X (this means that the distribution of X is the same for all t and that { t} t the conditional distribution of a future value X for u>t, given the past X s t , depends u { s | ≤ } only on the present X ) and want to estimate the population mean E[φ(X )] for some function φ( ) t t · by the sample mean T −1 1 E[φ(X )] φ(X ). t ≈ T t Xt=0 Even though they are identically distributed, the random variables Yt := φ(Xt) won’t be independent if the Xt aren’t independent, so the LLN we already have doesn’t quite apply.

Theorem 6 If an L sequence Y has a constant mean µ = EY and a summable autocovariance 2 { t} t γ := E(Y µ)(Y µ) that satisﬁes ∞ γ c < uniformly in s, then the random st s − t − t=−∞ | st| ≤ ∞ variables Yt obey a LLN: { } P T −1 1 E[Yt] = lim Yt. T →∞ T t=0 X

Proof. Let ST be the sum of the ﬁrst T Yts and (as usual) set Y¯T := ST /T . The variance of ST is

T −1 T −1 E[(S Tµ)2] = E[(Y µ)(Y µ)] T − s − t − s=0 t=0 X X T −1 ∞ γ T c, ≤ | st|≤ s=0 t=−∞ X X

Page 6 STA 711 Week 8 R L Wolpert so Y¯ has variance V[Y¯ ] c/T and by Chebychev’s inequality T T ≤ E[(Y¯ µ)2] E[(S Tµ)2] P[ Y¯ µ >ǫ] T − = T − | T − | ≤ ǫ2 T 2ǫ2 T c c = 0 as T . ≤ T 2ǫ2 Tǫ2 → →∞ As in the iid case, this WLLN can be extended to a SLLN by first applying it along the sequence T 2 : T N , applying B/C, then filling in the gaps. { ∈ } A sequence of random variables Yt is called stationary if each Yt has the same probability distribution and, moreover, each finite set (Yt1+h,Yt2+h, ,Ytk+h) has a joint distribution that doesn’t · · · 2 depend on h. The sequence is called “L2” if each Yt has a finite variance σ (and hence also a well-defined mean µ); by stationarity it also follows that the covariance

γ = E[(Y µ)(Y µ)] st s − t − is ﬁnite and depends only on the absolute diﬀerence t s (write: γ = γ(s t)= γ(t s)). | − | st − −

Corollary 2 If a stationary L2 sequence Yt has a summable covariance function, i.e., satisﬁes ∞ { } t=−∞ γ(t) c< , then | |≤ ∞ T −1 P 1 E[Yt] = lim Yt. T →∞ T t=0 X Note we didn’t need full stationarity. It would be good enough for Y to be “2nd-order stationary,” { t} i.e., to have a common mean µ = EY and a covariance function γ = E(Y µ)(Y µ)= γ(s t) t st s − t − − that depends only on the absolute diﬀerence s t . | − |

8.4.1 Examples

1. IID: If Xt are independent and identically distributed, and if Yt = φ(Xt) has finite variance 2 σ , then Yt has a well-defined finite mean µ and Y¯T µ. 2 → σ if s = t 2 Here γst = , so c = σ < . (0 if s = t ∞ 6 2. AR : If Z are iid No(0, 1) for 0, 1 <ρ< 1, and 1 t −∞ ∞ ∈ − ∞ s Xt := µ + σ ρ Zt−s s=0 X = ρX + α + σZ , ( ) t−1 t ∗ No σ2 where α = (1 ρ)µ, then the Xt are identically distributed (all with the (µ, 1−ρ2 ) dis- − 2 tribution) but not independent (since γ = σ ρ|s−t| = 0 for s = t). This is called an st 1−ρ2 6 6 “autoregressive process” (because of equation (*), expressing Xt as a regression of previous

Page 7 STA 711 Week 8 R L Wolpert

Xss) of order one (because only one earlier Xs appears in (*)), and is about the simplest non-iid sequence occuring in applications. Since the covariance is summable, ∞ σ2 1+ ρ σ2 γ(t) = | | = < , | | 1 ρ2 1 ρ (1 ρ )2 ∞ t=−∞ X − − | | − | | we again have X¯ µ a.s. as T . T → →∞ |s−t| 3. Geometric Ergodicity: If for some 0 <ρ< 1 and c> 0 we have γst cρ for a Markov t ≤ chain Yt the chain is called Geometrically Ergodic (because cρ is a geometric sequence), and the same argument as for AR1 shows that Y¯t converges. Meyn & Tweedie (1993), Tierney (1994), and others have given conditions for MCMC chains to be Geometric Ergodic, and hence for the almost-sure convergence of sample averages to population means. The mean squared error MSE := E Y¯ µ 2 for a geometrically ergodic sequence is bounded by MSE | t − | ≤ 1+ρ (σ2/t) 1/t, but the constant grows without bound as ρ 1. Irreducible aperiodic 1−ρ ≍ → ﬁnite-state Markov chains are geometrically ergodic, with ρ the second-largest eigenvalue of the one-step transition matrix.

4. General Ergodicity: Consider the three sequences of random variables on (Ω, , P) with F Ω = (0, 1] and = (Ω), each with X (ω)= ω: F B 0 (a) Xn+1 := 2 Xn (mod 1);

(b) Xn+1 := Xn + α (mod 1) (Does it matter if α is rational?); (c) X := 4X (1 X ). n+1 n − n For each there exists a probability measure P (a distribution for X0) such that the Xn are identically distributed— the Un(0, 1) distribution for (a) and (b), and the “arcsin” distri- Be 1 1 n bution ( 2 , 2 ) for (c), and each is of the form Xn = X0 T (ω) for a measure-preserving transformation T :Ω Ω, a measurable mapping T :Ω Ω for which P(E) = P T −1(E) → → for each E . Such a transformation is called ergodic if the only T -invariant events (those ∈ F that satisfy E = T −1(E)) are “almost trivial” in the sense that P[E]=0or P[E] = 1. A sequence X := X T n is called ergodic if T is. Each of the sequences (a)–(c) is ergodic (can n 0 ◦ you show that?). Birkhoﬀ’s Ergodic Theorem asserts that, if X L (Ω, , P) is an integrable ergodic se- n ∈ 1 F quence, then X¯ converges almost-surely to a T -invariant limit X as n . Since only n ∞ → ∞ constants are T -invariant for ergodic sequences, it follows that X¯ µ := EX a.s. for er- n → n godic sequences. The conditions here are weaker than those for the usual LLN; in all three cases above, for example, all ergodic, each Xn is completely determined by X0 so there is complete dependence, with σ(X ) σ(X ) σ(X ) for all 0 m n! n ⊂ m ⊂ 0 ≤ ≤ For any L distribution µ(dx) on (R, ), we can construct iid random variables X iid µ(dx) 1 B { n} ∼ on the product probability space Ω = R∞, = ∞, P = µ and a measure-preserving F B transformation T :Ω Ω called the left-shift by → N X (ω ,ω ,ω , ) := ω T (ω ,ω ,ω , ) := (ω ,ω ,ω , ). n 1 2 3 · · · n 1 2 3 · · · 2 3 4 · · · The σ-algebra = A : A = T −1(A) of T -invariant sets is just the tail σ-algebra for the T { } independent random variables X , so by Kolmogorov’s zero-one law is almost-trivial { n} T

Page 8 STA 711 Week 8 R L Wolpert

and so T is ergodic. It follows from Birkhoﬀ’s Ergodic Theorem that the sample mean ¯ 1 n Xn := n j=1 Xj converges almost-surely to a T -invariant and hence almost-surely constant random variable whose value must be µ, proving a strong LLN for iid random variables that P assumes only L1:

Theorem 7 (L1 iid SLLN) Let Xn be iid L1(Ω, , P) random variables with mean µ = E[Xn]. ¯ { }1 F Set Sn := j≤n Xj and Xn := Sn/n = n j≤n Xj. Then:

P P[X¯ µ] = 1. n → ¯ Thus the “space average” R Xn dµ and the limiting “time average” limn→∞ Xn coincide for ergodic sequences. R

Last edited: October 25, 2017

Page 9