1 2

Summer Workshop on Distribution Theory & its Summability Perspective

50

45

40

35

30

25 Frequencies 20

15

10

5

0 | 16 17 18 18.9 20 21 22 Center of mass Amount of Drink Mix (in ounces)

M. Kazım Khan Kent State University (USA)

Place: Ankara University, Department of Mathematics

Dates: 16 May - 27 May 2011

Supported by: The Scientific and Technical Research Council of Turkey (TUBITAK)¨ 4

Preface

This is a collection of lecture notes I gave at Ankara University, department of math- ematics, during the last two weeks of May, 2011. I am greatful to Professor Cihan Orhan and the Scientific and Technical Research Council of Turkey (TUBITAK)¨ for the invitation and . The primary focus of the lectures was to introduce the basic components of distribution theory and bring out how summability theory plays out its role in it. I did not assume any prior knowledge of on the part of the participants. Therefore, the first few lectures were completely devoted to building the language of probability and distribution theory. These are then used freely in the rest of the lectures. To save some time, I did not prove most of these results. Then a few lectures deal with Fourier inversion theory specifically from the summability perspective. The next batch consists of convergence concepts, where I introduce the weak and the strong laws of large numbers. Again long proofs were omitted. A noteable exception deals with the results that involve the uniformly in- tegrable sequence spaces. Since this is a new concept from summability perspective, I have tried to sketch some of the proofs. I must acknowledge the legendary Turkish hospitality of all the people I came to meet. As always, it was a pleasure visiting Turkey and I hope to have the chance to visit again.

Mohammad Kazım Khan, Kent State University Kent, Ohio, USA. 6

List of Participants

1- AYDIN, Didem Ankara Universitesi¨ 2- AYGAR, Yelda Ankara Universitesi¨ 3- AYKOL, Canay Ankara Universitesi¨ 4- BAS¸CANBAZ TUNCA, G¨ulen Ankara Universitesi¨ 5- CAN, C¸a˘gla Ankara Universitesi¨ 6- CEBESOY, S¸erifenur Ankara Universitesi¨ 7- COS¸KUN, Cafer Ankara Universitesi¨ 8- C¸ETIN,˙ Nursel Ankara Universitesi¨ 9- DONE,¨ Ye¸sim Ankara Universitesi¨ 10- ERDAL, Ibrahim˙ Ankara Universitesi¨ 11- GUREL,¨ Ovg¨u¨ Ankara Universitesi¨ 12- IPEK,˙ Pembe, Ankara Universitesi¨ 13- KATAR, Deniz Ankara Universitesi¨ 14- ORHAN, Cihan Ankara Universitesi¨ 15- SAKAOGLU,˘ Ilknur˙ Ankara Universitesi¨ 16- SOYLU, Elis Ankara Universitesi¨ 17- S¸AHIN,˙ Nilay Ankara Universitesi¨ 18- TAS¸, Emre Ankara Universitesi¨ 19- UNVER,¨ Mehmet Ankara Universitesi¨ 20- YARDIMCI, S¸eyhmus Ankara Universitesi¨ 21- YILMAZ, Ba¸sar Ankara Universitesi¨ 22- YURDAKADIM,˙ Tu˘gba Ankara Universitesi¨ 8 CONTENTS

9 Ranks, Order Statistics & Records 75

10 Fourier Transforms 83 10.1Examples ...... 83

11 Summability Assisted Inversion 89 Contents 12 General Inversion 97 12.1 Fourier&DirichletSeries ...... 99

13 Basic Limit Theorems 107 13.1 ConvergenceinDistribution ...... 108 13.2 ConvergenceinProbability&WLLN...... 111 Preface 3 14AlmostSureConvergence&SLLN 117 List of Participants 5 15 The Lp Spaces & Uniform Integrability 127 Contents 5 15.1 UniformIntegrability...... 132 List of Figures 9 16 Laws of Large Numbers 141 16.1 Subsequences & Kolmogorov Inequality ...... 142 1 Modeling Distributions 1 17 WLLN, SLLN & Uniform SLLN 151 1.1 Distributions ...... 1 17.1 Glivenko-CantelliTheorem ...... 163 1.2 ProbabilitySpace&RandomVariables ...... 8 18 Random Series 169 2 ProbabilitySpaces&RandomVariables 11 18.1 Zero-OneLaws&RandomSeries ...... 169 3 Expectations 21 18.2 RefinementsofSLLN...... 175 3.1 Properties of Lebesgue integral ...... 22 19 Kolmogorov’s Three Series Theorem 183 3.2 Covariance ...... 23 20 The Law of Iterated Logarithms 189 4 Various Inequalities 27 4.1 Holder&Minkowski’sInequalities ...... 28 4.2 Jensen’sInequality ...... 30

5 Classification of Distributions 35 5.1 AbsoluteContinuity&Singularity ...... 41

6 Conditional Distributions 49 6.1 ConditionalExpectations ...... 52

7 ConditionalExpectations&Martingales 57 7.1 Properties of E(X Y ) ...... 57 7.2 Martingales ...... | 59

8 Independence & Transformations 63 8.1 Transformations of Random Variables ...... 63 8.2 Sequences of Independent Random Variables ...... 68 8.3 GeneratingFunctions...... 70 10 LIST OF FIGURES

List of Figures

1.1 A Histogram for the Drink Mix Distribution...... 4 1.2 Inverse Image of an Interval ...... 8

8.1 Inverse of a Distribution Function...... 66

11.1 TriangularDensity ...... 95

12.1 Dirichlet Kernels for n = 5 and n =8...... 101 12.2 Fejer Kernels for T =5 and T =8...... 103 12.3 Poisson Kernels for r = 0.8 and r = 0.9...... 105

14.1 Density of Random Harmonic Series ...... 124 2 Modeling Distributions

(i) Distributions arising while measuring mass produced products. • (ii) Distributions arising in categorical populations. • (iii) Distribution of Stirling numbers. • (iv) Distribution of zeros of orthogonal polynomials. • (v) Distributional convergence of summability theory. Lecture 1 • (vi) Distributions of eigenvalues of Toeplitz matrices. • (vii) Maxwell’s law of ideal gas. • (viii) Distribution of primes. Modeling Distributions • (ix) The Fineman-Kac formula and partial differential equations. • Of course, this is just a tiny sample of topics from an enormous field. One obvious omission being the field of Schwartz’s distributions. This is purely because there A phenomenon when repeatedly observed gives rise to a distribution. In other are excellent books on the subject.1 We will, however, briefly visit this branch while words, a distribution is our way of capturing the variability in the phenomenon. discussing summability assisted Fourier inversion theory. Such distributions arise in almost all fields of endeavor. In social sciences they are used to keep tabs on social indicators, in finance they are used to study and qunatify Example - 1.1.1 - (Measurement distributions — accuracy of automatic the financial health of corporations and pricing various assets and derived securities filling machines) Kountry Times makes 20 ounce cans of lemonade drink mix. such as options and bonds. Data distributions appear in statistics. In mathematics Due to unknown random fluctuations, the actual fill weight of each can is rarely distributions of zeros of orthogonal polynomials appear and the distribution of equal to 20 oz. Here is a collection of fill weights of 200 randomly chosen cans. primes are fundamental entities. In natural sciences about one and a half centure ago Maxwell conjoured up a distribution to describe the speed of molecules in 18.3 19.4 18.8 19.6 19.8 17.7 18.2 20.1 17.2 18.8 19.0 18.6 18.0 18.9 19.1 17.2 17.3 19.4 18.6 20.5 20.8 19.9 ideal gas, which was later observed to be quite accurate. The genetic diversity 18.7 16.7 19.2 18.8 18.3 18.3 18.3 17.9 18.2 17.5 17.6 and its quantification is still in its infency in terms of discovering the underlying 19.7 20.5 19.5 18.6 19.9 19.3 18.5 19.9 18.7 20.3 19.2 distributions that it hides. 18.9 18.6 19.4 18.7 18.5 19.2 17.3 18.0 17.7 19.2 19.1 In this chapter we will collect the tools that are quite effective in studying 18.8 18.3 21.0 18.0 18.9 19.9 21.4 18.8 19.0 18.9 18.7 distributions. We will present the following basic notions. 18.9 19.2 17.6 20.0 19.5 19.4 18.3 19.9 18.4 18.3 18.6 19.4 17.7 18.8 17.8 19.2 18.6 20.2 19.0 18.3 18.3 19.0 Some examples of distributions, • 18.4 19.4 19.4 17.9 19.2 18.5 17.7 19.3 19.0 16.7 18.3 A framework by which distributions can be modeled, • 19.7 18.8 19.4 20.3 18.3 18.6 19.4 18.4 18.6 19.1 18.0 Transforms of distributions, such as moment generating functions and char- 18.8 18.3 18.7 19.1 17.8 17.5 17.0 19.4 19.2 19.8 18.6 • acteristic functions, 17.7 17.9 19.1 18.2 19.5 19.6 20.4 20.7 19.8 18.9 19.2 Conditional probabilities and conditional expectations. 17.8 21.0 17.5 17.9 18.5 21.1 19.8 18.3 20.2 17.4 18.8 • 18.5 19.7 19.0 18.3 19.3 18.8 18.1 17.8 19.1 20.1 19.9 These results will be used in the remainder of the book. 21.0 17.9 18.3 17.1 18.7 18.5 19.1 17.6 20.4 19.2 19.2 20.2 17.4 18.4 18.9 18.4 18.8 18.3 19.8 18.7 19.1 20.4 18.7 18.9 18.0 20.7 20.8 19.9 20.6 19.2 18.4 18.5 18.5 1.1 Distributions 18.4 19.9 17.9 19.4 19.2 20.4 19.7 17.5 19.0 17.9 18.4 19.7 19.1 Any characteristic, when repeatedly measured, yields a collection of measured/collected responses. The word “variable” is used for the characteristic that is being measured, In this example, the feature being measured is the fill weight (measured in ounces). since it may vary from measurement to measurement. The collection of all the mea- We see unexpectedly large amount of variability. The issue is: sured responses is called the “distribution” of the variable. Sometimes, the word “Does the distribution say anything about whether the advertised av- data is also used to refer to the distribution of the variable. Distributions may be erage fill weight of 20 oz is being met or not”? real or just imagined entities. Here we collect a few examples of distributions of the following sorts to show their vast diversity. 1For instance, see ... 1.1 Distributions 3 4 Modeling Distributions

The average, or the mean and the of this collected distribution now denoted 50 as x ,x , ,x , is 45 1 2 ··· n n 40 x + x + x 1 x = 1 2 ··· n = x = 18.940, n n n i 35 i=1 X 30 n n 2 2 2 1 2 i=1 xi n(x) S = (x x) = − = 0.8574. 25 n 1 i − n 1 i=1 P Frequencies − X − 20 The mean gives a feeling that the automatic filling machine may be malfunctioning. 15 Since the units of variance are squared, we work with its positive square root, S, called the standard deviation. For our fill weights distribution S = 0.926 oz. The 10 standard deviation typically gives a scale by which we can gauage the “width” of 5 the distribution. Typically plus/minus three times the standard deviation around 0 | 16 17 18 18.9 20 21 22 the mean contains most of the values of the distribution. Note that the smallest Center of mass value of our data distribution is 16.7 oz, and the largest value is 21.4. In this case Amount of Drink Mix (in ounces) all the values of the distribution lie within 3S of the mean. Of course knowing this distribution of 200 observations is only partially inter- Figure 1.1: A Histogram for the Drink Mix Distribution. esting. The real aim is to conclude someting about the source of these 200 obser- vations, called the population distribution, which is a mythical entity and repsents how the automatic filling machine behaves. To get a feel for and then model the tends to observe such bell shaped histograms. The superimposed curve is called a shape of the source distribution we resort to figures. We make some groups, also normal curve and is proportional to called bins, say J1 = (16.5 17.0], J2 = (17.0 17.5], etc., and count the number of observations that fall into− these bins. Dividing− the frequencies by the total number 1 1 2 of observations gives the relative frequency distribution, which does not change the f(x) = exp − (x µ) , < x < . σ√2π 2σ2 − −∞ ∞ shape. A plot of this frequency distribution is called a histogram of the distribution.  

Fill Weights Frequency Relative Frequency Most measurement type distributions are mathematically modeled by a normal curve. Symbolically we denote this by X N(µ, σ2), where X represents the 16.5 − 17.0 2 0.010 fill weight of a randomly chosen can. The letter∼ N stands for the word “normal 17.0 − 17.5 8 0.040 distribution”, and µ is the center or mean and σ is the standard deviation. In words, 17.5 − 18.0 23 0.115 the modeled density describes where will a randomly selected can’s fill weight fall. 18.0 − 18.5 33 0.165 2 18.5 − 19.0 44 0.220 The histogram reflects the empirical evidence for our model. The qunatity, P(a 0 we have n ≥ Bush, Bush, , Bush, Gore, Gore, , Gore .  ··· ···  n 1 n 1   0, 0. 80 million 85 million n + 1 → n + 1 → k: ak a ǫ k: bk b ǫ This is a very large| categorical{z distribution.} | However,{z a very simple} way to represent | X− |≥ | X− |≥ this distribution, without losing any information, is to write it in its frequencey In the language of summability theory, ak is Cesaro-statistically convergent to a format, namely write B (for Bush) once and put its frequence next to it and write and b is Cesaro-statistically convergent{ } to b. If a > 0 then the histogram of the { k} G (for Gore) once and put its frequence next to it. We may code the categories distribution of zeros of pn(x) is approximately the curve f(x). This is just the tip (letters) B,G by numbers, if we like. For instance, denoting B by 0 and G by 1, of the iceberg. Much more can be deduced in much more general settings. we may write the distribution of our coded variable, say X, as

Values of X 0 1 Example - 1.1.4 - (Convergence and summability) A matrix summability method Frequencies 80,000,000 85,000,000 consists of numbers ank, n,k = 0, 1, 2, arranged in a matrix form, A = [ank]. Such a matrix is constructed with the aim··· of converting a nonconvergent sequence, The resulting relative frequency distribution is specified by the proportion x ,x , into a convergent one. In other words, if 0 1 ··· 85, 000, 000 85 p = = = 0.51515. ∞ 80, 000, 000 + 85, 000, 000 165 y := a x , n = 0, 1, 2, n nk k ··· k=0 where X denotes the preference of an individual, represented in the coded form of X 0 or 1. Note that the population mean is p and the population variance is p(1 p). then our hope is that y should converge. However, when (x ) is itself convergent − n k to some number ℓ then we insist that (yn) should also be convergent to the same Example - 1.1.3 - (Distribution of zeros of orthogonal polynomials) So ℓ. A matrix A =[ank] which has this “convergence reproducing” property is called far we talked about data distributions and their sources. As mentioned in the regular. To handle the kind of examples we will present we need a bit more general beginning, distributions appear every where. concept that allows x = (xkn) to be a matrix as well and xkn need not be num- Consider a sequence of polynomials pn(x), n = 0, 1, 2, , where p0(x) 1, and ··· ≡ bers but could be functions. When xkn = xk for all n, we revert to the classical pn(x) is of degree n. Suppose there exist real constants an, bn such that an > 0 for summability. There are four notions of convergence. all n 0, ≥ (i) Let yn = k∞=0 xkn ank be defined for all n called the A-transform of x. a p (x) + (b x)p (x) + a p (x) 0, n 1. • n+1 n+1 n n n n 1 We way that x is A-summable to α if yn α. This notion can be extended − − ≡ ≥ P → to the case when xkn and α lie in a normed linear space. There is a result of Favard which says that each pn(x) has exactly n distinct real zeros, which we denote as (ii) If xkn are real, and let F (t) be a distribution, i.e., nondecreasing right • continuous function with F ( ) = 0 and F (+ ) = 1. We say x is x1n < x2n < < xnn. −∞ ∞ ··· A-distributionally convergent to F if for all t at which F is continuous we The issue is what are they? Having these zeros gives us an extremely fast numerical have integration method (called the Gaussian quadrature) among other benefits. If it is lim ank = F (t). n →∞ 3 k:xkn t Normal distributions are also called Gaussian distributions since the German math- X≤ ematician/astronomer, Carl Friedrich Gauss (1777 - 1855), showed their importance as This notion can be extended to higher dimensional forms when both xkn and models of measurement errors in celestial objects. t are d-dimensional vectors. 1.1 Distributions 7 8 Modeling Distributions

(iii) We say x = (xkn) is A-statistically convergent to α if for every ε > 0 we Example - 1.1.6 - Of course there are many more examples, such as • have (i) Asymptotic normality of the Stirling numbers. lim ank = 0. • n →∞ k: xkn α >ε (ii) Distributions of eigenvalues of Toeplitz matrices. | X− | • This notion can be extended to the case when xkn and α lie in a topological (iii) Distributions of eigenvalues of random matrices — wigner’s law. • space. Example 1.1.3 uses this notion of convergence for the ak and bk sequences with the matrix A being the Cesaro matrix. { } { } (iv) Maxwell’s law of ideal gas — distribution of speed of molecules in ideal • gas. (iv) We say x = (xkn) is A-strongly convergent to α if • (v) Fineman-Kac formula. Solutions of many types of PDEs and Brownian • ∞ motion go hand in hand. The poster child being the heat equation. This link lim xkn α ank = 0. n has found an unexpted admirer, namely the financial industry, since it ties →∞ | − | kX=0 very nicely into the price of various derived financial securities such as the This notion can be extended to the case when xkn and α lie in a metric space. call and put and many other options.

Example - 1.1.5 - (Distribution of primes) Let π(n) be the prime counting The above examples give a glimpse of the importance of the concept of a dis- function. That is, π(n) is the total number of primes that lie in the interval (0,n]. tribution. Probability theory provide the ideal language to express concepts of Gauss as a teenager conjectured that distribution theory. Therefore we start off by building some basic structures of n probability theory. π(n) . ∼ ln n The prime number theorem says that 1.2 Probability Space & Random Variables

n 1 ∞ j! Our aim is to construct a mathematical structure to house the concept of a distribu- π(n) Li(n) := dx n . ∼ ln x ∼ (ln n)j+1 tion. Distributions always describe some features of some variables. Since variables 2 j=0 Z X may have random components in them, distributions are often linked to probability This was proved by both Hadamard and de la Vallee Poussin in 1896, by showing theory with a concept called a . A random variable being a func- that the Riemann zeta function ζ(z) has no zeros of the type 1 + it. Hardy and tion defined over a probability space. To see what we need, consider the following Wright’s4 book provides more details. diagram in which ω belongs to some abstract set, Ω, shown as the horizontal axis In 1914 Littlewood5 showed that π(n) Li(n) is positive and negative infinitely for convenience. To connect to the histogram of X, if J = (a, b] is any bin, the often. Since Li(n) n + n + 2n −+ , Chebyshev asked the behavior of ∼ ln n (ln n)2 (ln n)2 ··· the ratio X(w) π(n) X := , n = 1, 2, . n n/ ln n ··· J=(a,b] 7 9 Chebyshev showed that 8 < liminfn Xn limsupn Xn < 8 . The recent book of 6 ≤ Havil shows that if limn Xn exists then limn Xn = 1. As an evidence of deep roots of π(x), the Riemann→∞ hyphothesis is equivalent to the statement w π(n) Li(n) = O (ln n)√n . A | − | For more, see Ingham.7  Figure 1.2: Inverse Image of an Interval 4Hardy, G. H. and Wright, E. M. An Introduction to the Theory of Numbers, 5th ed. Oxford University Press, 1979. 5 area of the rectangle over it represents the “size” of the event a

1 1 Remark - 1.2.1 - Note that the definition of P is tied to the collection of events X− ( J ) = X− (J ) = A . ∪i i i i (sigma field). Condition (i) of the definition of a sigma field is needed for Condition i i [ [ (i) of the definition of P to avoid logical inconsistencies. The same is the case with In general, iAi , whenever Ai . In particular, since R = i( i,i], we the third conditions of the two concepts. ∪ ∈ E 1 ∈ E ∪ − insist that Ω = iX− (( i,i]) . ∪ − ∈ E P c Exercise - 1.2.1 - Let (Ω, , ) be a probability space for a random experiment. (iii) Since J = ( , a] (b, ) is a union of some other J’s, we insist that P E • c 1 c −∞ ∪ ∞ Show that satisfies the following properties for any A, B : A = X− (J ) should also be in our collection. More generally, if A then ∈ E ∈ E (i) P( ) = 0, Ac . ∅ ∈ E (ii) P(Ac) = 1 P(A), − (iv) The concept of size should be defined for all A . Furthermore, the (iii) P(Ac B) = P(B) P(A B), ∩ − ∩ • concept of size should respect disjointness. That is, if ∈A E, A , are pairwise (iv) P(A B) = P(A) + P(B) P(A B), 1 2 ∪ − ∩ disjoint then their individual sizes should add up to the size attached··· to their (v) if A B, then P(A) P(B). ⊆ ≤ union A . ∪i i Around 1930 A. N. Kolmogorov realized that all of the above requirements were part and parcel of the then newly discovered Lebesgue measure and integration theory. His 1933 book on the foundations of probability theory detailing this is now a classic. Let us collect and freeze these notions for our future use.

Definition - 1.2.1 - (Probability space) A probability space (Ω, ,P ) consists of the following items. E Ω = The set of all possible outcomes of an experiment, also called the • sample space. = The set of all subsets of Ω for which a (probability) function (measure) • EP can be defined. Each member of is called an event (or a measurable set). E The collection of all events, , must obey the conditions E (i) Ω , • ∈ E (ii) if A then Ac , • ∈ E ∈ E (iii) if A , A , then A . • 1 2 ···∈E ∪i i ∈ E Any collection of subsets of the space, Ω, that obeys the above conditions, is called a sigma field. The probability measure, P, is a real valued function over , with the following requirements: E (i) P(Ω) = 1, • (ii) 0 P(A) 1, for any A , • ≤ ≤ ∈ E (iii) if A , A , are disjoint then P( A ) = P(A ). • 1 2 ···∈E ∪i i i i P 12 Probability Spaces & Random Variables

The reader should prove part (ii) (cf. Exercise (2.0.2)). ♠ As we saw above, a monotone sequence of sets has a limit. In general, if A1, A2, is any sequence of sets, the new sequence B1 = k 1Ak, B2 = k 2Ak, ··· ∪ ≥ ∪ ≥ , Bn = k nAk, for n = 1, 2, becomes monotone. That is, B1 B2 B3 ··· . Hence,∪ the≥ sequence, B ,B···, has a limit, which is called the⊇ limsup⊇ A⊇ ··· 1 2 ··· n n Lecture 2 and stands for limsup An := lim Bn = n∞=1Bn = n∞=1 k n Ak. n ≥ n →∞ ∩ ∩ ∪

Similarly, the new sequence Cn = k nAk is a monotone sequence since C1 C2 ∩ ≥ ⊆ ⊆ Probability Spaces & . It also has a limit, called the liminfn An and stands for ···

liminf An = lim Cn = n∞=1Cn = n∞=1 k n Ak. Random Variables n n ∪ ∪ ∩ ≥ →∞

Since Cn Bn for every n, their respective limits also share the same relationship, namely liminf⊆ A limsup A . Note that the definition of ensures that both n n ⊆ n n E liminfn An and limsupn An are in whenever all An are in . Remark - 2.0.2 - (Is P a continuous function?) The ususual notio of continuity E E In probability literature, the event iAi is often read as “at least one of the does not apply to the function P since its domain has not been given any topological ∪ Ai occurs”. And similarly, the event iAi is often read as “every one of the Ai structure. The problem is that we cannot talk about limn An for an arbitrary ∩ occurs”. Continuing this further, the event lim sup An stands for “infinitely many sequence of sets (events) in E. For some special sequences→∞ of sets “convergence of n of the Ai occur” and liminfn An stands for “all but finitely many of the Ai occur”. sets” can be defined. When A1 A2 , then we define limn An = ∞ An. ⊆ ⊆··· →∞ n=1 The reader should try to see why this interpretation is justified. Here is another Similarly, if A1 A2 , then limn An = n∞=1 An. A question arises, “for consequence of the definition of probability function. such sets is P continuous⊇ ⊇ ··· in the sense that→∞ S T Theorem - 2.0.2 - (The first Borel-Cantelli lemma) Let A1, A2,... be a se- P( lim An)= lim P(An)? n n quence of events. If P(A ) < then P(limsup A ) = 0. →∞ →∞ n n ∞ n n Here is a result that answers this question. P Proof: Note that if Bn = k n Ak then B1 B2 . Thus, by the continuity P ≥ ⊇ ⊇··· P property of , Theorem - 2.0.1 - (The continuity property of ) If An,n 1 and Bn,n S 1 are sequences of events such that A A and{ B ≥ B} { , then≥ } 1 ⊆ 2 ⊆ ··· 1 ⊇ 2 ⊇ ··· (i) lim P(An) = P lim An , and (ii) lim P(Bn) = P lim Bn . n n n n 0 P limsup An = lim P (Bn) = lim P Ak . ≤ n n n   →∞ →∞ →∞ →∞   k n     [≥ Proof: Note that lim An = ∞ An = A1 (A2 A1) (A3 A2) , where   n ∪n=1 ∪ − ∪ − ∪··· the unions on the right→∞ side are disjoint. Thus, By the subadditivity property of P, we get P A P (A ). Since, the  k ≤ k k n k n ∞ [≥ X≥ P A = P(A ) + P(A A ) + P(A A ) + tail of a convergent series goes to zero,   n 1 2 − 1 3 − 2 ··· n=1 ! [ n 1 − = P(A1)+ lim (P(Ai+1 Ai)) 0 P limsup An = lim P Ak lim P (Ak) = 0. n − ≤ n n   ≤ n ♠ →∞ i=1   k n k n X [≥ X≥ n 1   P − P P = (A1)+ lim ( (Ai+1) (Ai)) Remark - 2.0.3 - (Inclusion-exclusion principle) It is a natural question to n − →∞ i=1 X ask, “can one find the probability of union of events when one knows only the = P(A1)+ lim P(An) P(A1). probabilities of the individual events?” The answer is yes, provided probabilities of n →∞ − Probability Spaces & Random Variables 13 14 Probability Spaces & Random Variables their intersections are known, and is called the inclusion-exclusion principle and is that the outcomes should show no preference, the above counting method due to H. Poincar´e.) breaks down. Its natural analog then becomes size of A n n P(A) = , P A = P (A ) P (A A ) + P (A A A ) size of Ω i i − i ∩ j i ∩ j ∩ k i=1 ! i=1 i 1. In this case, typically is taken n+1 + + ( 1) P Ai to be the smallest sigma field containg the bounded rectangles,E called the ··· − ! i=1 Borel sigma field (cf. Exercise (2.0.6)). n+1 \ = B1 B2 + B3 + ( 1) Bn, − −··· − (Weighted versions). A large class of probability models come in weighted • form of the above two items. In the following several standard models of prob- where, B = P (A A A ); j = 1, 2, ,n. j i1 ∩ i2 ∩···∩ ij ··· ability and statistics of this sort are listed. Such models are often justified 1 i

c c c (Independence). Another distinct modeling technique that sets probability Therefore, P (A1 A2 An) = 1 P (A1 A2 An) • ∩ ∩···∩ − n ∪ ∪···∪ theory apart from other disciplines is the modeling tool of independence. We j 1 = 1 ( 1) − B will briefly describe this concept a bit later. − − j j=1 X Example - 2.0.1 - n (Secretary’s matching problem — equilikely probabil- j ity space) Here we illustrate the use of the inclusion-exclusion principle applied to a = ( 1) Bj where B0 = 1, − particular equilikely probability space and solve the secretary’s matching problem. j=0 X A secretary types n letters addressed to n different people. Then he types n en- and it represents the probability of none of the Ai’s occurring. This is a special velopes with the same n addresses. However, while putting the letters into the case of a yet more general result due to Jordan who proved it in 1927. It says that envelopes, he randomly puts letters into the envelopes. (This word “random” here stands for no preference to any particular letter going any particular envelope. n j k j This then can be interpreted to mean that the resulting probability space is equi- P ( exactly k events among A , A will occur ) = ( 1) − B , { 1 ··· n } − k j likely.) We would like to know the probability that at least one of the letters j=k   1 X is correctly put into its own envelope. Let Ai be the event that letter i goes which reduces to the result of Poincar´efor k = 0. Both of these results can be into its own envelope (1 i n) (i.e., a match occurs for the ith letter). We ≤ ≤ P n proved by induction and are left for the reader as exercises. want the probability of at least one of the Ai’s occur, i.e., ( i=1Ai). To find out P(A ), P(A A ), P(A A A ), ,i = j = k, etc., we proceed∪ as follows. i i ∩ j i ∩ j ∩ k 6 6 Remark - 2.0.4 - (Various assignment methods) How should one define the There are n! ways to place the letters into the n envelopes. Therefore, we see that, (n 1)! 1 (n 2)! 1 P P(A ) = − = , i = 1, 2, ,n. Similarly, P(A A ) = − = and function : [0, 1] so that the three requirements of its defintion are fulfilled i n! n ··· i ∩ j n! n(n 1) E → P 1 − and at the same time is realistic? The word “realistic” points towards our desire (Ai Aj Ak) = n(n 1)(n 2) , etc. So, by the result of Poincar´e, that it should be applicable in various real life situations. This is a modeling ∩ ∩ − − n n issue. Typically any one of the following four techniques is invoked due to various 1 1 1 P A = + + reasonings: i n − n(n 1) n(n 1)(n 2) ··· i=1 ! i=1 i

Example - 2.0.2 - (Random selection) Pick a point “at random” from (0, 2π]. Remark - 2.0.5 - (Whats observable?) The actual elements of Ω may or may What does this mean? Well, it might be painful to construct an experiment which not be observable. Even worse is that the probability of events, P(A), is NEVER will do just that. At least approximately it can be performed with the help of observable. Often the observables are the values of certain random variables that a spinner which does not have any preferential stopping spots. Mathematically turn up as a result of performing the experiment. speaking it stands for the following probability space. Probabilists propose models for the unknown (unobservable) probability func- Ω = (0, 2π] • tion P and then, using those models, deduce results by knowing some partial • is the smallest sigma field of subsets of (0, 2π] which contains all the inter- information concerning the experiment or without performing the random •vals E of (0, 2π]. We will call it the Borel sigma field, over (0, 2π]. experiment at all. The results are only as good as the models. P P b a Statisticians test the validity of the proposed models for after perform- ((a, b]) = 2π− 0 for any 0 a < b 2π. • • − ≤ ≤ ing the experiment a large number of times and observing certain random Note that the actual act of selection of a point from the interval (0, 2π], or as to variables (called data analysis and statistical inference). how does one physically perform this operation, are none of our concerns. This is an idealization and the phrase “selection of a point at random” points towards this Every random variable has its own (unique) distribution function (or distribu- mathematical model (abstraction) of the physical act. The resulting cumulative tion, for short). All the probabilistic properties of the random variable are stored distribution is in its distribution. Probability theory essentially is the study of these distributional 0 if t 0, properties. F (t) = t if 0 ≤< t < 2π,  2π 1 if t 2π. Definition - 2.0.2 - (Multivariate distribution) Let X be a d-dimensional ran-  ≥ dom vector (i.e, d number of random variables, X1,X2, ,Xd, all defined over the Exercise - 2.0.2 - Finish the proof of part (ii) of Theorem (2.0.1). same probability space (Ω, , P)). The multivariate (or··· joint) distribution of X is a function, E Exercise - 2.0.3 - By using induction, prove the result of Poincar´e. X1 x1 P Exercise - 2.0.4 - (Continuity property of revisited) Show that for any X2 x2 d sequence of events An, n 1, we have F (x1,x2, ,xd) = P(X x) = P  .   .  , x R , ≥ ··· ≤ . ≤ . ∈     P (liminf An) liminf P (An) limsup P (An) P (limsup An).  Xd   xd  n ≤ n ≤ n ≤ n     = P(X1 x1, X2 x2, , Xn xn). Hence, deduce that if liminfn An = limsupn An =: limn An, then P (limn An) = ≤ ≤ ··· ≤ limn P (An), giving a slight extension of the continuity property of P. Here the inequality between vectors means componentwise inequalities must hold for all the components, and the commas separating the events, Xi xi , stand Exercise - 2.0.5 - (Intersection of sigma fields is a sigma field) Let Ω be a for the intersection operations. { ≤ } nonempty set and let ,α Λ be any nonempty collection of sigma fields of {Gα ∈ } subsets of Ω. Show that = α ΛGα is again a sigma field. Distributions are nonnegative, right continuous functions, (in each variable) F ∩ ∈ which may or may not be differentiable. Even more, some distributions may have Exercise - 2.0.6 - (Smallest sigma field containing a class) Let Ω be a nonempty jumps, but the points of jump can always be counted (i.e., is a countable set). For set and let be a collection of of subsets of Ω. (Note that need not have any a distribution, F , of one random variable, a point x is a point of jump of F if properties.)A Show that = is again a sigma field, whereA the intersection FA ∩G⊃AG is over all sigma fields that contain . [Example: On R the smallest sigma field P(X = x) = F (x) F (x−) > 0. containing the collectionG of all intervalsA is called the Borel sigma field.] − For a distribution, F , of two random variables, a point (x1,x2) is a point of jump Exercise - 2.0.7 - (Generated sigma field) The collection, of F if

1 σ(X) := X− (B): B , P(X1 = x1, X2 = x2) = F (x1,x2) F (x−,x2) F (x1,x−) + F (x−,x−) > 0. ∈ B − 1 − 2 1 2 is always a sigma field (as can be verified easily), and is called the sigma field Most of the commonly used distributions fall into two categories: The differentiable generated by X. So, a real valued function, X, over Ω, is a random variable for kind, which are (strangely) given the name “continuous”, and the jump type, which the probability space (Ω, , P) if and only if σ(X) . Verify that σ(X) is a sigma are given the name “discrete”. However, there are other types of distributions which field when X is any real valuedE function on Ω. ⊆ E do not fit into these two categories. Probability Spaces & Random Variables 17 18 Probability Spaces & Random Variables

Definition - 2.0.3 - (Continuous case) The (joint) density, of a random vector (Exponential). X Exp(λ) stands for X having the density with (joint) distribution F , when it exists, is a nonnegative function • ∼ λx f(x) = λe− , x > 0. ∂d f(x) = F (x), with f(x) dx = 1. Here λ > 0, is the parameter of the density. ∂x ∂x Rd 1 ··· d Z (Gamma). X Gamma(λ,α) (also sometimes denoted as X G(λ,α)) In this case, for any (Borel) subset B of Rd, we take P(X B) = f(x) dx. • ∼ ∼ ∈ B stands for X having the density R α (Discrete case) The joint (discrete) density, of a random vector with (joint) λ α 1 λx f(x) = x − e− , x > 0. distribution F , when it exists, is a nonnegative function , f, with a countable subset Γ(α) D Rd, so that ⊆ Here λ,α > 0, are the parameters of the density. When α = 1 we get the 1 k f(x) = P(X = x) > 0, x D, with f(x) = 1. exponential density as a special case. If we take λ = and α = , we get ∈ 2 2 all x D the chi square density with k degrees of freedom. X∈ (Multivariate normal). X N(µ, V) stands for the X = [X , ,X ] Remark - 2.0.6 - (Notation) X F or X f. The actual probability space 1 d ′ • having the (joint) density ∼ ··· (Ω, , P), over which X is defined,∼ is often suppresed∼ once the distribution F (or theE density f) is obtained. One may safely assume that there is some probability 1 1 1 d space from where the specified random variable with its distribution, came from. f(x) = exp − (x µ)′V− (x µ) ; x R . (2π)d/2 det(V) 2 − − ∈ This observation was proved by Kolmogorov. Hence, X, F and f are related as   Actually, a multivariatep normal random variable is defined through its mo- d F (x), continuous case, P(X t) = F (t), f(x) = dx ment generating function (mgf) since it uses V only, even when V is not ≤ F (x) F (x−), discrete case. invertible. The vector µ = [µ , , µ ] controls the center of the density ( − 1 d ′ and the d d positive-definite matrix,··· V, controls the spread and shape of × Example - 2.0.3 - Here are some commonly used (discrete and continuous) models the density. for distributions (actually densities) of random variables. (Binomial). X B(n,p) stands for X having the (discrete) density, • ∼ (Normal). X N(µ, σ2) stands for X having the density • ∼ n x n x f(x) = p (1 p) − , x = 0, 1, 2, ,n. 1 (x µ)2 /(2σ2) x − ··· f(x) = e− − , < x < .   σ√2π −∞ ∞ (Poisson). X P oisson(λ) represents a random variable whose (discrete) The parameters, µ, σ > 0, control the shape of the density. • density is ∼ x 2 λ λ (Lognormal). X LN(µ, σ ) stands for X having the density f(x) = e− , x = 0, 1, 2, . • ∼ x! ··· 1 (ln x µ)2/(2σ2) f(x) = e− − , x > 0. (Geometric). X Geometric(p) represents a random variable whose (dis- √ xσ 2π • crete) density is ∼ The parameters, µ, σ > 0, control the shape of the density. f(x) = p (1 p)x, x = 0, 1, 2, . (Chi square). X χ2 stands for X having the density − ··· • ∼ (k) Definition - 2.0.4 - (Independence) Two events, A, B, are called independent if 1 k 1 x/2 f(x) = x 2 − e− , 0 < x < . 2k/2Γ( k ) ∞ P(A B) = P(A)P(B). 2 ∩ Here the single parameter, k > 0, called the degree of freedom, controls the This is a distinguishing concept of probability theory. The early literature of prob- shape of the density. By the way, ability theory heavily relied on it. Much of the modern probability theory evolved while proving old results that assumed this structure by trying relaxing it as much 1 √ Γ( 2 ) = π, Γ(1) = Γ(2) = 1, Γ(x + 1) = x Γ(x), x > 1. (0.1) as possible. Probability Spaces & Random Variables 19 20 Probability Spaces & Random Variables

Definition - 2.0.5 - (Independence of events) A (finite or infinite) sequence of Remark - 2.0.7 - (Zero-one property) When A , A , is a sequence of in- 1 2 ··· events, A1, A2, is called independent if for any finite subset of them their joint dependent random variables the two Borel-Cantelli lemmas together show that ··· P probability is the product of the individual probabilities. That is (limsupn An) is always either 0 or 1. It cannot have any other value. This fact is a special case of a more general result, known as Kolmogorov’s zero-one law. P( i J Ai) = P(Ai), ∩ ∈ i J Definition - 2.0.6 - (Independence of random variables) When the distribu- Y∈ tion of X =[X , ,X ]′ can be written as a product, i.e., for any finite subset J of positive integers. 1 ··· d X1 x1 Theorem - 2.0.3 - (2nd Borel-Cantelli lemma) If A1, A2, are independent d ··· X2 x2 ∞ P P P Rd P P F (x) = (X x) =  .   .  = (Xi xi); x , events such that (An) = , then limsup An = 1. ≤ . ≤ . ≤ ∈ ∞ n . . i=1 n=1       Y X  Xd   xd      Proof: By the continuity property of P, we have     we say that X ,X , ,X are mutually independent (or just independent). If 1 2 ··· d for every pair, (Xi,Xj), i = j, the two random variables are independent then P P P 6 limsup An = An = lim An , and, X1,X2, ,Xd are called pairwise independent. The single word “independent” n   k     k 1 n k →∞ n k ··· \≥ [≥ [≥ will always refer to mutual independence.  m    Remark - 2.0.8 - (The iid notation & the notion of random sample) The P An = lim P An .   m iid n k →∞ n=k ! notation X1,X2, ,Xd stands for the case when [≥ [ ··· ∼   Now, the independence of events gives that X1 x1 d X2 x2 m m m d F (x) = P(X x) = P  .   .  = P(X1 xi), x R . P A = 1 P Ac = 1 (1 P(A )) ≤ . ≤ . ≤ ∈ n − n − − n     i=1 n=k ! n=k ! n=k  X   x  Y [ \ Y  d   d  m     If we denote the common function P(X t) by G(t), then the above iid notation = 1 exp ln(1 P(An)) . 1 − − iid ≤ (n=k ) takes the form X ,X , ,X G, or G is replaced by the name given to G. In X 1 2 ··· d ∼ P this case the collection X1,X2, ,Xd is called a random sample from G, and G Here, if (An) = 1, then the equality remains valid if we agree to take ln 0 = . { ··· } xk −∞ is called the population distribution. By the Taylor series ∞ = ln(1 x), we see that k=1 k − − 2 Example - 2.0.4 - Recall that N(0, 1) is the name given to the standard normal P (P(A )) iid ln (1 P(A )) = P(A ) + n + P(A ). distribution. So, the notation, X ,X , ,X N(0, 1), uses the (common) dis- n n 2 n 1 2 d − − ··· ! ≤ − tribution function ··· ∼ This inequality remains valid when P(A ) = 1 since ln0 = < 1. Thus, t u2/2 n 1 u2/2 e− m m −∞ − G(t) = P(X1 t) = e− du, with density f(u) = , u R. ln (1 P(A )) P(A ). This implies that ≤ √2π √2π ∈ − n ≤ − n Z−∞ nX=k nX=k Remark - 2.0.9 - (Independence of events, & sigma fields) For a probability m m space (Ω, , P), two events, A, B , are independent if P(A B) = P(A)P(B). E ∈ E ∩ 0 lim exp ln (1 P(An)) lim exp P(An) = 0. Extending this idea, if , are subsigma fields of , then , are called indepen- ≤ m − ≤ m − F G E F G →∞ (n=k ) →∞ ( n=k ) dent if for any A , and any B , we have P(A B) = P(A)P(B). Using this X X ∈F ∈ G ∩ This gives that notion of independence of sigma fields, it turns out that two random variables X, Y are independent if and only if their respectively generated sigma fields, σ(X),σ(Y ) m are independent. For the most part we will not work at this generality. P An = lim 1 exp ln (1 P(An)) = 1.   m − − 2 n k →∞ (n=k )! Exercise - 2.0.8 - When X N(0, 1), by rewriting P(X t) as P( √t X [≥ X ∼ ≤ − ≤ ≤   √t) and then differentiating with respect to t, show that Y = X2 χ2 . Since this is true for all k, P (limsup A ) = 1. ∼ (1) n n ♠ 22 Expectations

difference of positive and negative parts of h. This not only gives us the change of variables formula, but more importantly, this way we are able to construct integrals over abstract probability spaces. Using this inegral, we define the expectation (or mean) of a random variable X to be

E(X) = X dP = t dFX (t), whenever E X = X dP < . R | | | | ∞ Lecture 3 ZΩ Z ZΩ Higher order moments are defined by using h(X) = Xk, by using positive integers for k. In particular, the variability of a distribution is captured by V ar(X) = E(X2) (E(X))2, Std(X) = V ar(X), Expectations − called the variance and standard deviation respectively. p

3.1 Properties of Lebesgue integral The concept of an average or a mean of values of h(X), when X is a random variable and h is a function of interest, is captured by the Lebesgue integral. We present a Ignoring some technical details the resulting integral has the usual properties: brief heuristic argument here along with collecting its basic properties that we will E E E need. (Linearity) (ah(X) + bg(X) + c) = a (h(X)) + b (g(X)) + c for any • constants a, b, c. The multivariate extensions go along the same lines. For We illustrate the basic idea with the help of the distribution FX of any random variable X defined over some probability space (Ω, , P). The Riemann-Stieltjes in- instance, for two random variables, X, Y , b E tegral a h(t) dFX (t) partitions the x-axis using intervals, while Lebesgue’s method E(ah(X, Y ) + bg(X, Y ) + c) = aE(h(X, Y )) + bE(g(X, Y )) + c. of integration, instead partitions the y-axis using intervals. Then the two methods R perform distinctly different actions. When (xi 1,xi] is one of the partitioning in- (Positivity) If h1(t, s) h2(t, s) then E(h1(X, Y )) E(h2(X, Y )), and tervals, the Riemann-Stieltjes integral measures− its size by using the distribution • ≥ ≥ E(h(X, Y )) E h(X, Y ) . as FX (xi) FX (xi 1). Furthermore it uses an arbitrary point ai [xi 1,xi] and | | ≤ | | evaluates h−(a ) to create− the Riemann-Stieltjes sum ∈ − i (Change of variable formula CVF) If Z = h(X, Y ) with distribution n n • FY (t) then R(h, FX , ) = h(ai) (FX (xi) FX (xi 1)) = h(ai) P(xi 1

(Lebesgue dominated convergence theorem) If X (ω) X(ω) and for any bivariate distribution, H, as long as the of the random variables • n → Xn Y for some random variable Y with E(Y ) < then exist and F and G were the marginal distributions of H”. Note that this approach | | ≤ ∞ does not need the joint density to compute the covariance. E(X) = X dP = lim Xn dP = lim E(Xn). n n ZΩ ZΩ Note from the definition that the covariance obeys the following properties, (Fubini-Tonelli’s theorem) If F,G are distributions of X, Y respectively, known as the symmetry and blinearity properties: • then Tonelli’s theorem says that (i) Cov(X, Y ) = Cov(Y,X), • h(x,y) dF (x) dG(y) = h(x,y) dG(y) dF (x). (ii) Cov(cX, Y ) = Cov(X, cY ) = cCov(X, Y ), for any constant c, R R | | R R | | • Z Z Z Z (iii) Cov(X + X , Y ) = Cov(X , Y ) + Cov(X , Y ), Fubini’s theorem says that if either side of above equation is finite then the • 1 2 1 2 above interchange of integrals can be performed without the absolute values (iii) Cov(c, Y ) = 0, for any constant c. around h(x,y) as well. • In particular, V ar(X) = Cov(X,X) when E(X2) < . We say that X and Y are The integral has some further properties which we will mention as they are needed. uncorrelated if Cov(X, Y ) = 0. Next, we turn our attention∞ towards inequalities. Example - 3.1.1 - When X 0 with distribution F (x) and E X < , Tonelli’s R theorem gives ≥ | | ∞ Proposition - 3.2.1 - (Existence of moments) Let h : [0, ) be a strictly increasing continuous function. Then for any nonnegative random∞ → variable X we X ∞ ∞ have E(X) = X dP = 1 dxdP = 1 dP dx = (1 F (x)) dx. Ω Ω 0 0 ω:X(ω)>x 0 − Z Z Z Z Z{ } Z ∞ 1 1 P(X h(n)) h− X(s) dP = h− (t) dFX (t) ≥ ≤ ◦ R n=1 S 3.2 Covariance X Z Z ∞ 1 2 2 = E h− (X) P(X h(n)). Definition - 3.2.1 - (Variance-covariance matrix) If E( X1 + + Xd ) is ≤ ≥ | | ··· | | n=0 finite, the variance-covariance matrix of the random vector X =[X1,X2, ,Xd]′  X is ··· Proof: Just note that X1 EX1 . [X1 Xd] . [EX1 EXd] ∞ ∞ ∞ ∞ V = E . . 1 1  .  ···   .  ···  P (X h(n)) = P (h− (X) n) = P (i h− (X)

∞ 1 = h− (X(s)) dP (s) i=0 Ai X Z ∞ ∞ (i + 1) dP (s) = (i + 1)P (A ) ≤ i i=0 Ai i=0 X Z X i ∞ 1 = P (i h− (X)

Example - 3.2.1 - If X N(0,σ2) and Y = X2, then E(X) = 0 and E(Y ) = E(X2) = σ2. Here are the∼ justifications.

1 ∞ u2/(2σ2) E(X) = ue− du = 0, σ√2π Z−∞ since the integrand is an odd function. Next,

2 1 ∞ 2 u2/(2σ2) E(Y ) = E(X ) = u e− du σ√2π Z−∞ 2 ∞ 2 u2/(2σ2) = u e− du, intengrand is an even function, σ√2π Z0 2 σ ∞ 1/2 t/2 2 2 = t e− dt, by substituting t = u /σ , √2π Z0 σ2 = 23/2Γ( 3 ), since total area under χ2 density is one, √2π 2 (3) = σ2, by (0.1). 28 Various Inequalities

The AM-GM inequality says that µ µ˜. To prove this, define a random variable ≥ X ∆ = a1, a2, , an with equal probabilities assigned to the elements of ∆ (the∈ set ∆{ contains··· the given} positive numbers and repetitions are allowed). Then, it is obvious to see that E(X) = µ. Also, by CVF, we have log a + log a + + log a E log(X) = 1 2 ··· n = log µ˜ . Lecture 4 { } n { } Since log x is a convex function, Jensen’s inequality gives − log µ = log E(X) E log(X) = log µ˜ . − { } − { } ≤ − { } − { } Multiplying both sides by 1 and then taking the antilogs give µ µ˜. Various Inequalities − ≥ Exercise - 4.0.2 - (AGH inequality) Let a, b be two positive numbers. The 1 (1/a) + (1/b) − quantity is called the harmonic mean of the two numbers. Use 2 A useful inequality involving a random variable is due to Jensen. It says that if f   is a convex function over an interval and a random variable X takes values in that the AM-GM inequality to prove the following AGH inequality: interval, then Arithmetic Mean Geometric Mean Harmonic Mean. E(f(X)) f(E(X)), ≥ ≥ ≥ when the expectation on the left side exists. In particular, by taking f(t) = t2, we get 4.1 Holder & Minkowski’s Inequalities E X2 (E( X ))2 . ≥ | | An extension of the CBS inequality is known as the H¨older inequality. For this we This implies that V ar(X) 0.  need an extended version of the AM-GM inequality — known as Young’s inequality. Let X be a random variable≥ with variance σ2. Chebyshev’s inequality says that Proposition - 4.1.1 - σ2 (Young’s inequality) Let X ∆ = a, b with density P ( X E(X) >ε) , for any ε > 0. f(a) = p, f(b) = 1 p, where 0 < p < 1 and a, b 0. Then∈ { } | − | ≤ ε2 − ≥ p 1 p This is a bit crude inequality and surprisingly pervasive in probability theory and ap + b(1 p) = E(X) a b − , analysis. Here we collect some of the standard inequalities from analysis. − ≥ · 1 where the equality holds if and only if a = b. (For p = 2 this reduces to the AM-GM Exercise - 4.0.1 - For any p > 0, show that the following statements are equivalent. inequality.) (i) E Y p < , • | | ∞ Proof: Look at the inequality backwards. We need to show that p (ii) ∞ P ( Y n) < , • n=1 | | ≥ ∞ a p p ap + b(1 p) b, assuming that b = 0. (iii)P[0, ) P ( Y t) dt < , • ∞ | | ≥ ∞ − ≥ b 6 p 1   (iv) R n∞=1 n − P ( Y n) < . (The inequality is trivially true when either a = 0 or b = 0.) So, we need to prove • | | ≥ ∞ a a p Then showP that that p + (1 p) or equivalently, b − ≥ b p ∞ p 1 ∞ p 1   E( Y ) = p t − P ( Y >t) dt = p t − P ( Y t) dt. p | | | | | | ≥ f(t) = t + tp + (1 p) 0, for all t > 0. Z0 Z0 − − ≥ Example - 4.0.2 - (AM-GM inequality) The arithmetic mean µ of n numbers, The minimum of this function can be obtained with brute force, and we leave it for a1 + a2 + + an the reader. Instead, there is an easier way. Define a random variable U that takes a1, a2, an, is µ = ··· . If these numbers are positive, their geomet- a ··· n two values, b and 1, with respective probabilities, p and (1 p). Now apply the ric mean is AM-GM inequality to get − 1 n n log a1 + log a2 + + log an a a p µ˜ = antilog ··· = ai . p + (1 p) = µ µ˜ = . e n b − ≥ b ♠ i=1 !   Y   4.1 Holder & Minkowski’s Inequalities 29 30 Various Inequalities

1 p 1 1 p Proposition - 4.1.2 - (H¨older’s inequality) Let p, q be numbers such that p,q > This gives that E X + Y < . Since p + q = 1, we have 1 + q = p which gives 1 1 p q | | ∞ p 1 q p 1 and p + q = 1. If X, Y are random variables with E( X ) < and E( Y ) < that p = (p 1)q. Note E X + Y − = E( X + Y ) < . Now apply the then | | ∞ | | ∞ − | | | | ∞ 1 1 H¨older inequality to get E XY (E X p) p (E Y q) q .  | | ≤ | | · | | 1 p 1 p 1 p 1 q q Equality holding if and only if α X p = β Y q for some constants α, β. E X + Y − X (E X ) p E X + Y − . | | | | {| | | |} ≤ | | | |   1 1 p 1 p p Proof: If E X p = 1 and E Y q = 1 then take a = X p and b = Y q and replace Similarly, we get E X + Y − Y (E Y ) p (E X + Y ) q . Now we put these {| | | |} ≤ | | | | 1 | | | | 2 | | | | pieces together as follows. p by our p in Young’s inequality to get p p 1 E X + Y = E X + Y − X + Y 1 1 X p Y q | | | | | | a + b a1/p b1/q or | | + | | XY . p q ≥ p q ≥| | p 1  p 1         E X + Y − X + E X + Y − Y ≤ {| | | |} {| | | |}

Taking expectations of both sides we get 1 1 1 1 (E X p) p (E X + Y p) q + (E Y p) p (E X + Y p) q 1 1 1 1 1 1 ≤ | | | | | | | | p q p p q q E XY E X + E X = + =1=(E X ) (E Y ) . 1 1 1 | | ≤ p | | q | | p q | | | | = (E X + Y p) q (E X p) p + (E Y p) p . | | { | | | | } p q If E X = 0 or if E Y = 0 then one of the random variables is identically zero Dividing by the first expression of the right hand side (and noting that 1 = 1 1 ) | | | | p − q and the inequality is trivially true. Otherwise, define p 1 1 p 1 p 1 gives that (E X + Y ) − q (E X ) p + (E Y ) p . If E X + Y p = 0 then there | | ≤ | | | | | | X Y was nothing to prove in the first place. U := 1 ,V := 1 . ♠ (E X p) p (E Y q) q | | | | This gives that E U p = 1 and E V q = 1 . Thus, we have 4.2 Jensen’s Inequality | | | | 1 1 Now explain the Jensen inequality in a bit more detail. E UV 1 or E XY (E X p) p (E Y q) q . | | ≤ | | ≤ | | · | | ♠ Definition - 4.2.1 - (Convex functions) Let f be a real valued function on an Exercise - 4.1.1 - Finish the proof of the above proposition by showing when the interval (α, β). We say f is convex on (α, β) if, for any subinterval [a, b] (α, β), equality holds. the graph of f on [a, b] lies on or below the line segment connecting (a, f⊆(a)) and (b, f(b)). This is equivalent to saying that Proposition - 4.1.3 - (Minkowski’s inequality) Let X, Y be two random vari- ables with E X p < , and E Y p < . Then E X + Y p < and f(θa + (1 θ)b) θf(a) + (1 θ)f(b), (0 θ 1), | | ∞ | | ∞ | | ∞ − ≤ − ≤ ≤ 1 1 1 for all a, b (α, β), a < b. (E X + Y p) p (E X p) p + (E Y p) p ; for any p 1. ∈ | | ≤ | | | | ≥ HW1 Exercise - 4.2.1 - If φ is a convex function over (a, b) then show that φ is contin- Proof: The case of p = 1 is trivial, so assume that p > 1. The triangle inequality, uous. X + Y X + Y , gives that | |≤| | | | Proposition - 4.2.1 - A function f defined over an interval (α, β) is convex if and p p p X + Y ( X + Y ) (2 max X , Y ) only if for every random variable, say Xn, taking values in (α, β) and having a finite | | ≤ | | | | ≤ p {| |p| |}p p p p = 2 max X , Y 2 ( X + Y ) . range a1, a2, , an we have f(E(Xn)) E f(Xn) . {| | | | } ≤ | | | | { ··· } ≤ { } 1 p 1/p Proof: In mathematics texts, the quantity (E|X| ) is often represented as ||X||p and is The only if part follows easily by taking n = 2. The if part is obtained by called the p-norm of X. repeated use of the definition and induction. We illustrate the induction argument 2 Note that 1 − p of Young’s inequality is now 1/q of this proposition. That is either we for n = 3 only. work with p, 1 − p where 0 1, 1 1 q > 1 with p + q = 1 as we do in this proposition. This is just using different notations. f(E(X)) = f(p1a1 + p2a2 + p3a3) 4.2 Jensen’s Inequality 31 32 Various Inequalities

p1a1 + p2a2 positive convex function. Finally to show that Ef(X ) Ef(X), just note that, = f (1 p3) + p3a3 n → − 1 p3 by our construction (since all constructions f(X ) f(X) or using g instead of  −  | n |≤| | p1 p2 f and having the same property), (1 p3)f a1 + a2 + p3f(a3) ≤ − 1 p3 1 p3  − −  f(Xn) f(X) f(X) + f(Xn) 2 f(X) p p | − |≤| | | | ≤ | | (1 p ) 1 f(a ) + 2 f(a ) + p f(a ) ≤ − 3 1 p 1 1 p 2 3 3 The left side goes to zero, the right side has finite expectation. Hence, the Lebesgue  − 3 − 3  = p f(a ) + p f(a ) + p f(a ) = E f(X) . dominated convergence theorem gives that E f(Xn) f(X) 0. 1 1 2 2 3 3 { } | − |→ ♠ The same argument goes for higher values of n. HW2 Exercise - 4.2.2 - Let f be a convex function over an interval I and let X be a ♠ random variable taking values in I and E X < . Prove that E f(X) is well | | ∞ { } defined by showing that E f −(X) < . Proposition - 4.2.2 - Let f be a convex function over an interval I and let X be { } ∞ a random variable taking values in I with E X < so that E f(X) < . Then Proposition - 4.2.3 - (Convexity & Jensen’s inequality) Let I be an interval. there exists a sequence of simple random variables| | ∞X taking values| | in I∞so that n The following statements are equivalent X X, E(X ) E(X), E f(X ) E f(X) as well as Ef(X ) Ef(X). n → n → | n |→ | | n → (i) f is convex over I, • Proof: As in approximating any random variable by a sequence of simple ran- (ii) E(f(X)) f(E(X)) simple r.v. X on I, R i i+1 dom variables, we subdivide the real line by considering intervals [ 2n , 2n ) • ≥ ∀ n i i+1 (iii) E(f(X)) f(E(X)) r.v. X on I with E X < . for i = 0, 1, 2 ,n2 to cover [0,n) and considering the intervals [ 2n , 2n ) for • ≥ ∀ | | ∞ i = 1, 2, ···, n2n to cover the interval [ n, 0). We need only consider those subintervals− − ··· that− intersect with I and ignore− the rest. Now over each such subin- Proof: It is clear that (iii) implies (ii) since a simple random variable automat- ically has a finite expectation. Proposition (4.2.1) gives that (ii) implies (i). The terval the value of X is approximated by a value taken by Xn as follows. Since i i+1 statement (i) implies (iii) is known as Jensen’s inequality, which we now prove. f is a continuous function, over the closed interval [ n , n ], let an be the point | | 2 2 So, let f be a convex function and let X be a random variable taking values in I of minimum of f(t) and let Xn = an over this subinterval. When X falls in the | | with E X < . Since Ef(X) is always well defined, (cf. Exercise (4.2.2)) and interval [n, ), even though the minimum f(t) may not exist, inft [n, ) f(t) is | | ∞ still a finite∞ value. (Draw all five possible| shapes| of f and then all∈ eight∞ | or| so Ef −(X) < , the only possibility is that Ef(X) = , in which case we have nothing to prove.∞ So, assume that E f(X) < . Take∞ a sequence of simple ran- corresponding shapes of f(t) . There are only two shapes, corresponding to those | | ∞ f which are always nonnegative| | and monotone and convex and the decreasing side dom variables Xn taking values in I so that Xn X, and E(Xn) E(X) and Ef(X ) Ef(X) (cf. Proposition (4.2.2)). Note→ that → has a finite asymptote, these shapes are the ones where the minimum of f(t) over n → [n, ) or ( , n] does not occur at t = n. For all the rest of the shapes, and ∞ −∞ − ± f(E(X)) = f lim E(Xn) , by construction of X, large enough n, we have that f(t) will have its minimums at t = n.) Consider n | | ± →∞ the rest of the six shapes first. In these cases, take Xn = n when X n and take = lim f (E(Xn)), continuity of f ≥ n Xn = n when X < n. Hence, we see that by construction f(Xn) f(X) for →∞ − − 1 | |≤| | lim E (f(Xn)) , by Proposition (4.2.1), all n. Furthermore, Xn X χ{|X|

E Xn X χ{|X|≥n} = E n X χ{|X|≥n} 2E X χ{|X|≥n} 0. | − | | − | ≤ | | → This finishes the proof.    ♠ Hence, not only Xn X but also E Xn X 0. Furthermore, the continuity of f implies that f(X→ ) f(X). Since| −f(X| →) f(X) and E f(X) < , n → | n |≤| | | | ∞ Remark - 4.2.1 - (Carefull) Now we consider more than one random variable at the Lebesgue dominated convergence theorem gives that E f(Xn) E f(X) . a time. When X, Y have a induced measure on R2, (and without loss of general- Now consider those two shapes in which f(t) 0 and convex| and| either → entirely| | ≥ ity, ignoring the inifinite values of X and Y ) then for any real valued measurable increasing or entirely decreasing. For simplicity consider the case when f is de- function h over R2, the change of variable formula (once again) gives that creasing and let a be the right asymptote. Then consider another convex function g(t) = f(t) a 1. Now this g is one of those considered above and hence we can find − − E(h(X, Y )) = h(X(s), Y (s)) dP (s) = h(x,y) dPX,Y (s, y) simple random variables X X with E(X ) E(X) and Eg(X ) Eg(X), R2 n → n → n → ZS Z which is another way of saying Ef(Xn) Ef(X). But since f 0, it is the same as saying that E f(X ) E f(X) . Similar→ argument takes care≥ of the increasing, =: h(x,y) dFX,Y (x,y), n R2 | |→ | | Z 4.2 Jensen’s Inequality 33 34 Various Inequalities

provided the first integral exists. In particular, if X, Y have finite means then by D Exercise - 4.2.7 - Let X be (any) random variable and define a sequence of discrete the linearity of Lebesgue integral, random variables, E E E j j j + 1 (X + Y ) = (X) + (Y ). X = if

HW3 Exercise - 4.2.3 - Prove that E Y < if and only if for any constant c > 0, | | ∞ ∞ P ( Y cn) < . Deduce that P ( Y cn i.o.) = 0. n=1 | | ≥ ∞ | | ≥ P HW4 Exercise - 4.2.4 - (Lyapunov inequality) For any s > 1, from the H¨older in- equality deduce that E( X ) (E X s)1/s. | | ≤ | | And deduce the Lyapunov inequality

(E X r)1/r (E X s)1/s ; for any 0

HW5 Exercise - 4.2.5 - Let X, Y be two nonnegative random variables and let p 0 be a number. Prove the following ≥

p p p E(X ) + E(Y ) if p [0, 1], E(X + Y ) p 1 p p ∈ ≤ 2 − (E(X ) + E(Y )) if p > 1. 

HW6 Exercise - 4.2.6 - If X is a nonnegative random variable with distribution F and finite variance then show that

∞ E X2 = 2x (1 F (x)) dx. − Z0  36 Classification of Distributions

a function ∞ f(x) = bnχ (x); x R. [an,∞) ∈ n=1 X It is easy to see that f is non-decreasing and the series is uniformly convergent. The following proposition implies that all points of discountinuity of f are of the Lecture 5 first kind. Proposition - 5.0.4 - (Basic facts for nondecreasing functions) Let f be non- decreasing on an interval I R. Then the following results hold. ⊆ Classification of (First kind discontinuities) Then all points of discontinuity, if any, of f • are of the first kind. Therefore, we may write

+ Distributions f(x−) = sup f(t); f(x ) = inf f(t). t

(Discontinuities are countable) The set D I over which f is discontin- • uous is at most countably infinite. ⊆ First, let us recall some elemantry facts about non-decreasing functions and left + and right limits. (Continuity) f is continuous at x if and only if f(x−) = f(x) = f(x ). • ′ Definition - 5.0.2 - (Discontinuties of type I and type II) Let f be any real Proof: Since f is non-decreasing, for any 0 < h < h R + valued function over an interval I . For any x I, the symbol f(x ) and f(x−) ′ are defined as ⊆ ∈ f(x) f(x + h ) f(x + h). ≤ ≤ + f(x ) := lim f(x + h), f(x−) := lim f(x + h), That is, f(x + h) is decreasing when h is decreasing and it is bounded below by h 0 h 0 ↓ ↑ f(x). Thus the limit must exist. Similar argument goes for the left limit. when the limits exist and are called the right limit and left limit of f at x respec- Any non-decreasing function has discontinuities only of the first kind (i.e., jump tively. We say f is right continuous at x if f(x+) = f(x) and f is left continuous discontinuities). Note that when f is bounded, at x if f(x−) = f(x). (If I =[a, b], then by default f is right continuous at b and + D = ∞ x I : f(x ) f(x−) 1/n = ∞ D . left continuous at a.) The quantities ∪n=1 ∈ − ≥ ∪n=1 n  + + Each Dn is a finite set, since, it consists of those points where f has a jump of at f(x ) f(x), f(x) f(x−), f(x ) f(x−) − − − least 1/n. (If Dn had infinite points, adding all these jumps would become infinite are called the right jump, left jump and the jump of f at x respectively. which would violate the boundedness of f). Hence, D must be countable. + When f is not bounded, let E =[ n,n] and let D be the points of disconti- A point x I is called a point of discontinuity of the first kind if f(x ) and n − n ∈ nuities of f which are in En. Since f, when restricted on the inverse image of En , f(x−) both exist and f is not continuous at x. All other points of discontinuity are considered to be of the second kind. is a bounded non-decreasing function, Dn is a countable set. Taking union over all Dn gives that D must be countable as well. For example, let f be the Dirichlet function on [0,1], i.e., Finally, for any ǫ > 0, we have f(x ǫ) f(x) f(x + ǫ). Letting ǫ 0 gives + − ≤ ≤ ց that f(x−) f(x) f(x ). So, if f is continuous at x then the left side must equal ≤ ≤ 1 if x is rational, x [0, 1], the right side. Conversely, when the left side equal the right side, then f must be f(x) := ∈ 0 otherwise. continous at x.  ♠

Then all points are points of discontiuity of the second kind. For χ 1 (t) the point R 2 Proposition - 5.0.5 - Let f1 and f2 be non-decreasing functions over . If over 1 { } R t = 2 is a point of discontinuity of the first kind. a dense subset D , f1(x) = f2(x) for x D, then f1 and f2 must have the As a second example, let a be an enumeration of the rationals and let b same points of jump⊆ (if any) and f (x) = f ∈(x) for all x at which f and f are { n} { n} 1 2 1 2 be a sequence of non-negative numbers so that the series n bn converges. Define continuous. P Classification of Distributions 37 38 Classification of Distributions

Proof: Let x R be fixed and let tn x, then HW7 Exercise - 5.0.10 - Let F be a nondecreasing function which is bounded so that ∈ ր a F (x) b for all x R. Prove that for any ǫ > 0, the total number of points of ≤ ≤ ∈ f1(x−)= lim f1(tn)= lim f2(tn) = f2(x−). jump of F having jump size greater than ǫ is no more than (b a)/ǫ. tn x tn x ↑ ↑ − + + Exercise - 5.0.11 - (Specification over a dense subset suffices) Let f be a Similarly, we see that f1(x ) = f2(x ). Hence, we see that non-decreasing function defined over a dense subset D R. Let + + ⊆ f1(x ) f1(x−) = f2(x ) f2(x−). − − ∼f (x)= inf f(t); x R. t D; t>x ∈ ∈ So, f1 and f2 must have the same points of jump and at the points of continuity we have f1(x) = f2(x). Show that ∼f (x) f(x) on D, and it is non-decreasing and right continuous on R. ♠ ≥ Remark - 5.0.2 - (Normalization of monotone functions) We cannot say any- HW8 Exercise - 5.0.12 - Let f be a nondecreasing function defined over a dense subset thing about the values of the functions over the points of jump. An easy example D and let ∼f be its nondecreasing, right continuous version over R as in Exer- is to take f1(x) = χ[0, )(x) and let f2(x) take value zero over ( , 0) and value cise (5.0.11). If f is uniformly continuous on D then show that ∼f is also uniformly ∞ −∞ one over (0, ). And f2(x) could take any fixed value between zero and one. continuous on R. By an example show that the assumption of uniform continuity ∞ There are three commonly used ways of rectifiying misbehavior at points of of f on D cannot be relaxed to just continuity on D. jump (called normalizing) of a non-decreasing function. If f is a non-decreasing function, then we define The following is one of the main results of this section which says that any distribution can be uniquely decomposed into a convex combination of a discrete (i) f (x) = f(x−), (Chinese & Eastern European way). • R and a continuous distribution. + (ii) f (x) = f(x ), (American & Western European way). Let a1, a2, be an enumeration of all the points of discontinuities of a • A { ···} + nondecreasing right continuous function F along with their corresponding jumps, f(x ) + f(x−) (iii) fF (x) = , (Applied mathematicians way). • 2 b := F (a ) F (a−) > 0; j = 1, 2, . j j − j ··· Proposition - 5.0.6 - (Right continuous normalization) If f(x) is a non-decreasing Define functions D and C as follows: function then fA(x) is a non-decreasing and right continuous function. ∞ R D(x) := bj χ[aj , )(x), C(x) := F (x) D(x), x . Proof: If x 0 small enough so that x + ǫ

+ (a) C(x) and D(x) are non-decreasing functions, We already know that fA(x) fA(x ) since fA is non-decreasing. Now, for any • ǫ > 0, ≤ (b) D is right continuous and C is continuous. + + • fA(x ) fA(x + ǫ) = f((x + ǫ) ) f(x + 2ǫ). (c) The above decomposition F (x) = C(x) + D(x) is unique. ≤ ≤ • The first inequality follows by the fact that fA is non-decreasing, the middle equality (d) Every non-trivial (non-constant) bounded non-decreasing right continu- is just the definition of fA and the last inequality follows by the fact that f is non- • ous function F can be written as decreasing. Letting ǫ drop to zero gives that F (x) = αFd(x) + (F (+ ) α)Fc(x), + + fA(x ) f(x ) = fA(x). ∞ − ≤ for an α [0, F (+ )], where F is a discrete and F ∈ ∞ d c So, fA is right continuous. is a continuous probability distribution. ♠ Classification of Distributions 39 40 Classification of Distributions

Proof: D is a non-decreasing function since x

C(x) = F (x) D(x) F (y) D(y) = C(y). 0 = K(t) K(t−) C(t) C(t−) − ≤ − − − − = K(t) C(t) K(t−) C(t−) { − } −  − So, C is also a non-decreasing function as well. Next, each χ[aj , )(x) is a right ∞ = D(t) ∆(t) D(t−) ∆(t−) continuous function for each fixed j. The series { − } −  − = D(t) D(t−)  ∆(t) ∆(t−) = 0. ∞ − − − 6 D(x) := bj χ[aj , )(x)   ∞ This contradiction proves the result. To prove the last part, we already know that j=1 X F (x) = D(x) + C(x). If D( ) = α (0, F ( )), then we may write ∞ ∈ ∞ converges uniformly. Therefore, D(x) is right continuous. Hence, C(x) = F (x) − D(x) C(x) D(x) is also right continuous. Now we show that C is left continuous. For this, F (x) = α + (F ( ) α) = αF (x) + (F ( ) α)F (x). α ∞ − F ( ) α d ∞ − c just note that for any x R, we have   ∞ − ∈ If D( ) = 0, then we may take α = 0 in this case and take 0 if x = aj for any j, ∞ F (x) F (x−) = D(x) D(x−) = 6 − − bj if x = aj for some j.  C(x) F (x) = F ( ) = F ( ) Fc(x), Hence, we for any x R we have, ∞ F ( ) ∞ ∈ ∞ since, D(x)=0 for all x. And if D( ) = F ( ) then we may take α = F ( ) and C(x) C(x−) = F (x) F (x−) D(x) D(x−) = 0. − − − − ∞ ∞ ∞ D(x) To prove the uniqueness of the above decomposition suppose there exists a contin- F (x) = α = αF (x), α d uous function K(x) and another function ∆(x) of the type   since C(x)=0 for all x. ∞ ♠ ∆(x) := βj χ (x), βj < , [αj ,∞) | | ∞ j=1 j X X HW9 Exercise - 5.0.13 - Let F be a distribution function and let aj be the collection { } where βj are not zero and αj is a sequence of real numbers and of all (if any) points of jump. Prove that R F (x) = K(x) + ∆(x); x . lim F (aj ) F (aj−) = 0. ∈ ǫ 0 { − } ↓ aj : aj (x ǫ,x) We want to prove that K must necessarily be C and ∆ must be D. Suppose X∈ − D(x) = ∆(x) for some x R. Then atmost one of the following two possibilities Does the conclusion remain true if the interval (x ǫ, x) is replaced by (x ǫ, x] in 6 ∈ may occur the above sum? − −

The set a1, a2, = α1,α2, . That is, there is a point t which is in HW10 Exercise - 5.0.14 - Let F be a distribution function. A point x is called a point of • one set and{ not the· · · }other 6 { so that···}D(t) D(t ) = ∆(t) ∆(t ). Note that the − − support of F if for every ǫ > 0 we have F (x + ǫ) F (x ǫ) > 0. The collection left and the right limits of ∆ must exist− since6 its series− converges uniformly of all such points gives the support of F . Show that− if F −is a non-decreasing right and χ[αj , )(x) are non-decreasing. ∞ continuous function then the following results hold. The two sets are equal, i.e., a , a , = α ,α , . In this case relabel 1 2 1 2 1. Each point of jump of F belongs to the support of F . • α so that both D and ∆ have{ the···} same points{ of···} jump but for some t, the { j} jump sizes are different. That is, D(t) D(t−) = ∆(t) ∆(t−). 2. Each isolated point of support of F is a point of jump of F . − 6 − 5.1 Absolute Continuity & Singularity 41 42 Classification of Distributions

3. The support of a distribution function is a closed set. Definition - 5.1.2 - (Absolutely continuous distributions) Let F,G be any R 4. Provide an example of a discrete distribution whose support is the whole real two distributions over . We say G is absolutely continuous with respect to F if for any ǫ > 0 there exists a δ > 0 such that for any finite collection of disjoint line. n subintervals (xi,yi), i = 1, 2, ,n, of its domain with i=1(F (yi) F (xi)) < δ 5. The support E of a continuous distribution is a perfect set (i.e., each point implies that ··· − of E is a limit point of E). n P G(y ) G(x ) < ǫ. | i − i | i=1 X 5.1 Absolute Continuity & Singularity When G is absolutely continuous with respect to F we denote this by G F . When we take F (x) = I(x) = x, and G I often we simply say that G is≪ absolutely To see how to decompose a distribution into “smooth” and “non-smooth” ones we continuous without mentioning that≪ it is this I(x) = x that we used.3 need to introduce a new concept. Definition - 5.1.1 - (Absolutely continuous & singular functions) A real- Example - 5.1.2 - (Continuous random variables) All absolutely convergent valued function G defined over R (or a closed interval [a, b]) is called absolutely continuous integrals give rise to absolutely continuous distributions. Suppose g is the density if for any ǫ > 0 there exists a δ > 0 such that for any finite collection of disjoint of a continuous random variable X, with distribution n subintervals (xi,yi), i = 1, 2, ,n, of its domain with total length i=1(yi xi) < t δ implies that ··· − G(t) = g(x) dx. n P Z−∞ G(yi) G(xi) < ǫ. | − | Note that for any xi

Note that for any

Remark - 5.1.4 - (Decomposition to absolutely continuous distributions) = G(y) G(x) Gac(y) Gac(x) Now we spend some time showing how to decompose a continuous distribution − − { − } function into a continuous singular and absolutely continuous components. Recall = G(y) G(x) G′(t) dt − − (x,y] that by Radon-Nikodym theorem, if G is absolutely continuous with respect to Z 0, I(x) = x, then there exists a nonnegative integrable function g so that ≥ ′ ′ we see that Gs(x) is also a nondecreasing function. Since, Gs(x) = G (x) G(t) = g(x) dx; forall t R. a.e. ′ ′ − Gac′(x) = G (x) G′(x) = 0, we see that Gs(x) is singular with respect to ( ,t] ∈ − Z −∞ I(x) = x. (i.e. a singular distribution) If G is a continuous distribution then the So, the function g is the density of G. corresponding Gs will also be continuous but still singular. On the other hand, if G is singular with respect to I(x) = x then G must be Hence, any distribution could be decomposed into its absolutely continuous and concentrated on a set whose length is zero. For instance, any discrete distribution singular parts. Combining with our earlier decomposition results, we see that any must be singular with respect to I(x) = x. We may have a continuous distribution distribution can be decomposed into discrete, absolutely continuous and continuous which is singular, as the above example of Cantor function shows. How do we singular components. Such a decomposition becomes unique if we insist that the separate a singular portion from its distribution? component distributions be probability distributions. To construct this, let G be any probability distribution. Its a fact from real variables theory, that G(x) is always differentiable a.e.[I], where I(x) = x, with a HW11 Exercise - 5.1.2 - Show that if the support of a distribution G has length (Lebesgue measure) zero then G is singular. Give an example of a singular distribution whose nonnegative derivative G′(t) = g(t). Now for any x

Exercise - 5.1.7 - (Integration by parts revisited) Let f and g be two non- called a bivariate distribution it must obey the extra condition that its G-area (an decreasing functions over R with the corresponding generated Borel measures F analog of our concept of G-length) of a rectangle (a, b] (c, d], defined as follows and G respectively. For any

+ is nonnegative. If G is a probability distribution, the G-area is the probability of f(x−)dG(x) + g(x )dF (x) = F (b)G(b) F (a)G(a), (a,b] (a,b] − the rectangle. Show that all of the following functions are distributions and give Z Z the same G-area of any rectangle [a, b] [c, d]. × where E fdG(x) stands for the integral with respect to the measure represented by G. Then deduce that G1(x,y) = xy R G2(x,y) = xy + x + f(x )dG(x) + g(x−)dF (x) = F (b)G(b) F (a)G(a), G (x,y) = xy + y − 3 Z(a,b] Z(a,b] G4(x,y) = xy + x + y and hence deduce that G5(x,y) = xy + x + y + 13.

F (b)G(b) F (a)G(a) Is any of them a probability distribution? Show that if G(x,y) is a distribution − + + f(x ) + f(x−) g(x ) + g(x−) then, for any fixed constant k and a fixed interval A, the new function = dG(x) + dF (x). (a,b] 2 (a,b] 2 Z Z H(x,y) := G(x,y) + G-area(A (k,y]) × Exercise - 5.1.8 - (Functions of bounded variation & Jordan decompo- sition) A real-valued function f is called of bounded variation and denoted as is also a distribution and gives the same H-area to any rectnagle as G does. (Hence f BV , if f = F G for some nondecreasing functions F,G. Let = a = H is another “version” of G.) Finally, if G is a bounded bivariate distribution ∈ − P { explain why, without effecting the G-areas of rectangles, we may conveniently define x0,x1,x2, ,xn = b be a partition of [a, b]. If f is a real valued function over [a, b], we define··· } a version of G to be 2 n n H(x,y) := G-area(( ,x] ( ,y]); (x,y) R . + + −∞ × −∞ ∈ (f) := f(xi) f(xi 1) , −(f) := f(xi) f(xi 1) −, and P { − − } P { − − } i=1 i=1 X X n + (f) := i=1 f(xi) f(xi 1) , where x , x − stand for the positive and the P | − − | {+ } { } negative parts of x. Note that (f) = (f) + −(f). The positive, negative and total variationP of f are respectivelyP definedP by P

+ + V (f) := sup (f),V − (f) := sup −(f),V (f) := sup (f). [a,b] P [a,b] P [a,b] P P P P (a) Prove the following decomposition of f known as the Jordan decomposition.

+ f(x) = f(a) + V (f) V − (f); x [a, b], [a,x] − [a,x] ∈ and the two functions on the right are the positive and the negative variations of f. + (b) Is it true that V[a,x](f) and V[−a,x](f) are right continuous? (c) Show that f BV [a, b] if and only if V[a,b](f) < . (d) Can any comparison∈ be made between the spaces∞BV [a, b] and BV (R)?

Exercise - 5.1.9 - (Distributions on Rk) Consider R2 for simplicity. If G(x,y) is non-decreasing and right continuous in both of its variables in order for it to be 50 Conditional Distributions

This property may fail when j,k are not integers. For instance, j = k = 1/2 gives a counter example. The above notions involving probabilities carry over to densities. Suppose (X, Y ) have a discrete bivariate density, f(x,y). The conditional density of X given Y = y is taken to be

fX,Y (x,y) Lecture 6 fX Y =y(x) = , x R, | fY (y) ∈

both in the discrete and the continuous cases. A conditional density acts just as Conditional Distributions any usual density, as a function of x. However, as a function of y it does not act like a density. The Bayes theorem in density format is

fX Y =y(x) fY (y) fY X=x(y) = | , x R, When A, B are two events, the conditional probability of A given B, denoted as | fX (x) ∈ P(A B), is define as | The theorem of total probablity, in density format, is P P (A B) P (A B) = P ∩ , when (B) > 0. ∞ | (B) fX (x) = fX Y =y(x) fY (y) dy | Z−∞ The Bayes’ theorem shows how to flip the positions of A, B. That is, Example - 6.0.6 - (Conditional densities in the bivariate normal case) Let 2 2 P(A B) P(B) (X, Y ) BV N((µ1, µ2), (σ1,σ2),ρ). To make our life a bit simpler, let us first take P(B A) = | , when P(A) > 0, P(B) > 0. µ = µ∼= 0 and σ = σ = 1. In this case, the marginal density of Y is obtained | P(A) 1 2 1 2 by simply integrating out the unwanted variable x. That is,

The theorem of total probability (TTP) states that if C , C , , forms a partition 2 2 1 2 ∞ 1 ∞ x 2ρxy + y ··· − − of the sample space then fY (y) = f(x,y) dx = exp 2 dx 2π 1 ρ2 ( 2(1 ρ ) ) Z−∞ − Z−∞  −  (y2 (ρy)2)/(2(1 ρ2)) ∞ e− − − p∞ 1 2 2 P(A) = P(A Ci) P(Ci), when P(Ci) > 0 for all i. = exp − x 2ρxy + (ρy) dx | 2π 1 ρ2 2(1 ρ2) − i=1 − Z−∞  −  X y2/2   e− p 1 ∞ 1 2 = exp − 2 [x ρy] dx Example - 6.0.5 - (Memory less property) For X Geometric(p), prove that √2π 2π(1 ρ2) 2(1 ρ ) − ! P P ∼ − Z−∞  −  (X j + k X j) = (X k), for any nonnegative integers j,k. Does this 2 ≥ | ≥ ≥ e y /2 p property hold when j,k are not necessarily integers? = − . √2π When j,k are integers, we have Here the we used the fact that the area under the normal density N(ρy, (1 ρ2)) − P(X j + k X j) is one. Hence, we see that Y N(0, 1). Therefore, the conditional density of X P(X j + k X j) = ≥ ∩ ≥ ∼ ≥ | ≥ P(X j) given Y = y is P ≥ (X j + k) 1 1 2 2 = ≥ exp − 2 x 2ρxy + y P f(x,y) 2π√1 ρ2 2(1 ρ ) (X j) − − − ≥ fX Y =y(x) = = 2 j+k n 1 y /2 o (1 p) | fY (y) e−   = − √2π (1 p)j 1 1 − = exp − [x ρy]2 . = P(X k). 2π(1 ρ2) 2(1 ρ2) − ≥ −  −  p Conditional Distributions 51 52 Conditional Distributions

Hence, we see that the conditional density of X given Y = y is N(ρy, (1 ρ2)). 6.1 Conditional Expectations 2 2 − More generally, when (U, V ) BV N((µ1, µ2), (σ1 ,σ2),ρ), the reader can verify ∼ 2 A conditional expectation is just an ordinary expectation but obtained from a con- (along the above lines) that the marginal density of U is N(µ1,σ1), the marginal 2 ditional density. For instance, in the case of continuous densities, density of V is N(µ2,σ2), and the conditional density of U given V = v is

∞ ρσ1 2 2 U V = v N(µ + (v µ ),σ (1 ρ )) (0.1) E(X Y = y) = x fX Y =y(x) dx, (conditional expectation), 1 2 1 | |{ } ∼ σ2 − − | Z−∞ 2 ∞ 2 Remark - 6.0.5 - E(X Y = y) = x fX Y =y(x) dx, (conditional second moment), It should be kept in mind that the conditional density is defined, | | in the continuous random variable case, when the given event has zero probability. Z−∞ V ar(X Y = y) = E(X2 Y = y) (E(X Y = y))2 , (conditional variance). We should not use the usual rules of conditional probability on the conditioning | | − | events in such situations. If we still apply those rules, we may run into contradictory In the discrete case, we use summations instead of integrals. results. Such contradictions are called Borel paradoxes. Here is one such paradox. Remark - 6.1.1 - iid So far we have defined conditional distributions and conditional Example - 6.0.7 - (A Borel paradox) Let X, Y N(0, 1). Let us find the expectations for discrete or continuous cases. To define conditional expectations conditional density of X given X = Y . (Note that P∼(X = Y ) = 0.) { } for the general case, and to stay away from Borel paradoxes, one really needs Method 1. Let Z = X Y and note that we want to find the conditional density the concept of differentiation on abstract spaces. In particular, a theorem due to of X given that Z = 0 .− This requires us to find M. Johann Radon and Otton Nikodym. For the time being we will bypass this { } detail. fX,Z(x, 0) fX Z=0(x) = . | fZ (z) Example - 6.1.1 - (Convolutions) Here we show that sometimes the theorem of total probability can also be used for obtaining new distributions. If X, Y are two It is not difficult to see that Z N(0, 2). The absolute value of the jacobian is 1. ∼ (say continuous) random variables with a joint density f(x,y), and the marginal The joint density of X and Z is density of Y is denoted as g(y), then the theorem of total probability can be used

1 x2/2 1 (x z)2/2 to find the density of their sum, Z = X + Y . Indeed, f (x,z) = e− e− − . X,Z √ · √ 2π 2π ∞ P(Z z) = P(X z Y ) = P(X z y Y = y) gY (y) dy 1 x2 1 ≤ ≤ − ≤ − | Thus, the required conditional density is fX Z=0(x) = e− , which is N(0, ). Z−∞ | √π 2 Differentiating with respect to z gives the density of Z (which is a bit more general Method 2. Consider the random variable Z = X/Y . Now X = Y if and only if version of convolution) Z = 1. We now want to find the conditional density of X given X = 1 , { Y } ∞ ∞ fX,Z(x, 1) fZ (z) = fX Y =y(z y) gY (y) dy = f(z y,y) dy. fX Z=1(x) = . | − − | fZ (1) Z−∞ Z−∞ By using the change of variable method, the absolute value of the jacobian of the If the rvs are discrete, the integral is replaced by a summation symbol. If X, Y transformation is x /z2. Therefore, are independent with their respective densities, h(x) and g(y), then the resulting | | density, fZ, is the convolution of h and g. Now the distribution of W = X Y can 2 −2 − x x 1 x (1+z )/2 be obtained along the above lines as well. What about the distribution of R = X fX,Z(x,z) = | | fX,Y (x,x/z) = | | e− . Y z2 z2 2π or S = XY ? The same theorem of total probability can be used to obtain Integrating out x, the marginal density of Z is Cauchy(0, 1). Thus, the required ∞ ∞ 1 conditional density is f (z) = y f(zy,y) dy, f (z) = f(z/y,y) dy. R | | S y 1 x2 Z−∞ Z−∞ | | x 2π e− x2 fX Z=1(x) = | | = x e− , | 1 So far we have seen the usual four basic operations of algebra +, , and 2π | | performed on random variables and how to obtain their corresponding distributions.− ÷ × which is not even a normal density. This example underscores the importance of One natural question is, “what about the composition “ ” operation?” Conditioning keeping the conditioning event as specified to avoid arbitrary results. can handle this operation as well. See Example (6.1.3).◦ 6.1 Conditional Expectations 53 54 Conditional Distributions

Remark - 6.1.2 - (Conditional expectation & conditional variance ran- HW13 Exercise - 6.1.2 - Let X Exp(λ). Compare P(X x+y X x) with P(X y), dom variables) It should be noted that E(X Y = y) is a real number. How- when x,y > 0. ∼ ≥ | ≥ ≥ ever, this number may change for different values| of y. For instance, E(X Y = | HW14 2 2 3) may be different from E(X Y = 13.4). That is, E(X Y = y) is a func- Exercise - 6.1.3 - Let (U, V ) BV N((µ1, µ2), (σ1 ,σ2),ρ), verify that the marginal | | 2 ∼ 2 tion of y. In this function of y, if we replace the dummy variable, y, by the density of U is N(µ1,σ1), the marginal density of V is N(µ2,σ1), and the condi- random variable Y then we obtain the conditional expectation random variable, tional density of U given V = v is denoted as E(X Y ). This random variable has its own distribution. Similary, ρσ2 2 2 | U V = v N(µ1 + (v µ2),σ2(1 ρ )). the conditional variance random variable V ar(X Y ) is obtained by replacing the |{ } ∼ σ1 − − dummy variable y by the random variable Y in V| ar(X Y = y). Usually we are not interested in the distributions| of the random variables Exercise - 6.1.4 - Let (X, Y ) have the following bivariate density E(X Y ) and V ar(X Y ). However, these two very special random variables have two 3x if 0

ρσ1 P 1 1 2 2 2 Here ties are allowed in the sense that if X1 = X2 then (U = X1 X) = n + n = n . E(X Y = y) = µ1 + (y µ2), Var(X Y = y) = σ1(1 ρ ). (1.2) | | σ2 − | − The conditional distribution of U X is called the empirical distribution function. A better way is to define the distribution|{ } In this case the conditional variance random variable, V (X Y ), is a constant and | the conditional expectation random variable, E(X Y ), is P the number of Xi which are x | FU X(x) = (U x X) = ≤ . | ≤ | n ρσ1 E(X Y ) = µ1 + (Y µ2). Using the empirical distribution, show that | σ2 − n 2 1 2 By Example (6.0.6), since we know that the marginal of Y is N(µ2,σ2 ), we see E(U X) = X , Var(U X) = (X X ) . | n | n i − n that i=1 ρσ1 X E ( E(X Y ) ) = µ1 + (E(Y ) µ2) = µ1 = E(X), | σ2 − Then by using the theorems of total expectation and total variance, verify that which verifies the theorem of total expectation. The theorem of total variance also E(U) = E(X1) and V ar(U) = V ar(X1). From the statistical point of view, the is easily verified since conditional distribution of U given X itself is considered as an estimator of whole F . More precisely, for a fixed real number x, let Yi(x) := 1 if Xi = x and zero E ( V ar(X Y ) ) + V ar ( E(X Y ) ) otherwise, i = 1, 2, ,n. Verify that Y (x), Y (x), , Y (x) iid B(1, F (x)), and | | ··· 1 2 ··· n ∼ 2 2 ρσ1 n = E σ1(1 ρ ) + V ar µ1 + (Y µ2) 1 F (x)(1 F (x)) − σ2 − FU X(x) = Yi(x), E(FU X(x)) = F (x), Var(FU X(x)) = − .   | n | | n  ρσ 2 i=1 = σ2(1 ρ2) + 1 σ2 = σ2 = V ar(X). X 1 − σ 2 1 The quantity  2  sup FU X(x) F (x)

measures the largest discrepency between the two distribut ions, FU X(x) and F (x), Exercise - 6.1.1 - Verify statements (1.2). and plays a basic role in model selection. | 6.1 Conditional Expectations 55 56 Conditional Distributions

Example - 6.1.3 - (Compositions of random variables) Your friend rolls a fair die and tells you the value he observed. Then you toss a coin (with probability of heads being p) that many times. What is the distribution of the number of heads you observed? To model this, let N be the number of heads you observed and let Y be the fact value your friend observed. Note that N = X(Y ) is a composition of two random variables where X(n) B(n,p) and Y Uniform 1, 2, 3, 4, 5, 6 . By the theorem of total probability ∼ ∼ { }

6 P(N = k) = P(N = k Y = j) P(Y = j) | j=1 X 6 j k j k 1 = p (1 p) − k − 6 Xj=k   6 k pk − i + k = (1 p)i, k = 0, 1, 2, 3, 4, 5, 6. 6 i − Xi=0   Furthermore, by the theorem of total expectation, E(N) = E(E(N Y )) = E(Yp) = 3p and by the theorem of total variance |

V ar(N) = V ar(Yp) + E(Yp(1 p)) = p2V ar(Y ) + 3p(1 p). − − 58 Conditional Expectations & Martingales

(vi) (Given factor comes out) Any factor determined by the information • on the right side of the vertical line can be treated as a constant: For instance, E(h(Y )g(X, Y ) Y ) = h(Y ) E(g(X, Y ) Y ). | | (vii) (Compression property) • E(E(Z X) X, Y ) = E(E(Z X, Y ) X) = E(Z X). Lecture 7 | | | | | That is, the smaller given information survives. (viii) (Conditional Jensen’s inequality) For any convex function h if • E X < and E h(X) < then Conditional Expectations & | | ∞ | | ∞ E(h(X) Y ) h(E(X Y )). | ≥ | In particular, E( X Y ) E(X Y ) . Martingales | || ≥| | | (ix) (Covariance property) Cov(Y, E(X Y )) = Cov(X, Y ) when variances • are finite. | (x) (Independence property) If X, Y are independent then E(X Y ) = • | 7.1 Properties of E(X Y ) E(X). | (xi) (Projection property) When X, Y have finite variances, among all It should now be self evident as to what the notation • functions h(Y ) with finite variance, E(X Y , Y ) E(X h(Y ))2 | 1 2 − is the least for the function h(Y ) = E(X Y ). must stand for. Indeed, it is a random variable that is a function of Y1, Y2 and is | obtained by computing Let us prove the last two properties. Note that when X, Y are independent, then E(X Y = v, Y = u) | 1 2 fX,Y (x,y) fX,Y (x,y) = fX (x) fY (y), = fX Y =y(x) = = fX (x). | and then replacing v by Y1 and u by Y2 in it. Of course the more general version ⇒ fY (y) This gives E(X Y ) = E(X). For the projection property, add and subtract E(X Y ) E(X Y = v, Y = u, , Y = w) | | | 1 2 ··· n and expand to get is obtained similarly. Here are the main properties. E(X h(Y ))2 = E (X E(X Y )) + (E(X Y ) h(Y )) 2 − { 2 − | |2 − } (i) (Function of given info) E(X Y ) is always a function of Y . = E(X E(X Y )) + E(E(X Y ) h(Y )) + 2E((X E(X Y ))(E(X Y ) h(Y ))). • | − | | − − | | − (ii) (Indicator property) For any indicator random variable χ , where the Now notice something surprising. The TTE and Property (iv) give • B event B is determined by Y , we have E((X E(X Y ))(E(X Y ) h(Y )) − | | − = E (E((X E(X Y ))(E(X Y ) h(Y )) Y )) , TTE, E(XχB ) = E(χB E(X Y )). − | | − | | = E ((E(X Y ) h(Y )) E((X E(X Y ))) Y )) , Prop (iv), | − − | | (iii) (TTE) E(E(X Y )) = E(X). = E ((E(X Y ) h(Y )) (E(X) E(E(X Y ))) Y )) , Linearity, • | | − − | | = E ((E(X Y ) h(Y )) (E(X) E(X)) Y )) , TTE, (iv) (TTV) E(V ar(X Y )) + V ar(E(X Y )) = V ar(X). | − − | • | | = 0. (v) (Linearity on the left side of the vertical line) • So, we see that E(ah(X, Y ) + bg(X, Y ) + c Y ) = aE(h(X, Y ) Y ) + bE(g(X, Y ) Y ) + c. E(X h(Y ))2 = E(X E(X Y ))2 + E(E(X Y ) h(Y ))2 E(X E(X Y ))2. | | | − − | | − ≥ − | where a, b, c are constants. Need say no more! 7.2 Martingales 59 60 Conditional Expectations & Martingales

iid Example - 7.1.1 - Let X ,X , , F , having finite variance. Let S = X + Definition - 7.2.1 - (Martingale) Let Y1, Y2, be a sequence of (finitely or in- 1 2 ··· ∼ n 1 ··· X2 + + Xn. Let us find finitely many) random variables. Let Mn = fn(Y1, Y2, , Yn), n = 1, 2, be a ··· new sequence made up from the Y , Y , , so that E M···< for all n =··· 1, 2, . 1 2 ··· | n| ∞ ··· (i) E(Sn X1), and (ii) E(X1 Sn). We say that the collection, M , M , , forms a martingale if | | { 1 2 ···} E (M Y , Y , , Y ) = M , n = 1, 2, . The first one is easy since n+1 | 1 2 ··· n n ···

Note that the value of Mn is determined by the values of Y1, Y2, , Yn, for each E(Sn X1) = X1 + E(X2) + E(X3) + + E(Xn) = X1 + (n 1)E(X1). ··· | ··· − n. We rephrase this by saying “Mn is an adapted process to Y1, Y2, .” ··· The reader should justify the above steps. The second one is not too hard either, Example - 7.2.1 - (Centered random walks are martingales) Let Y1, Y2, thanks to the above properties of conditional expectation random variable. Indeed, be any sequence of independent random variables with finite means µ = E(Y···), let E(X S ) = Z. Since X are iid, one may not object to noting E(X S ) = n n 1| n i 2| n n = 1, 2, . Let = E(Xn Sn) = Z. Adding these up gives nZ = E(Sn Sn) = Sn. Hence, ··· ··· | Sn | n E(X1 Sn) = Z = . | n M = (Y µ ) + (Y µ ) + + (Y µ ) = (Y µ ), n = 1, 2, . n 1 − 1 2 − 2 ··· n − n i − i ··· i=1 Exercise - 7.1.1 - Consider X, Y with joint density X Note that the value of Mn is determined by the values of Y1, , Yn. Also E Mn < 6(1 y) if 0

7.2 Martingales Exercise - 7.2.1 - (Wald’s martingale) Let Y1, Y2, be a sequence of indepen- dent and identically distributed random variables such··· that the moment generating function φ(θ) := E(eθY1 ) exists in a neighborhood of zero. Fix a real number θ and Two of the most useful constructs of modern probability theory are martingales and define n Markov chains. While a martingale is based on considitional expectations, Markov θYi exp θ(Y1 + Y2 + + Yn) e chains are based on conditional probabilities. Although the two concepts are dis- Mn := { ··· } = . (φ(θ))n φ(θ) i=1 tinct, they have some common ground. Here we collect a few results concerning Y   Martingales. Show that M , M , obeys the martingale property. 1 2 ··· 7.2 Martingales 61 62 Conditional Expectations & Martingales

Exercise - 7.2.2 - (Variance martingale) Let Y , Y , be a sequence of inde- 1 2 ··· pendent and identically distributed random variables such that E(Yn) = 0 and V ar(Y ) = σ2 < . Define a sequence of random variables, n ∞ n 2 M := Y nσ2, n = 1, 2, . n i − ··· i=1 ! X Show that M , M , obeys the martingale property. 1 2 ··· iid Exercise - 7.2.3 - (Random harmonic series) Let U1, U2, B(1,p) repre- sent the zero/one outcomes of a coin toss. Define ··· ∼

n 2U 1 X := i − , n 1. n i ≥ i=1 X Let n denote the information regarding U1, U2, , Un, as a short hand notation so thatF E(W ) stands for the conditional expectation··· E(W U , U , , U ). For |F n | 1 2 ··· n which value(s) of p does X1,X2, obey the martingale property with respect to , , ? ··· F 1 F 2 ···

Exercise - 7.2.4 - Let Y1, Y2, be a sequence of iid random variables with E(Y1) = 0 uY1 uSn n ln φ(u) and mgf E(e ) = φ(u). Let Mn := e − , where Sn = Y1 + Y2 + + Yn. Show that for each u, M is a martingale. ··· { n}

HW17 Exercise - 7.2.5 - (Martingale generation technique) Consider Exercise (7.2.4). (i) Formally differentiate Mn with respect to u once to deduce that Sn is a cen- tered random walk martingale of Example (7.2.1). (ii) Formally twice{ differentiate} 2 2 Mn with respect to u to deduce that Sn nσ is a variance martingale of Exercise 2 { − } (7.2.2), where σ = V ar(Y1). (iii) By three times differentiating Mn and letting 3 u 0, obtain an expression for a possible martingale containing an Sn term. Then directly→ verify.

2 n Exercise - 7.2.6 - Let Y1, Y2, be iid N(µ, σ ) and let Sn = k=1 Yk. Find a ··· θSn nonzero value of θ so that Mn := e is a martingale. P n HW18 2 Exercise - 7.2.7 - Let Y1, Y2, be iid N(µ, σ ) and let Sn = k=1 Yk. For a ··· θSn nr fixed value of r > 0, find two nonzero values of θ for which Mn := e − becomes a martingale. P 64 Independence & Transformations

Question (ii) was answered in the last chapter. Now we try to answer questions (i) and (ii). It turns out that the density of Y can be obtained from the density of X in a straight forward way, especially when h is a one-one function. For instance, when X is a continuous random variable with density fX , then the density of Y is

1 1 dh− (y) Lecture 8 fY (y) = fX (h− (y)) dy

In the discrete case the last derivative term is omitted. More generally, if h is a k-one function, we break up the domain of h over intervals where it is one-one and Independence & apply the above result over each subinterval and then add up all the terms. More Transformations precisely, Proposition - 8.1.1 - Let X be a continuous random variable taking values in an open set ∆ R with density f where ∆ is the disjoint union of open sets ⊆ X ∆1, ∆2, , ∆k. If h is a k-one function which is strictly monotone over each ··· 1 1 1 ∆i with continuously differentiable inverse functions h1− , h2− , , hk− then the We will now provide some of the standard techniques of distribution theory. There density of Z = h(X) is given by ··· are three (somewhat unrelated) themes here. k 1. The first theme shows a little bit of the normal sampling theory, which is 1 d 1 fZ(z) = χ (z)fX (h− (z)) h− (z) ; z h(∆); explored in more detail in the following chapter. h(∆i) i · dz i ∈ i=1 2. The second theme deals with the ranks and order statistics, which lie behind X

nonparametric side of statistical inference. where χ (z) = 1 if z h(∆ ) and otherwise the whole entry in the summand is h(∆i) i 3. The second theme provides convolution densities, i.e., the density of a sum taken to be zero. ∈ (or linear combinations) of (usually independent) random variables. Random walks is rich subclass of this topic. Also this theme leads one to define splines, Proof: For any small ǫ > 0, consider P (z ǫ

(i) What is the distribution of the new rv Y = h(X)? Some terms in the summation may become zero if χh(∆i)(z) = 0. (For convenience • of notation we ignore this indicator term.) Dividing by 2ǫ on both sides gives that (ii) What is the expectation of the new rv, E(h(X))? • k 1 1 E P (z ǫ

F(t) As ǫ drops to zero, both δi and τi drop to zero. The left side gives the density of Z and so, z k w τi + δi 1 v fZ (z) = lim fX (h− (z)). ǫ 0 2ǫ i i=1 → X u 1 1 Now note that τi + δi = hi− (z + ǫ) hi− (z ǫ) when hi is increasing in Ii and 1 1 − − τ +δ = h− (z ǫ) h− (z +ǫ) when h is decreasing in I . Consider the increasing t i i i − − i i i case. H(u) H(v) H(w) H(z) 1 1 τ + δ h− (z + ǫ) h− (z ǫ) lim i i = lim i − i − ǫ 0 2ǫ ǫ 0 2ǫ Figure 8.1: Inverse of a Distribution Function. → → ′ ′ 1 1 h− (z + ǫ) + h− (z ǫ) = lim i i − (L’Hopital’s rule) ǫ 0 → 2 ′ d 1 1 The monotonicity of F applied to the left and the far right sides of (1.2) gives that = h− (z), (continuity of h− ). dz i i F (H(U)) F (x). Since the minimum is attained, the left equality part of (1.2) ≤ In the decreasing case we get the negative sign with the derivative. also gives that U F (H(U)). Putting these two facts together gives U F (x). ♠ Hence, U B. We≤ have shown that A B. Conversely, if U B then U ≤ F (x). ∈ ⊆ ∈ ≤ Example - 8.1.1 - In Exercise (2.0.8) the reader showed that Y = X2 χ2 when Hence, x is among all those values of t for which F (t) U. Therefore, ∼ (1) ≥ X N(0, 1). Now we verify this result using the above proposition. Since h(x) = x2 ∼ H(U) = min t : F (t) U x. is a 2-one function, its inverse functions being h1(y) = √y and h2(y) = √y for { ≥ } ≤ y > 0, we see that the density of Y is − This gives that U A, and hence B A. Hence, A = B, making their probabilities equal, namely P(X∈ x) = P(U ⊆F (x)) = F (x), since U is a uniform random 1 d 1 1 d 1 ≤ ≤ f (y) = f (h− (y)) h− (y) + f (h− (y)) h− (y) variable and F (x) [0, 1] for all real values of x. Y X 1 · dy 1 X 2 · dy 2 ∈

1 y/2 1 1 y/2 1 HW19 Exercise - 8.1.1 - Let F be a continuous cdf and let H(p) be the inverse function = e− + e − √2π 2√ y √2π 2√y as defined in Example (8.1.2) for p (0, 1). Prove that 1 ∈ (1/2) 2 1 1 y/2 (i) F (H(p)) = p. = y 2 − e− , y > 0. • √π (ii) H(p) is strictly increasing in p and H(F (x)) x, for F (x) (0, 1). 1 1 2 1 • ≤ ∈ This is the density of Gamma( 2 , 2 ) χ(1) since Γ( 2 ) = √π. (iii) H(p) is left continuous. ∼ • Example - 8.1.2 - (From uniform to any F , i.e., Simulation) Let U ∼ The multivariate extension of the above proposition is similar. For simplicity Uniform(0, 1), and let F (x) be a given distribution. If F is invertible, then we mention the bivariate case, since the multivariate case is analogous. Suppose 1 X = F − (U) has the distribution F . This idea works even when F is not one- that U, V be two continuous random variables with joint density f(u, v). Suppose one, if we construct an inverse carefully. Define an “inverse” function we want to find the density of a new random variable X = h(U, V ). One way this is accomplished is by the following few steps. H(u) := inf t : F (t) u . (1.1) { ≥ } Step 1. Introduce another random variable Y = k(U, V ), so that the trans- One can prove that, (i) the infimum is obtained, and (ii) X := H(U) F . (i) Since • formation (h(u, v), k(u, v)) is a one-one mapping. F is a right continuous function, the minimum is always attained.∼ Figure (8.1) Step 2. Find the inverse of the transformation, which we denote by u = explains how the above defined inverse operation works out for various values of • u,v,w and z. (ii) We need to show that P(X x) = F (x), for all real numbers r1(x,y), v = r2(x,y). ≤ x. Since X x = U : H(U) x , let A := U : H(U) x . Define another Step 3. Find the jacobian of the inverse transformation, event B :={ U≤: U} {F (x) . We claim≤ } that A = {B. To prove≤ this} we need to show • { ≤ } ∂r1 ∂r1 that A B and B A. Let U A. Then H(U) x. In other words ∂x ∂y ⊆ ⊆ ∈ ≤ J = ∂r ∂r 2 2 H(U) = min t : F (t) U x. (1.2) ∂x ∂y { ≥ } ≤

8.1 Transformations of Random Variables 67 68 Independence & Transformations

Step 4. Obtain the joint density of X, Y be the formula HW20 Exercise - 8.1.2 - (Separate functions of independent r.vs. are indepen- • dent) Let X, Y be independent random variables. If U = h(X) and V = g(Y ) are g(x,y) = f(u, v) J , where insert u = r (x,y), v = r (x,y). | | 1 2 such that U, V have joint mgfs, prove that U, V are also independent. (Hint: use (In the discrete case Step 3 is not needed and J is ignored.) the independence property of mgfs.) | | Step 5. Finally get the marginal of X, by integrating out (or summing out Exercise - 8.1.3 - (Separate functions of uncorrelated r.vs. are not neces- • in the discrete case) the variable y. sarily uncorrelated) Construct two random variables X, Y that are uncorrelated and functions, h, g, so that U = h(X) and V = g(Y ) are correlated. Example - 8.1.3 - Let us prove that if U, V iid N(0, 1) then X = U + V N(0, 2). Let us introduce a new random variable Y :=∼U V so that the transformation∼ is one-one. Now the inverse transformation is u = (x−+y)/2, v = (x y)/2. This gives 8.2 Sequences of Independent Random Variables that the Jacobian of the inverse transformation is J = 1 . Hence− the joint density 2 Earlier we gave a definition of independence of a sequence of infinitely many events. of X, Y is How do we know that such a sequence exists? Intuitively it does not take much 1 ((x+y)2 +(x y)2)/8 to imagine an infinite sequence of fair coin tosses. But what sort of (Ω, , P) cor- g(x,y) = f(u, v) J = e− − E | | 2 2π responds to such an experiment? Here we present a heuristic argument, which can × 1 x2/4 y2/4 be made as precise as one desires, while gives a bit more general result of an in- = e− e− . 4π finite sequence or independent random variables. Then we present some standard transforms, such as their partial sums of which random walks is a famous example, This shows that X, Y are independent and X N(0, 2), as well as Y N(0, 2). ∼ ∼ rank statistics, record statistics and order statistics. Example - 8.1.4 - (Ratio of standard normals is Cauchy) Finding the density Remark - 8.2.1 - (Infinite sequence of independent events, Rademacher of a ratio of two random variables is very similar to the convolution density. We sequence) If we identify a head by 1 and a tail by 0, then members of Ω must be illustrate this by showing that the density of the ratio of two indepenent standard iid infinite sequences of 0’s and 1’s, denoted as ω = (u1,u2, ). Such a sequence also normals happens to be a Cauchy density. Let X, Y N(0, 1), and let Z = X/Y . gives us a real number, ··· Let V = Y . Now the inverse transformation is y = v∼and x = vz. In other words, u u u ω r := 1 + 2 + 3 + , r [0, 1]. (2.3) v z 1 2 3 J = = v. ↔ 2 2 2 ··· ∈ 0 1 which lies in the interval [0, 1]. Conversely, every number in [0, 1] may be written in

Therefore, the joint density of Z,V is (nonterminating) binary expansion of the above type. Therefore, we may identify

each infinite coin toss experiment to represent a point in [0, 1]. f (z, v) = f (vz) g (v) v . Z,V X Y | | Next note that the subset of Ω that describes the fate of the first n tosses corresponds to an interval in [0, 1] and vice-versa. Therefore, it is natural to consider Integrating out the unwanted variable gives the density of Z, the smallest sigma field containing the subintervals of [0, 1] to be our event space = , i.e., the Borel sigma field over [0, 1]. ∞ E B fZ(z) = v fX (vz) fY (v) dv. Finally, as our model, we insist that the probability function, P that we aim to | | Z−∞ define, must reflect the intuitive nature of the experiment. Namely, In our case, X, Y are standard normals, therefore, (i) Unrelatedness of the coin tosses must be preserved. That is, the fate of • the first n coin tosses should not influence (or be influenced by) the fate of 1 ∞ (1+z2)v2/2 f (z) = v e− dv another batch of m tosses beyond the first n tosses. Z 2π | | Z−∞ (ii) Since the coin is fair, we may identify a “random selection” of a point from 2 ∞ (1+z2)v2/2 • = ve− dv the interval [0, 1] through the outcome of an infinite fair coin toss experiment. 2π 0 Z P 1 ∞ t 2 2 It is not too difficult, then, to show that a function can be defined which satisfies = e− dt, t = (1 + z )v /2, P π (1 + z2) these requirements. In fact, that function assigns the probability to any interval, Z0 1 (a, b) [0, 1] to be its length, = , z R. ⊆ π (1 + z2) ∈ P([a, b]) = P((a, b]) = P([a, b)) = P((a, b)) = b a, 0 a b 1. − ≤ ≤ ≤ 8.2 Sequences of Independent Random Variables 69 70 Independence & Transformations

iid With this probability space ([0, 1], , P) we can now construct an example of an HW22 Exercise - 8.2.4 - Let U, V Gamma(λ, 1 ). Show that U + V Gamma(λ, 1) E ∼ 2 ∼ ∼ infinite sequence of independent events. As long as the events describe separate 1 iid Exp(λ). (Here you may use the fact that Γ( 2 ) = √π.) Let X, Y N(0, 1), deduce portions of the infinite coin tosses, they will give an infinite sequence of independent that X2 + Y 2 χ2 . ∼ events. ∼ (2) Next, if we define a random variable U : [0, 1] [0, 1] to be the identity function, then U Uniform(0, 1). By (2.3), we also see→ that 8.3 Generating Functions ∼ ∞ u Finally we should mention that there are several types of transforms that are used U = i . 2i in probability theory. The context is a bit different from the one we have presented i=1 X above. Now the aim is not to find the distribution but instead find the expectation. In fact, with the help of this example, we can construct an infinite sequence of inde- These transforms store some features of the distribution of X which may be easier pendent random variables, Y1, Y2, , having any specified distributions F1, F2, to study in the transformed domain than in the original distribution. Typical respectively. The idea is very simple··· and here it is. ··· examples along these lines are as follows. Instead of writing the outcomes of the infinitely many tosses as a sequence, Generating functions, (designed for nonnegative integer valued rvs). write it as a double array as presented in the adjacent figure. That is, we put at • the head of each arrow the outcome of the coin toss. So, u1 goes at the head of the ∞ X k first arrow, u goes at the head of the second arrow, and so on. E(t ) = t P(X = k), t (0, 1]. 2 ∈ k=0   -u1 - X 6 6 6 Laplace transforms, (designed for nonnegative continuous rvs). ? ?   ? ? • 6 u2 6 λX ∞ λx ?- - - - ? E(e− ) = e− f(x) dx, λ (0, ). 6 0 ∈ ∞       ? Z Moment generating functions, or bilateral Laplace transforms, (defined for all Note that this is how we also show that the set of rationals is countable. So, the • varieties of random variables, as long as they exist). The moment generating p positions of the heads of the arrows my be identified as the rationals q , where function (mgf) of X = [X1, ,Xd]′ is the following expectation (provided q = 1, 2, and p = 0, 1, 2, . Considering column wise, we have partitioned it exists for all θ in a neighborhood··· of 0.) the sequence··· of infinite± coin± tosses··· into separate infinitely many infinite subse- θ X θ1X1+ +θdXd quences of fair coin tosses. Each column, say p, gives rise to a uniform random E(e · ) = E e ··· , θ =[θ1, , θd]′. ··· variable, which we many denote as U Uniform(0, 1). Since they are made from p  separate portions (columns), U , U , ∼forms an independent sequence (actually an Fourier-Stieltjes transforms, also called the characteristic functions, (defined 1 2 ··· • iid sequence) of Uniform(0, 1) random variables. Finally, by taking Yi = Hi(Ui), for all varieties of random variables). The characteristic function of X = where H is as defined in (1.1) from F , we see that Y , Y , will be an infinite [X1, ,Xd]′ is defined to be the expectation, i i 1 2 ··· sequence of independent random variables with Y F . ··· i i iθ X iθ X + +iθdXd d ∼ E(e · ) = E e 1 1 ··· , θ R . So, for instance, we may define random varibles X,Y,Z on ([0, 1], , P) so that ∈ X N(µ, σ2), Y P oisson(13) and Z Cauchy(0, 1) or a randomB vector X = A useful device to handle sums of independent random variables (and for several [X ∼,X , ,X ] ∼ MV N(µ, V). In some∼ sense ([0, 1], , P) may be taken to be 1 2 d ′ other uses) is the concept of a transform or a generating function. the “mother”··· of all∼ random variables. B Definition - 8.3.1 - iid (Moment generating function) The moment generating function HW21 Exercise - 8.2.1 - Let U, V Exp(λ). Show that U + V Gamma(λ, 2). More (mgf) of X = [X1, ,Xd]′ is the following expectation (provided it exists for all ∼ ∼ iid ··· generally, U + U + + U Gamma(λ,n) when U , U , , U Exp(λ). θ in a neighborhood of 0.) 1 2 ··· n ∼ 1 2 ··· n ∼ θ X θ X + +θdXd E(e · ) = E e 1 1 ··· , θ =[θ , , θ ]′. Exercise - 8.2.2 - If X ,X iid B(n,p), find P (X + X = k) for k = 0, 1, , 2n. 1 ··· d 1 2 ∼ 1 2 ··· The characteristic function of X =[X1, ,X d]′ is defined to be the expectation, Exercise - 8.2.3 - If X, Y are independent with X P oisson(λ), Y P oisson(µ), ··· ∼ ∼ iθ X iθ1 X1+ +iθdXd d find P (X + Y = k) for k = 0, 1, 2 . E(e · ) = E e ··· , θ R . ··· ∈  8.3 Generating Functions 71 72 Independence & Transformations

Example - 8.3.1 - If X N(0,σ2) then, by completing the square, we have (Uniqueness) If two random variables have the same moment generating func- ∼ • tion then they both have the same distribution. θX 1 ∞ θu u2/(2σ2) E(e ) = e e− du σ√2π (Independence) If the joint moment generating function of X, Y , Z−∞ • 2 2 1 ∞ 2 2 2 σ θ /2 (u θσ ) /(2σ ) θ1 X+θ2 Y = e e− − du φ(θ1, θ2) = E(e ), σ√2π  −∞  2 2 Z = eσ θ /2. happens to be the product of the individual moment generating functions of X and Y , Here we used the fact that the area under a normal (or any) density is one. This θ1 X θ2 Y is a useful trick and we will use it whenever applicable. A bit more generally, if E(e ) E(e ), for all (θ1, θ2) in a neighborhood of (0, 0), U N(µ, σ2), then X = U µ N(0,σ2). That is, U = X + µ, and the above result∼ gives the mgf of U N−(µ,∼ σ2) as then the two random variables are independent, and vice-versa. ∼

2 2 E(eθU ) = eµθ+(σ θ /2). Example - 8.3.3 - (Two normal random variables are independent if and only if they are uncorrelated) In general if two random variables are indepen- 2 iθU iµθ (σ2θ2/2) Similarly, the characteristic function of U N(µ, σ ) is E(e ) = e − . dent then they must be uncorrelated (provided their moments exist, of course). ∼ However, the converse does not hold as one can easily construct counter examples Example - 8.3.2 - When X Gamma(λ,α), one can show that the mgf of X is (cf. Exercise (??)). One nice exception exists for normal random variables, for ∼ which the two concepts are equivalent, as we now show. A bit more general version λ α is in the next chapter (cf. Example (??)). E(etX ) = , t<λ. λ t We will find the mgf for (U, V ) BV N((0, 0), (1, 1),ρ). The trick is to use the  −  fact that the area under any normal∼ density is 1. That is, This is easy to prove. Indeed, 1 ∞ 1 (x µ)2 α e− 2σ2 − dx = 1, for any µ R, and any σ > 0. tX λ ∞ xt α 1 λx √ ∈ E e = e x − e− dx 2π Γ(α) Z−∞ Z0  α α The joint mgf of U, V is, λ (λ t) ∞ α 1 (λ t)x = − x − e− − dx (λ t)α Γ(α) 0 E(etU+sV ) −α  Z  λ tx = , t<λ. 1 ∞ e ∞ 1 (λ t)α = esy exp − (x2 + y2 2ρxy) dy dx − √2π 2π(1 ρ2) 2(1 ρ2) − Z−∞ − Z−∞  −  Here we used that the area under the gamma (or any) density is one. In particular, tx x2/2(1 ρ2) 2 2 2 1 ∞ ep e− − (y 2(ρx + s(1 ρ ))y) when X χ(d), its mgf is = exp − − − dy dx ∼ √2π 2π(1 ρ2) R 2(1 ρ2) Z−∞ − Z  −  1 ∞ 2 2 1 (ρx+s(1 ρ2 ))2 tX 1 1 txp x /2(1 ρ ) 2(1−ρ2 ) E e = , for t < . = e e− − e − dx (1 2t)d/2 2 √2π − Z−∞  1 ∞ 1 = exp − x2 (ρx + s(1 ρ2))2 2(1 ρ2)tx dx Remark - 8.3.1 - (Properties of mgf) Some of the nice features of moment gen- √2π 2(1 ρ2) − − − − Z−∞  −  erating functions are listed below. Similar results hold for characteristic functions  1 ∞ (x2(1 ρ2) 2s(1 ρ2)ρx 2(1 ρ2)tx s2(1 ρ2)2)/(2(1 ρ2)) as well. Through out we will take these facts for granted. = e− − − − − − − − − dx √2π Z−∞ (Generation of moments) By differentiation we can derive the moments of • s2 (1 ρ2)/2 1 ∞ 1 2 the random variable from its moment generating function. More precisely, if = e − exp − x 2(sρ + t)x dx θX √2π 2 − φ(θ) = E(e ) is the moment generating function of X then Z−∞   s2(1 ρ2) (sρ + t)2  = exp − + d d2 d3 2 2 E(X) = φ(θ) E(X2) = φ(θ) E(X3) = φ(θ)   dθ dθ2 dθ3 2 2 θ=0 θ=0 θ=0 = exp (t + s + 2tsρ)/2 .

 8.3 Generating Functions 73 74 Independence & Transformations

2 2 X µ1 If (X, Y ) BV N((µ , µ ), (σ ,σ ),ρ) then it is easy to see that U = − and 1 2 1 2 σ1 Y µ ∼ V = − 2 have joint cdf σ2 P (U u, V v) = P (X µ + uσ , Y µ + vσ ). ≤ ≤ ≤ 1 1 ≤ 2 2 Differentiating with respect to u, v shows that (U, V ) BV N((0, 0), (1, 1),ρ). Hence, the mgf of (X, Y ) is ∼

Ψ(t, s) = E e(tX+sY )   = E et(µ1 +σ1U)+s(µ2 +σ2V )   = etµ1+sµ2 E e(tσ1)U+(sσ2)V

tµ1+sµ2  2  2 = e exp ((tσ1) + (sσ2) + 2tsρσ1σ2)/2 .

Once again we see that X, Y are independent if and only if ρ = 0.

HW23 2 Exercise - 8.3.1 - If X, Y are independent random variables so that X N(µ1,σ1), Y N(µ ,σ2), then find the mgf of X + Y . What might be its distribution?∼ ∼ 2 2

HW24 Exercise - 8.3.2 - Verify that the joint mgf of X and (X X ), i = 1, 2, ,n, n i − n ··· when X ,X , ,X iid N(µ, σ2), is as follows, (see also Exercise (??)) 1 2 ··· n ∼

n σ2 t2 σ2 n 2 tXn+P si(Xi Xn) µt+ P (si sn) E e i=1 − = e 2n e 2 i=1 − .   Classify the following statements as true or false, with justifications.

(i) Xn is independent of (Xi Xn),i = 1, 2, ,n . • 2 { − ··· } (ii) X N(µ, σ ). • n ∼ n (iii) X is independent of n (X X )2. • n i=1 i − n (iv) X is independent of n (X X )3. • n Pi=1 i − n (v) X is independent of n (X X )4. • n Pi=1 i − n P 76 Ranks, Order Statistics & Records

j = 1, 2, ,n. Yet in other words, ··· n

Rk = χ Xi Xk = The number of Xi which are Xk. { ≤ } ≤ i=1 X Note that R(X) = (R ,R , ,R ) forms a permutation of the positive inte- 1 2 ··· n Lecture 9 gers 1, 2, ,n. Knowing (Y1, Y2, , Yn) does not allow us to reconstruct our X ,X , ···,X . However, knowing··· both (Y , Y , , Y ) and (R ,R , ,R ) 1 2 ··· n 1 2 ··· n 1 2 ··· n does allow us to reconstruct (X1,X2, ,Xn), by the relationship Xj = YRj , j = 1, 2, ,n. Hence, (D(X),R(X)) is a one-to-one··· function of X = (X ,X , ,X ). ··· 1 2 ··· n Ranks, Order Statistics & Now consider a slightly different setup. Now allow n to be unrestricted and consider X ,X , . Let N := 1, indicating the starting time as a base, and take 1 2 ··· 0 Records X1 as our starting (trivial) base value to compare the future record values with. Define N1 to be the time at which the first record is achieved. That is,

N := min n 1: X X . 1 { ≥ n+1 ≥ 1}

For two independent random variables their addition yielded the convolution density. The amount XN1 is the first record value and N1 is the first record time. Future Now consider the operations of minimum and maximum performed on independent record times and record values are defined similarly in a recursive fashion. That is, random variables. What will be their densities? after Nm and XNm , have been defined, let More elaborately, suppose X1,X2, ,Xn are independent and identically dis- ··· Nm+1 := min n > Nm : Xn XN , m = 0, 1, 2, , tributed continuous random variables with cumulative distribution function FX . { ≥ m } ··· Define new random variables, be the (m + 1)-th record time and XNm+1 be the (m + 1)-th record value. The amounts ∆m := Nm Nm 1 are called inter-record times. X(1) := Y1 := min X1,X2, ,Xn , − − { ··· } The aim of this section is to find the distributions of order and rank statistics X := Y := next to the min X ,X , ,X , (2) 2 { 1 2 ··· n} as well as the record times and record values. As we shall see there is a bit of X := Y := second next to the min X ,X , ,X , (3) 3 { 1 2 ··· n} parallelism that exists between the two scenarios: We will show that . . . . The distribution of the ranks does not depend on F , however, the distribution • X := Y := max X ,X , ,X . of the order statistics is heavily dependent on F . (n) n { 1 2 ··· n} Similarly, the distribution of the record times does not depend on F , however, These random variables are called the order statistics1. When n is an odd integer, • the distribution of the record values is heavily dependent on F . the middle of these Y ’s is called the of the sample. When n is an even integer, the median of the sample is taken to be the average of the middle two Y ’s. Note that the average of all the Y ’s is the same as the average of all the X’s. Example - 9.0.4 - (How do record and order statistics arise in real life?) It is quite intuitive to imagine how records are set in a continuing process where Sometimes the order statistics are also denoted as X(i) = Yi, i = 1, 2, ,n. In order to show their dependence on n, we sometimes use the more elaborate··· notation identically distributed experiments are performed, such as the Olympic games or floods or tides or earthquakes. Note that

X(n:i) = Yi, i = 1, 2, ,n. ··· 1 P(N n) = P(N >n) = P(all X ,X , ,X X ) − 1 ≤ 1 2 3 ··· n+1 ≤ 1 The vector of order statistics, D(X):=(X ,X , ,X ), where X = (X ,X , ,X ), ∞ 1 (1) (2) ··· (n) 1 2 ··· n = F (x)n f(x) dx = , n = 1, 2, . provides ranks, denoted as R1,R2, ,Rn, for the original X1,X2, ,Xn. In n + 1 ··· ··· ··· Z−∞ words, R1 is the position of X1 when X1,X2, ,Xn are rearranged in increasing ··· 1 order. Therefore, YR1 = X1. Similarly, R2 is the position of X2 when X1,X2, ,Xn This gives that P(N = 1) = and ··· 1 2 are rearranged in increasing order, and so on. More mathematically, Xj = YRj , 1 1 P(N = j) = P(N j) P(N = j 1) = , j = 2, . For more on order statistics, see the last chapter. 1 1 ≤ − 1 − j(j + 1) ··· Ranks, Order Statistics & Records 77 78 Ranks, Order Statistics & Records

n 1 Note that the density of the first record time, N1, does not depend on F . As we − n 1 i n i 1 will see the distribution of the record value, X does depend on F . − (F (y)) (1 F (y)) − − . N1 − i − ) Order statistics, also, arise quite naturally in real life. For instance, suppose Xi=k   a crane is lifting a tank with the help of a metal chain having n identical links. Now if we let j = i+1 in the second sum, we see that the two sums become identical There is some possibility that, due to the heavy load, the chain might break. We after separating the first term of the first sum. This gives can find the probability distribution of the breaking load of the chain by using the first order statistic Y1. Indeed, the chain when suspended with the load is going n 1 k 1 n k fk(y) = nf(y) − (F (y)) − (1 F (y)) − . to expose all its links with the same weight. The weakest of the links will break k 1 −  −  first and at that moment the chain will break. If X1,X2, ,Xn are the breaking ··· This finishes the proof strengths of the individual links, then the breaking strength of the weakest link is ♠ the first order statistic Y1. Thus, Remark - 9.0.2 - (A faster way of obtaining the density of Yk) There is an P (chain breaks for weight y) = P (Y1 y) = 1 P (Y1 >y) ≤ ≤ − easy (heuristic) way to get the density fk of the order statistics Yk. Just use the = 1 P (X1 > y,X2 > y, ,Xn >y) fact that for h > 0, − n ··· = 1 (1 FX (y)) . − − Fk(y + h) Fk(y) 1 fk(y)= lim − = lim P (y < Yk y + h). Differentiating with respect to y gives the density of the breaking load of the chain: h 0 h h 0 h → → ≤ d n 1 The following argument will be similar for the case when h < 0. The event fY (y) = P (Y1 y) = n(1 FX (y)) − fX (y). 1 dy ≤ − y < Y y + h means that (for a very small h) we must have k 1 of the X’s { k ≤ } − Note that this clearly depends on F . We will see that the distribution of the rank fall below y, one of the X’s must fall in the interval (y,y + h] and the remaining n k of the X’s fall above y + h. By using the we get R1 of X1, however, will not depend on F . Consider the order statistics first. We − will denote the density of Yk by fk or fYk . 1 fk(y) = lim P (y < Yk y + h) iid h 0 h ≤ Proposition - 9.0.1 - → (Distribution of Yk) When X1,X2, ,Xn F continu- k 1 ··· ∼ 1 n!(F (y)) − n k ous random variables, the density of the k-th order statistic, Yk, is as follows: = lim (F (y + h) F (y)) (1 F (y + h)) − h 0 h (k 1)!(n k)! − − n! →  − −  k 1 n k k 1 fk(y) = (F (y)) − f(y) (1 F (y)) − . n!(F (y)) − F (y + h) F (y) n k (k 1)! (n k)! − = lim − (1 F (y + h)) − − − (k 1)!(n k)! h 0 h − − − → Proof: We will first find the distribution of Yk. Its derivative will then give the n! k 1 n k density. = (F (y)) − f(y) (1 F (y)) − . (k 1)!(n k)! − − − Fk(y) = P (Yk y) = P (at least k of X1,X2, ,Xn are y) n ≤ ··· ≤ Remark - 9.0.3 - (Joint Density of Yk, Yj ) The above trick that we used to find n i n i = (F (y)) (1 F (y)) − . the density of Yk goes through without much difficulty for the joint density of two i − (and more) order statistics. Note that for kz); (yz) X k ≤ j Here when i = n, the second term becomes zero since (n i) becomes zero. So, we = P (at least k of the X’s ( ,y) and at most j 1 fall in ( ,z)). see that − ∈ −∞ − −∞ n Now consider a three sided die with probabilities n 1 i 1 n i fk(y) = nf(y) − (F (y)) − (1 F (y)) − ( i 1 − p1 := P (X1 y) = F (y), Xi=k  −  ≤ Ranks, Order Statistics & Records 79 80 Ranks, Order Statistics & Records

p := P (yz) = 1 F (z). 3 1 − 1 P(R1 = i1,R2 = i2, ,Rn = in) = , We roll this die n times and need to find the probability of at least k faces of the ··· n! first type and at most j 1 faces of the first and second types. This is just summing − where (i ,i , ,i ) is a permutation of (1, 2, ,n). the appropriate multinomial probabilities. 1 2 ··· n ··· The vector of order statistics (Y , Y , , Y ) and the vector of ranks (R ,R , j 1 j 1 i 1 2 n 1 2 − − − • ··· n ,Rn) are independent. P (Yk y, Yj >z) = ··· ≤ i,ℓ,n i ℓ i=k ℓ=0  − −  X X Proof: We already know the result of part (i). Recall Xj = YRj , j = 1, 2, ,n. i ℓ n i ℓ ··· (F (y)) (F (z) F (y)) (1 F (z)) − − . Let y 0 so small that the intervals (y ǫ, y + ǫ] do not overlap. Differentiating gives the joint density f (y,z) for y

The joint density of the first two record values, XN1 ,XN2 , is Exercise - 9.0.14 - (Infinite number of records, but finitely many consec- • utive records) Let U , U , be a Bernoulli process and let T ,T , be a record ln(1 F (x)) 1 2 ··· 1 2 ··· − − f(x) f(y), x

Exercise - 9.0.13 - (Bernoulli process & record time process) An infinite sequence of independent and identically distributed random variables U , U , 1 2 ··· is called a Bernoulli process if Uk B(1,p). An infinite sequence of independent random variables T ,T , is called∼ a record time process if T = 1 and if T 1 2 ··· 1 k ∼ 1 iid B(1, k ) for k = 2, 3, . To explain why, let X0,X1,X2, F , where F is a continuous distribution.··· Take T 1 and for k = 2, 3, , let··· ∼ 1 ≡ ··· 1 if Xk is a record i.e.,Xk > max1 i k 1 Xi, T = ≤ ≤ − k 0 otherwise.  Show that T ,T , forms a record time process. 1 2 ··· 84 Fourier Transforms

Example - 10.1.1 - (Characteristic function of a standard uniform) Let X Uniform(0, 1), then the characteristic function is obtained as follows. ∼ φ (t) = E eitX = E cos(tX) + iE sin(tX) X { } { } { } 1 sin t E cos(tX) = cos(tx) dx = { } t Z0 Lecture 10 1 cos t 1 E sin(tX) = sin(tx) dx = − + { } t t Z0 sin t i cos t + i i(1 eit) eit 1 φ (t) = E eitX = − = − = − . Fourier Transforms X { } t t it As is the case in this example, the complex integral often behaves just like the real integral of exponential function. Also, note that when t = 0, we get φX (0) = E(e0) = 1. By using L’Hopital’s rule it is easy to see that the function, (eit 1)/(it), also gives the same value for t = 0. So, we can say that the characteristic− function Recall that the characteristic function, φ (t), of a random variable X having dis- X is continuous at t = 0. tribution F is the expectation

itX itx Example - 10.1.2 - (Characteristic function of triangular random vari- φX (t) = E e = e dF (x) a a R able) Let X Uniform( , ) where, a > 0, and let X, Y be independent and Z ∼ − 2 2 dF (x) identically distributed random variables. It is not hard to see that the density of ∞  itx e f(x) dx, continuous case, f(x) = dx , −∞ the new random variable, Z = X + Y , is = eitk P (X = k) integer case, P (X = k) = F (k) F (k 1),  R k − −  itxk a z  k e P (X = xk) discrete case, P (X = xk) = F (xk) F (xk−). −| | if a

10.1 Examples a itx a x φZ (t) = e −|2 | dx. Fourier transform is a versatile tool of analysis with numerous applications. Here, a a Z−   we list a few tricks of the trade. By using the fact that eitx = cos(tx) + i sin(tx), as we did in the previous example, Definition - 10.1.1 - Let f be a function such that ∞ f(x) dx < , (such func- the above integral reduces to | | ∞ tions will be called absolutely integrable or just integrable).−∞ Then the Fourier trans-form R 2(1 cos at) sin(at/2) 2 of f, denoted by (f,t), is defined to be φ (t) = − = . F Z a2t2 at/2 ∞ ∞ ∞   (f,t) = eitxf(x) dx = cos(tx) f(x) dx + i sin(tx) f(x) dx. F Remark - 10.1.2 - Note that, by using the result of Exercise (10.1.1), this char- Z−∞ Z−∞ Z−∞ acteristic function is the product of the characteristic functions of X and Y . This Remark - 10.1.1 - In probability applications f is usually a density (satisfying the should not be surprising, keeping in mind the convolution property that we have absolute integrability condition automatically) and the resulting Fourier transform often seen in the past for discrete random variables. is known as the characteristic function of the density or the characteristic function of the random variable having the density f. So, if X is a random variable with Example - 10.1.3 - (Characteristic function of exponential random vari- density f then the characteristic function, φX (t), is able) Let X Exp(λ). Its characteristic function is ∼ itX ∞ itx E e = φX (t) = e f(x) dx = (f,t). ∞ λx ∞ λx { } F φX (t) = cos(tx) λe− dx + i sin(tx) λe− dx. Z−∞ Z0 Z0 10.1 Examples 85 86 Fourier Transforms

By (twice) integrating by parts, we get that (v) (Smoothness) Let f be piecewise continuous and absolutely integrable. • 2 If xf(x) is also absolutely integrable then (f,t) is continuously differentiable ∞ λx λ F cos(tx) λe− dx = . and λ2 + t2 d Z0 (f,t) = i (xf(x),t). Also, integration by parts gives that dtF F (vi) (Anti-derivative) Let f be an integrable and differentiable function so ∞ t ∞ tλ ′ λx λx • that f is continuous and integrable as well. Then sin(tx) e− dx = cos(tx) e− dx = 2 2 . 0 λ 0 λ + t Z Z d λ2 + itλ λ it (f,t) = f(x),t . Thus, the characteristic function of X is φ (t) = = . − F F dx X λ2 + t2 λ it   − (vii) (Expansion) Let f be an integrable function so that xn+1f is also Example - 10.1.4 - x Note that the function, f(x) = sin x e−| |, is integrable. To • absolutely integrable (for some positive integer n). Then find its Fourier transform, (f,t), note that F n j n+1 ix ix (it) (gjf, 0) t ( gn+1f , 0) ie + ie− (f,t) = F + ξ | | F | | , sin x = − . F j! n (n + 1)! 2 j=0 X Therefore, the Fourier transform is where g (x) = xj and ξ is a complex number with ξ 1. j n | n| ≤ i ∞ ix ix x itx ∞ 1 x i(t 1)x (e− e ) e−| | e dx = i e−| | e − dx Proof: (Property (iv)) It is trivial to see that 2 − 2 Z−∞ Z−∞ ∞ 1 x i(t+1)x ∞ itx ∞ e−| | e dx (f,t) e f(x) dx = f(x) dx < . − 2 |F | ≤ | || | | | ∞ Z−∞  Z−∞ Z−∞ 1 1 = i . To show continuity, note that 1 + (t 1)2 − 1 + (t + 1)2  −  ∞ The last equality comes by using the result of Exercise (10.1.2). (f,t + h) (f,t) = (eix(t+h) eitx) f(x) dx |F −F | − Z−∞ ∞ Remark - 10.1.3 - It is a standard exercise in contour integration to show that eitx eihx 1 f(x) dx t the characteristic function of the Cauchy(0, 1) density is e−| |. We will bypass the ≤ | | | − | | | Z−∞ ∞ contour integration by first inventing the inversion formula and then deduce this = eihx 1 f(x) dx. result. The characteristic function of N(0, 1) is obtained in the next section by | − | | | Z−∞ using some general properties of Fourier transforms. The last integral does not depend on t. Its integrand is bounded by 2 f(x) whose | | Now we obtain some useful properties of (f,t). In particular, we will get the integral is finite. And the integrand goes to 0 as h goes to zero. Thus, we can take properties of the characteristic functions of continuousF random variables. Here is the limit inside as h goes to zero to get the property. Therefore, we can interchange the main result of this section. the integral and the derivative operations. This is because the integrand and its derivative (with respect to t) are integrable. To show the continuity of (xf,t), Theorem - 10.1.1 - (Fundamental properties of Fourier transforms) Every note that F Fourier transform enjoys the following properties. ∞ (xf,t + h) (xf,t) = xf(x)eitx eihx 1 dx (i) (Linearity) (αf + βg,t) = α (f,t) + β (g,t), for any constants α, β. |F −F | − • F icx F F Z−∞ (ii) (Translation) (e f(x),t) = (f,t + c), for any constant c.  • F F ∞ ihx (iii) (Scale) (f(x), tc) = 1 (f(x/c),t) , for any constant c > 0. xf(x) e 1 dx. c ≤ | | − • F F Z−∞ (iv) (Uniform continuity) If f is integrable and piecewise continuous then • (f,t) is a bounded function of t and the bound is The integrand is bounded by 2 xf(x) which is integrable and the integrand drops F to zero as h goes to zero. So, taking| the| limit inside the integral gives the result. ∞ (f,t) f(x) dx < . (Property (vi)). By the fundamental theorem of calculus |F | ≤ | | ∞ x Z−∞ ′ Furthermore, (f,t) is a uniformly continuous function of t. f(x) = f (u)du + f(0). F Z0 10.1 Examples 87 88 Fourier Transforms

′ By the integrability of f , limx f(x) exist. Now the integrability of f implies Remark - 10.1.4 - (Convolution property) Let f,g be piecewise continuous and that the limits must be zero. So,→±∞ by integration by parts, integrable functions. Recall that the convolution of these two functions, denoted by f g(t), is ′ itx ∞ itx ∗ (f ,t) = e f(x) ∞ it e f(x)dx = it (f,t). ∞ F |−∞ − − F f g(t) = f(x)g(t x)dx. Z−∞ ∗ − Z−∞ j/(n+1) (n+1 j)/(n+1) (Property (vii)). For j 1, 2, ,n write g f = g f f − . Now One way the convolution function arises is via adding independent random vari- ∈ { ··· } j j apply the H¨older inequality, with p = (n + 1)/j, to see that gjf are absolutely ables. More precisely, if X, Y are independent continuous random variables with integrable as well. Next recall the fact that respective densities f and g, then the density of Z = X + Y is the convolution f g(t). This shows that the characteristic function of Z is the product of the n (itx)j tx n+1 characteristic∗ functions of X and Y , eitx = + R (itx); R (itx) | | . j! n | n | ≤ (n + 1)! j=0 X (f g, θ) = (f, θ) (g, θ), θ R. F ∗ F F ∈ Therefore, HW29 Exercise - 10.1.1 - Find the characteristic function of X Uniform( a, a) where n j (it) ∞ ∞ ∼ − (f,t) = xjf(x) dx + R (itx)f(x) dx a > 0. Deduce that (sin t)/t is a characteristic function. F j! n j=0 Z−∞ Z−∞ X HW30 Exercise - 10.1.2 - n j Find the characteristic function of the Laplace density (also (it) ∞ λ λ x = (gjf, 0) + Rn(itx)f(x) dx. called the double exponential density) f(x) = 2 e− | |; < x < . (Hint: the j! F 2 2 2 −∞ ∞ j=0 answer is λ /(λ + t ).) X Z−∞ Finally, the last term is a complex number with magnitude x Exercise - 10.1.3 - Find the Fourier transform of f(x) = cos x e−| |. n+1 ∞ t ∞ R (itx)f(x) dx | | x n+1 f(x) dx Exercise - 10.1.4 - Prove the first three properties of the Fourier transforms and n ≤ (n + 1)! | | | | Z−∞ Z−∞ deduce that, for any constant c and a random variable X, n+1 t = | | ( gn+1f , 0). ict (n + 1)!F | | φX+c(t) = e φX (t), φcX (t) = φX (tc).

This finishes the proof. Exercise - 10.1.5 - ♠ Prove that the characteristic function of a continuous random variable is always uniformly continuous and bounded by 1. Example - 10.1.5 - (Characteristic function of N(0, 1)) Let X N(0, 1). By ∼ the smoothness property, HW31 Exercise - 10.1.6 - Suppose the first moment of X is finite. Show that its charac- teristic function is differentiable and d 1 ∞ itx x2/2 itX φ(t) = ixe e− dx, where φ(t) = E(e ). d d dt √2π φ (t) = E iXeitX , φ (t) = iE(X). Z−∞ dt X { } dt X t=0 Since, x cos tx is an odd function of x and the density is an even function,

HW32 Exercise - 10.1.7 - Let f be piecewise continuous and absolutely integrable. If d 2 2 ∞ x2/2 2 φ(t) = i x sin(tx) e− dx. x f(x) is absolutely integrable then prove that (f,t) is twice differentiable and dt π F r Z0 d2 By integration by parts, we get (f,t) = x2f(x),t . dt2 F −F  d 2 ∞ x2/2 1 Suppose that the second moment of X is finite. Deduce that its characteristic φ(t) = t cos(tx) e− dx = tφ(t), or dφ(t) = tdt. dt − π − φ(t) − function is twice differentiable with r Z0 t2 2 2 t /2 c 2 d Integrating both sides, ln φ(t) = 2 + c, or φ(t) = e− e . Now for t = 0, we get E(X ) = φ (t) . c − − dt2 X φ(0) = e . For any random variable, φ(0) = 1. This gives that the characteristic t=0 t2/2 function of N(0, 1) must be φ(t) = e− .

90 Summability Assisted Inversion

This is called the Cesaro method. Another averaging (summability) operation is

1 ∞ t/T e− L (t) dt, T x Z0 and is known as the Cauchy/Abel method. Note that in both cases we used a probabilistic expectation of Lx — in the first case we used a Uniform(0, 2T ) density 1 Lecture 11 and in the second case Exp( T ) density —. In general, if pT (t) is some density then the summability smoothing of L ( ) is the following expectation x · t ∞ 1 ∞ iθx pT (t)Lx(t) dt = pT (t) e− (f, θ) dθ dt 0 2π 0 t F Summability Assisted Z Z Z− 1 ∞ iθx ∞ = e− (f, θ) gT (θ) dθ, gT (θ) := pT (t) dt. 2π F θ Inversion Z−∞ Z| |

Step 2. Parseval’s relation. When f,gT are integrable, Parseval’s relation says that

iθx Now we show how to recover the function f from its Fourier transform (f,t). In e− (f, θ) gT (θ) dθ = f(t) (gT ,t x) dt = f(x + θ) (gT , θ) dθ. F R F R F − R F this regards, the fundamental insight is due to Dirichlet. Z Z Z Let f be an integrable function (i.e., ∞ f(x) dx < ). The Fourier trans- This relationship follows by writing the Fourier transform as an integral and then form of f is −∞ | | ∞ R switching the order of integration. The reader should fill in the details. (f, θ) := eiθx f(x) dx, θ R. F R ∈ Example - 11.0.6 - (Three examples) Note that regardless of which pT (t) den- Z sity we choose to use, we will have gT (θ) 0. If gT (θ) is integrable, it is only off The fundamental question of all transform theories1is “does the trasform, (f, θ), ≥ F by a constant to be a probability density in itself. If cgT is a probability density, contain all the information needed to reconstruct f from it”? One of the first for some constant c, we have reconstruction schemes was suggested by Dirichlet. He wondered if f could be reconstructed from (f, θ) as follows 1 iθx 1 F e− (f, θ) gT (θ) dθ = f(x + θ) (cgT , θ) dθ, 2π R F c2π R F Z Z t ? 1 iθx Since g (θ) is symmetric, (cg , θ) will be a real-valued function. Consider the f(x) = lim e− (f, θ) dθ (0.1) T F T t 2π t F following three examples. →∞ Z− (i) When we take pT (t) to be Uniform(0, 2T ) density, we get gT (θ) = Amazingly this works, at least for some special functions f, such as those integrable • θ (1 | | ), for θ 2T , and zero otherwise. Note that g is a constant mul- f’s for which (f, θ) is again integrable. Even in the case of those f’s for which − 2T | | ≤ T F 1 θ 1 (0.1) does not work, some averaging (summability smoothing) operations performed tiple of the Triangular( 2T, 2T ) density, 2T (1 2| T| ), where c = 2T . The − − sin T θ 2 on top of the above Dirichlet scheme can, again, reconstruct f from (f, θ). Their characteristic function (Fourier transform) of cgT is ( T θ ) . So, if we let F 2 underlying ideas are summarized below. C have density T sin T θ , θ R, the Cesaro inversion formula becomes T π T θ ∈ 1 t iθx  Step 1. Summability assistance. Let Lx(t) := e− (f, θ) dθ. The 2T 2π t F 1 iθx θ T sin T θ 2 most basic averaging (summability) operation is the simple− averaging, lim e− (f, θ) 1 | | dθ = lim f(x + θ)( ) dθ, R T 2π 2T F − 2T T π R T θ →∞ Z−   →∞ Z 2T = lim E (f(x + CT )) . 1 T L (t) dt. →∞ 2T x Z0 1 t/T (ii) When we take pT (t) = T e− , for t > 0 and zero otherwise, i.e., the 1 • 1 θ /T such as generating functions, moment generating functions, Laplace transforms, Exp( T ) density, then gT (θ) = e−| | . It is a constant multiple of the dou- 1 θ /T 1 Fourier transforms, Fourier-Stieltjes transforms etc. ble exponential density, 2T e−| | , with c = 2T . The characteristic function Summability Assisted Inversion 91 92 Summability Assisted Inversion

1 (Fourier transform) of cgT is 1+(T θ)2 . So, the Cauchy inversion formula be- (i) limT R hT (θ) dθ = 1, comes • →∞ (ii) supT R RhT (θ) dθ < , 1 iθx θ /T T 1 • | | ∞ lim e− (f, θ)e−| | dθ = lim f(x + θ) 2 dθ, R T 2π R F T π R 1 + (T θ) (iii) limT θ >δ hT (θ) dθ = 0 for any δ > 0. →∞ Z →∞ Z • →∞ | | | | = lim E (f(x + KT )) , T The densities of theR random variables, CT , KT ,WT , are all approximate identities. →∞ Now for any bounded function f, if x is any point of continuity of f, then for any where K has the density T 1 , θ R. T π 1+(T θ)2 ∈ ǫ > 0, there exists a δ > 0 so that f(x + θ) f(x) < ǫ whenever θ δ. This gives that | − | | | ≤ (iii) When we take p (t) to be the density of Exp( 1 ) random variable, we • T 2T θ2/2T q get gT (θ) = e− . It is a constant multiple of the N(0,T ) density, where f(x + θ)hT (θ) dθ f(x) f(x) hT (θ) dθ 1 + Kf hT (θ) dθ 2 1 T θ /2 R − ≤ | | R − θ >δ | | c = . The characteristic function (Fourier transform) of cgT is e− . Z Z Z| | √2πT Now the inversion scheme, called the Gauss-Weierstrass inversion formula, + f(x + θ) f(x ) hT (θ) dθ, (0.3) becomes θ δ | − || | Z| |≤

2 √ 2 1 iθx θ /2T T T θ /2 where Kf is a bound for 2f. The last term is bounded by ǫ sup hT (θ) dθ. So lim e− (f, θ)e− dθ = lim f(x + θ)e− dθ, T R | | T 2π R F T √2π R the last term can be made arbitrarily small by picking ǫ small enough. The first →∞ Z →∞ Z R = lim E (f(x + WT )) , two terms go to zero as T gets large. By using a slightly more elaborate argument, T →∞ (see Khan [?]), we can somewhat relax the continuity assumption on x and get 1 where WT N(0, T ). + ∼ f(x ) + f(x−) lim f(x + θ)hT (θ) dθ = , T R 2 Step 3. Approximate identity. So, continuing from Step 2, the issue is, “when →∞ Z is it that at any point of simple discontinuity. Also, the boundedness assumption on f can be ? iθx relaxed somewhat, but the amount of relaxation is tied to which inversion technique f(x) = lim e− (f, θ) gT (θ) dθ = lim f(x + θ) (gT , θ) dθ. T R F T R F we choose to use. This gives us the general inversion theorem. →∞ Z →∞ Z The answer depends on f. If f is integrable so that (f, θ) is also integrable, then + F2 1 ∞ iθx f(x ) + f(x−) even the limit in Dirichlet’s original idea, (0.1), holds and we get lim e− (f, θ) gT (θ) dθ = . T 2π F 2 →∞ Z−∞ 1 ∞ iθx f(x) = e− (f, θ) dθ, θ R. (0.2) 2π F ∈ Remark - 11.0.5 - (Summary) The moral of this elaborate story can be summa- Z−∞ rized as follows: When (f, θ) is not integrable, in each of the above three examples, F Take any density, pT (t), defined over (0, ) and get gT (θ) = t> θ pT (t) dt. • ∞ | | lim Ef(x + CT )= lim Ef(x + KT )= lim Ef(x + WT ) = f(x), If the summability density, pT (t), has a finite first moment, then gT (t) will T T T R →∞ →∞ →∞ be integrable. Hence, h (θ) := (g , θ) will be well defined and will be real, T F T if f(x) does not grow “too rapidly” as x , and if x is a point of continuity of because of the symmetry of gT . | | | |→∞ f. The key reason behind this is that each random variable, CT , KT ,WT , converges If the collection, h (θ),T = 1, 2, , forms an approximate identity, • { T ···} to zero in probability as T gets large. In other words, if hT (θ) denotes the density of any of these three random varibles, then in each case, then

t ∞ 1 ∞ lim hT (θ) dθ = 0, for any δ > 0. iθx T lim pT (t) Lx(t) dt = lim pT (t) e− (f, θ) dθ = f(x), →∞ θ >δ T 0 2π T 0 t F Z| | →∞ Z →∞ Z Z− A sequence of functions, h (θ),T = 1, 2, , is called an approximate identity if { T ···} for any bounded integrable function f, and x is its point of continuity. This is the 2Thanks to a theorem, called the Lebesgue dominated convergence theorem. underlying idea behind Fourier inversion. Summability Assisted Inversion 93 94 Summability Assisted Inversion

Remark - 11.0.6 - (Delta function) We may convey the above summary/idea in Hence, f,g must agree at their points of continuity. (At the points of discontinuity a language, and a notation, that avoids mentioning which summability technique they may differ.) ♠ was deployed. This is done with the help of the delta function which you may think The following proposition shows that we really cannot do better than what we have of as the limit of the density of either of the three random variables CT , KT ,WT : when there are points of discontinuity. The integrability of the Fourier transform 2 is really a strong assumption. It necessitates that the function f should have no T sin T θ T 1 √T T θ2/2 δ(θ)= lim = lim = lim e− , θ R. points of discontinuity. T π T θ T π 1 + (T θ)2 T √2π ∈ →∞   →∞ →∞ Proposition - 11.0.1 - Let f be a piecewise continuous and integrable function The delta function is given the following properties: with the Fourier transform, (f,t). If f has a point of discontinuity then (f,t) F F (i) δ(0) = + , and δ(x) = 0 for any x = 0, so that δ(x) = δ( x). can not be integrable. • ∞ 6 − (ii) δ(x) dx = 1, or more generally, as our last three examples suggested, • R Proof: Suppose (f,t) is integrable. Recall that (f,t) is always continuous for integrable f. WhenF (f,t) is also integrable then ourF inversion theorem gives R f(x+) + f(x ) f(x + θ) δ(θ) dθ = − . F R 2 + f(x ) + f(x−) 1 ∞ iθx Z = lim e− (f, θ) gT (θ) dθ 2 T 2π F iθx →∞ Z−∞ (iii) R e δ(x) dx 1, i.e, (δ, θ) 1, which is visible from our three • ≡ F ≡ 1 ∞ iθx examples since = e− (f, θ)dθ. R 2π F Z−∞ θ θ /T θ2/2T lim 1 | | = lim e−| | = lim e− 1. The last expression being a Fourier transform of (f,t), it must be a continuous T − 2T T T ≡ F →∞   →∞ →∞ function. So the left side must also be a continuous function. ♠ (iv) t δ(x) dx = ∆(t), where ∆(t) = 0 for t < 0 and ∆(t) = 1 for t 0. Remark - 11.0.7 - Many of the continuous densities which are defined on the half • The idea−∞ behind this is the fact that each of the three random variables,≥ real line or finite intervals have points of discontinuity. The above proposition C , KR ,W , converged to zero in probability, and ∆ is the distribution of a T T T shows that all such densities cannot have integrable characteristic functions. Thus, r.v. which takes value zero with probability one. we have to resort to the summability assisted limit operation to invert such Fourier It should be noted that these properties of δ(θ) are only operational devices abbrevi- transforms. The inversion formula is more useful to prove other theoretical results ately representing more elaborate approximation (limit) arguments. Such functions than to actually perform the inversion. are studied under the topic of “Schwarz distributions”, which is a bit different no- tion than the usual probability distribution. Example - 11.0.7 - Now we find the Fourier transform of the Cauchy(0, 1) den- x sity. Recall (Exercise (10.1.2)) that the Laplace density (1/2)e−| | had the Fourier Theorem - 11.0.2 - (Uniqueness) Let f,g be two piecewise continuous and inte- transform (characteristic function) (f,t) = 1/(1+t2). Now both the Laplace den- F grable functions so that (f, θ) = (g, θ). Then f and g are essentially3 the same sity and its Fourier transforms are continuous and integrable. So, the first inversion functions. F F theorem gives that 1 ∞ ixt 1 1 x e− dt = e−| |. Proof: By the above inversion theorem, 2π 1 + t2 2 Z−∞ + By making the transformation y = t we get f(x ) + f(x−) 1 ∞ iθx − = lim e− (f, θ) gT (θ) dθ 2 T 2π F ∞ 1 →∞ Z−∞ ixy x e 2 dy = e−| |. 1 ∞ iθx π(1 + y ) = lim e− (g, θ) gT (θ) dθ Z−∞ T 2π F →∞ Z−∞ 2 + Since, g(y) = 1/(π(1 + y )) is the Cauchy(0, 1) density, we see that the Fourier g(x ) + g(x−) t = . transform (characteristic function) of this density is (g,t) = e−| |. 2 F

3They may differ on a “negligible” set, such as artificially making the functions to differ Here is another nice example of the many uses of the inversion formula. In this at the points of discontinuity. example we see how to find some not too obvious integrals. Summability Assisted Inversion 95 96 Summability Assisted Inversion

6f(x)

1/2

- x 2 2 −

Figure 11.1: Triangular Density

Example - 11.0.8 - Recall (Exercise (10.1.1)) that Uniform( 1, 1) density had − the characteristic function sin t/t. Also that if X, Y iid Uniform( 1, 1) then X +Y had the triangular density f. The density f looks like∼ −

Its characteristic function is (sin t/t)2, since E(eit(X+Y )) = E(eitX )E(eitY ). Note that both the density and its characteristic function are continuous and inte- grable functions. So, the inversion formula immediately gives that

2 1 ∞ ixt (sin t) e− dt = f(x). 2π t2 Z−∞ In particular, when x = 0,

2 2 1 ∞ (sin t) 1 ∞ (sin t) dt = f(0) = , or dt = π. 2π t2 2 t2 Z−∞ Z−∞ 98 General Inversion

Proof: It does not take much effort (by the usual convolution argument) to show directly that the said function, fZ (x), is indeed a density of Z. Moving along, if (F (x+h) F (x))/h is a density then note that its characteristic function (by using the anti-derivative− property of Fourier transforms) has to be

∞ F (x + h) F (x) 1 (f(x + h),t) (f,t) eitx − dx = F F h h it − it Lecture 12 Z−∞  − −  1 e ithφ(t) φ(t) = − h it − it  − −  ith 1 e− = φ(t) − . General Inversion ith Since the last term is a characteristic function. This is a bounded function since it is the product of two characteristic functions, each of which has to be bounded. The continuity of fZ (x) is trivial when x,x + h are points of continuity of F . Now Now we present a particular form of the inversion theorem that is useful in proba- take W N(0,n) and use Parseval’s relation to get bility and statistics. ∼ ith 1 ∞ itx 1 e− t2/(2n) ∞ n(y x)2/2 iθx e− φ(t) − e− dt = fZ (y)e− − dy Theorem - 12.0.3 - (General inversion) Let φ(θ) = R e dF (x) be the char- √2πn ith Z−∞ Z−∞ acteristic function of a distribution F of a random variable X. ∞ ny2/2 R = fZ (y + x)e− dy. (i) (Clever trick). Define a new random variable Z = X + U, where U is • Z−∞ independent of X and U Uniform( h, 0). Prove that Z has the following 1/2 probability density ∼ − Multiplying both sides by (n/(2π)) and invoking the Gauss-Weierstrass inversion gives the result. Finally, the point of discontinuity of (any distribution) F is a F (x + h) F (x) countable set. So, taking an infinite sequence of points x = t0

(The fact that f is a density, and hence integrable, opens the door for 1 F (x)= lim F (tM+1) F (x) Z − M − inversion by our earlier summability assisted tools.) →∞ Each of the terms of the last limit is obtained by the inversion scheme that we have 1 e−iθh (ii) Show that the Fourier transform of fZ is (fZ , θ) = φ(θ) − . proved. Hence, F (x) can be computed at each of its points of continuity. But then • F iθh by the right continuity of F , it is fully specified at all x R. (iii) Show that fZ (x) is a bounded function of x and if x,x + h are points of ∈ ♠ • continuity of F , then x is a point of continuity of fZ . HW33 Exercise - 12.0.8 - Let f be any integrable function. Prove that,1 for any real (iv) Explain why the following equalities and convergences hold as T . number x, • →∞ T 1 iθx ∞ sin Ty iθh e− (f, θ)dθ = f(x + y) dy. 1 iθx 1 e− θ2/2T √T T θ2/2 2π T F πy e− φ(θ) − e− dθ = fZ (x + θ)e− dt Z− Z−∞ 2π R iθh √2π R Z Z D Exercise - 12.0.9 - (Plancheral’s identity) Let X be a continuous random vari- F (x + h) F (x) fZ(x) := − , able with bounded density f, distribution F and characteristic function φ(t). Then → h φ(t) 2 is integrable if and only if f 2 is integrable and in this case 2T iθh | | 1 iθx 1 e− θ e− φ(θ) − 1 | | dθ fZ(x), ∞ 1 ∞ 2π 2T iθh − 2T → f 2(x) dx = φ(t) 2 dt. Z−   iθh 2π | | 1 ∞ iθx 1 e− θ /T Z−∞ Z−∞ e− φ(θ) − e−| | dθ fZ(x), 1 2π iθh → This approach leads to the Dirichlet inversion formula. However, its proof requires Z−∞ sin T y some more work. Note that the right hand side function, hT (y) := , is neither (v) Explain why the last part characterizes F . πy • nonnegative nor |hT (y)| is integrable. It is not an approximate identity. 12.1 Fourier & Dirichlet Series 99 100 General Inversion

HW34 Exercise - 12.0.10 - (Parseval’s relation) Prove the following results. So, someone comes along and gives us the Fourier coefficients (1.1) of a 2π- periodic integrable function f. One of the fundamental issues regarding Fourier (i) For any integrable functions f(x), g(x) having respective Fourier trans- series is to see if the sequence of partial sums of the resulting Fourier series, • forms (f,t), (g,t), we have F F n a ∞ ∞ S (f,t) := 0 + (a cos kt + b sin kt), n = 1, 2, , g(t) (f,t) dt = f(y) (g,y) dy. n 2 k k ··· F F k=1 Z−∞ Z−∞ X (ii) For any random variables X, U with respective characteristic functions converges in some sense. And if it does, to which limiting function. This is again • an inversion question. To see why, it will be more illuminating if we rewrite the φX (t), φU (t), E φX (U) = E φU (X) . { } { } partial sums in complex form,

(iii) Let X,W be two random variables with respective characteristic func- n ikt ikt ikt ikt n • a0 e + e− e e− ikt tions φ (t) and φ (t). Then, for any x R, Sn(f,t) = + ak + bk − = f(k)e , X W ∈ 2 2 2 k=1   k= n ixW X X− E e− φX (W ) = E φW (X x) . b { − } ak ibk 1 π iku where we take b0 = 0 and for k = 0, 1, ,n, f(k) = −2 = 2π π f(u)e− du,  ··· − (iv) For any integrable functions f,g with respective Fourier transforms ak +ibk 1 π iku • f( k) = 2 = 2π π f(u)e du. Conversely, ak = f(k) +Rf( k) and bk = (f,t), (g,t), and any x R, we have − − b − F F ∈ i(f(k) f( k)). R b − − b b ∞ ixt ∞ e− g(t) (f,t) dt = f(y) (g,y x) dy. F F − Remarkb -b 12.1.1 - (Reconstruction of 2π-periodic functions) The aim of the Z−∞ Z−∞ following set of exercises is to make a connection with our earlier discussion about Fourier inversion formulas and convergence of Fourier series. When f is 2π-periodic Exercise - 12.0.11 - (Continuity theorem) Let Xn be a sequence of random variables with respective characteristic functions φ{ (t)} . Then P (X x) con- and integrable, let the Fourier transform of f be { n } n ≤ verge P (X x) at all x at which F (x) := P (X x) is continuous, if and only if π ≤ ≤ iju the sequence φn(t) converges to a continuous limit φ(t) (in which case φ is the (f,j) := f(u) e du; j = 0, 1, 2, { } F π ± ± ··· characteristic function of X). Give a sketch of the proof in the backward direction Z− by using the result of Theorem (12.0.3). Again, the issue is: “does the collection (f,j), j = 0, 1, contain all the information needed to reconstruct f”? Once{F again, Dirichlet± proposed···} a reconstruc- 12.1 Fourier & Dirichlet Series tion scheme, (which happens to be the partial sums sequence), n n Fourier transform is a function-to-function transform. Fourier and Dirichlet series ijx 1 ikx S (f,x) := f(j)e = e− (f,k), (1.2) are function-to-sequence transformations. n 2π F j= n k= n X− X− b Definition - 12.1.1 - (Trigonometric & Fourier series) Any series of the form and showed (now known as Dirichlet-Jordan theorem[?], p. 57. [?]) that it does converge to f(x) provided a ∞ 0 + (a cos kt + b sin kt), 2 k k f is 2π-periodic and continuous and, k=1 • X f ′ exists and is bounded and has at most finitely many points of discontinuity. • is called a trigonometric series. If the coefficients, a , b , are obtained from an k k It was discovered by du Bois and Raymond (see Hardy and Rogosinski [?]) that integrable function f over [ π,π], which is 2π-periodic,{ } { by} the formulas − assuming only continuity of f is not enough, in the sense that there exist continuous π π f for which limsup S (f, 0) = . But once again, this failure is primarily due 1 1 n | n | ∞ an = f(x) cos nxdx, bm = f(x) sin mxdx, (1.1) to Dirichlet’s reconstruction scheme. The collection (f,k),k = 0, 1, does π π π π {F ± ···} Z− Z− store enough information to construct f, if we use an averaged version (summability where n = 0, 1, 2, and m = 1, 2, , then the resulting trigonometric series is assisted form) of the Dirichlet reconstruction scheme. This is the subject matter of called the Fourier··· series of f. ··· the following developments. 12.1 Fourier & Dirichlet Series 101 102 General Inversion

Theorem - 12.1.1 - (Dirichlet kernel) Let f be 2π-periodic integrable function This gives that 1 n ikx with partial sums of its Fourier series S (f,x) = e− (f,k), where n 2π k= n F π (f,k) = π f(x)eikx dx, k = 0, 1, 2, . Then we have− 1 sin((N + 0.5)(t x)) π P SN (f,x) = f(t) − dt F − ± ± ··· 2π π sin((t x)/2) Z− − R π sin((n + 0.5)u) π π x S (f,x) = f(x+u) D (u) du, D (u) := , D (u) du = 1. − n n n 2π sin(u/2) n = f(u + x) DN (u) du, u = t x, π π π x − Z− Z− Z− − The kernel, Dn(u), is called the Dirichlet kernel. where we take D (u) := sin((N+0.5)u) . Finally, for f(u) 1 over [ π,π], by the N 2π sin(u/2) ≡ − Proof: Figure (12.1) gives two representative shapes of Dn(u), for n = 5 and orthogonality of sin, cos functions, we have a0 = 2 and ak = bk = 0 for k 1. a0 ≥ n = 8. Using the fact that cos(A B) = cos A cos B + sin A sin B. Hence, Sn(f,x) = = 1 for all n 1. This gives that − 2 ≥ π π 2 3 Dn(u) du = f(x + u)Dn(u) du = Sn(f,x) = 1. 2.5 π π Z− Z− 1.5 2 This finishes the proof. ♠ 1 1.5 Here is the main result of this section which shows how to get the inversion for (u) (u) 5 8 1 any continuous 2π-periodic function. D D 0.5 0.5

0 Theorem - 12.1.2 - (Fejer’s theorem) Let f be 2π-periodic integrable function 0 1 n ikx with partial sums of its Fourier series Sn(f,x) = e− (f,k), where −0.5 2π k= n π ikx − F (f,k) = π f(x)e dx, k = 0, 1, 2, . Then the following results hold. −0.5 −1 F ± ± ··· P −5 0 5 −5 0 5 − u u (i) The CesaroR mean of S (f,x),n = 0, 1, 2, is { n ···} T T Figure 12.1: Dirichlet Kernels for n = 5 and n = 8. 1 j ijx σ (f,x) := S (f,x) = 1 | | (f,j)e− . T T + 1 n − T + 1 F n=0 j= T −   N X X 1 π 1 π SN (f,x) = f(x) dx + f(t) cos kt cos kx + sin kt sin kx (ii) We also have 2π π π π { } Z− k=1  Z−  X 2 π N T π θ 1 j ijx 1 sin(T + 1) 2 = f(t) 1 + 2 cos k(t x) dt. 1 | | (f,j)e− = f(x + θ) θ dθ 2π π − − T + 1 F π T + 1 sin ( k=1 ) j= T   Z− 2 ! Z− X X− Note that 2 sin A sin B = sin(A + B) + sin(A B) and taking A = x and B = kx, = Ef(x + YT ), − 2 by the telescoping effect we get, θ 2 1 sin(T +1) 2 where YT is a random variable with density hT (θ) = T +1 θ for N N sin 2 2 sin(x/2) cos(kx) = sin(k + (1/2))x sin(k (1/2))x θ ( π,π), and zero otherwise.   { − − } ∈ − k=1 k=1 (iii) The sequence of functions, h (θ),T = 1, 2, , is an approximate identity. X X { T ···} = sin(1 (1/2))x + sin(N + (1/2))x. That is, − − lim hT (θ) dθ = 0, for any δ > 0. Dividing by sin(x/2), we get T θ >δ →∞ Z| | N N 1 (iv) If f is continuous at x, then σ (f,x) converges to f(x). 2 cos(kx) = 2 sin(x/2) cos(kx) T sin(x/2) (v) If f is a continuous function then σT (f,x) converges to f(x) uniformly in x. kX=1 kX=1 sin(N + (1/2))x (vi) (Uniqueness). When f,g are 2π-periodic and continuous functions, and if = 1 + . − sin(x/2) f(j) = g(j) for all j, then f(t) = g(t) for all t.

b b 12.1 Fourier & Dirichlet Series 103 104 General Inversion

(vii) (Weierstrass approximation theorem). Show that for any continuous we get 2π-periodic function f, there exists a sequence of trigonometric polynomials that converge to f uniformly. T T 2 sin(u/2) sin(u(k + 0.5)) = (cos(ku) cos((k + 1)u) − k=0 k=0 Proof: The function hT (θ) is shown in Figure (12.2), and is called the Fejer kernel. X X = cos 0 cos((T + 1)u) (i) The fact that − = 1 cos((T + 1)u). 1 1.5 − Thus we have 0.8 1 1 1 cos((T + 1)u) 1 hT (u) = − 0.6 T + 1 sin(u/2) 2 sin(u/2)

(u) (u) 2 5 8 1 2(sin((T + 1)u/2)) h h = . 0.4 T + 1 2 (sin(u/2))2 0.5

0.2 To see how Parseval’s relation gives the same result, note that

T π T 0 0 −5 0 5 −5 0 5 j ijx j ijx 1 | | (f,j)e− = f(u) 1 | | e− du. u u − T + 1 F − T + 1 j= T   π j= T   X− Z− X− Figure 12.2: Fejer Kernels for T = 5 and T = 8. The reader may simplify the sum to get the same result. Next, it is clear that hT (θ) 0. Since for f(u) = 1, Sn(f,x) 1 for all x and all n, its average will ≥ ≡ π T T also be σT (f,x)=1 for all T and all x. Hence, π hT (θ) dθ = 1. This shows that 1 j ijx − σ (f,x) := S (f,x) = 1 | | (f,j)e− YT is a random variable with density hT (θ). (iii) The graph of the density shows T T + 1 n − T + 1 F R n=0 j= T   that it ought to be an approximate identity, since most of the area is concentrated X X− near zero. Indeed, for any δ > 0, note that if θ > δ then sin θ > c > 0 for some | | 2 δ is just that the sum of all the uniform probabilities, pT (t), (over 0, 1, ,T ) for constant cδ. This shows that, as T , values greater than or equal to j is ··· →∞ | | 1 2π 1 T + 1 j j hT (θ) dθ 2 0. j + ( j + 1) + + T = −| | = 1 | | . θ >δ ≤ (T + 1) cδ → T + 1{| | | | ··· } T + 1 − T + 1 Z| | Parts (iv) and (v) now follow from our earlier argument concerning approximations (ii) One direct way of obtaining the Fejer kernel is to take the cesaro sum of the [ Dirichlet kernel. Indeed, via approximate identities. Part (vi) follows immediately, since f g(j) = f(j) g(j) = 0 for all j. Now the zero function has all Fourier coefficients− equal to− T 1 zero, whose resulting Fourier series must converge to it uniformly, by partb (v). h (u) = D (u) T T + 1 k bTherefore, f(t) = g(t) for all t. Now part (vii) is obvious once we note that σn(f,x) kX=0 is a trigonometric polynomial. T ♠ 1 sin(u(k + 0.5)) = T + 1 sin(u/2) k=0 Exercise - 12.1.1 - (Abel/Poisson theorem) Let f be 2π-periodic integrable X 1 n ikx T function with partial sums of its Fourier series Sn(f,x) = 2π k= n e− (f,k), 1 1 π ikx − F = sin(u(k + 0.5)) . where (f,k) = π f(x)e dx, k = 0, 1, 2, . T + 1 sin(u/2) F − ± ± ··· P (k=0 ) X (i) Verify that theR Abel/Poisson mean of Sn(f,x),n = 0, 1, 2, is Here we can use some trigonometry (as before). Multiply by 2 sin(u/2) and using { ···} ∞ ∞ the equation n j ijx A (f,x) := (1 r) r S (f,x) = r| | (f,j)e− . r − n F 2 sin(u/2) sin(u(k + 0.5)) = cos(ku) cos((k + 1)u), n=0 j= − X X−∞ 12.1 Fourier & Dirichlet Series 105 106 General Inversion

(ii) Either by Parseval’s relation, or directly, verify that Assume that k fk < . These functions are not periodic, when λk are not integers, and such| series| are∞ called Dirichlet series. Note that we cannot talk about π ∞ 2 P j ijx (1 r ) the Fourier transform of such f’s, since ∞ f(t) dt involves meaningless terms r| | (f,j)e− = f(x + θ) − dθ −∞ | | 2 2 itθ itλk F π 2π (1 r) + 2r sin (θ/2) of the type ∞ e e dt. Such functions commonly arise in probability theory j= Z− R X−∞ { − } — as the characteristic−∞ functions of discrete (but non-lattice) random variables —. = Ef(x + Zr), R Here a probabilist faces the following question, “if I give you the fuction f(t) and (1 r2) tell you that it is some Dirichlet series, can you find its fk and the corresponding where Zr is a random variable with density hr(θ) := 2 − 2 { } 2π (1 r) +2r sin (θ/2) λk ”? The answer is yes, and is the subject matter of the following exercise. This for θ ( π,π), and zero otherwise. The function h (θ) is{ shown− in Figure} { } ∈ − r point of view is useful for spectral representation theory of second order stationary (12.3) and is called the Poisson kernel. processes.

1.5 3.5 Exercise - 12.1.3 - (Inversion for nonintegrable functions) Let f be inte- 3 grable over every interval of the type [ T,T ] and represent an absolutely convergent iλk t − Dirichlet series k fke . 2.5 1 (i) Show thatP if x is a point which does not equal any λk of the function f, 2 •

(u) (u) then .8 .9 T h h 1.5 1 itx ∞ sin T (λj x) L˜x(T ) := e− f(t) dt = fj − . 0.5 2T T (λj x) 1 T j= Z− X−∞ − 0.5 (ii) Show that if x = λ , then • k 0 0 T −5 0 5 −5 0 5 ∞ u u 1 itx sin T (λj x) L˜x(T ) = e− f(t) dt = fj − + fk. 2T T T (λj x) Z− j= ,λj =λk − Figure 12.3: Poisson Kernels for r =0.8 and r =0.9. −∞X 6 (iii) Letting T get large in parts (i) and (ii), deduce that • (iii) Verify that hr(θ),r (0, 1) is an approximate identity by showing that T { ∈ } 1 itx fk if x = λk for some k = 0, 1, 2, lim e− f(t) dt = ± ± ··· T 2T 0 otherwise. →∞ T lim hr(θ) dθ = 0, for any δ > 0. Z−  − r 1 θ >δ → Z| | (Hence, we may recover the coefficients and the exponents of the Dirichlet series by knowing the function f. Note that this “inversion” scheme uses 2T (iv) Now conclude that if f is continuous at x, then Ar(f,x) converges to f(x). instead of 2π, compared to the Fourier coefficients.)

(v) Conclude that if f is a continuous function then Ar(f,x) converges to f(x) uniformly in x.

Exercise - 12.1.2 - (The moral of Fourier series convergence) Let f be 2π- periodic integrable function with partial sums of its Fourier series Sn(f,x) = 1 n ikx π ikx 2π k= n e− (f,k), where (f,k) = π f(x)e dx, k = 0, 1, 2, . Ex- plain the− generalF steps behind theF convergence− of summability assisted± ± inversion.··· P R Remark - 12.1.2 - (Is Dirichlet inversion scheme the only game in town?) There are many functions which do not fit into the varieties that we have studied so far. For instance, ∞ f(t) = f eitλk , t R. k ∈ k= X−∞ 108 Basic Limit Theorems

For instance, U could be a binomial random variable and V could be a Poisson random variable, or a normal random variable. How should we compare U, V in such a case? Well, the natural thing to do is to see if their respective cdf (and/or densities) are similar. This is called the distributional comparison and it compares

F (x) = P(U x) with F (x) = P(V x), for each x R. (0.3) Lecture 13 U ≤ V ≤ ∈ The weak laws of large numbers use the (0.2) to measure closeness of two random variables. The strong laws of large numbers use the above mentioned almost sure sense and the central limit theorem uses (0.3) for comparison. Basic Limit Theorems The above three comparison methods are distinctly different and lead to differ- ent types of limit theorems. There are, however, some general links between them that we will provide in this lecture.

Much of the modern probability theory is based on various types of limit theorems. Here we look at the three or four broad categories. 13.1 Convergence in Distribution Recall that a random variable is a function on a sample space, Ω. The heart of all limit theorems lies in specifying how do we decide two random variables, U, V , Definition - 13.1.1 - (Convergence in distribution) We say that a sequence, are “close”. X1,X2, , of random variables converges in distribution to a random variable X if ··· 1. We could say U, V are “close” if, U(ω) V (ω) is small for each ω Ω. This | − | ∈ F (x) := P (X x) P (X x) =: F (x), approach is called pointwise comparison. By the way, this assumes that the n n ≤ → ≤ random variables be defined for the same experiment. Sometimes we add a for all real numbers, x, at which the distribution F (x) is continuous. We denote disclaimer that the closeness may fail for a few ωs as long as these “bad” dist dist this type of covergence by Xn X, or Fn F . ωs form a set (event) whose probability is zero. This version of pointwise → → comparison is called almost sure comparison. It is natural to ask, is the limiting distribution unique? To see that it is, suppose 2. We could say U, V are “close” if on the average (in the sense of expectations) that G is another possible such limit. Now their difference, U V , is small. (Once again this approach would be | − | meaningless if U, V are defined over two different sample spaces.) Here we F (x) G(x) F (x) Fn(x) + Fn(x) G(x) could consider several sub-varieties. For instance: | − | ≤ | − | | − | goes to zero when x is a point of continuity of both F and G. For other points, we Pick a p 1, and we may compute the Lp distance: • ≥ use the right continuity of both F,G to get that F (x) = G(x) for all x. (E U V p)1/p, (0.1) | − | dist Proposition - 13.1.1 - (Equivalent form for ) Let X,X1,X2, be a se- We may compute the, so called L0 distance, → ··· • quence of random variables. The following statements are equivalent. U V E | − | (0.2) (i) X dist X. 1 + U V • n →  | − | (ii) E (f(Xn)) E (f(X)) for every bounded continuous function f over R. and see if this is small. This unusual looking expectation approach • → can be studied via probabilities. The expectation in (0.2) is finite for Proof: Let F, F be the distributions of X,X respectively. Assume (i) holds all random variables whereas Lp distance may not be finite for some n n and let f be a nonzero bounded continuous function over R. Denote its bound by random variables. B = supx f(x) > 0. Now cut the tails off of the distribution of X. That is, for | | P ǫ 3. All of the above comparisons are stringent in the sense that they lock us into any ǫ > 0, find continuity points c of F so that ( X > c) B . Since Fn( c) ± | | P≤ 2ǫ ± → requiring that both U and V must be defined for the same random experi- F ( c), there exists an N such that for all n N we have ( Xn > c) B . Next ment. It is quite possible that U could be coming from one random experi- for± this ǫ, over the interval [ c, c] approximate≥ the continuous| function| ≤f by a step ment while V could be coming from a totally different random experiment. function h so that h(t) = −m a χ (t), where c = c < c < < c = c i=1 i (ci−1,ci] − 0 1 ··· m P 13.1 Convergence in Distribution 109 110 Basic Limit Theorems

and all these ci are points of continuity of F , and supt [ c,c] f(t) h(t) < ǫ. Exercise - 13.1.2 - Show that X, Y have the same distribution if and only if Extend h(t) = 0 for t [ c, c]. Note that ∈ − | − | E(h(X)) = E(h(Y )) for all bounded real-valued continuous functions h. 6∈ − m m Remark - 13.1.1 - (The continuity theorem and the Cram´er-Wold de- E (h(Xn)) = ai Fn(ci) Fn(ci 1) ai F (ci) F (ci 1) = E (h(X)) . { − − } → { − − } i=1 i=1 vice) Let X,X1,X2, is a sequence of random variables with respective char- X X ··· dist acteristic functions φ(t),φ (t),φ (t), . If X X then Proposition (13.1.1) Furthermore, for all n N, we also have 1 2 ··· n → ≥ gives that φn(t) φ(t) for every t R. This is because cos(tx), sin(tx) are → ∈ R E (f(Xn)) E (f(X)) bounded continuous functions of x for each fixed t . The converse holds as | − | R ∈ dist E E E well, namely if φn(t) φ(t) for every t then Xn X. This is known as f(Xn)χ Xn c f(X)χ X c + f(Xn)χ Xn >c → ∈ → ≤ {| |≤ } − {| |≤ } {| | } the continuity theorem. We will prove this later after studying Fourier-Stieltjes E + f(X )χ X >c  {| | } transforms. E E P P f(Xn)χ Xn c f(X)χ X c + B ( Xn > c) + B ( X > c) A d-dimensional version of convergence in distribution is defined analogously ≤ {| |≤ } − {| |≤ } | | | | E E and the corresponding analog of Proposition (13.1.1) holds as well. Furthermore f(Xn)χ Xn c  f(X)χ X c  + 3ǫ ≤ {| |≤ } − {| |≤ } the continuity theorem also holds. More precisely if X, X1, X2, is a sequence E E E E ··· f(Xn)χ Xn c  (h(Xn)) + (h (Xn)) (h(X)) dist dist ≤ {| |≤ } − | − | of d-dimensional random vectors then Xn X if and only if t′Xn tX for all E E → → + (h(X)) (f(X)) χ X c + 3ǫ vectors t′ =[t1,t2, ,td] consisting of real numbers. This result is known as the − {| |≤ } E E ··· (h(Xn)) (h(X)) + 5ǫ 5ǫ. Cram´er-Wold device. ≤ | − | → Since ǫ is arbitrary, the left side must go to zero. Example - 13.1.1 - (Limiting distributions of extreme order statistics) It The converse is easy. Let x, x + ǫ, x ǫ be points of continutiy of F . Consider − turns out that there are essentially three types of limiting distributions when one the bounded continuous functions f(t) shown below. tries to find the limiting distribution of

6 X(n) bn Zn := − , X(n) = max X1, ,Xn , . 1 . an { ··· } ...... where a > 0, and b are constants chosen so that Z dist G for some distribution . . n n n → . . G. The following three cases give the three varieties of G. . . . . iid . . Case 1: (Gumbel’s extreme value distribution) Let X1,X2, Exp(1). . . - t n ··· ∼ Now P(X(n) t) = (1 e− ) for any t > 0. Therefore, any an > 0 and bn, we see 0 x x + ǫ that ≤ −

n (bn+tan) bn By part (ii) we see that P(Zn t) = P(X(n) bn + tan) = 1 e− , t > . ≤ ≤ − −an   F (x + ǫ) E(f(X)) = lim E (f(Xn)) limsup P(Xn x). ≥ n ≥ n ≤ When we take an = 1 and bn = ln n, we see that a limiting distribution exists and equals Letting ǫ drop to zero over those x + ǫ which are points of continuity of F gives that limsup F (x) F (x). For the other side, consider the shifted version, h(t) = t ln n n t n n lim P(Zn t) = lim 1 e− e− , = exp e− =: G(t), t R. ≤ n n f(t + ǫ), which is also bounded and continuous. This gives →∞ ≤ →∞ − − ∈   This G is called Gumbel’s extreme value distribution. F (x ǫ) E(h(X)) = lim E (h(Xn)) liminf P(Xn x). − ≤ n ≤ n ≤ iid Letting ǫ drop to zero over those x ǫ which are points of continuity of F gives Case 2: (Frechet’s extreme value distribution) Let X1,X2, P areto(a, 1). a P 1 n ··· ∼ − that is, f(x) = a+1 for x > 1. Now (X(n) t) = (1 a ) for any t > 1. There- that liminfn Fn(x) F (x). x ≤ − t ≥ ♠ fore, any an > 0 and bn, we see that dist dist HW35 Exercise - 13.1.1 - If X X then show that h(X ) h(X) for any continuous n n n P P 1 1 bn R → → (Zn t) = (X(n) bn + tan) = 1 , t > − . function h over . ≤ ≤ − (b + ta )a a  n n  n 13.2 Convergence in Probability & WLLN 111 112 Basic Limit Theorems

1/a x So, if we take bn = 0 and an = n then Proof: Since 1+x is a continuous strictly increasing function of x > 0, therefore (1/t)a n P a Yn Y ε lim (Zn t) == lim 1 = exp (1/t) = G(t), t (0, ). Y Y ε if and only if | − | . n ≤ n − n {− } ∈ ∞ | n − | ≥ 1 + Y Y ≥ 1 + ε →∞ →∞   | n − | This G is called Frechet’s extreme value distribution, for any fixed constant a > 0. Therefore, Markov’s inequality shows that if (i) holds then iid Case 3: (Weibull’s extreme value distribution) Let X1,X2, Uniform(0, 1). Y Y ε 1 + ε Y Y ··· ∼ P P n E n a > 0 and b , we see that Now P(X t) = tn for any t (0, 1). Therefore, any ( Yn Y ε) = | − | | − | 0. n n (n) ≤ ∈ | − | ≥ 1 + Yn Y ≥ 1 + ε ≤ ε 1 + Yn Y → an > 0 and bn, we see that  | − |   | − | This gives (ii). For the converse, a “Hungarian trick” gives b 1 b P(Z t) = P(X b + ta ) = (b + ta )n , n < t < − n . n ≤ (n) ≤ n n n n −a a n n E Yn Y E Yn Y E Yn Y | − | = | − | χ Yn Y ε + | − | χ Yn Y <ε 1 1 + Yn Y 1 + Yn Y {| − |≥ } 1 + Yn Y {| − | } If we try bn = 1 and an = n , we get  | − |  | − |   | − |  E ε P ε χ Yn Y ε + = ( Yn Y ε) + . n ≤ {| − |≥ } 1 + ε | − | ≥ 1 + ε t t ( t) lim P(Zn t) = lim 1 + = e = e− − , t ( , 0). n ≤ n n ∈ −∞ ε →∞ →∞   When (ii) holds, the right side goes to 1+ε which can be made arbitrarily close to zero since ε > 0 is arbitrary. For the uniqueness of the limit, This is a special case of G(t) = exp ( t)α , for t < 0 and G(t) = 1 for t 0, for {− − } ≥ a positive constant α, known as Weibull’s extreme value distribution. P( Y Z ε) P( Y Y + Y Z ε) | − | ≥ ≤ | n − | | n − | ≥ P( Yn Y ε/2) + P( Yn Z ε/2). 13.2 Convergence in Probability & WLLN ≤ | − | ≥ | − | ≥ which goes to zero. As ε 0, by the continuity property P( Y Z > 0) = 0. ↓ | − | ♠ Let us start off by giving an official name to the unusual looking expectation sense, (0.2), of distance. Lp prob Remark - 13.2.1 - ( implies ) The last proposition says that convergence in probability can be→ proved by showing→ that either of the two quantities: Definition - 13.2.1 - (Convergence in probability) A sequence of random variables, Y ,n = 1, 2, , (all random variables defined over the same proba- n ··· P E Yn Y bility space) is said to converge in probability to a random variable, Y , denoted by ( Yn Y ε), or | − | , | − | ≥ 1 + Yn Y prob  | − | Yn Y , if → gets small as n gets larger. Unfortunately, neither of these two expressions is easy E Yn Y lim | − | = 0. to compute exactly. Instead often E(Y Y )2 is not hard to compute when it is n 1 + Yn Y n →∞  | − | finite. Via Chebyshev’s inequality, − The reason the above form of convergence is called convergence in probability E p is that it can be performed via probabilities, instead of the above types of expec- P Yn Y ( Yn Y ε) | −p | , for any p > 0. tations. | − | ≥ ≤ ε p E p p L Proposition - 13.2.1 - (Equivalent form for convergence in probability) When Yn Y 0 we say that Yn converge to Y in L , denoted as Yn Y . The | − | → p → Let Y, Y ,n = 1, 2, be random variables all defined over the same probability above observation shows that L convergence implies convergence in probability. n p space (Ω, ,P ). Then··· the following statements are equivalent. So, if Y, Z are two potential limits of L convergence, then by the uniqueness of the E limit obtained by convergence in probability, it must be that P(Y = Z) = 1 as well. 1. (i) Yn converge to Y in probability, in the sense of Definition (13.2.1). By the way, convergence in probability does not imply convergence in Lp. We will 2. (ii) For every ε > 0, we have address the converse after introducing the concept of uniform integrability. P lim ( Yn Y ε) = 0. prob n dist →∞ | − | ≥ Proposition - 13.2.2 - ( implies ) Let Y, Yn, n = 1, 2, be random variables all defined over the→ same probability→ space (Ω, ,P ). Then··· the following prob prob Uniqueness: If Y Y and Y Z then P(Y = Z) = 1. results hold. E n → n → 13.2 Convergence in Probability & WLLN 113 114 Basic Limit Theorems

prob (a) Y Y implies Y dist Y . as n gets large. In fact, Chebyshev invented his inequality for this purpose. By the • n → n → dist prob way, here we have proved a bit more by showing that the convergence occurs in the (b) Y Y and P(Y = c) = 1 for some constant c imply Y Y . 2 • n → n → L sense. Not too long after Chebyshev, Khintchin imporved Chebyshev’s version of Proof: For any ǫ > 0, the fact P( Y Y ǫ) 0 says that there exists a | n − | ≥ → 1 n prob E positive integer N such that for all n N we have WLLN substantially in two directions. He showed that n i=1 Xi (X1) ≥ holds under only the following two conditions → P P( Yn Y ǫ) ǫ. | − | ≥ ≤ (i) The sequence of X1,X2, consists of only pairwise independent random • ··· If F, Fn are the distributions of Y and Yn respectively then another Hungarian trick variables. gives (ii) Each X has the same distribution with finite mean. • i F (x ǫ) = P (Y x ǫ) = P (Y x ǫ, Yn Y ǫ) That is, now V ar(X ) need not be finite nor X need be mutually independent − ≤ − ≤ − | − | ≥ i i +P (Y x ǫ, Yn Y < ǫ) anymore. We will postpone its proof for now. A slightly more “expensive” version ≤ − | − | is in our reach provided we do not mind taking the continuity theorem of Remark P ( Yn Y ǫ) + P (Yn x) ≤ | − | ≥ ≤ (13.1.1) for granted. At this moment we will also take for granted that if E X1 < ǫ + Fn(x), for all n N. | | ∞ ≤ ≥ then the characteristic function of X1 is differentiable. With Fn on the left side gives an analogous inequality. So, Proposition - 13.2.3 - (A WLLN) Let X ,X , be a sequence of independent 1 2 ··· F (x ǫ) ǫ + Fn(x), Fn(x ǫ) ǫ + F (x), for all n N. and identically distributed random variables with finite mean E(X1) = µ. Then − ≤ − ≤ ≥ 1 n the sample mean Xn := Xi converges to µ in probability. Replacing x by x + ǫ in the second inequality, we get n i=1 Proof: By part (b) of PropositionP (13.2.2) we need only show that the sample F (x ǫ) ǫ + Fn(x) 2ǫ + F (x + ǫ). − ≤ ≤ mean Xn converges in distribution to a constant random variable Y , namely P(Y = Hence, we see that µ) = 1. Since the characteristic function of this Y is E(eitY ) = eitµ, by the continuity theorem we need only verify that E(eitXn ) converges to eitµ for all real F (x ǫ) ǫ + liminf F (x) ǫ + limsup F (x) 2ǫ + F (x + ǫ). n n E itX1 − ≤ n ≤ n ≤ t. For this purpose, note that if φ(t) = (e ) then the characteristic function n of Xn is (φ(t/n)) . When E X1 < the characteristic function is differentiable When x is a point of continuity of F , letting ǫ drop to zero gives that Fn(x) F (x). | | ∞ giving φ′(0) = iµ. Thus, L’Hopital’s rule gives This proves part (a). To prove part (b), the reader may verify that for any→ǫ > 0, we have ln (φ(t/n)) 0 lim ln ( φ(t/n) n) = lim , form, n { } n 1/n 0 P( Yn Y ǫ) 1 P(Yn c + ǫ) + P(Yn c 0.5ǫ). →∞ →∞ 2 | − | ≥ ≤ − ≤ ≤ − φ′(t/n) ( t/n ) φ′(0) = lim − = t = itµ. The right sides goes to zero. n φ(t/n) ( 1/n2) φ(0) ♠ →∞ − Remark - 13.2.2 - (Bernoulli’s, Chebyshev’s and Khintchin’s WLLN) Bernoulli Therefore, lim E(eitXn ) = eitµ = E(eitY ). n ♠ was the first one to notice that if X1,X2, forms a sequence of iid fair coin toss ··· 1 random variables, (i.e., P(Xi = 1) = 1 P(Xi = 0) = ) then − 2 Exercise - 13.2.1 - (Slutsky’s theorem) Let Xn,X,Yn be defined over the same X + X + + X prob 1 probability space. Y := 1 2 n E(X ) = . n ··· 1 p n → 2 (i) Let Y 0 and let X dist X. Then show that X + Y dist X and • n → n → n n → This is called Bernoulli’s weak law of large numbers. Chebyshev extended Bernoulli’s X Y dist 0. n n → WLLN by noting that there was nothing special about fair coin tosses in his proof. p dist One could have used any sequence of independent and identically distributed ran- (ii) If Yn c, where c is any real number and Xn X, then show that • dist→ dist → dom variables, X1,X2, , as long as they had finite variance. The proof takes one Xn + Yn X + c and XnYn cX. line where we take P(Y···= µ) = 1. For any ε > 0, by Chebyshev’s inequality, → → [Hints: For first part of item (i) verify 2 2 E(Yn µ) V ar(Yn) σ P( Yn Y ε) = P( Yn µ ε) − = = 0, P(X x ǫ) P ( Y > ǫ) + P (X + Y x), and | − | ≥ | − | ≥ ≤ ε2 ε2 nε2 → n ≤ − ≤ | n| n n ≤ 13.2 Convergence in Probability & WLLN 115 116 Basic Limit Theorems

P (X + Y x) P ( Y > ǫ) + P (X x + ǫ). After multiplying by x on both sides and combining these two inequalities we get n n ≤ ≤ | n| n ≤ ′ ′ Prove the second part of item (i) by verifying x f (c) ǫ

P ( XnYn > ǫ) P ( Yn > δ) + P ( Xn > ǫ/δ). for any ǫ > 0 and all sufficiently small x (0, ǫ). In particular, we can always find | | ≤ | | | | a δ(x) such that, ∈ ′ For part (ii) consider Zn := Yn c.] f(c + x) f(c) = x f (c) δ(x) . − − { − } The above two inequalities imply that, for all sufficiently small x (0, ǫ), it must Exercise - 13.2.2 - Let X ,X , be a sequence of independent random variables 1 2 be that 0 δ(x) < ǫ. A similar argument gives the same conclusion∈ for small from N(µ, σ2). Prove that the sample··· variance negative values≤ | of|x. Now, letting ǫ drop to zero forces δ(x) and x to go to zero. n That is, 1 2 ′ (Xi Xn) f(c + x) = f(c) + xf (c) xδ(x), n 1 − − i=1 − X where δ(x) 0 as x 0. → → ♠ converges to σ2 in probability. [Hint: You may use Khintchin’s WLLN.]

Exercise - 13.2.5 - (Delta method) Let kn be a sequence of positive numbers Exercise - 13.2.3 - Do the above exercise when the Xi form a random sample from diverging to infinity. Let Tn be a sequence of random variables so that for some 2 some distribution having finite variance σ . constant µ we have k (T µ) dist Z. n n − → Exercise - 13.2.4 - If Z converge to Z in distribution where Z F and if x x (i) For any function f which is differentiable at µ with f ′(µ) = 0, show that n ∼ n → • 6 where x is a point of continuity of F , then dist k (f(T ) f(µ)) f ′(µ) Z. n n − → lim P (Zn xn) = F (x). n ≤ →∞ (ii) For any function f which is twice differentiable at µ with f ′′(µ) = 0, and • f (µ) = 0, show that 6 Lemma - 13.2.1 - (Taylor expansion) Suppose f is well defined in [a, b]. Fix a ′ point c (a, b). Suppose f is continuous in a neighborhood around c and differen- ∈ 2 dist f ′′(µ) 2 tiable at c. (Note that we are not asking that f be differentiable in a neighborhood k (f(Tn) f(µ)) Z . n − → 2 around c). Then we have

′ f(c + x) = f(c) + xf (c) xδ(x), − for a function δ, where δ(x) 0 as x 0. Thus, → → Proof: Define a new function g by

′ g(x) = f(c + x) f(c) x f (c) ǫ ; (c a) x (b c), − − { − } − − ≤ ≤ − where ǫ > 0 is a number. Note that g(0) = 0. Since g may not be differentiable in a neighborhood of 0, we will compare the slopes of the secant lines. The differen- tiability of f at c gives that

f(c + x) f(c) ′ lim − = f (c). x 0 → x Therefore, for any ǫ > 0 and all sufficiently small x, x (0, ǫ), ∈

f(c + x) f(c) ′ f(c + x) f(c) ′ − >f (c) ǫ and −

The first order of business is to know where does this kind of convergence fit into the grand scheme of things in relation to the other forms of convergences that we already have defined so far.

a.s. Proposition - 14.0.4 - (Equivalent forms for ) Let Y , Y , be a sequence → 1 2 ··· of random variables defined on probability space (Ω, ,P ) and let An(ǫ) = ω Ω: Y (ω) ǫ . The following statements are equivalent.E { ∈ Lecture 14 | n | ≥ } (i) Y converge to 0 almost surely. • n (ii) For every ǫ > 0 we have P (limsup A (ǫ)) = 0. Almost Sure Convergence & • n n (iii) For every ǫ > 0 we have limn P ( k nAk(ǫ)) = 0. • →∞ ∪ ≥ prob. (iv) supk n Yk 0. SLLN • ≥ | | → (Uniqueness): If Y a.s. Y and Y a.s. Z then P(Y = Z) = 1. n → n →

Proof: The main thing to notice is that almost sure convergence of Yn to zero Definition - 14.0.2 - On a probability space a property holds almost surely if there can be symbolically stated as exists an event A with P(A) = 1 and the property holds for each ω A. (We do c ∈ not care whether the property holds or fails on A .) P ( ω Ω: ǫ > 0 N N so that n N, Y (ω) < ǫ ) = 1. { ∈ ∀ ∃ ∈ ∀ ≥ | n | } What this phrase means is that for those ω Ω for which the property may We may restate this in terms of set notation as not hold (or we are unable or we do not want to∈ check for some reason) form a P “bad” set that has zero probability and intend to ignore. In the above definition ( ǫ>0 N∞=1 n N ω Ω: Yn(ω) < ǫ ) = 1. ∩ ∪ ∩ ≥ { ∈ | | } that ignorable “bad” set is Ac. Note that ǫ could be taken to be positive rationals. Since the probability is one for the intersection over all ǫ > 0, and it can not get any higher, it must be that for Example - 14.0.1 - Let X N(0, 1). If we define Y = 1 then the new random X 3 each individual ǫ > 0, we have variable Y is not completely∼ well defined since X can take− the value 3 and we will have the unpleasant situation “ 1 ”. However, P(X = 3) = 0. Therefore, Y is P 0 ( N∞=1 n N ω Ω: Yn(ω) < ǫ ) = 1. well defined for each ω in the set A = ω : X(ω) = 3 and P(A) = 1. Hence, ∪ ∩ ≥ { ∈ | | } { 6 } even though Y is not a well defined function mathematically, Y is a well defined c c This probability is that of the liminfn An(ǫ) , which is the same as (lim supn An(ǫ)) . function probabilistically in the almost sure sense. Probabilists ignore events of So, parts (i) and (ii) are equivalent. The continuity property of P shows that parts zero probability, even countably infinitely many of them since (ii) and (iii) are equivalent. Now part (iv) says that for every ǫ > 0 we have

P( iBi) P(Bi), ∪ ≤ 0 = lim P sup Yk ǫ lim P ( k n Yk ǫ ) i n k n | | ≥ ≥ n ∪ ≥ {| | ≥ } X →∞  ≥  →∞ and a sum of infinitely many zeros is still zero. giving part (iii). The converse holds since supk n Yk 2ǫ k n Yk ǫ . The uniqueness of the limit follows trivially.{ ≥ | | ≥ }⊆∪ ≥ {| | ≥ } Definition - 14.0.3 - (Almost sure convergence) We say that a sequence of ♠ random variables Yn converges almost surely to another random variable Y , de- a.s. { } Corollary - 14.0.1 - For the notation of Proposition (14.0.4) we have noted by Yn Y , if there exists a set A with P(A) = 1 and limn Yn(ω) = Y (ω) →∞ for each ω →A. That is, for each fixed ω A, we have the ordinary type of a.s. prob a.s. prob (a) ( implies ) If Yn 0 then Yn 0. convergence:∈ for every ε > 0, there exists a positive∈ integer, N(ω,ε), such that • → → → → a.s. P (b) (Sufficient condition for ) If k∞=1 (Ak(ǫ)) < for every ǫ > 0 • → a.s. ∞ Yn(ω) Y (ω) < ε, for all n N(ω,ε). (called complete convergence) then Y 0. | − | ≥ n →P Almost Sure Convergence & SLLN 119 120 Almost Sure Convergence & SLLN

a.s. 1 1 (c) (Necessary condition for under independence) If Yn is a se- finitely many n we have Ykn (ω) n . That is, Ykn (ω) < n for all but finitely • → a.s. | | ≥ | | quence of independent random variables and Yn 0 then for every ǫ > 0, many n. That is, Ykn (ω) must be converging to zero. Finally, to prove part (f), P → Yn k∞=1 (Ak(ǫ)) < . since convergence in probability is ordinary convergence of E | | to zero, if 1+ Yn ∞ | | 1 Y converge to 0 in probability then every subsequence Y will also converge to P a.s. L a.s. E p n n(k)   (d) ( versus ) If Yn 0 and if supn Yn < for some p > 1 then 0 in probability. For it, part (e) gives a further subsequence that converges to zero • L1→ → → | | ∞ Y 0. almost surely. Now to prove the converse, we assume that for every subsequence n → prob a.s. prob Yn(k) there exists a further subsequence Yn(kj ) that converges to 0 almost surely. (e) ( implies subsequentially) If Yn 0 then there exists a Yn • → → → a.s. Now if Yn did not converge to 0 in probability, then E( | | ) 0. Hence, for increasing subsequence of positive integers, k ,k , so that Y 0. 1+ Yn 6→ 1 2 kn | | Y ··· → E | n(k)| some ǫ > 0 there exists an infinite subsequence, Yn(k) such that ( 1+ Y ) > ǫ prob prob | n(k)| (f) (Subsequential characterization of ) Yn 0 if and only if for for all k. No subsequence of Y can therefore converge in probability, but this • → → { n(k)} every subsequence Yn(k) there exists a further subsequence Yn(kj ) that con- contradicts the fact that it has a subsequence which converges to 0 almost surely. verges to 0 almost surely. This contradiction proves the proposition. ♠ Proof: Part (a) follows from part (iv) of Proposition (14.0.4) and that 0 In the above discussion the limiting random variable is specified. In the ab- P ≤ sence of the limiting random variable, we can proceed to characterize almost sure Yn supk n Yk . For part (b), when k∞=1 (Ak(ǫ)) < then automatically | | ≤ ≥ | | ∞ the tail of this convergent series must go to zero, i.e., limn P ( k nAk(ǫ)) = 0. convergence as follows. This point of view will be needed while discussing random So by part (iii) of Proposition (14.0.4)P part (b) holds.→∞ For part∪ ≥ (c) note that series. the sequence A (ǫ) is given to be independent and, by part (ii) of Proposition k Exercise - 14.0.6 - (14.0.4), P(limsup A (ǫ)) = 0. Contrapositivity of Borel-Cantelli lemma-II gives (Convergence without specifying the limit) Let Sn,n = n n 1, 2, be a sequence of measurable functions on some probability space (S,{ ,P ). that ∞ P(A (ǫ)) < . For part (d) H¨older’s inequality gives that k=1 k ∞ Let C···}be the set of all s S so that S (s) converges as n gets large. Verifyk that ∈ n P P E ∞ ∞ ∞ Yn = Yn χ Yn ǫ dP + Yn χ Yn <ǫ dP 1 | | Ω | | {| |≥ } Ω | | {| | } C = s : max Sj(s) Si(s) . Z Z n 0 is arbitrary Yn 0. For part (e), consider the events be random variables with respective distributions F and F such that X X. → { n} n → Then there exists a probability space (S, ,P ) and random variables, Yn and Y 1 k a.s. { } A (ǫ ) = Y ǫ , for ǫ = . defined on it, having respective distributions Fn and F and Yn Y . n n {| n| ≥ n} n n P { } → Proof: Take a uniform random variable U Uniform(0, 1) defined over some Since Y converges to zero in probability, P(A (ǫ )) 0 giving a k so that n n 1 1 probability space (S, , P). (For instance, S =∼ [0, 1], = and P((a, b)) = b a P (A (ǫ )) < ǫ2 for all n k . Similarly, P(A (ǫ ))→ 0. From this pick a n 1 1 1 n 2 with U(u) = u.) NowE take E B − k > k so that P(A (ǫ )) <≥ ǫ2 for all n k . Continuing→ this way we obtain a 2 1 n 2 2 ≥ 2 subseqence k1

This being true for every ǫ > 0, we must have Y (s) liminfn Yn(s). Proposition - 14.0.5 - (Cesaro method is ineffective when mean does not To go the other way, pick any t such that s s. The convergence of Fn(x) F (x) implies that F (x) > s for each large≥n. The definition of Y (s), being the→ smallest Let X1,X2, be a sequence of independent and identically distributed random n n variables so··· that E X = + . Then the following results hold: such x gives | 1| ∞ Yn(s) x < Y (t) + ǫ for all large n. (i) P( ω Ω: X (ω) k infinitely often ) = 1. ≤ • { ∈ | k | ≥ } Making n large and then letting ǫ drop to zero gives that (ii)P( ω Ω: 1 n X converges to a finite limit ) = 0. • { ∈ n k=1 k } limsup Yn(s) Y (t), whenever t>s. Proof: Let 1 n a P L. Therefore, n ≤ n k=1 k → P n n 1 Now Y (t) is a nondecreasing function, it must have atmost countably many points an 1 n 1 1 − L | | = (ak L) − (ak L) + 0. of jump. Ignoring those s, letting t s gives that lim supn Yn(s) Y (s). Hence, n n − − n n 1 − n → → ≤ k=1 k=1 Yn(s) converge to Y (s) for all s (0, 1) except perhaps a countably many s. X − X ∈ ♠ P To prove (i) just note that ( X1 >x) is a decreasing function of x. Therefore, Remark - 14.0.3 - (Summability & the strong law of large numbers) Summa- | | k+1 bility theory came into being while trying to create an alogorithm that assigns ∞ ∞ + = E X = P( X >x) dx = P( X >x) dx a limit to nonconvergent sequences. Of course to avoid any arbitrariness in the ∞ | 1| | 1| | 1| 0 k=0 k assigned limit we require that the algorithm, when applied to a convergent se- Z X Z ∞ ∞ quence, must give the correct limit. Any such limit assignment algorithm is called P( X >k) = P( X >k). 1 ≤ | 1| | k| a regular summability method. There are many such methods and one of the most k=0 k=0 popular of them all is called the Cesaro method. For any sequence a , a , , of X X 1 2 ··· real (or complex) numbers the Cesaro method assigns the value In the last equality we used the fact that X1,X2, are identically distributed. ··· The events Ak = Xk > k being independent, the second Borel-Cantelli lemma n {| | } 1 gives that Xk > k occurs infinitely often almost surely. This gives part (i). Now lim ak, | | n n part (ii) follows since the Cesaro method is ineffective for any such sequence. →∞ k=1 ♠ X provided this limit exists. It is not difficult to show that if ak,k 1 is a con- iid E { ≥ } Example - 14.0.2 - Let X1,X2, Cauchy(0, 1). Since X1 = + , the above vergent sequence then the Cesaro method does give the correct answer (i.e., the P 1 n ··· ∼ | | ∞ proposition shows that n i=1 Xi converges = 0. Either by using the unique- Cesaro method is regular). There are many nonconvergent sequences for which the ness of characteristic functions, (a result that we will prove later) or using convo- Cesaro method does not work (it is ineffective). The following proposition gives P  lutions, X = 1 n X Cauchy(0, 1), so X dist Cauchy(0, 1). uncountably many such examples. On the other hand, there are quite a few non- n n i=1 i ∼ n → convergent sequences for which it is effective. For instance, for a = ( 1)k, it gives k Example - 14.0.3P- (A strong law of large numbers, SLLN) In light of the the limit 0. The reader may try to find a few more examples. − last result one must impose finiteness of some moment in order to hope that a 1Although most well known summability methods are linear operations, they do not strong law of large numbers might hold. A remarkable result of Kolmogorov says have to be. that assuming E X < does make the Cesaro method effective, and that the | 1| ∞ Almost Sure Convergence & SLLN 123 124 Almost Sure Convergence & SLLN

Cesaro limit turns out to be E(X1). We will postpone the proof of this result until 0.35 0.35 E 4 n = 30 n = 50 we build several necessary results. However, if (X1 ) is finite, the proof is easy. 0.3 0.3 /6) Density /6) Density

Note that, by Markov’s inequality, for any ε > 0, 2 2

π 0.25 π 0.25 E 4 0.2 0.2 ∞ P ∞ Xn µ

( Xn µ ε) | − | and N(0, 0.15 and N(0, 0.15 | − | ≥ ≤ ε4 n n n=1 n=1 X X 0.1 0.1 ∞ nE(X µ)4 + 3n(n 1)σ4 = 1 − − , (by Exercise (14.0.8)), 0.05 0.05 ε4n4 n=1 0 0 X Histogram of M −5 0 5 Histogram of M −5 0 5 4 4 Values of M Values of M E(X µ) ∞ 1 3σ ∞ 1 n n 1 − + < . ≤ ε4 n3 ε4 n2 ∞ n=1 n=1 X X 0.35 0.35 n = 100 n = 500 0.3 0.3 Item (b) of Corollary (14.0.1) shows that Xn converge to µ almost surely. Fur- /6) Density /6) Density 2 2 π E 4 π 0.25 0.25 thermore, since we have shown that supn 1 Xn µ < , item (d) of Corollary ≥1 | − | ∞ (14.0.1) shows that Xn converge to µ in L sense as well, i.e., lim E Xn µ = 0. 0.2 0.2 n | − | and N(0, →∞ and N(0, 0.15 0.15 n 1 n As a special case, a result of Borel falls out, which he proved when Xi B(1, 2 ). Borel’s result is also known as Borel’s normal number theorem. ∼ 0.1 0.1 0.05 0.05 E 4 4 4 n (X1 µ) +3n(n 1)σ Exercise - 14.0.8 - E − − 0 0 Histogram of M Verify (Xn µ) = n4 . Histogram of M −5 0 5 −5 0 5 − Values of M Values of M n n Remark - 14.0.4 - (Fatou’s lemma & asymptotic variance) If (Y E(Y )) a.s. n− n → Y , by Fatou’s lemma, all we can say is Figure 14.1: Density of Random Harmonic Series

2 2 V ar(Y ) E(Y ) liminf E(Yn E(Yn)) = liminf V ar(Yn). ≤ ≤ n − n 1 π2 In general the inequality can be strict. However, for certain martingales (as men- This gives that sup E M 2 = ∞ = . The martingale convergence theo- n | n| k=1 k2 6 tioned below) equality can be guaranteed. rem therefore guarantees the existence of a random variable, say Z, so that Mn Z almost surely. In other words, P → ∞ 2U 1 Example - 14.0.4 - (Martingale convergence theorem) One nice thing about Z = k − . k martingales is that they converge almost surely under one simple condition. If k=1 E p X M1, M2, is a martingale, with supn Mn < for some constant p 1, ··· | | ∞ a.s. ≥ What is the distribution of Z? It is known that Z has a density (with respect to then there exists a random variable Z such that M Z. This is called the n → Lebesgue measure) [?], [?], [?], [?], [?]. No one knows its closed form expression. martingale convergence theorem. In fact, if p = 2 then we can further say that π2 2 But, E(Z) = 0, and V ar(Z) = . Many types of extensions and variants have been E Mn Z 0 as well. In particular, V ar(Mn) V ar(Z). Its proof is quite 6 deep| and− we| → will postpone it until we build the necessary→ tools later on. explored, [?], [?], [?], [?], [?], [?], [?], [?], [?], [?], [?]. Four simulation approximations To see an example of its use, consider the random harmonic series, of the density of Z are presented in Figure (14.1). The superimposed curve is the π2 density of N(0, 6 ) for comparison purposes. n 2Uk 1 Mn := − , n = 1, 2, , iid k ··· HW37 k=1 Exercise - 14.0.9 - (Martingale convergence) Let U1, U2, Uniform 0, 1, 2, X , 9 , , ··· ∼ { iid 1 ··· } n where U1, U2, B(1, 2 ), i.e., fair coin toss outcomes. Ui = 1 if head occurs and Ui 4.5 ··· ∼ Xn := − , n 1. Ui = 0 otherwise. Note Mn is a martingale and 10i ≥ i=1 X n n V ar(2Uk 1) 1 Does X ,X , obey the martingale property with respect to U , U , ? If so, E M 2 = V ar(M ) = − = . 1 2 ··· 1 2 ··· | n| n k2 k2 does the martingale converge? If so, to which random variable and in which sense? Xk=1 Xk=1 Almost Sure Convergence & SLLN 125 126 Almost Sure Convergence & SLLN

Exercise - 14.0.10 - (Martingale convergence) For Exercise (14.0.9) write a computer program and simulate the distribution of Xn for n = 5, 10, 15 and 20.

HW38 Exercise - 14.0.11 - (How to pick a point at random from [ 0.5, 0.5]?) By Exercise (14.0.9) explain how you can pick a point (approximately)− at random from [ 0.5, 0.5]? − 128 The Lp Spaces & Uniform Integrability

(Note that f 0 is always a member of since for any g , 0 = g g ). ≡ L ∈L − ∈L Minkowski’s inequality shows that on Lp, for each 1 p < , ≤ ∞ 1 p X := X p dP k kp | | ZS  is a semi-norm. To make it a norm, i.e., to make X = 0 imply that X = 0, (it Lecture 15 k kp only implies that X a.s.= 0) we identify all those r.v.s which are equal almost surely as one equivalence class of r.vs. We denote this set by Lp as well. More formally, if p [X] is the class of functions which are equal almost surely to X, and [Y ] is another The L Spaces & Uniform such class then we define α[X]:=[αX] for any α R ∈ Integrability [X]+[Y ]:=[X + Y ].

This definition does not, clearly, depend on the choice of X1 [X] and Y1 [Y ]. Therefore, Lp becomes a linear space. If we define ∈ ∈ We start off with an exercise for the reader. It is a special case of the concept of [X] pdP := X pdP uniform integrability which we will pick up later in this lecture. | | | | ZS ZS then it is a well defined number and Lp becomes a normed linear space. In fact, if HW39 Exercise - 15.0.12 - (Uniform integrability of integrable r.v.) Let X be a [X], [Y ] Lp then for p 1, random variable. Prove that the following three statements are equivalent. ∈ ≥ 1/p (a) E X < . ρ([X], [Y ]) := X Y pdP = X Y • | | ∞ | − | k − kp  S  (b) If An is a sequence of events so that P (An) 0 then X dP 0. Z • → An | | → To avoid this complicated notation we will denote the elements [X] by X from now p (c) limn X dP = 0. R on and continue to use the terminology “random variable in L ”. • →∞ X n | | {| |≥ } Here are the four basic varieties. In particular, deduceR that (a) implies limn nP ( X n) = 0, but the converse may 0 0 not hold. | | ≥ (i) (L , case) Over the whole L we may define the distance between X, Y • as X Y ρ (X, Y ) = E | − | . The linear space consisting of all the random variables (actually equivalence 0 1 + X Y P 0  | − | classes) defined over a probability space (Ω, , ) will be denoted by L . The space One can verify that ρ obeys the requirements for a distance function. In 0 E 0 L is broken down into various subsets with the help of a constant p [0, + ]. L0 prob p ∈ p ∞ this case X X, or ρ (X ,X) 0, stands for X X. When p > 0, the set L consists of all those random variables for which E X < . n → 0 n → n → It is not difficult to see that Lp itself is linear, called the Lp space. | | ∞ (ii) (Lp, p (0, 1) case) It turns out that for p (0, 1), if we define • ∈ ∈ ρ (X, Y ) := E X Y p, Definition - 15.0.4 - Let be a linear space. A real valued function on is p | − | L k · k L called a seminorm if for any f,g , then again the resulting linear space becomes a metric space. In this case ∈L p L p (i) f 0. Xn X or ρp(Xn,X) 0, stands for E Xn X 0. • k k ≥ → → | − | → (ii) αf = α f for any α R. (iii) (Lp, 1 p < case) Using Minkowski’s inequality, when 1 p < , • k k | | k k ∈ • ≤ ∞ ≤ ∞ (iii) f + g f + g . ρ (X, Y ) := X Y = (E X Y )1/p , • k k ≤ k k k k p || − ||p | − | p Furthermore, is called a norm if in addition p L k · k makes L a normed linear space. In this case Xn X or ρp(Xn,X) 0, (iv) f = 0 if and only if f 0. stands for (E X X p)1/p 0. → → • k k ≡ | n − | → The Lp Spaces & Uniform Integrability 129 130 The Lp Spaces & Uniform Integrability

(iv) (L∞ case) Finally, there is one further case, namely that of L∞. This is p p • ǫ + Y dP. the space of essentially bounded random variables. That is, X L∞ if and ≤ Xn ǫ | | a.s.∈ Z{| |≥ } only if there exists a positive real number K such that X K. In this the last integral goes to zero by Exercise (15.0.12). Since ǫ is arbitrary, the result space the size of a random variable is measured by | | ≤ follows. ♠ a.s. p ρ (X, Y ) = inf K : X Y K = inf K : P( X Y > K) = 0 Since the concepts of convergence in probability and L convergence are defined ∞ | − | ≤ { | − | } over metric spaces, it is natural to ask if the Cauchy criterion can be invoked? That n o The number ρ (X, Y ) is called the essential supremum of X Y . It is is, does every Cauchy sequence have to converge to some element of the space? If ∞ L∞ | − | the answer is yes, we say that the metric space is complete. It turns out that again a metric space. In this case Xn X or ρ (Xn,X) 0. p → ∞ → all these L spaces are complete metric spaces for their respective metrices. The p As subsets, the various Lp spaces form a tower, thanks to the Lyapunov inequality. completeness of L spaces for p 1 is a theorem due to Riesz and Fischer which is the main focus of this section.≥ In the proof of the Riesz-Fisher theorem we will L0 need the following proposition. 1/2 p L Proposition - 15.0.7 - Let 0 < p < and let Xn,n 1 be a sequence in L so that ∞ { ≥ } 1 L p 1/p 1 (E Xn Xn+1 ) < L2 | − | 4n L∞ for all n 1, then X converge almost surely to a random variable. ≥ n Proof: Consider the set n A := s S : X (s) X (s) 2− , (n 1). n ∈ | n − n+1 | ≥ ≥ By the Chebyshev inequality we get

np p np p np P (An) 2 Xn Xn+1 dP 2 E Xn Xn+1 < 2− . As far as convergence is concerned, Fatou’s lemma gives that for any X ,X , ≤ | − | ≤ | − | 1 2 ZAn L0 we always have ···∈ ∞ p p E liminf Xn liminf E Xn . Therefore, P (An) < . By the first Borel-Centelli lemma P limsupAn = 0. n | | ≤ n | | ∞ n n=1     That is, forX almost all s S, there exists an N(s) such that Remark - 15.0.5 - (Some other “Lp” spaces) We should mention that there are ∈ P n are several other classification techniques that use the tail ( X >t) itself, rather Xn(s) Xn+1(s) < 2− for all n N(s). than the moments, to define the, so called, weak Lp space. | | | − | ≥ Thus, for all m>n N(s), We should also mention the weighted Lp space used for defining orthogonal ≥ polynomials. In this space, a single random varible, X, which lies in Lp for all p 1 m 1 m 1 ≥ − − k n+1 (i.e., X has all moments) is used to collect all functions h(t) so that E h(X) 2 < . Xn(s) Xm(s) Xk(s) Xk+1(s) < 2− 2− . 2 | | ∞ | − | ≤ | − | ≤ The resulting L (X) space consists of deterministic functions h. Such a space has kX=n kX=n all polynomials as its members and infinitely many orthogonal polynomials, when That is, Xn(s),n 1 is a Cauchy sequence of real numbers for almost all s S X takes infinitely many values. and hence{ it converges.≥ } ∈ ♠ Lp prob Proposition - 15.0.6 - Let 0 < p < be a number. If Xn 0 then Xn 0. Remark - 15.0.6 - The existence of the random variable X in the above propo- prob ∞ → → sition does not guarantee that it is a member of Lp. Of course, the following Conversely, if Xn 0 and there exists a random variable Y such that Xn < Y → p | | p L Riesz-Fischer theorem shows when it is. so that E Y < then Xn 0. p | | ∞ → Note that if Xn X in the metric space L , 1 p < , then X is unique almost surely. Also, it→ implies that ≤ ∞ Proof: The forward case is obvious. Conversely, Xn p X p Xn X p 0 as n . p p p | k k − k k | ≤ k − k → →∞ E( Xn ) = Xn dP + Xn dP That is, p is a continuous real valued function. | | Xn <ǫ | | Xn ǫ | | Z{| | } Z{| |≥ } k · k The Lp Spaces & Uniform Integrability 131 132 The Lp Spaces & Uniform Integrability

Theorem - 15.0.2 - (Riesz-Fischer — completeness of Lp) For 1 p < Exercise - 15.0.15 - (Lp, 0 < p < 1 case) For any fixed p (0, 1) prove or disprove the space Lp is a complete metric space. ≤ ∞ that Lp, is a complete metric space. ∈

p Proof: Let Xn,n 1 be a Cauchy sequence in (L ,ρp). That is, for any ε > 0, Exercise - 15.0.16 - (Completeness of ℓ∞) Let ℓ∞ be space of all real (or com- { ≥ } 1 plex) sequences. That is, functions defined over 1, 2, with norm f = there exists N such that Xn Xm p < ε for all n,m N. For ε = 4 let N1 be k − k ≥ 2 { ···} k k 1 1 supk f(k) . Show that the space is complete. the integer so that Xn Xm p < 4 for all n,m N. For ε = 4 let N2 > N1 | | k − k 1 2 ≥ be the integer so that Xn Xm p < 4 for all n,m N2. Continue  this way Exercise - 15.0.17 - (Completeness of L∞(µ)) Let (Ω, , µ) be a measure space. and consider the subsequencek − X k ,i 1 . Since, ≥ E { Ni ≥ } Let L∞(µ) be the space of functions consisting of those f for which 1 X X < , i 1, f ess := inf K : µ( s : f(s) > K ) = 0 Ni − Ni+1 p 4i ≥ || || K>0 { { | | } }

is finite. Show that f ess is a norm. Is L∞(µ) complete? the previous proposition says that XN i ,i 1 converge to a random variable X || || a.s. on S. Therefore, X is measurable{ and≥ Fatou’s} lemma gives

p p p 15.1 Uniform Integrability Xn X dP liminf Xn XNi dP = liminf Xn XNi p. | − | ≤ i | − | i k − k ZS ZS The basic theme of this section is to improve the Lebesgue dominated convergence Now, for any ε > 0, picking n large enough, the Cauchy sequence X gives that theorem. Recall that the Lebesgue dominated convergence theorem gives sufficient { n} conditions for a sequence of random variables, Xn converging almost surely to X X p <ε for all large i. { } k n − Ni kp another random variable X so that p p P That is, Xn X dP < for large n and lim Xn X dP = 0. Thus, lim Xn X d = 0. | − | ∞ n | − | n | − | S →∞ S →∞ ZΩ X = (XZ X) + X Lp and X X 0 as nZ . − n − n ∈ k n − kp → →∞ ♠ Now we will provide some necessary and sufficient conditions for this to hold. This requires the notion of uniform integrability. Remark - 15.0.7 - Any complete normed linear space is called a Banach space. We have just proved that the space Lp, for each p 1, is a Banach spaces. Another Definition - 15.1.1 - Let (Ω, , P) be a probability space. Let K = X ,α I ≥ α important example of a Banach space is C[a, b] with be a collection of random variables.k We say the collection is uniformly{ integrable∈ if} P f := max f(x) . k k a x b| | lim sup Xα dP = 0. ≤ ≤ t →∞ α I x Ω: Xα(x) t | | n n p ∈ Z{ ∈ | |≥ } Other common examples are (R, ), (R , p), (R , ), ℓ , 1 p . The ∞ E list of examples is extremely large|·| (infinite).k · k k · k ≤ ≤∞ Note that any collection consisting of only one random variable, X, with X < , is automatically uniformly integrable, cf. Exercise (15.0.12). Using this we| | see ∞ Exercise - 15.0.13 - In (R, ) we have a simple way of checking summability that any finite collection of such random variables will also be uniformly integrable. of a series. One useful result|·| for characterizing completeness of a normed linear Natanson [ ] uses the term equi-absolutely continuous integrals to describe uni- space is the following summability comparison of absolute (norm) summability and form integrability. Uniform integrability plays a major role in Probability, Analysis ordinary summability. and other fields. Here is another way of of looking at uniform integrability. Let (B, ) be a normed linear space. Then show that the following are equivalent k · k Proposition - 15.1.1 - A collection Xα,α I is uniformly integrable if and only if the following results hold. { ∈ } (i) (B, ) is a Banach space. (i) supE X < , • k · k α n α I | | ∞ (ii) For any f ,n 1 in B such that ∞ f < implies that f ∈ • { n ≥ } n=1 k nk ∞ i=1 i (ii) For any ε > 0, there exists a δ > 0 such that for any event A with converges to an element f in the metric space (X, ). P(A) < δ we have, P k · k P Exercise - 15.0.14 - (L0 is complete) Show that the space L0 is complete. That P is, show that for any Cauchy sequence X in L0, there exists a random variable sup Xα d < ε. n α I A | | prob { } ∈ Z X so that X X. [Hint: Use Exercise (14.0.7).] n → 15.1 Uniform Integrability 133 134 The Lp Spaces & Uniform Integrability

prob Proof: Assume the two conditions hold. By Chebyshev’s inequality, – (i) X X, n → E E – (ii) X has finite expectation, Xα supα I Xα P ( Xα t) | | ∈ | |. | | ≥ ≤ t ≤ t – ((iii) Xn,n 1 is uniformly integrable. { ≥ } 1 E P Thus, if we let Aα,t := Xα t and let t > δ supα I Xα , then (Aα,t) < δ Proof: Assume (1) holds. For n = 1, we see that for all α I. Thus, {| | ≥ } ∈ | | ∈ E X E X X + E X . | | ≤ | 1 − | | 1| P P Xα d sup Xβ d < ε, for each α I. This shows that X has finite expectation giving (ii). By Chebyshev’s inequality, A | | ≤ β I A | | ∈ Z α,t ∈ Z α,t E X X Since, this is true for each α I, we have, P ( X X >ε) | n − | 0 ∈ | n − | ≤ ε →

sup Xα dP < ε for all large t. as n . This gives part (i). To show uniform integrability note that for ε = 1, α I A | | →∞ ∈ Z α,t E Xn E Xn X + E X To prove the converse, let ε > 0. Then there exists a T (ǫ) > 0 such that | | ≤ | − | | | 1 + E X = K , for all n N . ≤ | | 1 ≥ 1 P ε sup Xα d < ; forall t T (ε). Since Xn are integrable, we see that α I Xα t | | 2 ≥ { } ∈ Z{| |≥ } supE Xn max E Xi + K1; 1 i < N1 < . We use this T = T (ε) carefully to show that both (i) and (ii) hold. Indeed, n | | ≤ { | | ≤ } ∞ Next, for any ε > 0, using the uniform integrability of X and the given information, P P P sup Xα d = sup Xα d + Xα d (a) there exists a δ > 0 such that E X < ε whenever P(A) < δ, α I Ω | | α I Xα T | | Xα 0 so that ≤ α I Xα T | | ≤ 2 ∞ 1 2 N2 1 2 N2 ∈ (Z{| |≥ } ) { ··· } E ε P This gives (i). To show (ii), take δ = ε/(2T ). Then for any event A, with P(A) < δ, Xi < whenever (A) < δi, 1 i N2. | | 2 ≤ ≤ P Take δ∗ = min δ, δ1, δ1,...,δN2 . Then, for any event A with (A) < δ∗, sup Xα dP = sup Xα dP + Xα dP { } α I A | | α I ( A Xα T | | A Xα

So, for any ǫ > 0, we have Exercise - 15.1.7 - (Beppo Levi’s theorem) Let Xn be random variables with finite expectations such that E X X = X X dP + X X dP n n n E | − | Xn X t | − | Xn X

HW40 Exercise - 15.1.1 - If Xi are identically distributed and have finite expectations then prove that Xi is uniformly integrable. lim fn(t) f(t) dm(t) = 0. { } n R | − | →∞ Z Exercise - 15.1.2 - Let Xt,t T and Ys, s U be uniformly integrable fami- And hence Eh(Xn) Eh(X) for any bounded continuous function h. lies over the same probability{ space.∈ } Prove{ that∈ } →

(1) Xt + Ys,t T, s U is also uniformly integrable Exercise - 15.1.9 - (Modified LDCT) Let Xn Yn where Yn have finite ex- • { ∈ ∈ } prob prob | | ≤ pectations and that X X and Y Y with Y having finite expectation and (2) Any subcollection of a uniformly integrable family is again uniformly n → n → • E(Y ) E(Y ). Then show that integrable. n →

lim E Xn X = 0. n Exercise - 15.1.3 - Let Xt,t T be a family of random variables defined over a →∞ | − | probability space so that{X ∈X almost} surely for each t T and that E X < . | t| ≤ ∈ | | ∞ Prove that the family is ui. Exercise - 15.1.10 - Let K = Xα, α I be a collection of random variables in p { ∈ } L for a p > 1. If supα I Xα p < then show that K is uniformly integrable. ∈ || || ∞ prob Exercise - 15.1.4 - Let X a.s. X (or that X X or that X dist X) and let n → n → n → Exercise - 15.1.11 - Let p > 0 (note that p could be less than one). Let K = Xn be uniformly integrable. Prove that X has finite expectation and that p { } Xn,n = 1, 2, be a set in L . Prove that the following are equivalent. { ···} lim E(Xn) = E(X). p prob n (1) Xn , n 1 is uniformly integrable and Xn X, →∞ • {| | ≥ } → (2) E X X p 0, Exercise - 15.1.5 - Consider the probability space ( 1, 1), , P), where P((a, b]) = • | n − | → b a { − B − , and let X be the function over ( 1, 1) such that it takes the value n over prob 2 n − (3) X Lp, E X p E X p and X X. the interval (0, 1/n) and takes the value n over the interval ( (1/n), 0) and zero • ∈ | n| → | | n → − − a.s. otherwise. Show that Xn are not uniformly integrable. However Xn 0 =: X → Exercise - 15.1.12 - Let (S, ,P ) be a probability space. Let Φ(x) : [0, ) and that F ∞ → [0, ) be an increasing function so that Φ(x)/x as x . Let K L1(P ) E(Xn) E(X). ∞ → ∞ → ∞ ⊆ → be a collection of random variables (automatically having finite means). If Does this contradict Theorem (15.1.1)? sup E(Φ( X )) = U < , X K | | ∞ Exercise - 15.1.6 - Let X ,n 1 be a collection of random varibles having finite ∈ { n ≥ } expectations. Show that the following two statements are equivalent. then prove that K is uniformly integrable.

(a) limn E Xn X = 0. • →∞ | − | prob Exercise (15.1.12) contains the gist of uniform integrability over probability (b) E X < and E X E X and X X. • | | ∞ | n|→ | | n → spaces. The following theorem shows why. 15.1 Uniform Integrability 137 138 The Lp Spaces & Uniform Integrability

Theorem - 15.1.2 - (de la Vall´ee Poussin) Let (S, ,P ) be a probability space where φ is the number of possible values of n for which T k. Since, T 0, ∞ x Φ(x) 1 x 1 x Φ(x) = φ(t) dt; x > 0, = φ(t) dt φ(t) dt 0 x x ≥ x Z Z0 Zx/2 φ(x/2) x φ(x/2) where φ(t) is a non-negative and non-decreasing function with φ(t) . By the 1 dt = . uniform integrability of K, for ε = 1/2n, we find positive numbers T →(we ∞ can take ≥ x 2 →∞ n Zx/2 Tn to be positive integers so that Tn Tn | | 2 ··· ∈ Z{| | } ∞ EΦ( Xα ) = E Φ( Xα )χ k Xα j jX=T X X ∞ ∞ = φ P ( X >j) 1. X dP = X dP. j | α| ≤ ≤ j X

1 ∞ Exercise - 15.1.13 - If Xn is uniformly integrable then prove that Xn is also X dP P ( X k). { } { } n α α uniformly integrable. (Averages preserve uniform integrability of random vari- 2 ≥ Xα >Tn | | ≥ | | ≥ Z{| | } kX=Tn ables). Summing this inequality over n gives that Exercise - 15.1.14 - (This assumes WLLN) If Xn are independent and identi- { } L1 ∞ ∞ ∞ ∞ cally distributed random variables with E X1 < then prove that Xn E(X1). 1 P ( Xα k) = χ Tn k (n) P ( Xα k) | | ∞ → ≥ | | ≥ { ≤ } | | ≥ n=1 k=Tn k=1 n=1 X X X X Exercise - 15.1.15 - Extend the above two exercises as follows. Let A = (ank) be ∞ ∞ ∞ a non-negative summability method whose rows add up to 1. If Xn is uniformly = P ( Xα k) χ Tn k (n) = P ( Xα k) φk, { } | | ≥ { ≤ } | | ≥ integrable then prove that (AX)n,n = 1, 2, is also uniformly integrable. k=1 n=1 k=1 { ···} X X X 15.1 Uniform Integrability 139 140 The Lp Spaces & Uniform Integrability

Exercise - 15.1.16 - (This assumes WLLN for summability methods) Let Xn be independent and identically distributed random variables with E X < { and} let | 1| ∞ A be a non-negative summability method whose rows add up to one. If (AX)n L obey the WLLN then prove that (AX) 1 E(X ). n → 1 142 Laws of Large Numbers

Here our aim is to focus on the laws of large numbers. The following tools/techniques show up again and again. 1. The subsequence method. 2. The truncation method along with Kolmogorov-type inequalities. Lecture 16 3. Usage of Kornecker’s lemma. 4. And the relatively newer approach of martingales. We will postpone the topic on martingales to be picked up later. The rest are Laws of Large Numbers explained one by one. 16.1 Subsequences & Kolmogorov Inequality

Chebyshev invented his famous inequality in order to prove the weak law of large Suppose X , n = 1, 2, is a sequence of random variables defined over a prob- { n ···} numbers, which goes as follows. If Xk,k 1 are pairwise uncorrelated random ability space (S, ,P ). Consider the new sequence of partial sums 2 { ≥ } E variables with E(Xn) < M for all n then

Sn := X1 + X2 + + Xn; n = 1, 2, . n ··· ··· 1 Sn E(Sn) L2 Zn := (Xk E(Xk)) = − 0, Here are some of the typical questions probability theory deals with:- n − n → Xk=1 What is the limiting behavior of Sn as n gets large? Does the series n Xn and hence in probability automatically. This result of Chebyshev was extended by • 1 ǫ itself make sense? For instance if Xn n− − for sure for some ǫ > 0, then Rajchman by considering subsequences and he showed that the convergence takes | | ≤ P the series converges absolutely and the random series is a genuine random place in almost sure sense. Note that, variable. Our restriction is, however, very strong and we need to explore n what we can do without it. When the X are independent random variables, 1 n E(Z ) = 0, Var(Z ) = σ2, where σ2 = V ar(X ). it turns out that we can give a precise answer towards the convergence of the n n n2 k k k random series. This is the celebrated “three series theorem” of Kolmogorov. Xk=1

The distribution properties of Sn when Xi are iid are explored under the ti- Theorem - 16.1.1 - (Subsequence method of Rajchman) Let Xk,k 1 be • 2 { ≥ } tle of “random walk”. Further specialization of random walk occurs when pairwise uncorrelated random variables with E(Xn) < M for all n then the X are non-negative random variables. This goes under the title of i n “renewal theory”. 1 Sn E(Sn) a.s. Z := (X E(X )) = − 0. n n k − k n → Estimates of the probabilities P( S > nǫ) are called “large deviations” re- k=1 n X • sults. | | Proof: Without loss of any generality, we may assume that E(Xk)= 0 for all k. If we use normalizing contants and consider • By Chebyshev inequality we have Sn an n(i) ∞ ∞ ∞ Zn := − , n = 1, 2, , 1 1 2 1 M bn ··· P ( Z > ǫ) σ . | n(i)| ≤ ǫ2 n(i)2 k ≤ ǫ2 n(i) i=1 i=1 k=1 i=1 then the asymptotic behavior of Zn opens the door to a rich history. For X X X X convergence in distribution these topics come under “central limit theorem”. If we pick n(i) growing fast enough so that the last series converges, then, by the a.s. Topics such as the weak and strong “laws of large numbers”, consider the first Borel-Cantelli lemma, we have Zn(i) 0 as i . For instance, take • 2 | | → → ∞ convergence aspects of Sn . n(i) = i . n Take g(m)=([√m])2. That is, g(m), for m = 1, 2, 3, 4, looks like Further refinements of the rate of convergence in the law of large numbers, ··· • Sn i.e., with an = o(n) are considered by the “law of itterated logarithms”. 1, 1, 1, 4, 4, 4, 4, 4, 9, 9, 9, 9, 9, 9, 9, 16, an ··· 16.1 Subsequences & Kolmogorov Inequality 143 144 Laws of Large Numbers

Since Z converge to zero almost surely, we have Hence, we have 0 m g(m) 2√m. By the Chebyshev inequality we get n(i) ≤ − ≤ a.s. 2 Z1, Z4, Z9, Z16, 0. ∞ ∞ E(Zm Zg(m)) ··· → P ( Z Z > ǫ) − | m − g(m)| ≤ ǫ2 If we repeat a few terms of the convergent sequence, we should still get a convergent m=1 m=1 X X sequence. That is, 1 ∞ 2√m 4m M + < . a.s. ≤ ǫ2 m2 m2g(m) ∞ Z1, Z1, Z1, Z4, Z4, Z4, Z4, Z4, Z9, Z9, Z9, Z9, Z9, Z9, Z9, 0. m=1 ··· → X  

The triangle inequality gives that Therefore, by the first Borel-Cantelli lemma, Zm Zg(m) goes to zero almost surely as m goes to infinity. | − | Z Z Z + Z . ♠ | m|≤| m − g(m)| | g(m)| So, all we need to show is that the first term on the right side also goes to zero almost Remark - 16.1.1 - (Cantelli’s SLLN) Let Xn be independent random vari- { } 4 surely. For this we again use the Chebyshev inequality and the first Borel-Cantelli ables (only pairwise independence is needed however) and let E(Xn) < K for all lemma. Now, n. Then 2 E(Z Z ) Sn E(Sn) (X1 E(X1)) + (X2 E(X2)) + + (Xn E(Xn)) m − g(m) − := − − ··· − = V ar(Z Z ) n n m − g(m) = V ar(Z ) + V ar(Z ) 2Cov(Z , Z ) goes to zero almost surely. Of course this is just a simple consequence of the above m g(m) − m g(m) m g(m) m g(m) SLLN of Rajchman and the CBS inequality 1 1 2 = V ar(X ) + V ar(X ) Cov(X ,X ). m2 k g(m)2 k − mg(m) k i E(X2) (E(X4))1/2 √K; forall n. i=1 n n Xk=1 Xk=1 kX=1 X ≤ ≤ Since, Cov(X ,X ) = 0 whenever k = i, we have Direct calculations are not hard. Again without loss of generality assume that k i 6 E(Xn)=0 for all n. Then just note that m g(m) g(m) 2 1 2 1 2 2 2 E(Zm Zg(m)) = σk + σk σi , 4 − m2 g(m)2 − mg(m) E(Sn) = E XiXj XkXℓ k=1 k=1 i=1 i j X X X X X Xk Xℓ since g(m) m for all m. Therefore, we have ≤ = E(XiXjXkXℓ) 2 i j k ℓ E(Zm Zg(m)) X X X X − n g(m) m g(m) g(m) = E(X4) + 2 E(X2X2) 1 2 1 2 2 2 i i j = + σk + σk σi i i= j m2   g(m)2 − mg(m) 6 i=1 X X X kX=1 k=gX(m)+1 Xk=1 X = Kn + 2Kn(n 1).  m g(m) g(m) g(m) − 1 σ2 σ2 σ σ = σ2 + k + k 2 i i Therefore, m2 k m2 g(m)2 − mg(m) k=g(m)+1 k=1 k=1 i=1 ∞ S ∞ Kn + 2Kn(n 1) X X X X P n > ǫ − < . g(m) n ≤ ǫ4n4 ∞ m g(m) σ σ 2 n=1   n=1 M − + k k X X ≤ m2 m − g(m) This finishes the result. k=1   X 2 m g(m) 1 1 Exercise - 16.1.1 - E 2 M − + Mg(m) Construct a sequence of random variables so that (Xn) 0 ≤ m2 m − g(m) but the almost sure convergence of 1 n X fails. →   n k=1 k m g(m) (m g(m))2 P = M − 2 + M −2 . HW41 Exercise - 16.1.2 - Let A = [ank] be a regular summability method. If Xn is m m g(m) 2 { } any sequence of random variables such that E(Xn) 0 then prove the following Now note the fact that results. → 2 2 m 2√m √m 1 ([√m]) = g(m) m. (i) ∞ E X a 0. − ≤ − ≤ ≤ • k=1 | k| nk →  P 16.1 Subsequences & Kolmogorov Inequality 145 146 Laws of Large Numbers

HW42 (ii) (AX)n := k∞=1 Xk ank exists almost surely for each n. Exercise - 16.1.4 - (Improvement of Rajchman’s result) Let Xn be pairwise • 2 { } uncorrelated random variables and let E(Xn) < K for all n. Then prove that L1 P (iii) (AX)n 0. • → Sn E(Sn) a.s. 3 L2 − 0; for any α > . (iv) (AX) 0. nα → 4 • n → prob Exercise - 16.1.5 - (A variant of Rajchman’s result) Let X be pairwise (Both (iii) and (iv) imply (AX) 0). Construct an example of independent n n θ { } 1 → uncorrelated random variables and let V ar(Xn) = O(n ) for some θ [0, ]. Then 2 prob 1 n ∈ 2 random variables for which E(Xn) 0, Xn 0 and n k=1 Xk fails to converge find various combinations of α > 0 so that 6→ 1 →n 2 to zero in probability. So, automatically n k=1 Xk will fail to go to zero in L . P Sn E(Sn) a.s. P − α 0. In contrast to the last exercise, if pairwise uncorrelation is assumed, variances n → can be allowed to be unbounded as the following exercise shows. Remark - 16.1.4 - (How Kolmogorov’s inequality comes into the picture) Now we explain the main reason why Kolmogorov’s inequality (and its general- 2 Exercise - 16.1.3 - (L -version of Rajchman’s SLLN) Let X1,X2, be pairwise ity called Hajek-Renyi inequality) is so useful from the point of view of proving ··· uncorrelated random variables so that V ar(Xk) = o(k). Let A = [ank] be a row- the strong law of large numbers. Recall that a sequence of random variables Wn 2 finite regular summability method for which sup k ank < . Prove that converges to a random variable W if and only if n k | | ∞ P2 E L (Xk (Xk)) ank 0. P sup Wm W ǫ 0; as n , − → m n | − | ≥ → →∞ Xk  ≥  for any ǫ > 0. That is, if A := W W ǫ , then we need to show that In particular, this result holds for the Cesaro method. m {| m − | ≥ }

P ( m n Wm W ǫ ) 0; as n . Remark - 16.1.2 - The above results point out two counter-balancing effects of ∪ ≥ {| − | ≥ } → →∞ two types of assumptions: If the crude bound is finite, that is, if (i) Some sort of lack of “dependence” of the terms of X . • k ∞ (ii) Some kind of control on the moments of X . P ( m n Wm W ǫ ) P (Am) < , k ∪ ≥ {| − | ≥ } ≤ ∞ • m=n Their judicial mix gives rise to various forms of weak and strong laws of large X numbers. then the desired convergence takes place without any independence assumption and without the help of any other inequality. If the crude bound is infinite, we could still lump some of the events A together judiciously and apply the crude bound Remark - 16.1.3 - (Koml´os’ remarkable result) Existence of variance is a strong m on these lumps, and hope to have the resulting bound to be finite. That is, we pick (very restrictive) condition. Consider the above exercises along with a result of n = m

random variables). To see why, note that if Wk = Sk/k where E(Xi) = 0, we may We will prove that take W = 0. Then n n 1 − E(Y ) 2 2 2 2 2 mj+1 E(χ ) ; where Y = c S + (c c )S . ( ) 1 k ≤ ǫ2 n n k − k+1 k ∗ Bj = Ak = max Wk W ǫ = max Sk ǫ k=m k=m mj

If we assume that Xk are independent and have finite variances, here comes Kol- m n 2 2 mogorov’s inequality (or its extension the Hajek-Renyi inequality) and gives a much E(Y ) = cm V ar(Xk) + ckV ar(Xk). better bound on P(B ). This then leads to the Kolmogorov’s criterion for the strong k=1 k=m+1 j X X law of large numbers. This criterion pays the price of assuming independence of So, we need only prove the inequality in ( ). For this purpose, let χ be the event random variables. 0 that non of the χ equals one, k = m,m +∗ 1, ,n. So, k ··· 3 n Theorem - 16.1.2 - (H´ajek-R´enyi inequality) Let X1,X2, be a sequence ··· E(Y ) = Y dP + Y dP of independent random variables with mean zero and finite variances. Let cn be k=m χk =1 χ0=1 a sequence of non-increasing positive numbers and let Sn = X1 + X2 + + Xn. Z{ } Z{ } Xn Then, for any ǫ > 0 and any positive integers m and n, n>m, we have ··· Y dP ; (since Y 0). ≥ χk =1 ≥ m n k=m Z{ } 1 2 2 X P max ck Sk ǫ cm V ar(Xk) + ckV ar(Xk) . 2 2 m k n ǫ2 Now, we will show that E(χkS ) E(χkS ) for any j>k. Indeed, ≤ ≤ | | ≥ ≤ ! j k   kX=1 k=Xm+1 ≥ 2 2 E(χkSj ) = E(χk(Sk + Sj Sk) ) Proof: For ǫ > 0, define a sequence of indicator functions χk, − = E(χ S2) + E(χ (S S )2) + 2E(χ S (S S )). k k k j − k k k j − k 1 if cm Sm < ǫ, , ck 1 Sk 1 < ǫ, ck Sk ǫ χ := − − k 0 otherwise| |, ··· | | | | ≥ Note, that the random variable χkSk depends only on X1, ,Xk. Whereas the  random variable S S depends only on X , ,X . By independence,··· we then j − k k+1 ··· j for k = m,m + 1, ,n. That is, χ is the indicator function which checks if get that E(χkSk(Sj Sk)) = E(χkSk) E(Sj Sk) = 0. Therefore, for k

max cj Sj ǫ = cj Sj ǫ for some m j n E(Y ) Y dP m j n | | ≥ { | | ≥ ≤ ≤ } ≥ χk =1  ≤ ≤  kX=m Z{ } = χm = 1 or χm+1 = 1 or or χn = 1 n n 1 { ··· } − = χ + χ + + χ 1 . = c2 S2 dP + (c2 c2 ) S2 dP m m+1 n  n n j − j+1 j  { ··· ≥ } χk =1 j=m χk =1 kX=m  Z{ } X Z{ }  Note that χi + χj > 1 cannot happen since this implies that both χi = 1 and n n 1 − χj = 1, which is contradictory. Same argument shows that  2 2 2 2 2  cn Sn dP + (cj cj+1) Sj dP ≥  χk =1 − χk=1  k=m Z{ } j=k Z{ } χm + χm+1 + + χn 1 = χm + χm+1 + + χn = 1 . X  X  { ··· ≥ } { ··· } n n 1  2 2 − 2 2 2  cn Sk dP + (cj cj+1) Sk dP ≥  χk =1 − χk =1  k=m Z{ } j=k Z{ } So, P max cj Sj ǫ = P χm + χm+1 + + χn = 1 X  X  m j n ≤ ≤ | | ≥ { ··· } n n 1   2 2 2 − n cnǫ ǫ 2 2  2 1 dP + 2 (cj cj+1) dP = E χm + χm+1 + + χn = E(χk). ≥  ck χk =1 ck − χk =1  { ··· } kX=m  Z{ } Xj=k Z{ }  kX=m n 2 2 2 n 1 3 c ǫ ǫ P (χ = 1) −  H´ajek, J. and R´enyi, A. (1955). Generalization of an inequality of Kolmogorov. Acta n k 2 2 = 2 P (χk = 1) + 2 (cj cj+1) Math. Acad. Sci. Hung. vol. 6, pp. 281-283.  ck ck −  kX=m  Xj=k    16.1 Subsequences & Kolmogorov Inequality 149 150 Laws of Large Numbers

n n 1 c2 1 − We need only show that the right hand side goes to zero. For this, just note that = ǫ2 E(χ ) n + (c2 c2 ) k  c2 c2 j − j+1  k=m k k j=k m N m X  X  1 1 V ar(Xj ) n n V ar(Xj) V ar(Xj ) + . c2 c2 c2 m2 ≤ m2 j2 = ǫ2 E(χ )  n + k − n = ǫ2  E(χ ) j=1 j=1 j=N+1 k c2 c2 k X X X k=m  k k  k=m X X Letting m and then N go to infinity ends the proof. 2 ♠ = ǫ P max cj Sj ǫ . m j n | | ≥  ≤ ≤  Exercise - 16.1.9 - Does convergence in L2 take place in the above Kolmogorov This finishes the proof. criterion (Theorem (16.1.3))? Justify. If yes, what if we only had pairwise inde- ♠ pendence, would convergence in L2 sense still hold? HW43 Exercise - 16.1.6 - Show that for the Y as defined in ( ) above, ∗ Exercise - 16.1.10 - Let X1,X2, be a sequence of independent random variables m n ··· with finite variances, σ2, k = 1, 2, . Let a , a , be any sequence of real E(Y ) = c2 V ar(X ) + c2 V ar(X ). k 1 2 m k k k numbers. ··· ··· k=1 k=m+1 X X 2 2 a.s ak σk 1 n E (i) If ∞ 2 < , then show that ak(Xk (Xk)) 0. Exercise - 16.1.7 - (Kolmogorov inequality) Let X ,X , be a sequence of • k=1 k ∞ n k=1 − → 1 2 2 2 1 n ··· (ii) In particular, if ∞ a /k < then prove that for a (U independent random variables with mean zero and finite variances. Let Sn = X1 + P k=1 k P n k=1 k k • a.s. iid ∞ − X2 + + Xn. Then, for any ǫ > 0, explain why p) 0, where U , U , B(1,p). ··· → 1 P2 ··· ∼ P n 1 P max Sj ǫ V ar(Xk). 1 j n | | ≥ ≤ ǫ2 ≤ ≤   kX=1 Exercise - 16.1.8 - Let S = X +X + +X , where X ,X , iid with E(X ) = n 1 2 ··· n 1 2 ··· ∼ 1 0, V ar(X1) = 1. Prove that

1/2 P max Sj x 2P Sn x (2n) . 1 j n ≥ ≤ ≥ −  ≤ ≤    [Hint: consider Ak = S1 < x,S2 < x, ,Sk 1 < x,Sk x , k = 1, 2, ,n.] { ··· − ≥ } ···

Theorem - 16.1.3 - (Kolmogorov criterion (1930)) Let Xk be mutually inde- pendent random variables with finite variances σ2. If σ2/k2 < then k k k ∞ n 1 P (X E(X )) a.s. 0. n k − k → kX=1

Proof: Without loss of generality, take E(Xk) = 0 for all k = 1, 2, and let S = X + X + + X . Take c = 1/k in H´ajek-R´enyi inequality to··· get n 1 2 ··· n k m n 1 1 1 V ar(Xj) P max Sj ǫ V ar(Xj) + . m j n j | | ≥ ≤ ǫ2  m2 j2  ≤ ≤   Xj=1 j=Xm+1   Letting n go to infinity, the continuity property of probability gives that

m 1 1 1 ∞ V ar(X ) P sup S ǫ V ar(X ) + j . j | j| ≥ ≤ ǫ2  m2 j j2  j m j=1 j=m+1  ≥  X X   152 WLLN, SLLN & Uniform SLLN

To get rid of the assumption on the second moments of the random variables (in the above strong law of large numbers) we often use the truncation approach. However, we still pay the price by assuming that the random variables are (at least pair wise) independent. This section pulls the main benefits that the truncation method brings out.

Lecture 17 Definition - 17.0.1 - (Equivalence of random sequences) Two sequences of random variables X ,k 1 and Y ,k 1 are said to be equivalent if { k ≥ } { k ≥ } P (X = Y ) < . k 6 k ∞ WLLN, SLLN & Uniform Xk (This implies that P(limsup X = Y ) = P (X = Y i.o.) = 0.) SLLN k{ k 6 k} k 6 k So, the above truncation method shows that if X are iid with E X < , k | 1| ∞ then our truncated sequence is equivalent to Xk . The following proposition pulls out the main benefit of the truncated seuquence.{ } The truncation method was invented to get rid of the need to assumpe the finiteness of the second moments of the random variables as demanded by the Kolmogorov Proposition - 17.0.1 - If two sequences of random variables (Xk) and (Yk) are criterion (Theorem (16.1.3)). equivalent, then n (i) k=1(Xk Yk) converges almost surely as n gets large. Remark - 17.0.5 - (The truncation method) So, let Xk be identically dis- • − tributed random variables with finite first moment. There are many forms of trun- P 1 n a.s. (ii) If ak , then a k=1(Xk Yk) 0, as n gets large. cation, however, the following one will suffice for us. • ր∞ n − → P c Proof: Since P (Xk = Yk i.o.) = 0, therefore, for any ω B , Xk(ω) = Yk(ω) Xk if Xk k 6 ∈ Yk := | | ≤ for all but finitely many k, where B := ω : Xk(ω) = Yk(ω) i.o. . Therefore, for 0 if Xk > k. { 6 } c  | | all except finitely many values of k, we have Xk(ω) Yk(ω) = 0, for each ω B . − ∈ Note that Hence, ∞ (X (ω) Y (ω)) < for each ω Bc. ∞ ∞ ∞ k − k ∞ ∈ P (X = Y ) = P ( X >k) = P ( X >k) < , k=1 k 6 k | k| | 1| ∞ X Xk=1 Xk=1 kX=1 Since, P (Bc) = 1, part (i) follows. Part (ii) is then a trivial consequence of part (i). since E X1 < . Hence, by using the first Borel-Cantelli lemma we see that ♠ P (limsup| |X =∞ Y ) = 0. That is, X (ω) = Y (ω) for infinitely many k only k{ k 6 k} k 6 k over a set of probability zero. So, almost surely, Xk = Yk for all but finitely many HW44 Exercise - 17.0.11 - If two sequences of random variables Xk,k 1 and Yk,k values of k. This gives that 1 are equivalent, then show that almost surely the limiting{ behavior≥ } of { ≥ } n n 1 a.s. n n (X Y ) converges a.s. and (X Y ) 0. 1 1 k − k n k − k → Xk is the same as that of Yk k=1 k=1 an an X X Xk=1 Xk=1 So, note that n n n for any sequence (ak) of positive numbers going to infinity. Furthermore, if for some 1 1 1 X = (X Y ) + Y random variable X, n k n k − k n k k=1 k=1 k=1 n n X X X 1 prob 1 prob shows that the limiting behavior of the last term on the right side is the same as Xk X, then show that Yk X. an → an → the limiting behavior of the left side term. Xk=1 Xk=1 WLLN, SLLN & Uniform SLLN 153 154 WLLN, SLLN & Uniform SLLN

Let p ,k 1 be any sequence of numbers, and let X ,k 1 and Y ,k then the upper bound { k ≥ } { k ≥ } { k ≥ 1 be equivalent. Then the almost sure convergence of k pkXk is the same as } n 2 that of pkYk. i=1 E(Yi ) E X1 n(n + 1) k P | | n2ǫ2 ≤ 2n2ǫ2 P HW45 ExerciseP - 17.0.12 - Prove the following results. does not go down to zero. The trick is to use a sequence a of positive integers { n} (i) Let (Xk) be a sequence of random variables such that k P ( Xk >k) < going to infinity slower than n as shown below. So, • . If | | ∞ P Xk if Xk k n n an n Yk := | | ≤ 0 if Xk > k, E(Y 2) = E X2χ = + E X2χ .  | | i 1 {|X1|≤i} 1 {|X1|≤i} i=1 i=1 i=1 i=a +1! then prove that (Yk) is equivalent to (Xk). X X   X Xn  

(ii) Let (Xk) be identically distributed sequence of random variables with We use the crude estimate on the first part and a sharp estimate on the second • E X1 < . Let Yk be defined as in part (i). Prove that Yk is equivalent to part. That is, | | ∞ (Xk). [Hint: E X < is related to k P ( X >k) < ] | | ∞ | | ∞ a a n n a (a + 1) P 2 n n UC Exercise - 17.0.13 - Let (X ) be any sequence of random variables such that E X1 χ{|X |≤i} iE X1 = E X1 . k 1 ≤ | | 2 | | i=1 i=1 P ( Xk >g(k)) < for some sequence of numbers g(k). If   k | | ∞ X X P X if X g(k) So, if an/n goes to zero (e.g., an = ln n) then this upper bound would go down to Y := k k k 0 if |X |>g ≤ (k), zero. For the second part, use the fact that  | k| then prove that (Y ) is equivalent to (X ). X i = X a a < X i , k k {| 1| ≤ } {| 1| ≤ n} ∪ { n | 1| ≤ } Now we present an application of the truncation method. We will prove the where the above union is disjoint. So we have WLLN by assuming only pairwise independence. When we have mutual indepen- dence, the characteristic functions approach gives WLLN rather easily. n E X2χ 1 {|X1|≤i} Theorem - 17.0.4 - (Khintchin’s WLLN — pairwise independent case) Let i=an+1   X n n (Xk) be identically distributed random variables which are pairwise independent = E X2χ + E X2χ and let X1 have finite mean µ. Then, 1 {|X1|≤an} 1 {an<|X1|≤i} i=an+1   i=an+1   n X n X 1 prob Xi E(X1). an(n an)E X1 + iE X1 χ n → {an<|X1 |≤i} i=1 ≤ − | | | | X i=an+1   Xn Proof: By using the truncation scheme of Remark (17.0.5), we need only show an(n an)E X1 + nE X1 χ ≤ − | | | | {an<|X1 |} that Y n µ in probability. As in the SLLN of Kolmogorov, if we show that i=a +1 → Xn   n 1 prob = an(n an)E X1 + n(n an)E X1 χ (Y E(Y )) 0; as n − | | − | | {an<|X1|} n k − k → →∞ k=1 2   X nanE X1 + n E X1 χ{|X |>an } . ≤ | | | | 1 then we will be done. We will use Chebyshev inequality and some clever tricks.   First Chebyshev’s inequality gives that When we divide this by n2 and let n go to infinity, both of the terms drop to zero.

n n 2 This finishes the proof. i=1 V ar(Yi) i=1 E(Yi ) ♠ P ( Y n E(Y n) > ǫ) . | − | ≤ n2ǫ2 ≤ n2ǫ2 If we assume that the sequence X is iid then we may hope to avoid making P P { k} 2 the assumption that E X1 < . However, what should then be the limiting value? At this stage if we use a crude estimate of E(Yi ) by Well, we may not have| any.| But∞ the following result shows that there is a sequence 2 2 2 E(Y ) = E(X χ ) = E(X χ ) iE X1 , µn which picks up the limiting behavior in probabilistic sense. i i {|Xi|≤i} 1 {|X1|≤i} ≤ | | WLLN, SLLN & Uniform SLLN 155 156 WLLN, SLLN & Uniform SLLN

Theorem - 17.0.5 - (WLLN of Khintchin — iid case) Let Xk be an iid Both of these terms go to zero as given. sequence so that { } ♠ Proof of Theorem (17.0.5). We will use the result of Proposition (17.0.2) with lim nP ( X1 >n) = 0. (0.1) X = X and take b = n. Then note that n | | n,k k n n n 1 prob Let µn = E X1χ . Then Xi µn 0. P ( Xn,k >n) = nP ( X1 >n) 0. {|X1|≤n} n − → | | | | → i=1 k=1   X X Instead of proving the result directly, we pull out the main steps into a separate To prove the second condition of Proposition (17.0.2), let Y = X χ . n,k n,k {|Xn,k|≤n} proposition as follows. Since X = X , we see that Y = X χ . Hence, we need to prove that n,k k n,k k {|Xk|≤n}

Proposition - 17.0.2 - Consider the triangular array Xn,k, 1 k n, n = 1, 2, n n ≤ ≤ ··· 1 2 1 2 1 2 of random variables where for each fixed n, Xn,k, k = 1, 2, ,n are independent. E X χ = E X χ = E X χ ··· n2 k {|Xk|≤n} n2 1 {|X1|≤n} n 1 {|X1|≤n} Define a truncated triangular array k=1 k=1 X   X     X if X b , goes to zero as n gets large. Now note that Y = n,k n,k n n,k 0 if |X |> ≤ b ,  | n,k| n 1 2 1 ∞ E X χ = 2xP X1 χ >x dx where b is a sequence of positive numbers going to infinity so that 1 {|X1|≤n} {|X1|≤n} n n n 0 | |   Z n   n n 2 1 2 = xP X1 χ{|X |≤n} >x dx lim P ( Xn,k > bn) = 0, and lim E(Yn,k) = 0. n | | 1 n n b2 0 →∞ | | →∞ n Z n   kX=1 Xk=1 2 xP ( X1 >x) dx. If S = X + X + + X then we have ≤ n 0 | | n n,1 n,2 ··· n,n Z n S E(Y ) prob It is given to us that the integrand goes to zero. The integral version of the Cesaro n − k=1 n,k 0. b → transfrom will converge to the same limit by regularity. P n ♠ Proof: Let ǫ > 0 be arbitrary and let T = Y + Y + + Y . Just note n n,1 n,2 n,n Exercise - 17.0.14 - that ··· Give an example of a sequence of iid random variables for which condition (0.1) does not hold and Theorem (17.0.5) fails. S n E(Y ) P n − k=1 n,k > ǫ b  P n  Remark - 17.0.6 - (Converse of WLLN of Khintchin — iid case) It turns n out that the conditions in Khintchin’s WLLN (Theorem (17.0.5)) happen to be Tn k=1 E(Yn,k) P (Sn = Tn) + P − > ǫ ≤ 6 b sufficient as well. Let X1,X2, be iid random variables. The following statements  P n  ··· n are equivalent. T n k=1 E(Yn,k) P (Xn,k = Yn,k, for some k n) + P − > ǫ prob ≤ 6 ≤ bn 1 n  P  (i) n k=1 Xk µ. n n • → Tn E(Yn,k) P (X = Y ) + P − k=1 > ǫ P n,k n,k (ii) limn nP ( X1 >n) = 0andlimn µn = limn E X1χ = µ. ≤ 6 bn {|X1|≤n} k=1  P  • | | Xn n   itX1 Tn k=1 E(Yn,k) (iii) φ(t) := E(e ) is differentiable at t = 0 with φ′(0) = iµ. = P ( Xn,k > bn) + P − > ǫ • | | bn k=1  P  For a proof see Laha and Rohatgi p. 320. X n One can have a sequence of iid random variables for which the WLLN holds V ar (Tn) P ( Xn,k > bn) + , (Chebyshev’s inequality) P 1 ≤ | | ǫ2b2 and the SLLN does not hold. For instance, (X1 x) = 1 x ln x as x . k=1 n ≤ − →∞ Xn n 1 2 Here is a second application of truncation method in conjunction with the P ( Xn,k > bn) + 2 2 E(Yn,k). ≤ | | ǫ bn Kolmogorov inequality. kX=1 kX=1 WLLN, SLLN & Uniform SLLN 157 158 WLLN, SLLN & Uniform SLLN

∞ ∞ Theorem - 17.0.6 - (Kolmogorov’s SLLN) Let X1,X2, be independent and j ··· = E X1 χ{j−1<|X |≤j} identically distributed random variables. Then, k2 | | 1 j=1 X Xk=j   n ∞ ∞ 1 a.s. 1 X := X µ if and only if E X < , E(X ) = µ. = jE X1 χ{j−1<|X |≤j} n k 1 1 | | 1 k2 n → | | ∞ j=1 k=j Xk=1 X   X a.s. ∞ C Furthermore, if E X1 = then limsup Xn = . jE X1 χ | | ∞ n | | ∞ ≤ | | {j−1<|X1 |≤j} j Xj=1   Proof: First assume that E X < and let E(X ) = µ. We start off just as in | 1| ∞ 1 ∞ Khintchin’s WLLN by defining = C E X1 χ | | {j−1<|X1|≤j} j=1 X   Xk if Xk k Yk := | | ≤ ∞ 0 if Xk > k.  | | = CE X1 χ{j−1<|X |≤j}  | | 1  a.s. j=1 Just as in Khintchin’s WLLN, we see that we need only prove that Yn µ X → = CE X < .  due to Remark (17.0.5). Since the first moment exists, | 1| ∞ This gives the first part. E(Yk) = E Xkχ = E X1χ E(X1); as k . n {|Xk|≤k} {|X1|≤k} To prove the converse, use the fact that ( a )/n µ implies that a = → →∞ k=1 k → k     o(k), cf. Proposition (14.0.5). Hence, Sn/n converges to µ almost surely im- Therefore, by the regularity of the Ces`aro summability method we have plies that X /n goes to zero almost surely. IfPE X = then it must be that n | 1| ∞ n P ( X n) = . But then it must be that P ( X n) = . Now 1 n | 1| ≥ ∞ n | n| ≥ ∞ E(Yk) E(X1); as n . the second Borel-Cantelli lemma gives that P ( Xn n i.o.) = 1. This contra- n → →∞ dictsP with the fact that X /n goes to zero almost{| surely.P| ≥ } Hence, it must be that kX=1 n E X1 < . And the first part of the proof now gives that E(X1) must be µ. ∞ | | ∞ V ar(Yk) To finish the proof, assume that E X1 = . Then, for any k > 0 we have So, by Kolmogorov criterion, Theorem (16.1.3), we need only prove that 2 < . | | ∞ k ∞ E( X1 /k) = . So, P ( X1 > kj) = for each k > 0. Since, Xj are k=1 | | ∞ j | | ∞ We do not need to be as sophisticated as we were in Khintchin’s weak law.X Indeed, identically distributed we have P

2 2 ∞ E(Yk ) = E X1 χ{|X |≤k} P ( X > jk) = ; for each k > 0. 1 | j| ∞ j=1  k  X = E X2 χ Since, the events in question are independent, the second Borel-Cantelli lemma  1 {j−1<|X1|≤j}  j=1 gives that X k  P ( X > jk i.o.) = 1; for each k > 0. | j| = E X2χ 1 {j−1<|X1 |≤j} Now, notice the inequality j=1   X j j 1 j j 1 k − − X = X X X + X . jE X1 χ . j i i i i ≤ | | {j−1<|X1 |≤j} | | − ≤ j=1 i=1 i=1 i=1 i=1 X   X X X X j j 1 So, if Xj > jk then Xi > jk/2 or − X i > jk/2. Hence, if for some Therefore, we have | | | i=1 | | i=1 | ω S, X (ω) > jk for infinitely many j then j X (ω) > jk/2 for infinity j P P i=1 i ∞ ∞ 2 ∈ | | | | V ar(Yk) E(Yk ) many j. Thus, we have 2 2 k P k ≤ k P ( X > i.o.) = 1. Xk=1 kX=1 | j| 2 k ∞ 1 This gives that limsup X k/2 almost surely. Since, k is arbitrarily large. we jE X1 χ j j ≤ k2 | | {j−1<|X1|≤j} | | ≥ j=1 get the result. kX=1 X   ♠ WLLN, SLLN & Uniform SLLN 159 160 WLLN, SLLN & Uniform SLLN

1 N. Etemadi gave a remarkably simple method to prove that when E(X1) = we get that

µ and Xi are only pairwise independent and identically distributed then Sn/n j j converges to µ almost surely. The proof uses a mixup of the truncation argument 1 1 1 E(X1) liminf Yk limsup Yk aE(X1). and the subsequence argument. a ≤ j j ≤ j j ≤ Xk=1 kX=1 Theorem - 17.0.7 - (SLLN of Etemadi) Let X ,X , be pairwise independent Since a > 1 is arbitrarily close to 1, the result is proved. So, all we need to prove 1 2 ··· and identically distributed random variables with E X1 < and let E(X1) = µ. is (0.3). In fact we will show a little more that the left side in (0.3) converges Sn| | ∞ If Sn = X1 + X2 + + Xn is the partial sum then n converges to µ almost surely. completely to zero as n gets large. For this purpose let Y1 +Y2 + +Ym(n) = Tm(n). ··· Now, for any ǫ > 0, we have ··· Proof: Once again we use the truncation argument with m(n) ∞ 1 Xk if Xk k P (Yk E(Yk)) > ǫ Yk := | | ≤  m(n) −  0 if Xk > k. n=1 k=1  | | X X   As in the last proof, we need only show that the averages of Yk sequences converge ∞ = P Tm(n) E(Tm(n)) > ǫm(n) almost surely to µ. As before, E(Yk) µ. And the previous argument verbatim − → n=1 gives that X  ∞ V ar (Tm(n)) 2 2 2 , (Chebyshev), ∞ V ar(Yk) ∞ E(Y ) ≤ ǫ m(n) k < . (0.2) n=1 k2 ≤ k2 ∞ X k=1 k=1 m(n) X X 1 ∞ 1 = V ar(Y ), (pairwise independence), The problem is that now we cannot invoke the Kolmogorov criteria to conclude the ǫ2 m(n)2 k n=1 k=1 result, since it requires mutual independence of the random sequence. So, Etemadi X X + 1 ∞ ∞ 1 took the following detours. Since Xn = Xn Xn−, if the result is proved for the = V ar(Y ) , (Tonelli), + − ǫ2 k m(n)2 sequence Xn a similar argument will work for the other sequence. So, without k=1 n: m(n) k { } X X ≥ loss of generality assume that Xn 0 for each n. Now follow the line of attack of the 2 subsequence method. Take any a≥ > 1 and consider the subsequence m(n)=[an], 1 ∞ ∞ 1 = V ar(Yk) n = 1, 2, . Over this subsequence we will prove that ǫ2 [an] k=1 n: m(n) k   ··· X X ≥ m(n) 2 1 1 ∞ ∞ 2 a.s. V ar(Yk) (Yk E(Yk)) 0. (0.3) ≤ ǫ2 an m(n) − → k=1 n: an k   Xk=1 X X≥ 4 ∞ 1 1 Now, since E(Y ) µ, the regularity of the Cesaro means and (0.3) will imply k V ar(Yk) 1 → ≤ ǫ2 1 (aloga k)2 that k=1 a2 m(n) X − 2 1 a.s. 4a ∞ V ar(Yk) Yk µ. = < , by (0.2). m(n) → 2 2 2 k=1 (a 1)ǫ k ∞ X − kX=1 Using this then we fill the inbetween terms as follows. Note that the fact that This finishes the proof. X 0, implies that Y 0 and therefore, for any m(n) j

Exercises (17.0.15), (17.0.16) reveal a qualitative difference between the SLLN (iv) Conclude that in Exercises (17.0.15) and (17.0.16) L1 convergence takes of Kolmogorov and the SLLN of Etemadi. Of course SLLN of Etemadi is a gen- • place as well. eralization of the SLLN of Kolmogorov. However, the difference between them is 2 revealed when we introduce an arbitrary bounded sequence a into the Cesaro D Exercise - 17.0.18 - (Almost sure convergence may imply L convergence) { k} transforms, Let U1, U2, , be independent and identically distributed B(1,p) for some fixed n p (0, 1). For··· any real sequence a if 1 ∈ { k} akXk. n n k=1 1 X a (U p) a.s. 0, n k k − → In the case of the Kolmogorov SLLN this transform is guaranteed to reproduce the k=1 1 n X convergence or oscillatory behavior of n k=1 ak (with the assumption on mutual n 2 2 2 independence of Xk). While Etemadi’s SLLN guarantees that the transform will then k=1 ak = o(n ) and hence the L -convergence must take place. [Hint: 1 Pn reproduce the convergence behavior of a (with only pairwise independence (Uk p)(Uj p)/(p(1 p)),j>k 1 is an orthonormal system. Egorov’s the- n k=1 k { −P − − ≥ } assumption on Xk). Somehow it seems as if mutual independence may be necessary orem says that almost sure convergence implies uniform convergence over a set, A, P for this extra benefit that Komogorov’s SLLN provides. However, this is only a of positive probability. Expand χA by the orthogonal system.] conjecture at this moment. UC Exercise - 17.0.19 - Let X1,X2, be any sequence of independent random vari- ables and let Z = 1 n X ,···n = 1, 2, . Prove that the following three Exercise - 17.0.15 - (Weighted version of SLLN of Kolmogorov) Let X1,X2, n n k=1 k ··· statements are equivalent. ··· be any sequence of independent and identically distributed random variables with P a.s. E X1 < . If a1, a2, is any bounded sequence of real numbers then (i) Z 0. | | ∞ ··· • n → prob a.s. n (ii) (a) Zn 0 and (b) Z2n 0. 1 a.s. a (X E(X )) 0. • → → n k k − 1 → prob (ii) (a) Zn 0 and (b) P( 2Z2n+1 Z2n > ǫ) < for any ǫ > 0. kX=1 • → n | − | ∞ P E + [Note that ak does not have to be Cesaro summable. It says that the convergence Exercise - 17.0.20 - Let X1,X2, be iid random variables. If (X1 ) = + and { } 1 n 1 n E 1 ···n a.s. ∞ or oscillatory behavior of n k=1 ak is reproduced by n k=1 akXk. So, a bounded (X1−) < then show that n k=1 Xk + . 1 n ∞ → ∞ sequence is C1-summable if and only if n k=1 akXk converges almost surely.] P P UC Exercise - 17.0.21 - Suppose weP have two coins, a dime which is assumed to be P Exercise - 17.0.16 - (Weighted version of SLLN of Etemadi) Let X1,X2, fair and a quarter which is biased. A nickle (fair) is tossed and if a head appears, be any sequence of pairwise independent and identically distributed random vari-··· the dime is chosen otherwise the quarter is chosen. The chosen coin is then tossed ables with E X < . If a , a , is any bounded sequence of real numbers infinitely many times with outcomes X1,X2, , where Xk = 1 if the k-th toss | 1| ∞ 1 2 ··· lands heads and X = 0 otherwise. Does the strong··· law of large numbers hold for which is Cesaro summable to α then k the sequence X1,X2, . Prove your assertions. n ··· 1 a.s. akXk αE(X1). HW46 Exercise - 17.0.22 - Prove the following results. n → k=1 X (i) For any ǫ > 0 prove that • Exercise - 17.0.17 - (Uniform integrability is preserved by regular meth- n k n k ods) lim sup p (1 p) − = 0. n p [0,1] k − →∞ ∈ k: k np nǫ   (i) Show that if E( X ) < then X is uniformly integrable. | −X|≥ • | | ∞ (ii) Let X ,X , be any sequence of identically distributed random vari- (ii) For any continuous function f on [0, 1], 1 2 • • ables with E( X···) < . Prove that the collection X ,i 1 is uniformly | 1| ∞ { i ≥ } n integrable. k n k n k lim sup f p (1 p) − f(p) = 0. n n k − − →∞ p [0,1] k=0 (iii) Let X1,X2, be any sequence of random variables that is uniformly ∈   • { ··· X  integrable. Let A =[ank] by a regular summability method. Prove that the n k n k n k collection ∞ a X , n 1 is uniformly integrable. Bn(f,p) := k=0 f n k p (1 p) − is called the Bernstein polynomial. { k=1 nk k ≥ } − P P   17.1 Glivenko-Cantelli Theorem 163 164 WLLN, SLLN & Uniform SLLN

a.s. 17.1 Glivenko-Cantelli Theorem By the SLLN we see that Fn(x) F (x). When F is continuous the above result of Polya shows that convergence must→ be uniform. The followoing theorem of Glivenko We start off with a result of Polya concerning convergence in distribution. and Cantelli says that the convergence is uniform for any F .

Proposition - 17.1.1 - (George Polya) If Fn,n 1 converges in distribution iid { ≥ } Theorem - 17.1.1 - (Glivenko-Cantelli (1933) If X1,X2, F , where F to another distribution G and if G is continuous then supt R Fn(t) G(t) 0. ··· ∼ ∈ | − |→ is any probability distribution, and Fn(x) is the empirical distribution as defined above, then Proof: Suppose not and let there be an ǫ > 0 and a subsequence nk, k 1, so a.s. ≥ sup Fn(x) F (x) 0. that ǫ, for all k 1. t R | − | ≥ ∈ Proof: For any distribution F , let ∆ be the set of points of jump of F with j ∆ ∈ By the definition of sup apllied for each fixed k 1, we get a sequence xk, k 1 if and only if F (j) F (j ) = p > 0. Clearly this set is countable. If this set is t ≥ { ≥ } − j so that empty just ignore the− portion of the argument that relies on it being nonempty. ǫ Define a discrete distribution F (t) and a continuous distribution F (t) by F (x ) G(x ) > , for all k 1. (1.4) d c | nk k − k | 2 ≥ 1 1 Fd(t) := pj , Fc(t) := (F (t) pFd(t)) , t R, We will soon show that xk has to be a bounded sequence. Therefore, it must p 1 p − ∈ { } j ∆: j t − have a limit point, say α, so that there exists a subsequence xkj α as j gets ∈X≤ δ → large. That is xk α for all j J for any δ > 0. Of course, G is continuous | j − | ≤ 4 ≥ where p = j ∆ pj . So, we see that F (t) = (1 p)Fc(t) + pFd(t). What is not at α giving G(xk ) G(α) and we are given that Fn (α) G(α). Therefore, for ∈ − j → kj → usually noticed is that the random variable X F can also be similarly decomposed all j J, as X = UXP+ (1 U)X , where X F and∼ X F and U Bernoulli(p), ≥ d − c d ∼ d c ∼ c ∼ with all three random variables, U, Xd,Xc, being mutulally independent. Indeed, G(α δ) Fn (α δ) Fn (xk ) Fn (α + δ) G(α + δ) − ← kj − ≤ kj j ≤ kj → P(UX + (1 U)X t) = pP(X t) + (1 p)P(X t) Since δ > 0 is arbitrary, this contradicts (1.4). To finish the proof we now show d − c ≤ d ≤ − c ≤ = (1 p)Fc(t) + pFd(t) = F (t). that xk is bounded. First for our ǫ > 0 cut the left tail by picking an a so that − { ǫ} G(a) < 8 . Next, since Fnk (a) G(a), we have a K so that for all k K iid iid iid → ≥ Let Uk U, Xd,k Fd, and Xc,k Fc. Now it is easy to see that ǫ ǫ ∼ ∼ ∼ G(a) Fnk (a) G(a) + . n − 8 ≤ ≤ 8 1 Fn(t) = I(UkXd,k + (1 Uk)Xc,k t) Now for any k K, if xk < a then n − ≤ ≥ Xk=1 ǫ 1 n 1 n Fnk (xk) Fnk (a) G(a) + = I(U = 1) I(X t) + I(U = 0) I(X t) ≤ ≤ 8 n k d,k ≤ n k c,k ≤ Xk=1 Xk=1 implying F (x ) G(x ) 2G(a) + ǫ 3ǫ contradicting (1.4). Therefore, n n | nk k − k | ≤ 8 ≤ 8 1 1 x a for all k K. For an upper bound, see Exercise (17.1.1). = Uk I(Xd,k t) + (1 Uk) I(Xc,k t) n n ≤ n − ≤ ≥ ≥ ♠ k=1 k=1 iid X X Let X ,X , ,X F The empirical distribution function is defined as =: Dn(t) + Cn(t), (say). 1 2 ··· n ∼

the number of Xi which are x Let us use the notations Fn(x) := ≤ , x R. n ∈ n n 1 1 Fd,n(t) = I(Xd,k t), Fc,n(t) = I(Xc,k t). Note that Fn(x) is a random variable, i.e., a function of X1,X2, ,Xn. More n ≤ n ≤ ··· Xk=1 kX=1 precisely, for a fixed real number x, let Yi(x) := 1 if Xi x and zero otherwise, iid ≤ a.s. R i = 1, 2, ,n. Then Y (x), Y (x), , Y (x) B(1, F (x)), and By the SLLN, Fc,n(t) Fc(t) for all t , and the weighted version gives ··· 1 2 ··· n ∼ → ∈ n n 1 F (x)(1 F (x)) 1 1 a.s F (x) = Y (x), E(F (x)) = F (x), Var(F (x)) = − . C (t) = (1 U ) I(X t) F (t). n n i n n n 1 p n n(1 p) − k c,k ≤ → c i=1 k=1 X − − X 17.1 Glivenko-Cantelli Theorem 165 166 WLLN, SLLN & Uniform SLLN

By Polya’s theorem (actually Exercise (17.1.2)), Proposition - 17.1.2 - (Hoeffding’s inequality) Let X be any random variable with P(a X b) = 1. Then for any ǫ > 0 we have a.s ≤ ≤ sup Cn(t) (1 p)Fc(t) 0. t | − − | → 2 2 t(X E(X)) t (b a) E(e − ) exp − , for all t > 0. ≤ 8 Here almost surely over the intersection of the two events (one for ω of U’s and one   for ω of X ’s) each with probability one. To take care of F (t) part we take a ′ c,k d,n When X ,X , ,X are independent with P(a X b ) = 1 and S = X + slightly different tack. The SLLN and its weighted version, for each j ∆, give 1 2 ··· n i ≤ i ≤ i n 1 ∈ + Xn then ··· n n 1 a.s. P 1 a.s 2ǫ2 I(Xd,k = j) (X = j), Uk I(Xd,k = j) P (X = j). P(S E(S ) ǫ) exp . n → np → n n n − 2 − ≥ ≤ (bi ai) Xk=1 Xk=1  i=1 −  1 n P E Consider the matrix B =[bnj], where bnj = np k=1 Uk I(Xd,k = j). This matrix Proof: Without loss of generality we may assume (X) = 0. By the convexity of obeys the following properties (almost surely). the function etx we have P (i) b β for each j ∆. X a b X • nj → j ∈ etX − etb + − eta. ≤ b a b a (ii) supn j bnj C < , and − − • | | ≤ ∞ ta tb This gives that E(etX ) be ae =: h(t). The reader may verify that h(0) = 1, (iii) limnP j bnj = σ. b−a • →∞ ≤ 2 −ta 2 tb h′(0) = 0 and h′′(t) = (ba e ab e )/(b a). Therefore, if we write h(t) = P 1 ′′ ′ 2 (where βj = (X =Pj), C = p and σ = 1 in our case). Any such matrix is called ln h(t) g(t) − − h(t)h (t) (h (t)) e =: e then g(0) = 0, g′(0) = 0 and g′′(t) = − 2 . Therefore, a conservative matrix. A famous theorem of Kojima-Schur says that B takes a (h(t)) convergent sequence to a convergent sequence if and only if it is a conservative beta aetb t2 t2(b a)2 abeξ(a+b) matrix. In our case βj 0, βj = σ, and another result, which we will call − = exp g′′(ξ) = exp − − , ξ [0,t]. ≥ j b a 2 2 (beξa aeξb)2 ∈ discrete Scheff´e’s theorem see Exercise (17.1.3), says that we must have     P − − To maximize the exponent in ξ, we may write it as lim bnj βj = 0. n | − | →∞ j abeξ(a+b) beξa X − = u(ξ) (1 u(ξ)), u(ξ) = , (beξa aeξb)2 − beξa aeξb Thus we have − − Note that E(X) = 0 implies that a 0 b, making u(ξ) [0, 1]. Hence, u(ξ)(1 1 n 1 ≤ ≤ ∈ − P u(ξ)) 4 . This gives the first inequality. For the second inequality we apply the Dn(t) pFd(t) = p Uk I(Xd,k = j) (X = j) ≤ | − | np − first inequality to get j ∆, j t k=1 ! ∈ X≤ X n t(Sn E(Sn)) ǫt 1 P(Sn E(Sn) ǫ) = P(e − e ), t 0, p Uk I(Xd,k = j) P(X = j) = p bnj βj . − ≥ ≥ ≥ ≤ np − | − | 1 t(S E(S )) j ∆ k=1 j E(e n n ) ǫt − X∈ X X ≤ e n The right side does not depend on t and goes to zero, giving uniform convergence 1 t(Xi E(Xi)) = E(e − ) in t. Hence, we see that eǫt i=1 Yn 2 2 sup Fn(t) F (t) = sup Dn(t) pFd(t) + Cn(t) (1 p)Fc(t) 1 t (bi ai) /8 | − | | − − − | e − ) t t ≤ eǫt i=1 sup Dn(t) pFd(t) + sup Cn(t) (1 p)Fc(t) 0, Y ≤ | − | | − − | → n t t 1 t2 = exp (b a )2 . eǫt 8 i − i over an event of probability one. ( i=1 ) ♠ X The above theorem is just the tip of the iceberg. In order to see the rate of 4ǫ Taking t = n 2 gives the second inequality. P (bi ai) convergence, we bring in an inequality of Hoeffding. i=1 − ♠ 17.1 Glivenko-Cantelli Theorem 167 168 WLLN, SLLN & Uniform SLLN

Theorem - 17.1.2 - (Improved Glivenko-Cantelli theorem) For the Glivenko- Cantelli theorem we have

nǫ2/32 P sup Fn(t) F (x) > ǫ 8(n + 1)e− . t R | − | ≤  ∈  Instead of proving this result, we explain its significance and the context in which it should be viewed. First consider the collection of events := χ( ,t](x), t F { −∞ ∈ R Note that P(X t) = F (t) = Eft(X), where ft(x) = χ( ,t](x) . If } ≤ −∞ ∈ F iid 1 n X1,X2, ,Xn F then the empirical distribution function, Fn(x), is n i=1 ft(Xi) and the··· Glivenko-Cantelli∼ theorem may be stated as a uniform form of the strong law of large numbers, P

n 1 a.s. sup f(Xi) Ef(X1) 0. f n − → ∈F i=1 X In other words, the convergence is uniform over all the functions f . ∈F Exercise - 17.1.1 - Finish the proof of Polya’s theorem by showing that the se- quence x is bounded above. { k} Exercise - 17.1.2 - (A modified Polya’s result) Let G be a probability dis- tribution and let F , F , be a sequence of nondecreasing and right continuous 1 2 ··· functions with Fn( ) 0, and Fn(+ ) 1. If G is a continuous function such that F (t) G(t) −∞for all≡t then prove that∞ →sup F (t) G(t) 0. n → t | n − |→

Exercise - 17.1.3 - (Discrete Scheff´e’s theorem) Let B = [bnj] be a nonnega- tive conservative matrix, i.e., having the following properties. (i) b β for each j ∆. • nj → j ∈ (ii) sup b C < , and • n j nj ≤ ∞ (iii) limnP bnj = σ. • →∞ j If βj = σ thenP show that limn xnj (bnj βj ) = 0, for any bounded j →∞ j − double array [xnj]. P P

Exercise - 17.1.4 - (Scheff´e’s theorem) Let fn be a sequence of probability den- sities so that f f (pointwise) where f is also a density. Then show that n →

lim fn(t) f(t) dt = 0. n R | − | →∞ Z 170 Random Series

1 n + n k=N+1 Xk converge to zero, for each N. But the first term always goes to zero for each N. Hence, 1 n X converge to zero if and only if 1 n X P n k=1 k n k=N+1 k converge to zero, for each N. Hence, the event is a member of σ(XN+1,XN+2, ) for each N. P P ···

Exercise - 18.1.1 - Is it true that liminfn k∞=1 Xk 0 is a member of ? (Justify your answer). { ≤ } T Lecture 18 P

Theorem - 18.1.1 - (Kolmogorov’s 0-1 Law) Let X1,X2, be a sequence of independent random variables and let be the tail sigma field.··· For any A , Random Series either P (A) is 0 or 1. T ∈ T

Proof: Just note that σ(X1,X2, ,Xn) and B σ(Xn+1,Xn+2, ) are inde- pendent. To prove this, we proceed··· as follows. ∈ ··· By the independence of random variables, we see that σ(X1,X2, ,Xn) is We start off with the zero-one law of Kolmogorov. Then we present the famous ··· independent of σ(Xn+1,Xn+2, ,Xn+m) for each m 1. Therefore, for any A result of Kolmogorov concerning the convergence of a series whose components are ··· ≥ ∈ σ(X1,X2, ,Xn) and any B m∞=1σ(Xn+1,Xn+2, ,Xn+m) we have P (A random variables. Then we present some refinements of the strong laws of large ··· ∈ ∪ ··· ∩ B) = P (A)P (B). But the set m∞=1σ(Xn+1,Xn+2, ,Xn+m) forms a π-system. numbers. Hence, by Proposition (??),1 σ(∪X ,X , ,X ) is independent··· of σ(X ,X , ). 1 2 ··· n n+1 n+2 ··· Next, since σ( + , + , ), we see that σ(X1,X2, ,Xn) is inde- pendent of forT each ⊆ n.X That\ ∞ is,X\ for∈ any··· A σ(X ,X , ,X ) and··· any B we 18.1 Zero-One Laws & Random Series T ∈ 1 2 ··· n ∈T have P (A B) = P (A)P (B). Thus, the collection (π-system) ∞ σ(X ,X , ,X ) ∩ ∪n=1 1 2 ··· n For this, let us recall the definition of σ(X ,X , ,X ) . It is the smallest is independent of . Once again, by Proposition (??), it must be that 1 2 ··· n ⊆ k T sigma field with respect to which X1,X2, ,Xn are measurable. For an infinite ··· P σ( n∞=1σ(X1,X2, ,Xn)) = σ(X1,X2, ) sequence of random variables X1,X2, , we take σ(X1,X2, ) to be the smallest ∪ ··· ··· ··· ··· sigma field with respect to which any finite subcollection of the Xi’s is measurable. is independent of . But σ( , , ). Hence, is independent of . That is, That is, if A, B T thenTP ( ⊆A BX∞) =X∈P (···A)P (B). TakingT B = A gives thatT ∞ 2 ∈ T ∩ σ(X ,X , ) = σ σ(X ,X , ,X ) . P (A) = P (A) . Thus, it must be that P (A) is either zero or 1. 1 2 ··· 1 2 ··· n ♠ n=1 ! [ a.s. Another sigma field, called the tail sigma field, , is defined as follows. Exercise - 18.1.2 - Let Xk,k be a sequence of random variables so that Xk T X. Prove the following results.{ ≥} →

∞ (i) Show that there exists an event A with P(A) > 0 so that Xk converges to = σ( , + , ). • T X\ X\ ∞ ··· X uniformly on A. = \\∞ (ii) Show that there exists an event A with P(A) > 0 and a constant N so This is a sigma field since intersection of sigma fields is again a sigma field. • that supω A Xn(ω) X(ω) 1 for all n N. ∈ | − | ≤ ≥ Example - 18.1.1 - Here are some examples of elements of . Let X1,X2, be [Hint: you may use Egorov’s theorem.] a sequence of random variables. T ··· UC Exercise - 18.1.3 - (Buck-Pollard)2Let ([0, 1], , P) be the usual probability space n with P((a, b)) = b a and is the Borel sigmaE field. For any t (0, 1] let ∞ 1 − E ∈ Xk converges , Xk 0 . t = U1 + U2 + be the usual nonterminating dyadic expansion. This makes n → 2 22 (k=1 ) ( k=1 ) ··· X X 1 which says that if two π-systems are independent then their respective generated The first event belongs to is clear since the convergence of the series takes place if sigma fields are also independent. T 2 and only if k∞=n Xk converges for each n. To see why the second event also belongs R. Creighton Buck & Harry Pollard. (1943), Convergence and summability of subse- to note that the averages 1 n X converge to zero if and only if 1 N X quences, Bull. Am. Math. Soc. vol. 49, pp. 924–931. T P n k=1 k n k=1 k P P 18.1 Zero-One Laws & Random Series 171 172 Random Series

iid U , U , B(1, 1 ). For a given sequence a of constants assume there exists The issue is, can we take the left side limit inside the V ar( ) operation? In other 1 2 2 k · a random··· variable∼ Z so that { } words, is V ar( ) operation continuous in this sense so that we may have · n 1 a.s. ∞ Zn := ak (2Uk 1) Z. V ar(X + X + + X + ) = V ar(X ). n − → 1 2 n k k=1 ··· ··· X kX=1 Prove the following results. One major problem is to define the random variable on the left side whose variance is sought. The random series need to converge to a random variable before we can (i) If A is an event with P(A) > 0 so that Z converges uniformly, then there n ask about its variance. This is answered in the following theorem which says that • exists a constant K so that Z(t) K for all t A. | | ≤ ∈ a sufficient condition is to have k V ar(Xk) < for both issues to be settled. (ii) There exists a constant K so that the event D = t : Z(t) K t : ∞ • Z(t) K has probability P(D) = 1. Hence, E(Z) must{ exist,≤ say }µ. ∩ { Theorem - 18.1.2 - (Khintchin-KolmogorovP criterion) Let X ,X , be a ≥ − } 1 2 ··· sequence of independent random variables with means E(Xk) = µk, and finite (iii) For any ǫ > 0 explain why exactly one of the events, Z > µ + ǫ , 2 2 • { } variances V ar(Xk) = σ so that σ < . Then the follwoing statements hold: Z µ ǫ , Z µ ǫ has probability one. k k k ∞ { ≤ − } {| − | ≤ } (i) The random series k∞=1P(Xk µk) almost surely converges to a rv L. U1(t) U2(t) 1 U1(t) 1 U2(t) • − (iv) Verify that if t = 2 + 22 + then 1 t = − 2 + −22 + . • ··· − ··· 2 n 2 That is, U (1 t) = 1 U (t) for all n. Therefore, there exists a t (0, 1) (ii) E(L ) < and EP( (Xi µi) L) 0. n − − n ∈ • ∞ i=1 − − → for which Z(t) and Z(1 t) exist with Z(t) + Z(1 t) = 0, and hence Z = 0 n − − (iii) L (Xk µk),P n = 1, 2, 3, is uniformly integrable. almost surely. • { k=1 − ···} ∞ 2 (iv) E(LP) = 0, and V ar(L) = k=1 V ar(Xk). L n 2 2 • (v) Using Exercise (17.0.18) deduce that Zn 0. Hence, k=1 ak = o(n ). • → n Proof: Let S = (X µ ).P Now we will prove the existence of a ran- [For more on this, see Exercise (18.2.1).] P n i=1 i − i dom variable L so that Sn converge to L almost surely. By using Kolmogorov’s inequality, (by the firstP n terms to be zero) Now we present a third application of truncation in conjunction with the Kol- mogorov inequality. Here we will turn our attention to the question of whether an k m infinite seires of random variables converges or not. Our goal is to see when does 1 2 P max Sk Sn > ǫ = P max (Xj µj) > ǫ σk. nn n the continuity property of probability measures gives that Sn := Xk, Xk=1 P P 1 ∞ 2 and ask do the random variables Sn converge to a random variable in some sense. (sup Sk Sn > ǫ) = lim max Sk Sn > ǫ σk. | − | m nn →∞ ≤ We will assume that the random variables are mutually independent. The key re-   k=Xn+1 sults are the Khintchin-Kolmogorov theorem and the famous “three series theorem” Since infn supk>n Sk Sn supk>n Sk Sn , we see that due to Kolmogorov. Recall that if Xk are independent random variables then | − | ≤ | − | n 1 ∞ P inf sup S S > ǫ P(sup S S > ǫ) σ2 V ar(X + X + + X ) = V ar(X ). k n k n 2 k 1 2 n k n 1 k>n | − | ≤ k>n | − | ≤ ǫ ···  ≥  k=n+1 Xk=1 X for all n 1. Hence, it must be that This gives that ≥ n ∞ lim V ar Xk = V ar(Xk). P n inf sup Sk Sn > ǫ = 0. →∞ ! n 1 k>n | − | kX=1 Xk=1  ≥  18.1 Zero-One Laws & Random Series 173 174 Random Series

Since, ǫ > 0 is arbitrary, we have Exercise - 18.1.4 - Verify that a sequence of real numbers, a , is a Cauchy se- { k} quence if and only if infn 1 supk>n ak an = 0. ≥ | − | P inf sup Sk Sn = 0 = 1. n 1 k>n | − |  ≥  Exercise - 18.1.5 - Let X1,X2, be any sequence of iid random variables with ··· n This implies that S is a Cauchy sequence on a set of probability 1, (see Exercise E X1 < 1. Prove that the random series ∞ Xk converges absolutely k | | n=1 k=1 (18.1.4)) making {S } almost surely convergent to some limiting random variable almost surely. k P Q L. This proves part{ (i)} of the theorem. For the proof of the second part, we will use the fact that the space L2 of square Exercise - 18.1.6 - (Random harmonic series revisited) Recall Exercise (7.2.3), iid integrable random variables is complete. Since E(Sn) = 0, for any m>n, we have where U , U , B(1, 1 ) represent the zero/one outcomes of a fair coin toss. De- 1 2 ··· ∼ 2 m fine 2 n E Sn Sm = V ar(Sn Sm) = V ar(Xi) V ar(Xi) 0. 2Ui 1 | − | − ≤ → Xn := − , n 1. i=n+1 i n+1 i ≥ X ≥X i=1 2 X Therefore, Sn, n 1 is a Cauchy sequence in L . Hence, there exists a random Show that there exists a random variable L so that { 2 ≥ } 2 variable, T , in L so that Sn converge to T in the L norm. This implies that S converge to T in probability as well. And almost sure convergence of S to L (i) E(L) = 0, V ar(L) = π2/6. n n • implies convergence in probability to L as well. Hence, it must be that T = L with a.s. (ii) X L. probability one. This gives part (ii). • n → 2 2 2 Note that supn E(Sn) = σk < . Hence, Sn is uniformly integrable. L k ∞ { } (iii) Xn L. Since, Sn converge to L almost surely, we have E(L) = limn E(Sn) = 0. Now we • → show that LS is also uniformlyP integrable. Indeed, by the CBS inequality, { n} Exercise - 18.1.7 - (Pick a point at random revisited) Recall Exercise (14.0.9), 1/2 iid ∞ where U1, U2, Uniform 0, 1, , 9 , and 2 1/2 2 2 2 ··· ∼ { ··· } sup E( LSn ) sup E(L ) E(Sn) E(L ) σk < . n | | ≤ n ≤ ∞ n k=1 ! U 4.5  p X X := i − , n 1. 2 2 n 10i ≥ Denote the sum k σk by K. By the fact that L is integrable, for any ǫ > 0 there i=1 exists a δ(ǫ) > 0 so that X P Show that there exists a random variable L so that L2 dP ǫ, whenever P (A) < δ. ≤ (i) L Uniform( 0.5, 0.5). ZA • ∼ − 2 a.s. For any given ǫ > 0, using δ( ǫ ), once again the CBS inequality gives that (ii) Xn L. K • → L2 sup LS dP (iii) Xn L. | n| • → n ZA 1/2 1/2 2 1/2 iid 1 2 2 1/2 2 1/2 ǫ Exercise - 18.1.8 - (Cantor distribution) Let U1, U2, B(1, 2 ), sup L dP E(Sn) = K L dP K = ǫ. ··· ∼ ≤ n ≤ K ZA  ZA    n p 2Ui To prove the last part, since LS converges to L2 almost surely, by part (iii), we X := , n 1. n n 3i ≥ have E(LS ) E(L2). Finally, by part (ii), i=1 n → X n 2 Show that there exists a random variable L so that 0 = lim E (Xi µi) L a.s. n − − ! (i) Xn L. Xi=1 • → n 2 ∞ L 2 2 2 2 (ii) Xn L. = lim σk + E(L ) 2E(LSn) = σk E(L ). n − − • → (k=1 ) k=1 (iii) Show that L has the Cantor distribution. X X • This finishes the proof. ♠ (iv) Show that E(L) = 1 and V ar(L) = 1 . • 2 8 18.2 Refinements of SLLN 175 176 Random Series

n (v) Write a computer program that simulates one million approximate out- 1 • = Pk (ak ak 1) comes of L and plot the resulting histograms for various number of bins, say Pn − − Xk=1 20, 50, 100 and 500. Describe what happens to be resulting histograms. n 1 [Hint: for part (iv) you cannot use the density of L since it does not exist.] = Pnan (Pk Pk 1) ak 1 Pn − − − − ( k=1 ) n X Exercise - 18.1.9 - (Brownian motion a.k.a Wiener process) Take Z0, Z1, Z2, 1 ··· = an (Pk Pk 1) ak 1 to be independent and identically distributed standard normal random variables. − Pn − − − For any positive t, define kX=1 α α = 0. n → − tZ0 2 sin kt Wn(t) := + Zk,Wn(0) := 0. Here we used the regularity of the Riesz means. √π π k ♠ r k=1 X An analog of this result holds for the Euler/Borel methods, cf. Exercise??. As Fix t, and consider the sequence W0(t),W1(t),W2(t), . Show that there exists a.s ··· an immediate consequence we prove a strong law of large numbers. a random varible, say W (t), so that Wn(t) W (t). What is V ar(W (t)) for 0 t π? (As a function of t, W (t) turns out→ to be the Brownian motion.) ≤ ≤ Proposition - 18.2.2 - For Xn be any sequence of independent random variables,

UC Exercise - 18.1.10 - (Weiner’s orginial Brownian motion) n ∞ 1 1 V ar(X ) < implies (X E(X )) a.s. 0. k2 k ∞ n k − k → 18.2 Refinements of SLLN Xk=1 Xk=1

1 ∞ Now, we will see a rather elementary result from summability theory which has had Proof: By Khintchin-Kolmogorov criterion, k=1 k2 V ar(Xk) < implies that Xk E(Xk) ∞ a profound impact on the question of covergence of sums of independent random − converges almost surely. We take P = n in Kronecker’s lemma and k k P n variables. get the conclusion. P ♠

Proposition - 18.2.1 - (Kronecker’s lemma) Let Pn be a sequence of strictly positive numbers with P P and P . Let x ,x , be a sequence of Remark - 18.2.1 - (Beta matrices & Kronecker’s lemma) A summability ma- n ≤ n+1 n → ∞ 1 2 ··· real numbers and let trix B =[bnk] is called a beta matrix if for every convergent series k xk we have yn := xk bnk exists for all n and yn converges to some value. For a characteri- sn := x1 + x2 + + xn. k ··· zation of beta matrices see Exericse (??). P P Then, as n , For a beta matrix, a non-decreasing sequence, P , of positive numbers will →∞ { k} n n be called a beta-null sequence of B =[bnk] if for every convergent series of the type s 1 x xk n k the transform yn := xk bnk exists for all n and yn 0. = xk 0, provided converges. k Pk k → Pn Pn → Pk Kronecker’s lemma says that for the P -Riesz method, the sequence P k=1 k=1   P P k k X X itself is its own beta-null sequence. In particular{ } for the Cesaro method we{ have} Proof: Let us take P = k is its beta-null sequence. It is not difficult to show that for the regular n k xk Euler and Borel methods their beta-null sequence is Pk = √k. an := , Pk Xk=1   Remark - 18.2.2 - Our goal is to avoid assuming the existence of finite variance for and let a0 = 0 and let P0 = 0. We are given that an α, (say). We will apply → the sequence of random variables in the strong law of large numbers. This involves Abel’s summation by parts to the sequences ak and Pk . Indeed, we have { } { } a truncation argument. Not only the end result is important but also the method n is of some value since this method has other applications. sn 1 = xk Pn Pn k=1 Xn The next theorem shows the main refinement if the mean exists but the variance 1 xk may not. Then after this result we present a theorem of Feller which gives a = Pk Pn Pk refinement of the SLLN when the mean does not exist either. Xk=1 18.2 Refinements of SLLN 177 178 Random Series

Theorem - 18.2.1 - (Extension of Khintchin-Kolmogorov criterion — due This gives that 3 to Chung, Marcinkiewicz & Zygmund) Let f(x) be a nonnegative , even 2 2 V ar(Yn) Yn Xn f(Xn) and continuous function on the real line. Also, let f(x)/x be increasing for x > 0 E = E χ{|X |≤b } E < . b2 ≤ b2 b2 n n ≤ f(b ) ∞ and f(x)/x2 be decreasing for x > 0. (Such as f(x) = x p; for some 1 p 2). Let n n n  n  n  n  n  n  | | ≤ ≤ X X X X (Xk) be a sequence of independent random variables with finite means E(Xk) = µk, By Khintchin-Kolmogorov criterion we must have and let bk be a sequence of positive increasing numbers going to infinity. If 1 E a.s. ∞ Ef(Xk µk) (Yk (Yk)) bk − → − < k f(bk) ∞ k=1 X X E(Y ) a.s. k n (Xk µk) 1 n to some random variable. Now we need to show that k b converges. For this then (i) k=1 − converges a.s. and (ii) k=1(Xk µk) 0. k bk bn E(Yk ) − → we will show a bit more by showing that | | converges. The key fact is that k bk P Proof: PSince, all the assumptions and conclusionsP are in terms of Xn µn, without E(X ) = 0 gives that − k P loss of generality, take µn = 0. The trick is to truncate the random variables and E E E use the Khintchin-Kolmogorov criterion for convergence of a random series. This (Xk)=0= Xkχ Xk bk + Xkχ Xk >bk will give part (i). The second part is just Kronecker’s lemma. So, let us define {| |≤ } {| | } Therefore, we see that   X if X b Y := n n n n 0 if |X |> ≤ b . E E E n n (Yk) = Xkχ Xk bk = Xkχ Xk >bk  | | | | {| |≤ } {| | } Now, we will show that Xn and Yn are equivalent. That is,   Hence, by the last inequality of (2.1) we have P (Xk = Yk) = P ( Xk > bk) < . E E E 6 | | ∞ (Yk) (Xkχ Xk >bk ) f(Xk) k k | | = | {| | } | < . X X bk bk ≤ f(bk) ∞ In this regard, we use the fact that f(x)/ x is increasing as x goes to infinity. Note, Xk Xk Xk | | that This gives part (i) of the theorem. f(bn) f(Xn) Xn f(Xn) ♠ X > b | | . {| n| n} ⊆ b ≤ X ⊆ b ≤ f(b )  n | n|   n n  In the end we mention a beautiful refinement due to Feller. So, we get that Theorem - 18.2.2 - (Feller’s extension of SLLN (1946)) Let (X ) be a se- X f(X ) k n n quence of mutually independent and identically distributed random variables with P ( Xn > bn) = E χ{|Xn|>bn} E | |χ{|Xn|>bn} E .(2.1) | | ≤ bn ≤ f(bn)     E X1 = . Let bn be a sequence of positive numbers with bn/n . Then,  | | ∞ ր So, we have n ∞ i=1 Xi a.s. Ef(Xn) (i) If P ( X1 > bn) < then limsup = 0. | | ∞ n b P (Xn = Yn) = P ( Xn > bn) < . n=1 n 6 | | ≤ f(bn) ∞ P n n n X n X X X ∞ i=1 Xi a.s. Thus, the two random sequences X and Y are equivalent. Therefore, the (ii) If P ( X1 > bn) = then limsup = . n n | | ∞ n b ∞ { } { } n=1 P n two random sequences Xn/bn and Yn/bn are equivalent. To show that the X random series Y /b {converges} almost{ surely,} we use the Khintchin-Kolmogorov k k k Proof: The proof proceeds by using a slightly modified form of the truncation criterion. We need only prove that V ar(Yk ) < . In this regard, we use the P n bk ∞ argument. Define other given fact that f(x)/x2 is decreasing. So, over the set where X equals Y , P k k we have, Xn µn if Xn bn Y := where µ := E X χ . 2 n − | | ≤ n 1 {|X1|≤bn} f(bn) f(Xn) Xn f(Xn) µn if Xn > bn, Xn bn 2 2 2 .  − | |   | | ≤ ⇒ b ≤ X ⇒ b ≤ f(bn) n n n This truncation has the property that 3Chung, Kai Lai (1947), A note on some strong laws of large numbers. Amer. J. Math., vol. 69, pp. 189-192. E(Yn) = E (Xn µn)χ µnP ( Xn > bn) − {|Xn|≤bn} − | |  18.2 Refinements of SLLN 179 180 Random Series

2 = E X1χ{|X |≤b } µnP ( Xn bn) µnP ( Xn > bn) ∞ ∞ j b b 1 n − | | ≤ − | | X2 dP , since k j ,   ≤ 1  k2b2  k ≥ j = µn µn = 0, j=1 bj−1 < X1 bj j − X Z{ | |≤ } Xk=j   as well as ∞ 2 ∞ j 2 1 = 2 X1 dP 2 bj bj−1 < X1 bj  k  P (Yn = (Xn µn)) P ( Xn > bn) = P ( X1 > bn). j=1 Z{ | |≤ } k=j 6 − ≤ | | | | X X n n n ∞ 2   X X X j 2 1 C 2 X1 dP , where C is a constant, So, if we assume that the last series converges (part (i)) then the random sequence ≤ bj bj−1 < X1 bj j j=1 Z{ | |≤ }   Xn µn is equivalent to the sequence Yn . To prove (i), we use the Khintchin- X { − } { } ∞ Kolmogorov criterion for which we show that j 2 C 2 X1 dP ≤ bj bj−1 < X1 bj n j=1 Z{ | |≤ } V ar(Y ) 1 X k 2 (a) 2 < and (b) µk 0. ∞ X1 b ∞ bn → = C j dP k k k=1 b2 X X j=1 bj−1 < X1 bj j X Z{ | |≤ } The first result, (a), will then imply that the random series Yk converges almost k bk ∞ X1 1 n C j dP, since | | 1, surely as well as that b k=1 Yk converges to zero almost surely. The equivalence ≤ b ≤ n P j=1 bj−1 < X1 bj j Xk µk Z{ | |≤ } then gives that the random series − converges almost surely as well as that X P k bk 1 n ∞ b k=1(Xk µk) converges to zero almost surely. But then the second result, (b), = C j P (bj 1 < X1 bj) n − P − | | ≤ gives that j=1 P n n n X 1 1 1 a.s. j Xk = (Xk µk) + µk = 0. ∞ bn bn − bn → = C P (bj 1 < X1 bj) k=1 k=1 k=1 − | | ≤ X X X j=1 k=1 So, we now prove (a) and (b). Just note that X X ∞ ∞ = C P (bj 1 < X1 bj) 2 − | | ≤ ∞ V ar(Yk) ∞ E(Y ) k Xk=1 Xj=k b2 ≤ b2 k=1 k k=1 k ∞ X X = C P ( X1 > bk 1) ∞ 1 | | − = µ2 P ( X > b ) + µ2P ( X b ) k=1 b2 k | k| k k | k| ≤ k X k=1 k ∞ X  = CP ( X1 > b0) + C P ( X1 > bk 1) 2 | | | | − 2µkE Xkχ + E X χ k=2 − {|Xk|≤bk } k {|Xk|≤bk } X    o ∞ ∞ 1 = µ2 2µ µ + E X2χ = CP ( X1 > b0) + C P ( X1 > bj) < , j = k 1, 2 k k k k {|Xk|≤bk } | | | | ∞ − bk − · j=1 kX=1    X ∞ 1 E X2χ since the last series is given to be convergent. This proves (a). To prove (b), we 2 k {|Xk|≤bk } ≤ bk will use the above proved fact that kX=1   ∞ 1 2 = E X χ ∞ 2 1 {|X1|≤bk } bk j P (bj 1 < X1 bj ) < . (2.2) kX=1   − | | ≤ ∞ k j=1 ∞ 1 X = X2 dP, take b = 0, b2 1 0 k j=1 bj−1< X1 bj Note that for any positive integer N

18.2 Refinements of SLLN 181 182 Random Series

N n a2 ∞ k 1 1 (i) For any bounded sequence ak , or more general sequence obeying k=1 k2 < = E X1 χ X1 bk + E X1 χ X1 bk • a.s. { } bn | | {| |≤ } bn | | {| |≤ } , we have Z 0. Therefore, any such sequence is Cesaro summable if k=1 k=N+1 n P X  X  and∞ only if 1 n→ a U converges almost surely. That is, any sequence of N n n k=1 k k 1 1 a2 the type ∞ k < , is Cesaro summable if and only if almost all of its = E X1 χ X1 bk + E X1 χ X1 bN k=1Pk2 bn | | {| |≤ } bn | | {| |≤ } subsequences are Cesaro∞ summable. kX=1 k=XN+1 n   P 1 (ii) For any arbitrary sequence ak , if almost all of its subsequences are + E X1 χ bN < X1 bk • { } bn | | { | |≤ } Cesaro summable then it must be that the sequence itself is Cesaro summable. k=N+1 X  n N n 2 2 [Remark: Condition k=1 ak = o(n ) of part (vii) of Exercise (18.1.3) is only 1 1 2 E b χ + E b χ ak N X1 bN N X1 bN a necessary condition while the above condition of ∞ < is a sufficient ≤ bn {| |≤ } bn {| |≤ } k=1 k2 k=1 k=N+1 P 1 n ∞ condition for the almost sure convergence of Zn = ak(2Uk 1). There are X n  X  n k=1 1 other known sufficient conditions.] P − + E X1 χ bN < X1 bk P bn | | { | |≤ } k=N+1 UC Exercise - 18.2.2 - If X ,X , are independent random variables then show that n X  1 2 nb 1 ··· N k Xk converges almost surely if and only if k Xk converges in probability. = + E X1 χ bN < X1 bk bn bn | | { | |≤ } k=N+1 P P Xn  nbN 1 + E X1 χ bN < X1 bn ≤ bn bn | | { | |≤ } k=N+1 X  nbN n + E X1 χ bN < X1 bn ≤ bn bn | | { | |≤ } n  nbN n = + E X1 χ bj−1< X1 bj bn bn | | { | |≤ } j=XN+1 n  nbN n + bjP (bj 1 < X1 bj) ≤ bn bn − | | ≤ j=XN+1 n nbN bj n = + jP (bj 1 < X1 bj ) bn j · bn · − | | ≤ j=XN+1 n nbN bj bn + j P (bj 1 < X1 bj ), since , ≤ bn − | | ≤ j ≤ n j=XN+1 nbN ∞ + j P (bj 1 < X1 bj ). ≤ bn − | | ≤ j=XN+1 First let n go to infinity and then let N go to infinity. The first term drops to zero since bn must go to infinity, for otherwise, if it remains bounded then P ( X n k | 1| ≥ bk) being convergent would imply that E X1 < which would contradict E X1 = . Then as N gets large the second term| | drops∞ to zero, due to beingP the| tail| of ∞a convergent series (cf. (2.2)). This finishes the proof of part (i). Part (ii) has a similar proof whose details we omit. ♠

Exercise - 18.2.1 - (Buck-Pollard revisited) Continuing Exercise (18.1.3) prove the following further results. 184 Kolmogorov’s Three Series Theorem

Furthermore,

c c | k − k+1| 1 1 = Tk dP Tk+1 dP P (Ak) Ak − P (Ak+1) Ak+1 Z Z

1 1 Lecture 19 = Tk dP (Tk + Yk+1 ) dP P (Ak) Ak − P (Ak+1) Ak+1 Z Z

1 1 = Sk dP E(Sk) Sk dP + E(Sk) P (Ak) Ak − − P (Ak+1) Ak+1 Kolmogorov’s Three Series Z Z

1 Yk+1 dP −P (Ak+1) Ak+1 Theorem Z

1 1 1 = Sk dP Sk dP Yk+1 dP P (Ak) Ak − P (Ak+1) Ak+1 − P (Ak+1) Ak+1 Z Z Z

Now we turn our attention to the famous three series theorem of Kolmogorov. For 1 1 1 the proof of the three series theorem we will need the following lower inequality Sk dP + Sk dP + Yk+1 dP ≤ P (Ak) Ak P (Ak+1) Ak+1 P (Ak+1) Ak+1 Z Z Z also due to Kolmogorov. 2ǫ + K, since Sk ǫ over Ak and Ak+1 . ≤ | | ≤ Proposition - 19.0.3 - (Kolmogorov’s lower inequality) Let X1,X2, be in- Also note that dependent random variables having finite expectations. If there exists a··· constant 1 K such that Xn E(Xn) K, n = 1, 2, , then for any ǫ > 0, we have E(S ) c = E(S ) T dP | − | ≤ ··· | − k − k| − k − P (A ) k k ZAk k (2K + 4ǫ)2 1 P max Xi ǫ n . = E(Sk) Sk dP E(Sk) 1 k n ≤ ≤ V ar(X ) − − P (Ak) A − ≤ ≤ i=1 ! k=1 k Z k X 1 P Sk dP <ǫ. Proof: Note that the given condition on X makes X to be a bounded random ≤ P (Ak) A | | n n Z k variable, however, the location of the support of the random variables may drift n Now the monotonicity of the sequence Ak gives that away without any bounds. Let Sn = i=1 Xi and let S0 = 0. Denote the event of interest by P (T c )2 dP = (T c )2 dP (T c )2 dP. k+1 − k+1 k+1 − k+1 − k+1 − k+1 An := max Sk ǫ . Ak+1 Ak Ak Ak+1 1 k n | | ≤ Z Z Z −  ≤ ≤  We will deal with the two integrals one at a time. Over the event Ak Ak+1, we If P (An) = 0 then there is nothing to prove. Otherwise, since An 1 An, we − − ⊇ notice that Sk ǫ. By adding and subtracting ck in the integrands we have see that P (Ak) > 0 for each k = 0, 1, 2, ,n. If we denote the centered random | | ≤ ··· n variable by Yk = Xk E(Xk), then let Tn = i=1 Yi and once again let T0 = 0. 2 Furthermore, let − (Tk+1 ck+1) dP Ak Ak − P Z − +1 1 2 = (Tk+1 ck + ck ck+1) dP ck = Tk dP, k = 0, 1, 2, . − − P (A ) ··· Ak Ak+1 k ZAk Z − 2 Note that = (Tk ck + ck ck+1 + Yk+1) dP Ak Ak − − Z − +1 2 (Tk ck) dP = ckP (Ak) ckP (Ak) = 0. (0.1) ( Sk + E(Sk) ck + ck ck+1 + Yk+1 ) dP A − − ≤ Ak Ak+1 | | | − − | | − | | | Z k Z − Kolmogorov’s Three Series Theorem 185 186 Kolmogorov’s Three Series Theorem

2 V ar(X )P (A ) (4ǫ + 2K)2P (A A ). (ǫ + ǫ + (2ǫ + A) + A) dP ≥ k+1 n − k − k+1 ≤ Ak Ak Z − +1 2 Adding over k = 0, 1, 2, ,n 1, the telescoping effect gives that = (4ǫ + 2A) P (Ak Ak+1). ··· − − n n 1 − On the other hand, (T c )2 dP P (A ) V ar(X ) (4ǫ + 2K)2 P (A A ) n − n ≥ n k − k − k+1 ZAn k=1 k=0 2 Xn X (Tk+1 ck+1) dP 2 Ak − P (A ) V ar(X ) (4ǫ + 2K) P (S A ). Z ≥ n k − − n k=1 = (T c + c c + Y )2 dP X k − k k − k+1 k+1 ZAk Now we use the fact that = (T c )2 dP + (c c )2 dP + Y 2 dP + k − k k − k+1 k+1 (T c )2 dP ( S + E(S ) c )2 dP (ǫ + ǫ)2 dP = 4ǫ2P (A ). ZAk ZAk ZAk n − n ≤ | n| | − n − n| ≤ n ZAn ZAn ZAn 2 (T c )(c c ) dP + 2 (T c )Y dP + k − k k − k+1 k − k k+1 Hence we have ZAk ZAk n 2 (ck ck+1)Yk+1 dP. 2 2 2 − 4ǫ P (An) P (An) V ar(Xk) (4ǫ + 2K) + (4ǫ + 2K) P (An). ZAk ≥ − kX=1 The last three terms drop out by using (0.1) and the fact that Y is independent k+1 Rearranging gives that of all the previous random variables and E(Yk+1) = 0. Hence, we have n 2 2 2 2 (4ǫ + 2K) P (An) V ar(Xk) + (4ǫ + 2K) 4ǫ P (An) (Tk+1 ck+1) dP ≥ − k=1 Ak − Z Xn  2 2 2 = (Tk ck) dP + (ck ck+1) dP + Yk+1 dP P (An) V ar(Xk). Ak − Ak − Ak ≥ Z Z Z kX=1 (T c )2 dP + Y 2 dP + ≥ k − k k+1 This finishes the proof. ZAk ZAk ♠ = (T c )2 dP + E(Y 2 ) 1 dP, (independence) Theorem - 19.0.3 - (Kolmogrov’s 3-series theorem) Let (X ) be mutually k − k k+1 k ZAk ZAk independent random variables. And let K > 0 (for Kolmogrov) be a fixed constant. 2 Define a truncated sequence (Y ) by = (Tk ck) dP + V ar(Xk+1)P (Ak). j Ak − Z X if X K Y := j j Hence, we see that j 0 if |X |> ≤ K.  | j| n 2 Then, the partial sums X converges almost surely if and only if the following (Tk+1 ck+1) dP j=1 j Ak+1 − three series converge: Z P 2 2 (i) ∞ P (X = Y ) = ∞ P ( X > K) = (Tk+1 ck+1) dP (Tk+1 ck+1) dP • j=1 j 6 j j=1 | j | Ak − − Ak Ak − Z Z − +1 (ii)P j∞=1 E(Yj ) P 2 2 • (Tk ck) dP + V ar(Xk+1)P (Ak) (4ǫ + 2K) P (Ak Ak+1). ≥ − − − (iii)P ∞ V ar(Yj ). ZAk • j=1

Stating differently, (by using Ak An for k n) we have Proof: OneP way the proof is easy. Assume (i), (ii) and (iii) hold for some constant ⊇ ≤ K > 0. Then by item (iii) and Khintchin-Kolmogorov criterion, we have 2 2 (Tk+1 ck+1) dP (Tk ck) dP A − − A − ∞ Z k+1 Z k (Y E(Y )) converges almost surely. 2 j − j V ar(Xk+1)P (Ak) (4ǫ + 2K) P (Ak Ak+1) j=1 ≥ − − X Kolmogorov’s Three Series Theorem 187 188 Kolmogorov’s Three Series Theorem

1 Therefore, by item (ii) it must be that converges almost surely if and only if p > 2 . Then write a computer program 1 that simulates its density and V ar(Zp) for various values of p > . [In particular, ∞ 2 iid Yj converges almost surely. if X F with 0 < Var(X ) < , then Xk does not converge almost surely.] i 1 k √k j=1 ∼ ∞ X P 2Uk 1 iid 1 Therefore, by item (i) it must be that Exercise - 19.0.5 - Let X = − , k = 1, 2, where U , U , B(1, ). k √k ··· 1 2 ··· ∼ 2 Prove the following results. ∞ Xj converges almost surely. (i) Yk diverges almost surely, where Yk = (Xk + Xk 1), k 2, Y1 = j=1 • k − − ≥ X Xk. n n − P Conversely, suppose now that Xj converges almost surely. So, Xn = Xj j=1 j=1 (ii) Zk converges almost surely, where Zk = Xk Xk 1, k 2, Z1 = X1. n 1 − • k − − ≥ j=1− Xj 0 almost surely. Therefore, for any constant K > 0, P ( Xn > → P P| | P K i.o.) = 0. By the independence of events and the second Borel-Cantelli lemma, Exercise - 19.0.6 - (Randomly modulating random series) Let X1,X2, be P iid ··· any sequence of random variables and let U (t), U (t), B(1, 1 ) be the dyadic ∞ 1 2 ··· ∼ 2 P ( Xn > K) < . expansion of t [0, 1]. Assume X and U are independent. Prove that the | | ∞ ∈ { k} { k} n=1 following statements are equivalent. X Fix a constant K > 0 and this gives part item (i). Hence, the two random sequences (i) X (2U 1) converges almost surely. • k k k − Xn and Yn are equivalent implying that the random series { } { } (ii)P X2 converges almost surely. • k k ∞ (iii)P Xk (2Uk 1) converges almost surely. Yj converges almost surely. • k | | − j=1 2 FurthermoreP any one of the above items implies X (2Uk 1) converges almost X k k − surely, as well as if Xk(j) is any subsequence of Xk then Xk(j)(2Uj 1) Since Yn K, we see that Yn E(Yn) 2K. Now Kolmogorov’s lower inequality { } P { } j − | | ≤ | − | ≤ converges almost surely. Give an example of Xk for which all of the above results gives that { } P hold but k Xk = + almost surely. k | | ∞ (4A + 4)2 P P max Yi 1 m . Exercise - 19.0.7 - Let X ,X , be a sequence of independent random variables. nn | i=n+1 i| ≤ i>n i of the random series Yi does not go to zero, implying that the series Yi i P P i is not convergent. This contradiction implies that i V ar(Yi) < , giving item P E∞ P (iii). Now Khintchin-Kolmogorov criterion implies that k(Yk (Yk)) converges almost surely as well. Hence item (ii) must hold as well.P − P ♠ Exercise - 19.0.3 - By using Kolmogorov’s three series theorem give another proof of Theorem (18.2.1).

iid 1 Exercise - 19.0.4 - (Random p-series) Let U1, U2, B(1, 2 ). Show that the random series ··· ∼ ∞ 2U 1 Z := k − p kp kX=1 190 The Law of Iterated Logarithms

where j + j + + j = 2m and the sum is over all such terms with j 1 and 1 2 ··· r i ≥ r 1. Should any of the j1,j2, ,jr be odd, that term will drop to zero when its expectation≥ will be taken. Hence,··· only even terms will survive, and in which case the expectation of the “R” terms will be the product of their expectations and each one of them would equal one. So, we may write the above sum as

n 2m (2m)! Lecture 20 E x R = x2j1 x2j2 x2jr , k k k1 k2 kr ! (2j1)! (2j2)! (2jr)! ··· Xk=1 X ···

where 2j1 + 2j2 + + 2jr = 2m and the sum is over all such terms with ji 1, and r 1. This may··· be rewritten as ≥ The Law of Iterated ≥ n 2m (2m)! 2 j1 2 j2 2 jr Logarithms E xkRk = xk xk xkr , 1 2 (2j1)! (2j2)! (2jr)! ··· k=1 ! ··· X X    where j1 + j2 + + jr = m and the sum is over all such terms. Hence we may optimize this sum··· as follows Sn a.s. Recall the Borel normal number theorem which says that n 0 when we take → n 2m n iid 1 Sn = (2Ui 1) and Ui B(1, ) random variables. The question arose as to i=1 2 E xkRk − ∼ a.s. 0.5+ǫ what was the rate of convergence? Hausdorff (1913) showed that Sn = o(n ) k=1 ! P a.s. X for any ǫ > 0. A year later, Hardy and Littlewood (1914) showed that S = n (2m)! j1! j2! jr! m! 2 j1 2 j2 2 jr 0.5 = ··· x x x O((n log n) ). Steinhaus (1922) improved the constant of Hardy and Littlewood (2j )! (2j )! (2j )! m! j ! j ! j ! k1 k2 kr a.s. 1 2 r × 1 2 r ··· Sn ··· ··· by showing that limsup 1. A year later, Khintchin settled the upper X  j1  j2  jr n √2n log n m! x2 x2 x2 ≤ (2m)! j1! j2! jr! k1 k2 kr bound issue by proving the following result. Just as Kolmogorov proved his famous sup ··· ··· r,j ,j ,··· ,jr ≥1 (2j )! (2j )! (2j )! m! j ! j ! j ! ≤ 1 2 1 2 r 1 2  r  inequality to derive his famous results, Khintchin also discovered an inequality for J1+j2 +···+jr =m ··· X ··· his results and it has found many other uses and generalizations. We start off with n m 1 (2m)! j1! j2! jr! 2 his inequality. = sup ··· xk . r,j1,j2,··· ,jr ≥1 (2j1)! (2j2)! (2jr)! m! × ! J1+j2 +···+jr =m ··· k=1 Proposition - 20.0.4 - (Khintchin’s inequality) Let R = 2U 1, where U , U , X k k − 1 2 U , iid B(1, 1 ). Then for any real numbers x ,x , ,x , we have To get an upper bound on the “sup” term, just note that 3 ··· ∼ 2 1 2 ··· n n m n 2m n m r (2m)! j1! j2! jr! (2m)! jℓ! 2 (2m)! 2 = A2m x E xkRk x , m = 1, 2, . ··· k ≤ ≤ 2m m! k ··· (2j1)! (2j2)! (2jr)! m! m! (2jℓ)! k=1 ! k=1 ! k=1 ! ··· ℓ=1 X X X Yr (2m)! 1 for a constant Am that depends only on m. = m! (jℓ + 1)(jℓ + 2) (2jℓ) ℓ=1 ··· Proof: We will only prove the upper inequality since thats what we need. The Yr proof uses critically the facts that R1,R2, are independent and (2m)! 1 ··· ≤ m! (1 + 1)(2) (2) 2m+1 2m ℓ=1 E(R ) = 0, E(Rk ) = 1, m = 1, 2, . ··· k Yr ··· (2m)! 1 Just note that = m! 2jℓ n 2m ℓ=1 (2m)! Y E x R = E xj1 xj2 xjr Rj1 Rj2 Rjr , (2m)! 1 k k k1 k2 kr k1 k2 kr = ! j1! j2! jr! ··· ··· m! 2j1 +j2+ +jr Xk=1 X ···  ··· 1 (2m)! A. Khintchin (1923), Uber¨ dyadische Br¨uche, Math. Z. vol. 18, pp. 109-116. = , when j1 + j2 + + jr = m. m! 2m ··· The Law of Iterated Logarithms 191 192 The Law of Iterated Logarithms

This finishes the proof. Since √c < c, this gives that for all large k, we have Lck cLck−1 . We consider all ♠ such k’s only. For such k’s we see that ≤ We can bypass the use of the following inequality, however, we have elected this route to highlight a use of Khintchin’s inequality. The proof of this inequality is Sn based on the proof of Kolmogorov’s inequality. We leave its proof for the reader Ek = sup | | > (1 + ε) (ck−1 c p p p ck−1 1. ( ≤ ) 1 k n | | ≤ p 1 | |  ≤ ≤   −  S sup n > c2 p p | | ⊆ k−1 k L k−1 Note that ( p 1 ) 4 for any p 2. (c c Lck−1 iid 1 ⊆ Theorem - 20.0.4 - (Khintchin’s LIL, 1923) Let U1, U2, , B(1, ) and let 2 S∗k >cL k =: Gk. R = 2U 1 be the (so called) Rademacher functions. If S···= ∼n R then ⊆ { c c } k k − n k=1 k It will be enough if we show that S P limsup n 1, a.s. n √2n log log n ≤ ∞ P (G ) < , for every ǫ > 0, k ∞ k=1 Proof: Let Ln := √2n log log n. The way we will prove this statement is by X showing that the contrary holds only with probability zero. In other words, we will since then the Borel-Cantelli lemma will give that P (Eki.o.) P (Gki.o.) = 0 for show that ≤ S every ε > 0. Note that, for any real a, P limsup | n| > 1 = 0. aS∗ aS∗ n Ln ∗ e n + e− n   E eaSn 2E But this is equivalent to proving that for any ε > 0 we show that ≤ 2     2 2 4 4 6 6 a E(S∗) a E(S∗) a E(S∗) Sn = 2 1 + n + n + n + P limsup | | > (1 + ε) = 0. 2! 4! 6! ··· n Ln     2 2 4 4 6 6 a E(Sn) a E(Sn) a E(Sn) 2 8 1 + + + + Note that this in turn implies that we need only show that for c := 1+ ε so that ≤ 2! 4! 6! ··· 1 < c < 1 + ε,   ∞ 2m m Sn a (2m)!Bn P limsup | | > c = 0. 8 1 + , Khintchin with Bn = n, ≤ (2m)! 2m m! n Ln m=1 !   X Define ∞ (a2n/2)m 8 1 + n n ≤ m! m=1 ! Sn := Rk, Bn = V ar(Sn) = 1 = n, Sn∗ := max Sk . X 1 k n | | a2 n/2 ≤ ≤ = 8e . Xk=1 Xk=1 Note that log log n is positive for all n greater than some n0, which we will take to So, by Markov’s inequality, for any a > 0 we have k be the case. Also, consider all large k>n0 so that c is strictly increasing (which aS∗ happens after some point n ) and define E ck 2 k 00 e ea c /2 P (Gk) acL 8 acL . ≤ e ck  ≤ e ck ck Sn Sn E := k−1 > (1 + ε) = sup > (1 + ε) . k n=c +1 | | | | k ∪ Ln ck−1 0. Now take a = cLck /c and get   ( ≤ ) 2 2 k 2 2 k k k 1 c (L k ) /(2c ) c (L k ) /c Note that c = c c − and P (Gk) 8e c e− c ≤ 2 2 k 1/2 c (L k ) /(2c ) k k = 8 e− c Lck c log log c log log c + log(k) = √c √c. c2 log log(ck) Lck−1 ck 1 log log ck 1 ≤ log log c + log(k 1) → = 8 e− p− −  −  p The Law of Iterated Logarithms 193 194 The Law of Iterated Logarithms

c2 log(k log(c)) = 8 e−

8 k−1 log log(c (c 1)) ǫ 2 = c2 . for all k large enough so that − (1 ) . Now we fix the constants (k log c) log k+log log c ≥ − 2 β (0, 1) and c > 1, which are picked so that Since c > 1, P (G ) < . ∈ k k ∞ ♠ 2 ǫ 1 Within aP year,2 Khintchin was able to settle the whole problem by showing + β 1 1 > 1 ǫ. −√c − 2 s − c − that actually equality held in his earlier result.     This is certainly possible when β is close enough to 1 and c is large enough. Because iid 1 Theorem - 20.0.5 - (Khintchin’s LIL, 1924) Let U1, U2, , B(1, ) and let of this we see that S S > (1 ǫ)√n log log n infinitely often with probability ··· ∼n 2 n n Rk = 2Uk 1 be the (so called) Rademacher functions. If Sn = Rk then | | ≥ − P − k=1 one. So, we are only left with the task of showing that k (Dk) = + for any choice of 0 < β < 1 and any c > 1. There are several techniques of∞ finding a Sn P m P limsup | | = 1, a.s. lower bound of the probability P( (2Uk 1) >t) when m>n, but none is n √2n log log n k=n+1 trivial. After proving the central limit theorem we− will use it to derive this result. P Proof: By Khintchin’s LIL applied to (2U 1), we see that See Exercise (20.0.10) for the normal case. − k − ♠ Sn Sn Sn a.s. liminf = limsup − limsup | | 2. HW47 Exercise - 20.0.8 - (An extension of H´ajek-R´enyi inequality) Let Sn = X1 + − n √2n log log n n √2n log log n ≤ n √2n log log n ≤ X + + X , n 1, be the sequence of partial sums of independent random 2 ··· n ≥ Therefore, for all but finitely many n, we can say that variables having mean zero and finite p-th moment, where p > 1 is a fixed constant. Let cn be a sequence of non-increasing positive numbers. Then for any n>m we S 2 2n log log n, a.s. have n ≥ − n 1 P p − Let A1 the set with (A1) = 1 and the above result holds for all ω A1. 1 p E p p p E p ∈ P max ck Sk ǫ cn Sn + (c c ) Sk . Now we modify the argument of Khintchin’s proof in the reverse direction, with m k n ǫp k k+1 ≤ ≤ | | ≥ ≤ | | − | | ! (1+ǫ) replaced by (1 ǫ), hoping to use Borell-Cantelli-II, for which which we need   kX=m − independent events. So, the old event Ek or Gk in Khintchin’s proof won’t cut it. For ck = 1, if Sn∗ = max1 k n Sk , then show that ≤ ≤ | | Instead, for a constant 0 < β < 1, yet unspecified, consider the events p P 1 E E p p E p (Sn∗ > ǫ) Sn χ S∗ >ǫ , ( Sn∗ ) ( Sn ) . { n } Dk := Snk Snk > β 2(nk nk 1) log log(nk nk 1) , k = 2, 3, , ≤ ǫ | | | | ≤ p 1 | | −1 − −  −  − − − ···  n o (2m)! k p HW48 E 2m where we take n = c , and c > 1 is still unspecified. Note that D ,D , Exercise - 20.0.9 - Show that (X ) = m when X N(0, 1). Furthermore, k 2 3 2 m! ∼ are independent random variables. For a moment assume that we can prove that··· for any t > 0 prove the following inequalities P P k (Dk) = + . Therefore, by Borell-Cantelli-II, it must be that (Dki.o.) = 1. ∞ t t2/2 ∞ u2/2 1 t2/2 So, there exists a set A2 with P(A2) = 1 and for each ω A2, infinitely many Dk e− e− du e− . P ∈ 1 + t2 ≤ ≤ t occur. Zt Hence, over A1 A2, which is an event with probability one, we see that HW49 Exercise - 20.0.10 - ∩ (LIL for the normal case) Prove the law of iterated loga- iid rithm when R N(0,σ2), namely Snk Snk + β 2(nk nk 1) log log(nk 1)nk ) i ≥ −1 − − − ∼ 2 2nk p1 log log nk 1 + β 2(nk nk 1) log log(nk nk 1) Sn a.s Sn a.s ≥ − − − − − − − limsup = σ, liminf = σ, n √2n log log n n √2n log log n − p 2 p 1 log log(ck 1(c 1)) →∞ →∞ 2n log log n + β 1 − − ≥ k k −√c − c log k + log log c by proving the following steps. s  ! p (i) Explain why it is sufficient to assume that σ = 1. 2 ǫ 1 • 2n log log n + β 1 1 , ≥ k k −√c − 2 − c (ii) Show that Khintchin’s proof the upper inequality goes verbatim without s ! • p   using Khintchin’s inequality after proving the facts that Sn/√n is distributed 2Khintchin, A. (1924) Uber¨ einen Satz der Wahrscheinlichkeitsrechnung, Fund. Math. as a standard normal random variable and using its moments as given in Vol 6, pp. 9-20. Exercise (20.0.9). The Law of Iterated Logarithms 195 196 The Law of Iterated Logarithms

(iii) Show that the Khintchin’s lower bound proof goes almost verbatim as Remark - 20.0.5 - (Fine tuning of LIL by Slivka) John Slivka was a Ph.D. • well, after proving the facts that S S = m R N(0,m n) and student of Norman C. Severo at the State University of New York and Buffalo. In m − n i=n+1 i ∼ − using the lower inequality for P(Sm Sn >t√m n) from Exercise (20.0.9). his Ph.D. dissertation he considered the following type of fine tuning. − P − n As seen in the above discussion if Sn = k=1 Rk, where Rk are independent n 1/2 Remark - 20.0.3 - (LIL of Kolmogorov, 1929) Kolmogorov then extend Khintchin’s having mean zero and Bn = V ar(Rk), if we let bε(n) = (1+ε)(2Bn log log Bn) , k=1 P result by allowing nonidentically distributed random variables into the picture and for n 3, the logarithm having base e, then the event Sn > bε(n) occurs only 3 proved the following version of LIL. finitely≥ many times, almostP surely. Hence, we may define{ a sequence} of 0’s and 1’s Let R1,R2, be a sequence of independent random variables with mean zero by ··· 2 n 2 n 2 and V ar(Rk) = σk. Let Sn = k=1 Rk and Bn = V ar(Sn) = k=1 σk. If the 1 if Sn > bε(n), Yn(ε) := random variables Rk are bounded with bound 0 otherwise. P P  Bn So, for any ε > 0 we see that N(ε) := ∞ Yn(ε) is a finite function (a random Rn = o . (0.1) n=1 | | √log log B varible). Slivka proved that whenever the LIL holds, then for any ε > 0 it so  n  happens that P Then the LIL holds, (using R and R ), k − k E (N(ε)) = + . ∞ S a.s. S a.s. limsup n = 1, liminf n = 1. In fact,he showed something even more remarkable. He proved that for any λ > 0 2 n 2 n 2Bn log log Bn 2Bn log log Bn − E N(ε)λ = + . Or stating bothp parts together, p ∞ 7 S a.s. This result came out in John Slivka  limsup | n| = 1. 2 n 2Bn log log Bn Remark - 20.0.6 - (The current state of affairs) Literally to this day various Marcinkiewicz and Zygmund4 showedp that the condition (0.1) cannot be replaced improvements and variants of the LIL are seen coming out in the literature in by a big “O” version of it. different settings. No doubt finer details will appear in the future. One finely tuned result is an improvement of Kolmogorov done by Feller. It goes as follows. Remark - 20.0.4 - (LIL of Hartman and Wintner, 1941) Hartman and Wint- Let Sn = X1 + X2 + + Xn, where Xi are independent random variables with ner showed that Kolmogorov’s condition (0.1) could indeed by reduced considerably E 2 ··· 2 2 2 (Xi)=0and σi = V ar(Xi). Let Bn = σ1 + + σn. For any increasing sequence if we assumed the added assumption that the random variables, R1,R2, were of positive numbers φ , the following results hold:··· iid. In fact they proved the following result for non-iid case:5 Let R ,R ,··· be a n 1 2 2 ···2 φk φk/2 P sequence of independent random variables with mean zero and V ar(Rk) = σk. Let (a) If k k e− < then (Sn >Bnφn i.o.)=0. n n 2 Bn • ∞ S = R and B = V ar(S ) = σ be such that > c for some c > 0. 2 n k=1 k n n k=1 k n φk φ /2 (b) If P e− k = then P(S >B φ i.o.)=1. Then • k k ∞ n n n P SP limsup n = 1, a.s. P n √2Bn log log Bn provided that Rk are random variables whose distribution has a common (uniform) tail bound by another random variable Y which has finite variance:

sup P ( Rn r) = O(P ( Y r)), as r . n | | ≥ | | ≥ →∞ A new proof of this result was provided by Alejandro de Acosta.6

3A. Kolmogorov, (1929), Uber¨ das Gesetz des iterierten Logarithmus, Math. Annalen, vol. 101, pp. 126-135. 4Marcinkiewicz and Zygmund (1937), Remarque sur la loi du logarithme it´er´e, Fund. Math. vol. 29, pp. 215-222. 5 Philip Hartman and Aurel Wintner, (1941), Amer. J. Math. vol. 63, no. 1, (1941), pp. 169-176. 6Alejandro de Acosta (1983), A New Proof of The Hartman-Wintner Law of the Iter- 7John Slivka, (1969), On the Law of The Iterated Logarithm, Proc. National Academy ated Logarithm, Ann. Prob. vol. 11, no. 2, pp. 270-276. of Sci., vol. 63, pp. 289-291.