<<

Distributions

Connexions module m43336

Zdzis law (Gustav) Meglicki, Jr Office of the VP for Information Technology, Indiana University RCS: Section-1.tex,v 1.78 2012/12/17 16:29:57 gustav Exp

Copyright c 2012 by Zdzislaw Meglicki December 17, 2012

Abstract We introduce the concept of a and its charac- terizations in terms of moments and averages. We present examples and discuss probability distributions on multidimensional spaces; this also in- cludes marginal and conditional . We discuss and prove some fundamental theorems about probability distributions. Finally, we illus- trate how random variables associated with various probability distribu- tions can be generated on a computer.

Contents

1 Random Variables and Probability Distributions 2

2 Characteristics of Probability Distributions: Averages and Moments 4

3 Examples of Probability Distributions 7 3.1 Uniform Distribution ...... 7 3.2 ...... 9 3.3 Normal (Gaussian) Distribution ...... 10 3.4 Cauchy-Lorentz Distribution ...... 15

n 4 Probability Distributions on : Marginal and Condi- tional Distributions 15

5 Transformations 20 5.1 Application to Gaussian Distributions ...... 22 5.2 Application to Cauchy-Lorentz Distributions ...... 30 5.3 Cumulative Probability Distribution Theorem ...... 33 5.4 of Random Variables ...... 34 5.5 Covariance and Correlation ...... 35 5.6 Central Theorem ...... 36 5.7 Computer Generation of Random Variables ...... 40

1 2 Licensed to Connexions by Zdzislaw Meglicki, Jr

1 Random Variables and Probability Dis- tributions

Variables in mathematics A variable in mathematics is an argument of a . The variable may assume various values (hence the name) within its domain to which the function responds by producing the corresponding values, which usually reside in a set different from the domain of the function’s argument. Using a formal notation, we may describe this as follows:

X 3 x 7→ f(x) = y ∈ Y. (1)

Here X is the function’s domain, x ∈ X is he function’s variable, f is the function itself, y is the value that the function returns for a given x and that belongs to Y , the set of function values. Another way to describe this is f : X → Y. (2) Neither of the above specifies what function f actually does. The formulas merely state that f maps elements of X onto elements of Y . For a mapping to be called a function, the mapping from x to y must be unique. But this requirement is not adhered too strictly and we do work with multivalued functions too. Variables in A variable in physics is something that can be measured, for example, a position of a material or temperature or mass. How does a physics variable relate to a variable in mathematics? Depending on a position of the material point x, if the point is endowed with an electric charge qe and some externally applied electromagnetic field E is present, then the that acts on the point will be a vector valued function of its position:

F(x) = qeE(x). (3)

We read this as follows: the material point is endowed with electric charge qe and is located at x. The position of the point is the variable here (it’s actually a vector variable in this case, but we may also think of it as three variables). Electric field E happens to have a vector value of E(x) at this point. It couples to the point’s charge and in effect the force of F(x), that is also a vector, is exerted upon the point. The variable x may be itself a value of another function, perhaps a function of time t. We may then write x = f(t), or just x(t) for short, in which case F(x(t)) = qeE(x(t)). (4) A random variable is a physics variable that may assume different values when measured with a certain probability assigned to each possible value. We may think of it as an ordered pair

(X,P : X 3 x 7→ P (x) ∈ [0, 1]) (5)

where, X is a domain and P (x) is the probability of the specific value x ∈ X occurring. The probability is restricted to real values within the line segment between 0 and 1, which is what [0, 1] . This should not be confused with {0, 1} which is a set of two elements, 0 and 1, that does Creative Commons Attribution License (CC-BY 3.0) 3

not contain anything in between. We will often refer to a specific instance of (5), (x, P (x)) for short, also calling it a random variable. Otherwise, x can be used like a normal physics variable in physics expressions. However, it’s association with probability carries forward to anything the variable touches, meaning that when used as an argument in successive functions, it makes the functions’ outcomes random too. And so, if, say, (x,P (x)) is a random variable, then, for example, (E(x),P (x)) also becomes a random variable, although the resulting probability in (E,PE (E)) is not the same as P (x). There are ways to evaluate PE (E), about which we’ll learn more in Section 5. A set of all pairs (x, P (x)) is the same as P : X 3 x 7→ P (x) ∈ [0, 1], Random variables and because we can understand a function as a subset of a certain relation probability distributions and a relation is a set of pairs—this is one of the definitions of a func- tion. So a theory of random variables is essentially a theory of probability distributions P (x), their possible transformations, evolutions and ways to evaluate them. And random variables themselves are merely arguments of such probability distributions. The notation used in the theory of Markov processes, as well as the concepts, can sometimes be quite convoluted—and it can get even worse when mathematicians lay their hands on it—and it will help us at times to get down to earth and remember that a random variable is simply a measurable quantity x that is associated with a certain probability of occurring P (x). The formal mathematical definition of a random variable is that it Random variables and is a function between a and a measurable space, the mathematicians probability space being a triple (Ω,E,P ) where Ω is a , E is a set of subsets of Ω called events (it has to satisfy a certain list of properties to be such) and P is a probability (P (x) dx can be thought of as a measure in our context assuming that X is a continuous domain). A space is said to be measurable if there exists a collection of its subsets with certain properties. The reason why this formal mathematical definition is a bit opaque is because the process of repetitive measurements that yields probabil- ity densities for various values of a measured quantity is quite complex. It is intuitively easy to understand—once you’ve carried out such mea- surements yourself—but not this easy to capture into well defined math- ematical structures that mathematicians like to work with. In particular, mathematicians do not like the frequency interpretation of probability and prefer to work with the Bayesian concept of it. It is easier to formalize. In the following we will be interested in variables that are sampled from Probability density (or a continuous domain, x ∈ R¯ = [−∞, +∞] (the bar over R means that we distribution) have compactified R by adding the infinities to it—sometimes we may be interested in what happens when x → ∞, in particular our will normally run from −∞ to +∞). The associated probabilities will then be described in terms of probability densities such that

P (x) ≥ 0 everywhere (6)

and Z +∞ P (x) dx = 1, (7) −∞ 4 Licensed to Connexions by Zdzislaw Meglicki, Jr

the probability of finding x between, say, x1 and x2 being

Z x2 P (x ∈ [x1, x2]) = P (x) dx ∈ [0, 1]. (8) x1 (7) then means that x has to be somewhere between −∞ and +∞, which is a trivial observation. Nevertheless, the resulting condition, as imposed on P (x), is not so trivial and has important consequences. Conversely, a function that satisfies (6) and (7) can be always thought of as a probability density. If P1(x) and P2(x) are two probability densities and if they differ on more than a set of measure zero, then (x, P1(x)) and (x, P2(x)) are two different random variables. Here we assume that P1 and P2 are normal, well defined functions, not the so called generalized functions about which we’ll say more later. Cumulative probability Once we have a P (x) we can construct another function out of it, distribution namely Z x D(x) = P (x0) dx0 (9) −∞ This function is called the distribution of x, or, more formally, the dis- tribution of (x, P (x)), this emphasizing that it is a function of x and a functional of P , hence a property of the random variable (x, P (x)). Of- tentimes people call P (x) distribution (or continuous distribution) too— physicists especially—and there is nothing that can be done about this. This nomenclature is entrenched. In this situation D(x) would be called a cumulative distribution function. The cumulative distribution D(x) is used, for example, in computer generation of random numbers with arbitrary (not necessarily uniform) distributions, so it is a useful and important function. Still, in the follow- ing I will call P (x) a probability distribution, too, and this will tie nicely with the Schwartz theory of distributions about which I’ll say more later.

2 Characteristics of Probability Distri- butions: Averages and Moments

Average The average value of f(x) on the statistical ensemble described by the random variable (x, P (x)) is given by

Z +∞ hf(x)i = f(x)P (x) dx. (10) −∞

It is easy to see why. For a given xi, the value returned by f is f(xi). If we were to sample N times (N → ∞) the various x-es evaluating their f(x), the value xi would pop up NP (xi)dx times. Therefore their contribution to the total sum of f(x)-es would be Nf(xi)P (xi)dx. The total sum of f(x)-es obtained in the sampling would therefore be

Z +∞ Nf(x)P (x) dx, (11) −∞ Creative Commons Attribution License (CC-BY 3.0) 5

their average obtained by dividing by the number of samples N, thus leading to (10). In particular, the average of x itself (also called the of x) is Mean

Z +∞ hxi = xP (x) dx (12) −∞

Averages of higher powers of x are called moments. The average of x Moments itself, hxi is the first . The second moment is

Z +∞ hx2i = x2P (x) dx (13) −∞

and so on. The zeroth moment is

Z +∞ Z +∞ hx0i = x0P (x) dx = 1 · P (x) dx = 1, (14) −∞ −∞

the normalization itself. Although we may write hxni on paper, the integral that is implied may Existence of moments not exist. This depends on how quickly P (x) dies out. For some com- monly used probability distributions, e.g., Gaussian, all moments (assum- ing a finite n) may exist, for other distributions, e.g., the Cauchy-Lorentz distribution (most physicists know it as Lorentz bell curve encountered in the theory of resonance), none exist other than the zeroth one. The Cauchy-Lorentz random variable does not even have the average value defined! This has striking ramifications regarding data processing. If a particular process is described by the Cauchy-Lorentz distribution, aver- aging the results of measurements for the process makes no sense. What happens when we sample data so distributed is that the running average does not converge. It continues to jump all over the place. If all finite moments of a given probability distribution P (x) do exist the average of any f of the random variable x can be expressed using the moments, namely

Z +∞ hf(x)i = f(x)P (x) dx −∞ ∞ n Z +∞ X 1 d f(x) n = x P (x) dx n! dxn n=0 x=0 −∞ ∞ n X 1 d f(x) n = hx i. (15) n! dxn n=0 x=0

It is normally easier to just evaluate the integral than the infinite sum in- volving infinite of f(x), but the above expression is sometimes useful in derivations and approximations. The average value of x − hxi is, of course, zero. This is because there is as much of x on the left side of hxi as there is on the right side, weighed 6 Licensed to Connexions by Zdzislaw Meglicki, Jr

by P (x). We can see this by direct computation too:

Z +∞ hx − hxii = (x − hxi) P (x) dx −∞ Z +∞ Z +∞ = xP (x) dx − hxi P (x) dx −∞ −∞ = hxi − hxi = 0. (16)

But the average value of (x − hxi)2 is not zero unless in one particular case that is somewhat pathological and about which we’ll say more later. It is not zero, because this time the contributions from the left and the right side of hxi do not subtract. This quantity is called variance and we are going to refer to it as σ2. The reason for this is that the square root of variance is called standard deviation and people who measure things (physicists and engineers especially) are intimately familiar with it:

var(x) = σ2(x) = h(x − hxi)2i. (17)

It is easy to express the variance of x in terms of the second and first moments of x:

σ2(x) = h(x − hxi)2i Z +∞ = (x − hxi)2 P (x) dx −∞ Z +∞ = x2 − 2xhxi + hxi2 P (x) dx −∞ = hx2i − 2hxi2 + hxi2 = hx2i − hxi2. (18)

Because the first integral in the evaluation above is manifestly positive (or zero), we find that Observation 2.1. hx2i ≥ hxi2. (19) It’s an important inequality that’s used in many derivations and proofs in random variable theories including , which is one of them. Sure variable When is the variance of x zero? Let us observe first that (x − hxi)2 is positive for all x with the exception of x = hxi. Because P (x) is positive R +∞ 2 or zero too, in order for −∞ (x − hxi) P (x) dx to be zero P (x) must be zero everywhere where (x − hxi)2 isn’t zero, that is, everywhere with the exception of x = hxi. But the Lebegue measure theory tells us that if P (x) is non-zero at one point only, that is, on a set of measure zero, then R +∞ the integral −∞ P (x) dx must be zero, so the resulting P (x) cannot be a probability distribution. This is where the pathology creeps in. Within the body of the random variable theory we cannot describe a variable that is guaranteed to have a certain value at all times! Such a Sure variables and Dirac variable is called a sure variable. In order to include sure variables into Creative Commons Attribution License (CC-BY 3.0) 7

our theory, we have to describe them in terms of the , which is not a real function. It is a distribution (as in the Schwartz theory of distributions, not as defined by (9)) otherwise also known as a (this is Sobolev’s terminology). A distribution is a functional that maps functions onto numbers. An integral is an example of a Schwartz distribution, and an informal integral of a Dirac delta is another example. In summary, admitting Dirac deltas to our arsenal of tools, x is a sure variable if it is described by the pair

(x, δ(x − x0)) (20)

where δ(x) is the Dirac delta for which we have

δ(x) = 0 if x 6= 0, (21)

and δ(0) = +∞ (22) where the infinity is “so large” that it defeats the measure theory and

Z +∞ δ(x) dx = 1. (23) −∞

To conclude, we can now say that if the variance of x vanishes then x Variance vanishes for sure is a sure variable the probability distribution function of which is δ(x−x0) variables only where x0 = hxi. The reverse is also true: if x is a sure variable then its variance vanishes. This can be demonstrated trivially by evaluating the variance of (x, δ(x − x0)). Ipso facto, for real random variables that aren’t sure their variance is always positive.

3 Examples of Probability Distributions 3.1 Uniform Distribution A simple example of probability distribution function is uniform distri- bution on [x1, x2](x1 < x2). The function is zero outside of [x1, x2] and 1/(x2 − x1) inside. The mean of such a uniform distribution is

Z +∞ hxi = xP (x) dx −∞ 1 Z x2 = x dx x − x 2 1 x1 x2 − x2 = 2 1 2 (x2 − x1) (x − x )(x + x ) x + x = 2 1 2 1 = 2 1 . (24) 2 (x2 − x1) 2 8 Licensed to Connexions by Zdzislaw Meglicki, Jr

We can just as easily evaluate arbitrary (finite) moments of the distribu- tion: Z +∞ hxni = xnP (x) dx −∞ 1 Z x2 = xn dx x − x 2 1 x1 xn+1 − xn+1 = 2 1 . (25) (n + 1) (x2 − x1) It is useful to take this expession further:

  n+1 xn+1 1 − x1 2 x2 n 1 hx i =   n + 1 x1 x2 1 − x2  n+1 x1 − 1 1 x2 = xn . (26) n + 1 2 x1 − 1 x2

1 The expression to the right of n+1 looks like a sum of a geometric n the first element of which is x2 and the successive elements of which are constructed by multiplying the first one by x1/x2:

n n−1 2 n−2 n x2 , x1x2 , x1x2 , . . . , x1 , (27)

the sum of which can be then written as

n X i n−i x1x2 (28) i=0

N-th moment of uniform In summary, the n-th moment of the distribution can be rewritten as distribution Observation 3.1. n n 1 X i n−i hx i = x x . (29) n + 1 1 2 i=0 This form may sometimes be easier to use than (25), though generally not. But it’s good to know that we can switch between (25) and (29). This may come handy. Variance and standard deviation In particular, using (29), the second moment of the distribution is 1 hx2i = x2 + x x + x2 (30) 3 2 1 2 1 and so the variance is

σ2 = hx2i − hxi2 1 1 = x2 + x x + x2 − (x + x )2 3 2 1 2 1 4 2 1 1 = 4x2 + 4x x + 4x2 − 3x2 − 6x x − 3x2 12 2 1 2 1 1 1 2 2 1 1 = x2 − 2x x + x1 = (x − x )2 . (31) 12 2 1 2 1 12 2 1 Creative Commons Attribution License (CC-BY 3.0) 9

Therefore the standard deviation is 1 σ = √ (x2 − x1) . (32) 2 3

If we were to name the uniform distribution Px1x2 (x) then we could Uniform distribution and Dirac think of the Dirac Delta function as delta

δ(x − x1) = lim Px1x2 (x) (33) x2→x1

Although the value of Px1x2 (x) at x = x1 would become infinity for x2 → x1 and so the limit does not really exist, when placed under the integral and multiplied by an arbitrary function, the limit acquires a well defined meaning:

Z +∞ Z +∞ f(x)δ(x) dx = lim f(x)Px1x2 (x) dx −∞ x2→x1 −∞ 1 Z x2 = lim f(x) dx (34) x2→x1 x − x 2 1 x1

x1+x2  In the limit, we can approximate the integral with f 2 (x2 − x1), which cancels with 1/(x2 − x1) in front of the integral leaving just

 x1 + x2  lim f = f(x1) (35) x2→x1 2

behind. Thus the Dirac delta becomes a conceptual shortcut for this kind of operation.

3.2 Exponential Distribution The exponential distribution is defined by

 ae−ax for x ≥ 0 P (x) = (36) 0 for x < 0,

where a > 0. The n-th moment of the distribution is N-th moment of exponential distribution Z +∞ hxni = xnP (x) dx −∞ Z +∞ = xnae−ax dx 0 Z +∞ 1 n −ax 1 = a n (ax) e d(ax) 0 a a Z +∞ 1 n −y = n y e dy a 0 1 = Γ(n + 1), (37) an 10 Licensed to Connexions by Zdzislaw Meglicki, Jr

where Γ(n) is the Euler . Because Γ(n) = (n − 1)!, we get n! hxni = . (38) an In particular, 1 hxi = , (39) a and 2 1 1 σ2 = hx2i − hxi2 = − = , (40) a2 a2 a2 and so 1 σ = = hxi. (41) a (38) is easy to arrive at without invoking the Euler gamma result. Instead, we integrate by parts, which yields

Z +∞ Z +∞ yne−y dy = − yn de−y 0 0 +∞ +∞ Z = − yne−y + e−y dyn 0 0 Z +∞ = n yn−1e−y dy. (42) 0 Repeating the step n − 1 times we obtain

Z +∞ n(n − 1)(n − 2) ... 2 · 1 e−y dy = n! (43) 0 because the integral itself evaluates to 1. Exponential distribution and The exponential distribution converges to the Dirac delta for a → ∞. Dirac delta The distribution is normalized because 0! 1 hx0i = = = 1. (44) a0 1 Because for this distribution σ = 1/a we often replace a with 1/σ so that

 1 e−x/σ for x ≥ 0 P (x) = σ (45) 0 for x < 0,

and hxni = n!σn. (46)

3.3 Normal (Gaussian) Distribution

The normal or Gaussian distribution centered on x0 and of standard deviation σ is defined by the formula:

1 2 2 P (x) = √ e−(x−x0) /(2σ ), (47) 2π σ

where σ must be positive and x0 can be any finite . R +∞ −ax2 Basic There is a neat rick to the evaluation of −∞ e dx. We begin by Creative Commons Attribution License (CC-BY 3.0) 11

evaluating the square of the integral: +∞ +∞ Z 2 Z 2 e−ax dx e−ay dy −∞ −∞ +∞ +∞ Z Z 2 2 = e−a(x +y ) dx dy. (48) −∞ −∞ Now we switch to cylindrical coordinates on the plane, (r, φ), where r2 = x2 + y2, which yields 2π +∞ Z Z 2 e−ar rdφ dr 0 0 +∞ Z 2 1 = 2π e−ar dar2 0 2a π Z +∞ π = e−z dz = . (49) a 0 a Therefore +∞ r Z 2 π e−ax dx = . (50) −∞ a In particular, for a = 1 Observation 3.2. +∞ Z 2 √ e−x dx = π, (51) −∞ which is easy to remember, and worth remembering. But we will make use of (50) soon enough too. It is easy to see that Odd moments of basic Gaussian +∞ Z 2 xe−ax dx = 0 (52) −∞ 2 because x is antisymmetric, whereas e−x is symmetric. In the case of Mean of Gaussian distribution (47) the distribution is symmetric around x0, which, for a = 1, yields +∞ Z 2 xe−(x−x0) dx −∞ +∞ Z 2 −(x−x0) = (x0 + (x − x0)) e d (x − x0) −∞ Z +∞ −z2 √ = x0 e dz = x0 π. (53) −∞ √ √ For (47) the π that’s thrown√ out by (53) cancels with 1/ π in the defini- tion of P (x), whereas the 1/ 2σ2 cancels with the 2σ2 in the denominator of the exponent. We’re going to show this by checking whether P (x) is indeed normalized: Normalization of Gaussian +∞ distribution Z 1 2 2 √ e−(x−x0) /(2σ ) dx 2 −∞ 2πσ +∞ √ 1 Z 2 2 2 √ x − x = √ e−(x−x0) /( 2σ ) 2σ2 d √ 0 2 2 2πσ −∞ 2σ +∞ √ 1 Z 2 π = √ e−z dz = √ = 1. (54) π −∞ π 12 Licensed to Connexions by Zdzislaw Meglicki, Jr

In summary, the mean of the distribution is

hxi = x0. (55)

Second moment of basic We are now going to evaluate the second moment of the distribution. Gaussian But before we get to juggle the π-s and the σ-s, we’ll just work out the simple integral first:

+∞ Z 2 x2e−ax dx. (56) −∞ From our tricky evaluation of (50) we can anticipate that this is going to be just as tricky... The trick this time is to observe that

d 2 2 e−ax = −x2e−ax . (57) da Therefore

+∞ +∞ Z 2 d Z 2 x2e−ax dx = − e−ax dx −∞ da −∞ d r π = − da a 1  1  = − π − p π a2 2 a 1 r π = . (58) 2 a3 In particular, for a = 1

+∞ √ Z 2 π x2e−x dx = , (59) −∞ 2 which is easy to remember and worth remembering too. Second moment of Gaussian Now we are fully equipped to calculate the second moment of the distribution distribution:

+∞ 1 Z 2 2 hx2i = √ x2e−(x−x0) /(2σ ) dx 2 2πσ −∞ +∞ Z 2 2 1 2 2 −(x−x0) /(2σ ) = √ (x − x0) + 2xx0 − x0 e dx 2 2πσ −∞ +∞  Z 2 2 1 2 −(x−x0) /(2σ ) = √ (x − x0) e dx 2 2πσ −∞ +∞ Z 2 2 −(x−x0) /(2σ ) +2x0 xe dx −∞ +∞ Z 2 2  2 −(x−x0) /(2σ ) −x0 e dx ... (60) −∞ The second integral multiplied by the factor in front of the whole expres- sion is simply the mean, x0. The third integral after the multiplication is the normalization integral and evaluates to 1, so the difference between the Creative Commons Attribution License (CC-BY 3.0) 13

2 2 2 second and the third component in the above expression is 2x0 − x0 = x0. Consequently, the variance, hx2i − hxi2, is just the first integral times the Variance and standard deviation factor in front: of Gaussian distribution

+∞ Z 2 2 2 2 1 2 −(x−x0) /(2σ ) hx i − hxi = √ (x − x0) e dx 2 2πσ −∞ +∞ 2 1 3/2 Z (x − x ) 2 2 x − x = √ 2σ2 0 e−(x−x0) /(2σ ) d √ 0 2 2 2 2πσ −∞ 2σ 2σ 23/2 +∞ 2σ Z 2 = √ z2e−z dz 2 2πσ −∞ 23/2 √ 2σ π = √ 2πσ2 2 = σ2. (61)

And so, the standard deviation is σ, as we have asserted (but not proven) at the beginning. R +∞ n −ax2 The trick shown in (58) can be used to evaluate −∞ x e dx for even values of n. For odd values of n the integrals are zero, unless the Gaussian is shifted, in which case the shifts have to be taken into account as in (53). For an arbitrary moment hxni the tricks shown above can still be N-th moment of basic Gaussian used, but we can do better. A compact expression exists for the generic nth moment integral, namely

+∞ 1 n! π Z 2  p for even n xne−ax dx = 2n (n/2)! an+1 (62) −∞ 0 for odd n

In particular, for n = 0 we get pπ/a in agreement with (50) and for n = 2 1 p 3 we get 2 π/a in agreement with (58). Also, for a = 1 and for even n this is just 1 n! √ π. (63) 2n (n/2)!

To prove the correctness of the formula all we need is to carry out the Induction proof induction step. We assume that the formula is correct for n − 2 and try to see if this leads to a correct expression for n. The formula for n − 2 is

1 (n − 2)! √ πa−(n−1)/2. (64) 2n−2 n−2  2 !

We apply the trick used in (58). The minus d/da of (64) should yield 14 Licensed to Connexions by Zdzislaw Meglicki, Jr

(62). Let us see if it does:

1 (n − 2)! √ d π a−(n−1)/2 2n−2 n−2  da 2 ! 1 (n − 2)! √  n − 1  = π − a−(n−1)/2−1 2n−2 n−2  2 2 ! 1 (n − 1)! r π = 2n−1 n−2  an+1 2 ! 1 (n − 1)! n/2 r π = 2n−1 n−2  n/2 an+1 2 ! 1 n! r π = 2n n  n an+1 2 − 1 ! 2 1 n! r π = , (65) 2n n  an+1 2 !

which is indeed the same as (62), which therefore proves it. N-th moment of Gaussian Now let us find the actual nth moment: distribution +∞ 1 1 Z 2 2 hxni = √ √ xne−(x−x0) /(2σ ) dx 2 π 2σ −∞ +∞ 1 1 Z 2 2 √ x − x = √ √ xne−(x−x0) /(2σ ) 2σ2 d √ 0 2 2 π 2σ −∞ 2σ +∞ 1 Z 2 2 x − x = √ xne−(x−x0) /(2σ ) d √ 0 ... (66) 2 π −∞ 2σ

n To progress we have to express x in terms of x−x0. We use the Newton’s binomial formula:

n n n X n! n−k k x = (x + (x − x )) = x (x − x ) . (67) 0 0 k!(n − k)! 0 0 k=0

We plug (67) in (66), which yields

n n−k +∞ Z 2 2 1 X x0 k −(x−x0) /(2σ ) x − x0 √ n! (x − x0) e d √ π k!(n − k)! 2 k=0 −∞ 2σ n n−k √ k Z +∞ k 1 X x   (x − x0) −(x−x )2/(2σ2) x − x0 = √ n! 0 2σ2 e 0 d √ π k!(n − k)! √ k 2 k=0 −∞ 2σ2 2σ

n n−k √ k Z +∞ 1 X x   k −z2 = √ n! 0 2σ2 z e dz π k!(n − k)! k=0 −∞ n 1 X xn−k √ k 1 k! √ = √ n! 0 2σ2 π π k!(n − k)! 2k (k/2)! k=0,2,4,... n X xn−kσk = n! √ 0 . (68) k k=0,2,4,... 2 (n − k)! (k/2)! Creative Commons Attribution License (CC-BY 3.0) 15

For the 0th moment, there is only one term in the sum and n = k = 0, which yields 1, so the Gaussian is indeed normalized in agreement with (54). For the 1st moment, there is still one term only, but now n = 1 so we get x0 in agreement with (55). And for the second moment we have two terms:  x2 σ2  2 0 + = x2 + σ2, (69) 2 2 0 2 2 2 2 wherefrom we get that the variance is x0 + σ − x0 = σ in agreement with (61).

3.4 Cauchy-Lorentz Distribution The Cauchy-Lorentz distribution is given by Lorentz bell curve 1 a P (x) = , (70) 2 2 π (x − x0) + a

where a is positive and called the halfwidth of the distribution and x0 ∈ R. Halfwidth 1 The curve attains a maximum of πa at x = x0. It attains a half of its maximum at an x that is a solution of a 1 1 = , (71) 2 2 π (x − x0) + a 2πa

which is x = x0 ± a, hence the term halfwidth. It is easy to see that the distribution is normalized: Normalization: no higher moments exist Z +∞ 1 a dx 2 2 −∞ π (x − x0) + a a Z +∞ 1 x − x = a d 0  (x−x )2  π −∞ 2 0 a a a2 + 1 1 Z +∞ 1 = 2 dz π −∞ z + 1 1 Z +∞ d arctan z = dz π −∞ dz 1 1  π  π  = (arctan(+∞) − arctan(−∞)) = − − = 1.(72) π π 2 2 Alas, this zeroth moment is the only one that exists. All integrals with P (x) multiplied by xn, where n ≥ 1, are divergent.

4 Probability Distributions on Rn: Marginal and Conditional Distributions

3 Although we have mentioned probability distributions on R in the pre- vious sections, we didn’t talk much about them. In principle, probability n distributions on R are much the same as probability distributions on R, but there are some additional questions that can be asked of such distri- butions that lead to new concepts. 16 Licensed to Connexions by Zdzislaw Meglicki, Jr

3 Probability distribution in 3D Let us first consider P R = P (x, y, z). The normalization condition for the distribution is

Z +∞ Z +∞ Z +∞ P (x, y, z) dx dy dz = 1. (73) −∞ −∞ −∞

Marginal probability distribution But we can also ask the following question: what is the probability of finding x ∈ [x1, x2] and y ∈ [y1, y2] with z anywhere, i.e., anywhere within [−∞, +∞]. The answer to this question is given by a new probability 2 distribution, this time on R , that is constructed by integrating P (x, y, z) over z, Z +∞ Pxy(x, y) = P (x, y, z) dz. (74) −∞

We can also go further and ask about the probability of finding x ∈ [x1, x2] with y and z anywhere. This time the relevant probability distribution is given by Z +∞ Z +∞ Px(x) = P (x, y, z) dy dz. (75) −∞ −∞

And, of course, we can similarly construct Pxz, Pyz, Py and Pz. These distributions are called marginal probability distributions. The next question is more complicated. We can ask, for example, distribution about the probability of finding x ∈ [x1, x2] and y ∈ [y1, y2] but only as the fraction of the cases for which y ∈ [y1, y2], while ignoring z altogether. This time the corresponding distribution function will be (we’ll prove this more formally below, see (85))

R +∞ P (x, y, z) dz P (x|y) = −∞ . (76) x|y R +∞ R +∞ −∞ −∞ P (x, y, z) dx dz

This distribution is called a conditional probability distribution and the expression Px|y(x|y) is read “probability density of x given y”. It is not R +∞ to be confused with −∞ P (x, y = y0, z) dz for some fixed y0. We use the bar to separate the variables to emphasize that the one on the right is a conditioning variable. We can see that the numerator is a function of (x, y) whereas the denominator is a function of y. Or, we can also say that the integral in the numerator is over z which is the ignored variable, whereas the integrals in the denominator are over (x, z), i.e., over all the variables that are not the conditioning ones. Thus the integral in the denominator scales the probability to the narrower domain, the one that corresponds to the conditioning variable.

In a similar genre we can define Px|yz(x|yz), probability density of x given y and z, and Pxy|z(x, y|z), probability density of of (x, y) given z. All these distributions must be normalized with respect to their normal Creative Commons Attribution License (CC-BY 3.0) 17

variables, not the conditioning ones, that is

Z +∞ P (x|y) dx = 1, −∞ Z +∞ P (x|y, z) dx = 1, −∞ Z +∞ Z +∞ P (x, y|z) dx dy = 1. (77) −∞ −∞ The marginal distributions must similarly be normalized:

Z +∞ Px(x) dx = 1, −∞ Z +∞ Z +∞ Pxy(x, y) dx dy = 1. (78) −∞ −∞ The same goes for various other combinations of (x, y, z), of course. The conditioning variable can be x instead of z and so on. This is getting complicated. But the way through the thicket is to Working out expressions for think of the differential forms conditional distributions

Px(x) dx,

Pxy(x, y) dx dy,

Px|y(x|y) dx,

Pxy|z(x, y|z) dx dy, (79)

as the actual probabilities to which various probability rules can be ex- plicitly applied. In particular, let us go back to the actual definition of the conditional probability. At its most fundamental level, it is defined as k (y and x) P (x|y) = lim n n→∞ kn(y) k (y and x)/n = lim n n→∞ kn(y)/n P (y and x) = , (80) P (y)

where kn(x) is the number of occurrencies of x in n samplings. This implies

P (y and x) = P (x|y)P (y) and P (x and y) = P (y|x)P (x). (81)

But since x and y = y and x, P (y and x) = P (x and y), and

P (x and y) = P (y|x)P (x) = P (x|y)P (y). (82)

Switching to our differential probabilities

Pxy(x, y) dx dy = Px|y(x|y) dx Py(y) dy. (83) 18 Licensed to Connexions by Zdzislaw Meglicki, Jr

Dividing by dx dy yields

Pxy(x, y) = Px|y(x|y)Py(y) (84) wherefrom Pxy(x, y) Px|y(x|y) = , (85) Py(y) which yields (76) on substituting the integrals in place of the marginal probabilities. Relationships between There is a of expression similar to (84) from which formulas for conditional and marginal conditional probability distributions can be derived in terms of marginal distributions probability distributions, for example

P (x, y, z) = Pxy(x, y)Pz|xy(z|x, y)

P (x, y, z) = Px(x)Pyz|x(y, z|x), (86) which imply P (x, y, z) P (z|x, y) = z|xy R +∞ −∞ P (x, y, z) dz P (x, y, z) P (y, z|x) = . (87) yz|x R +∞ R +∞ −∞ −∞ P (x, y, z) dy dz Again, as we have observed above, the numerator in these expressions is a function of all variables that appear on the left hand side of the equations, whereas the denominator is only a function of the conditioning variables, meaning that the other variables must be “integrated away.” By the means of such integrations, the probability distribution P (x, y, z) defines all marginal and conditional probabilities. The reverse is also the case. By substituting (84) into P (x, y, z) as given by (86) we obtain

P (x, y, z) = Py(y)Px|y(x|y)Pz|xy(z|x, y). (88) This holds for various permutations of x, y and z. The expression is referred to as full conditioning of P (x, y, z). n These rules hold for probability distributions on R and marginal and conditional probability distributions derived from them, but they do not hold for probabilities in general. It is the exis- tence of P (x, y, z) that makes the difference here: it binds the relationships. There are various other expressions that link conditional and marginal probability distributions, for example, Z +∞ Px|z = Pxy|z(x, y|z) dy, −∞ Z +∞ Px(x) = Py(y)Px|y(x|y) dy, −∞ Z +∞ Z +∞ Pz(z) = Pxy(x, y)Pz|xy(z|x, y) dx dy, −∞ −∞ Z +∞ Pyz(y, z) = Px(x)Py|x(y|x)Pz|xy(z|x, y) dx. (89) −∞ Creative Commons Attribution License (CC-BY 3.0) 19

We call the variables x, y and z all connected through the probability Statistically independent distribution P (x, y, z) statistically independent, when conditioning of any random variables one of them by the other two, either individually or together, does not make a difference, that is, for example,

Px(x) = Px|y(x|y) = Px|yz(x|y, z). (90)

Let us recall that

P (x, y, z) = Pxy(x, y)Pz|xy(z|xy) (91)

If x, y, and z are independent, we get that

Pz|xy(z|x, y) = Pz(z) and

Pxy(x, y) = Px|y(x|y)Py(y) = Px(x)Py(y), (92)

therefore

Theorem 4.1 (Statistically Independent Random Variables). Random variables x, y and z are statistically independent if and only if

P (x, y, z) = Px(x)Py(y)Pz(z). (93)

3 If f is a function on R , reasoning as in Section 2 yields an expression Averages with respect to for the average value of f with respect to the distribution P : multi-dimensional distributions

Z +∞ Z +∞ Z +∞ hf(x, y, z)i = f(x, y, z)P (x, y, z) dx dy dz. (94) −∞ −∞ −∞

Marginal probability distributions become useful if f does not depend on one of the variables, for example, if f = f(x, y) then

Z +∞ Z +∞ Z +∞ hf(x, y)i = f(x, y)P (x, y, z) dx dy dz −∞ −∞ −∞ Z +∞ Z +∞ = f(x, y)Pxy(x, y) dx dy. (95) −∞ −∞

If we would like to evaluate the average of f along x for y fixed at some yj then we would use the conditional probability Px|y(x|y) as follows

Z +∞ hf(x, yj )i = f(x, yj )Px|y(x|yj ) dx. (96) −∞

Given P (x, y, z) we can evaluate moments of any of the variables, that is, hxni, hyni and hzni. But we can also compute a measure of correlation Covariance for the different variables, called covariance. For x and y it is defined as

cov(x, y) = h(x − hxi)(y − hyi)i . (97) 20 Licensed to Connexions by Zdzislaw Meglicki, Jr

We should first observe that, unlike variance, covariance can be negative. It is worthwhile to compute it further:

Z +∞ Z +∞ Z +∞ cov(x, y) = (x − hxi)(y − hyi)P (x, y, z) dx dy dz −∞ −∞ −∞ Z +∞ Z +∞ = (xy − xhyi − hxiy + hxihyi)Pxy(x, y) dx dy −∞ −∞ Z +∞ Z +∞ = xyPxy(x, y) dx dy − 2hxihyi + hxihyi −∞ −∞ = hxyi − hxihyi. (98)

This holds for any combination of x, y and z including

cov(x, x) = var(x). (99)

Correlation There is another quantity associated with probability distributions and random variables that is actually called correlation and it is defined as cov(x, y) corr(x, y) = . (100) σ(x)σ(y) So it is much like covariance but additionally scaled by standard deviations of the variables tested.

5 Variable Transformations

In Section 1 we discussed the case of propagation of . We had a random variable x associated with a probability distribution P (x) and a function E(x), which therefore also became randomized, but, we pointed out at the time, P (x) was not the same as P (E). So, the question arises, how to find P (E). The solution to the conundrum is in the observation that we only ever access probability distributions through the integrals while evaluating averages, moments, or probabilities over some subsets of the domain. Let us look at a 1-dimensional case first. Let us suppose we have a differentiable and invertible variable transformation y : x 7→ y(x), which implies y−1 : y 7→ x(y), and a probability distribution P (x). Let us also assume there is a function f(x) that has an average of

Z +∞ hf(x)i = f(x)P (x) dx. (101) −∞ We can rewrite this in terms of the average over y as follows:

Z +∞ dx hf(x(y))iy = f(x(y))P (x(y)) dy. (102) −∞ dy The reason why we take the of the dx/dy here is because we do not change the bounds on the integral, which we would normally have to do if the derivative was negative—we would then inte- grate from +∞ to −∞. In more mathematical terms, we are transforming Creative Commons Attribution License (CC-BY 3.0) 21

the positive measure dx into another positive measure |dx/dy| dy. The transformation tells us that Random Variable Transformation Theorem dx Py(y) = Px(x(y)) . (103) dy

Here we have marked P with indexes, x and y, to emphasize that Py is not the same function as Px and not of the same variable either. The above equation is a simple special case of the Random Variable Transformation Theorem. In particular, if y = ax + b then |dx/dy| = 1/|a| and Theorem 5.1 (Linear Transformation). 1  y − b  P (y) = P . (104) y |a| x a

In summary, to obtain Py it is not enough to tranform Px(x) → Px(x(y)), we have to multiply this additionally by the Jacobian of the transformation, |dx/dy| (this is because the volume element is really an antisymmetric differential form, for example, |dx×dy| in two dimensions). In a multidimensional case y = {y1, y2, . . . , yn} and x = {x1, x2, . . . , xn}, assuming that the transformation x 7→ y(x) is invertible and differen- tiable, we have that

∂ (x1, x2, . . . , xn) Py(y) = Px(x(y)) . (105) ∂ (y1, y2, . . . , yn) But what if the transformation, let us call it y = f(x) this time, is not invertible or not differentiable? Can we still arrive at P (y)? Yes we can and it is here that we get to appreciate how useful marginal and conditional probability distributions can be. Let then x be a random variable associated with a probability distri- bution P (x) and y = f(x). We think of x and y as joint random variables associated with some P (x, y). In this case Py(y) and Px(x) are marginal probabilities, which are connected by Z +∞ Py(y) = Px(x)Py|x(y|x) dx. (106) −∞

The trick is in figuring out Py|x(y|x). Since y and x are connected by x 7→ y = f(x), then the expression “given that x = x0” implies that y = f(x0), in other words, y is a sure variable of x0, the value of which is f(x0). The conditional probability distribution is therefore

Py|x(y|x) = δ(y − f(x)) (107) which yields Z +∞ Py(y) = Px(x)δ(y − f(x)) dx. (108) −∞

The trick extends to higher dimensions. Assuming that y = {y1, y2, . . . , yn} and x = {x1, x2, . . . , xm} and that yi = fi(x), we’ll end up with a product of conditional probabilities, every one of which will be δ(yi − fi(x)), so that 22 Licensed to Connexions by Zdzislaw Meglicki, Jr

Theorem 5.2 (Random Variable Transformation).

n Z +∞ Z +∞ Z +∞ Y Py(y) = ··· Px(x) δ(yi − fi(x)) dx1 dx2 ... dxm. −∞ −∞ −∞ i=1 (109) This is the most general form of the Random Variable Transformation theorem. How do we get from this the expression with the Jacobian? The Dirac delta is a function of f(x), so we have to change the integration variable to f(x) in order to carry out the integration, then we substitute y for f(x) in the integrated function, drop the delta and drop the integral:

Z +∞ Py(y) = Px(x)δ(y − f(x)) dx −∞ Z +∞ −1  dx = Px f (f(x)) δ(y − f(x)) df(x) −∞ df(x)

−1 dx = Px(f (y)) , (110) dy

which is the same as (103).

5.1 Application to Gaussian Distributions The Gaussian distribution in its simplest form is (see Section 3.3, (47))

1 −x2/2 Px(x) = √ e . (111) 2π

Linear transformations of Let us transform this distribution’s variable according to y = σyx + y0, Gaussian variables where σy > 0. According to (104)

1  y − y  1 2 2 0 −(y−y0) /(2σy ) Py(y) = Px = e (112) p 2 σy σy 2πσy

which is the same as (47). Another transformation of this of the form z = σzy + z0 (which implies y = (z − z0)/σz) yields   1 z − z0 Pz(z) = Py σz σz  2 z−z0 − y0 1 σz = exp − p 2 2 2 2πσyσz 2σy 1 (z − (σ y + z ))2 = exp − z 0 0 . (113) p 2 2 2 2 2πσyσz 2σyσz

Sometimes a special notation is used for the Gaussians, namely what we have called Py(y) in (112) would be described as N 2 (y), the N y0,σy standing for . Using this notation we could then sum up (113) as the following Creative Commons Attribution License (CC-BY 3.0) 23

Theorem 5.3 (Linear Transformation of a Gaussian Variable).

(y, N 2 ) and z = σzy + z0 ⇒ Pz = N 2 2 (114) y0,σy σz y0+z0,σy σz

Our next observation is about the probability distribution of a sum of two statistically independent Gaussian random variables. We use the Random Variable Transformation theorem again to arrive at the result. Let x be one such variable, its probability distribution given by N 2 (x), x0,σx and let y be another one, its probability distribution given by N 2 (y). y0,σy What is the probability distribution of z = x + y? The theorem tells us that this is going to be

Z +∞ Z +∞ Pz(z) = N 2 (x)N 2 (y)δ(z − (x + y)) dx dy. (115) x0,σx y0,σy −∞ −∞

The trick is now to evaluate this integral and find a more useful expression. We will discover that A sum of two statistically Theorem 5.4 (Normal (Gaussian) Sum). independent Gaussian variables is also a Gaussian variable

(x, N 2 ) and (y, N 2 ) and z = x + y ⇒ Pz = N 2 2 . (116) x0,σx y0,σy x0+y0,σx+σy

Proof. Before we get to compute this result we need to learn a little more about the Dirac delta first. It’s easy to see that the delta’s is

Z +∞ δˆ(k) = δ(x)e−2πikx dx = e−2πk·0 = 1. (117) −∞

Therefore the reverse transform should read: Z +∞ Z +∞ 1e2πikx dk = δˆ(k)e2πikx dk = δ(x), (118) −∞ −∞

which yields this peculiar integral representation of the Dirac delta A detour: Fourier representation of the Dirac Z +∞ delta function δ(x) = e2πikx dk. (119) −∞

Is this really so? The integral looks poorly defined for starters. But let us have a closer look at it. Let us first assume that x 6= 0. In this case

Z +∞ Z +∞ Z +∞ e2πikx dk = cos 2πkx dk + i sin 2πkx dk. (120) −∞ −∞ −∞

Both integrals evaluate to zero, because the integrated functions oscillate forever (along the k axis) and spend exactly as much time above the zero value as they spend below it. But when x = 0, then

Z +∞ Z +∞ e2πik0 dk = 1 dk = ∞. (121) −∞ −∞

R +∞ So, it looks like a delta. But does it integrate to 1? Let’s see A proof that −∞ δ(x) dx = 1 24 Licensed to Connexions by Zdzislaw Meglicki, Jr

Z +∞ Z +∞ e2πikx dk dx −∞ −∞ Z +∞ Z +∞ = cos(2πkx) dk dx −∞ −∞ Z +∞ Z +∞ + i sin(2πkx) dk dx = ... (122) −∞ −∞

The second integral vanishes because of the antisymmetry of sine. But the first integral looks promising. Because of the symmetry of cosine there is going to be a bump in the middle of the (x, k) plane that will not be cancelled by the oscillations of the cosine as we move away from it. To see this we first convert to the cylindrical coordinates in the (x, k) plane treating k like y

Z +∞ Z 2π ... = cos (2π r sin φ r cos φ) rdφ dr 0 0 1 Z +∞ Z 4π = cos (πr2) sin(2φ) d(2φ) d(πr2) 4π 0 0 1 Z +∞  1 Z π  = 4 cos (ρ sin ψ) dψ dρ = ... (123) 4 0 π 0 So far we have merely rearranged the integration variables and cleaned up the expression to be integrated. The integral over 2φ from 0 to 4π splits into four integrals from 0 to π because sin 2φ is a and because the cosine is insensitive to the sign of its argument. The The Dirac delta integral is expression in the brackets is, as it turns out, the zero-th reduced to a well defined Bessel of the first kind. It is its integral representation (see, for example, equation function integral 9.1.18 in “Handbook of Mathematical Functions”, ed. M. Abramowitz and I. Stegun, Dover, tenth printing, 1972). It is not otherwise integrable in terms of other elementary functions, so we simply replace it with J0(ρ):

Z +∞ ... = J0(ρ) dρ = 1. (124) 0 That this is indeed 1 is one of the elementary properties of Bessel func- tions, namely Z +∞ Jn(ρ) dρ = 1 for n ≥ 0. (125) 0 See, for example, equation 11.4.17 in “Handbook of Mathematical Func- tion” mentioned above. This impressive and quick detour through the Bessel functions has a snag. If you ask around, very few people can tell you how to derive (125). They will refer you to Bessel functions tables, as I have, but even there you won’t find the actual derivation. Could it be that the result is produced by reasoning that merely reverses what we have done? An independent way might be to use a series expansion for Jn(ρ) and work from there. In summary, we have a need for an alternative, perhaps more tedious but also more transparent proof that the integral of (119) is 1. We are going to spend some time on it because the Dirac delta function is an Creative Commons Attribution License (CC-BY 3.0) 25

important business, and it is worth learning how to understand it better and how to work with it.

This time, we will not use the transformation to the cylindrical coor- Another way to see that R +∞ dinates, because this inevitably leads to Bessel functions. Instead we’ll −∞ δ(x) dx = 1 work with regularized form of (119) that looks as follows

Z +κ δ(x) = lim lim e2πikxe−α|k| dk, (126) α→0 κ→∞ −κ

where α ≥ 0 and κ > 0 too. For α = 0 the second exp becomes 1 and if κ = ∞ at the same time, then we get back to (119). But (126) is a normal, regular, well defined integral. No hocus-pocus here. So, now the idea is that we evaluate the delta integral using (126) instead of (119) and take limits in the end only:

Z +∞ Z +∞  Z +κ  δ(x) dx = lim lim e2πikxe−α|k| dk dx −∞ −∞ α→0 κ→∞ −κ Z +∞  Z +2πκ  1 i(2πk)x − α |2πk| = lim lim e e 2π d(2πk) dx = ... 2π −∞ α→0 κ→∞ −2πκ (127)

0 0 α 0 Now we substitue 2πk = k , 2πκ = κ and 2π = α , and observing that when α → 0 then α0 → 0 too, we write

+∞ +κ0 ! 1 Z Z 0 0 0 ... = lim lim eik xe−α |k | dk0 dx 0 0 2π −∞ α →0 κ →∞ −κ0

+∞ +κ0 ! 1 Z Z 0 0 = lim lim cos k0x + i sin k0x e−α |k | dk0 dx = ... 0 0 2π −∞ α →0 κ →∞ −κ0 (128)

Here we again drop the sin k0x term on account of its antisymmetry, which 26 Licensed to Connexions by Zdzislaw Meglicki, Jr

leaves us with

+∞ +κ0 ! 1 Z Z 0 0 ... = lim lim e−α |k | cos k0x dk0 dx 0 0 2π −∞ α →0 κ →∞ −κ0

+∞ +κ0 ! 1 Z Z 0 0 = lim lim e−α k cos k0x dk0 dx 0 0 π −∞ α →0 κ →∞ 0 +∞ ∞ 1 Z  Z 0 0  = lim e−α k cos k0x dk0 dx 0 π −∞ α →0 0 0  0 0 !k =∞ 1 Z +∞ e−α k = lim −α cos k0x + x sin k0x dx  0 2 2  π −∞ α →0 α + x k0=0 1 Z +∞  α0  = lim dx 0 02 2 π −∞ α →0 α + x 1 Z +∞ α0 = lim dx 0 02 2 π α →0 −∞ α + x 1  1 x x=+∞ = lim α0 arctan α0→0 0 0 π α α x=−∞ 1  π  π  = lim − − = 1. (129) π α0→0 2 2

In the last row of this evaluation the limit α0 → 0 is harmless, because by now the two α0 have cancelled out. R +∞ Now that we have seen in two ways that −∞ δ(x) dx = 1 we can commence our proof of the Gaussian sum theorem. The actual proof commences We assume that the two Gaussian variables x and y are independent, here therefore, from the theorem about statistically independent variables

P (x, y) = Px(x)Py(y). (130)

Where both Px and Py are Gaussians. If z = x + y then from the random variable transformation theorem

Z ∞ Z ∞ Pz(z) = Px(x)Py(y)δ(z − (x + y)) dx dy −∞ −∞ ∞ ∞ Z Z 1 2 2 1 2 2 = √ e−(x−x0) /(2σx) e−(y−y0) /(2σy ) 2 p 2 −∞ −∞ 2πσx 2πσy  Z κ  × lim lim e2πik(z−(x+y))e−α|k| dk dx dy = ... α→0 κ→∞ −κ (131)

Now we substitute

0 x − x0 p 0 x = √ ⇒ dx = 2σ2 dx 2 x 2σx y − y q y0 = 0 ⇒ dy = 2σ2 dy0, (132) p 2 y 2σy Creative Commons Attribution License (CC-BY 3.0) 27

which also implies that

2 (x − x0) 02 2 = x 2σx 2 (y − y0) 02 2 = y . (133) 2σy We can drop the limits right away because the Gaussians regularize all integrals, thus replacing κ with ∞ and α with 0. Then, upon substituting z0 = x0 + y0 our Pz(z) becomes

∞ ∞ ∞  √ √  1 Z Z Z 02 02 2 0 2 0 ... = e−x e−y e2πik z− 2σxx + 2σy y +z0 dk dx0 dy0 π −∞ −∞ −∞ 1 Z ∞ = e2πik(z−z0) π −∞ ∞ √ Z 2 0 02  × e−2πik 2σxx e−x dx0 −∞ ∞ √ Z 2 0 02  × e−2πik 2σy y e−y dy0 dk = ... (134) −∞

The integrals in the brackets are nothing more than Fourier transforms Another detour: Fourier of Gaussians. It is important to know how to compute such a transform, transform of a Gaussian so we are going to stop here again and discuss this. We consider a generic integral of the form

∞ Z 2 e−ikxe−ax dx = ... (135) −∞ As we have seen with Gaussian integrals in Section 3.3 some trickery may be needed. We introduce ik x0 = x + ⇒ dx0 = dx (136) 2a Then  ik 2  ik k2  −ax02 = −a x + = −a x2 + 2x − 2a 2a 4a2 k2 = −ax2 − ikx + . (137) 4a Therefore 2 −ax02 −ax2 −ikx k e = e e e 4a . (138) 2 − k If we multiply both sides by e 4a we can rewrite (135) as

2 Z ∞ r 2 − k −ax02 0 π − k ... = e 4a e dx = e 4a , (139) −∞ a where we have made use of (50). We see right away that the Fourier transform of a Gaussian in the x space is a Gaussian in the k space. This has various interesting ramifications, for example, in quantum mechanics a Gaussian packet in the physical space is characterized by a Gaussian 28 Licensed to Connexions by Zdzislaw Meglicki, Jr

distribution in the space, and their related width ∆x and ∆p are such that ∆x∆p ≈ ~/2; in propagation a Gaussian pulse is also Gaussian in the frequency space, it has frequency components from −∞ to +∞: any dispersion present in the propagating medium will distort it. The proof resumes We will come back to it shortly, but right now we know enough to continue with our proof. To this effect we introduce

p 2 kx = 2πk 2σx and q 2 ky = 2πk 2σy. (140)

Then (134) becomes

1 Z +∞ ... = e2πik(z−z0) π −∞ ∞ ∞ Z 0 02 Z 0 02  × e−ikxx e−x dx0 e−iky y e−y dy0 dk −∞ −∞ +∞ Z 2 2 = e2πik(z−z0)e−kx/4e−ky /4 dk −∞ +∞ 1 Z 2 2 2 2 = e2πik(z−z0)e−4π k (σx+σy )/2 d(2πk) 2π −∞ +∞ 1 Z 0 02 2 2 = eik (z−z0)e−k (σx+σy )/2 dk0 = ... (141) 2π −∞ where we have substituted k0 = 2πk. This time we have the reverse Fourier transform integral:

∞ Z 2 eikxe−ak dk, (142) −∞ The computation proceeds exactly as before, the only difference being the sign in front of ikx, but on closer inspection this doesn’t matter, so that the result is still r 2 π − x e 4a (143) a Thus, to continue the original computation s ! 1 2π (z − z )2 ... = exp − 0 2 2 2 2 2π σx + σy 4 σx + σy /2 ! 1 (z − (x + y ))2 = exp − 0 0 , (144) q 2 2 2 2 2 σx + σy 2π σx + σy

which is exactly what we wanted to show, which ends the proof. It’s amazing I haven’t lost any pies on the way...

The summary of the proof Although it took us a while to get here, this was because we had made two detours: first, we explored and proved (in two ways) the Fourier expression for the Dirac delta; second, we showed that the Fourier trans- form of a Gaussian was a Gaussian and that this held in two directions. Creative Commons Attribution License (CC-BY 3.0) 29

The computation of the Gaussian sum itself proceeded directly from the Random Variable Transformation theorem: the delta became a reverse Fourier transform, while the (x + y) term in the delta resulted in the for- ward Fourier transforms from x to k and from y to k. We combined the resulting Gaussians in the k space, then applied the reverse Fourier trans- form to obtain a Gaussian again in the z space with the expected width and shift. This is a classic proof that I still remember from my under- graduate days (a long time ago) and that aptly illustrates the combined power of and Dirac delta concepts. The next theorem is more limited than the Gaussian sum theorem and applies to zero-centered Gaussians of variance 1 only. A ratio of two statistically Theorem 5.5 (Normal (Gaussian) Ratio). independent Gaussian variables is a Cauchy-Lorentz variable x 1 1 (x, N ) and (y, N ) and z = ⇒ P (z) = . (145) 0,1 0,1 y z π z2 + 1

Pz is therefore a Cauchy-Lorentz distribution. A special notation sim- ilar to N 2 exists for Cauchy-Lorentz distributions, namely x0,σx

1 a 2 2 = Cx0,a(x) (146) π (x − x0) + a

Therefore in this case

Pz = C0,1 (147)

Proof. To prove the theorem we apply the Random Variable Transforma- tion theorem directly:

Z ∞ Z ∞   1 −x2/2 1 −y2/2 x Pz(z) = √ e √ e δ z − dx dy = ... (148) −∞ −∞ 2π 2π y

First we replace x with u = x/y, then x = uy and dx = y du. But for negative y we would have to change the from-to limits on the integral, so we will use |y| du instead. This way

∞ ∞ 1 Z Z 2 2 ... = e−y (u +1)/2δ (z − u) |y| du dy 2π −∞ −∞ ∞ 1 Z 2 2 = e−y (z +1)/2 |y| dy 2π −∞ ∞ 1 Z 2 2 = e−y (z +1)/2 y dy π 0 ∞ 1 Z 2 2 = e−(y )(z +1)/2 dy2 2π 0 ∞ 1 Z 2 = e−v(z +1)/2 dv = ... (149) 2π 0

where we have substituted v = y2 and prior to that made use of the symmetry of exp(−y2(z2 + 1)/2) to replace the integral from −∞ to +∞ 30 Licensed to Connexions by Zdzislaw Meglicki, Jr

with two times the integral from 0 to ∞. This is why the theorem would not work for shifted Gaussians. Integrating further

Z ∞  2  1 2 −v(z2+1)/2 z + 1 ... = 2 e d v 2π z + 1 0 2 1 1 = = C (z). (150) π z2 + 1 0,1

5.2 Application to Cauchy-Lorentz Distributions A sum of two independent Cauchy-Lorentz variables is also a Cauchy- Lorentz variable. The theorem is similar to the one about the Gaussian variables and so is its proof: A sum of two statistically independent Cauchy-Lorentz Theorem 5.6 (Cauchy-Lorentz Sum). variables is a Cauchy-Lorentz variable (x, Cx0,ax ) and (y, Cy0,ay ) and z = x + y ⇒ Pz = Cx0+y0,ax+ay (151)

Proof. And here is the computation:

Pz(z) = a a Z ∞ Z ∞ δ(z − (x + y)) x y dx dy 2 2 2 2 2  π −∞ −∞ ((x − x0) + ax) (y − y0) + ay a a Z ∞ Z ∞ Z ∞ e2πik(z−(x+y)) x y dx dy dk 2 2 2 2 2  π −∞ −∞ −∞ ((x − x0) + ax) (y − y0) + ay = ... (152)

Now we substitute

0 x − x0 0 x = ⇒ axdx = dx ax 0 y − y0 0 y = ⇒ aydy = dy, (153) ay

so that

2 2 2 02 (x − x0) + ax = ax(x + 1) and 2 2 2 02 (y − y0) + ay = ay(y + 1). (154)

Returning to our triple integral (152) we notice the cancellation of all explicit ax and ay factors upon the substitution, so that the integral be- comes

Z ∞ Z ∞ Z ∞ 2πik(z−(x+y)) 1 e 0 0 ... = 2 02 02 dx dy dk = ... (155) π −∞ −∞ −∞ (x + 1)(y + 1)

But we still have x + y in the exponent, and this now becomes

0 0 x + y = axx + ayy + x0 + y0 (156) Creative Commons Attribution License (CC-BY 3.0) 31

so that

0 0 e2πik(z−(x+y)) = e2πik(z−axx −ay y −x0−y0) 0 0 = e2πik(z−(x0+y0))e−2πikaxx e−2πikay y , (157)

with the effect that (155) becomes Z ∞ 1 2πik(z−(x0+y0)) ... = 2 e π −∞ 0 ! 0 ! Z ∞ −2πikaxx Z ∞ −2πikay y e 0 e 0 × 02 dx 02 dy dk . . . = −∞ x + 1 −∞ y + 1 (158)

Will will show shortly that Z ∞ ikx e −|k| 2 dx = πe . (159) −∞ x + 1 The proof of this is kind of backward. But for the time being we’ll just take this result as is and use it to complete our main computation. We observe that the π factors cancel out completely and we obtain Z ∞ ... = e2πik(z−(x0+y0))e−|2πk(ax+ay )| dk −∞ Z ∞ −2πk(ax+ay ) = 2 cos (2πk (z − (x0 + y0))) e dk 0 2π(ax + ay) = 2 2 2 2 2 4π (z − (x0 + y0)) + 4π (ax + ay)

1 ax + ay = 2 2 , (160) π (z − (x0 + y0)) + (ax + ay) which is exactly what we wanted to show. To complete the proof we must still demonstrate (159). Also we should show how to compute Z ∞ −bx a e cos ax dx = 2 2 (161) 0 a + b which we have made use of in the second last line of (160). We begin with (159). The trick this time is to recall that the f(x) and its Fourier transform fˆ(k) make a pair and that we can switch from one to the other and back as follows Theorem 5.7 (Reverse Fourier Transform). Z ∞ fˆ(k) = e−2πikxf(x) dx −∞ Z ∞ f(x) = e2πikxfˆ(k) dk (162) −∞ The idea behind having the 2π factor in the exponent is that this way we don’t have to remember which of the two integrals has the 1/(2π) factor in the front. Neither does! 32 Licensed to Connexions by Zdzislaw Meglicki, Jr

Proof. That the second equation indeed holds can be easily seen by in- Reverse Fourier transform proof voking the integral representation of the Dirac delta (119):

∞ ∞ ∞ Z Z Z 0 e2πikxfˆ(k) dk = e2πikx e−2πikx f(x0) dx0 dk −∞ −∞ −∞ ∞ ∞ ∞ Z Z 0 Z = e2πik(x−x )f(x0) dk dx0 = δ(x0 − x)f(x0) dx0 −∞ −∞ −∞ = f(x). (163)

To prove (159), the left side of which is a reverse Fourier transform of 1/(x2 + 1) (up to the missing 2π factor) we need to show that the Fourier transform of the right side is 1/(x2 + 1) (again up to the missing 2π factor). So, here is the calculation: Z ∞ e−2πikxe−|x| dx −∞ Z 0 Z ∞ = e−2πikxex dx + e−2πikxe−x dx −∞ 0 Z 0 Z ∞ = ex(1−2πik) dx + e−x(1+2πik) dx −∞ 0 1 1 = + 1 − 2πik 1 + 2πik 2 = , (164) 1 + 4π2k2 which by (163) implies that

Z +∞ 2πikx 2e −|x| 2 2 dk = e . (165) −∞ 1 + 4π k The integral on the left is further transformed by variable substitution:

Z +∞ 2e2πikx 2 2 dk −∞ 1 + 4π k Z +∞ ei(2πk)x 1 = 2 d(2πk) −∞ 1 + (2πk) π Z +∞ ik0x 1 e 0 = 02 dk , (166) π −∞ 1 + k which on multiplication by π yields (159). Because of the symmetry of 1/(x2 + 1), (159) implies also that Z ∞ cos kx −|k| 2 dx = πe . (167) −∞ x + 1 Integrals (159) and (167) are quite non-trivial and it is instructive that the Fourier transform theory lets us evalate them so easily. Of course, the more difficult proof of the immensely useful integral representation of the Diract delta (119) hides behind it all. Creative Commons Attribution License (CC-BY 3.0) 33

The last remaining bit is the integral (161) which we have made use of to arrive at (160). This is an elementary integral that can be integrated by parts, but we have to do this twice. The first step converts the integral into a certain algebraic expression plus an integral of e−ax sin bx times some constant. The second step adds a yet another algebraic term plus the original integral of e−ax cos bx times another constant. So we end up with a simple linear algebraic equation that can be solved for the original integral, which yields

Z be−ax sin bx − ae−ax cos bx e−ax cos bx dx = . (168) a2 + b2 The first term in the numerator on the right hand side vanishes in infinity and at zero. The second term vanishes in infinity too, but is −a at zero. The integral from 0 to ∞ therefore yields a/(a2 + b2). And this, at last, completes the proof of the Cauchy-Lorentz sum theorem.

5.3 Cumulative Probability Distribution Theorem A cumulative probability distribution D(x) as defined by (9) is a mono- tonically increasing function of x that is 0 at −∞ and 1 at ∞. Let R x 0 0 y = D(x) = −∞ Px(x ) dx . Clearly y ∈ [0, 1]. Knowing Px(x) we may ask what is the probability density Py(y) of y. Because y is restricted to the interval [0, 1] (not the two value set {0, 1}) Py(y) must be zero outside the interval. Within the interval, the sought distribution is given by the Random Variable Transformation theorem, which in this particular case yields dx 1 P (y) = P (x) = P (x) (169) y x dy x dy(x)/dx R x 0 0 But since y(x) = −∞ Px(x ) dx , its derivative with respect to x is P (x) itself, therefore Py(y) = Px(x)/Px(x) = 1 (170) A function that’s 1 on [0, 1] and zero everywhere else is a uniform distribu- tion like the distributions we discussed in Section 3.1. A special notation

is used for such distributions: Ux0,x1 . So in this case we’ll have that Theorem 5.8 (Cumulative Distribution Function).

Z x 0 0 y = Px(x ) dx ⇒ Py = U0,1 (171) −∞

Computational environments usually provide means to generate ran- Generating random numbers dom numbers distributed uniformly within [0, 1]. In effect they produce with arbitrary distributions the random variable (y, U0,1). The Cumulative Distribution Function Theorem provides us with the means to convert such a random variable into a variable that has a different distribution, for example, Gaussian or Lorentzian. To do so, we simply invert y = D(x):

x = D−1(y) (172) 34 Licensed to Connexions by Zdzislaw Meglicki, Jr

The variable x is then guaranteed to be distributed according to the derivative of D: dD(x) = P (x) (173) dx x The Monte Carlo methods are based on this observation.

5.4 Linear Combination of Random Variables

Let q = f(x, y, z) and let x, y and z be bound with a probability distri- bution Px,y,z(x, y, z). What is the probability distribution Pq(q)? The answer follows from the Random Variable Transformation Theorem:

Z ∞ Z ∞ Z ∞ Pq(q) = Px,y,z(x, y, z)δ(q − f(x, y, z)) dx dy dz (174) −∞ −∞ −∞

The average value of q with respect to Pq is

Z ∞ hqiq = qPq(q) dq −∞ Z ∞ Z ∞ Z ∞ Z ∞ = qPx,y,z(x, y, z)δ(q − f(x, y, z)) dq dx dy dz −∞ −∞ −∞ −∞ Z ∞ Z ∞ Z ∞ = f(x, y, z)Px,y,z(x, y, z) dx dy dz −∞ −∞ −∞

= hf(x, y, z)ix,y,z (175)

And so, we find that the average value of q with respect to Pq is equal to the average value of f with respect to Px,y,z. This result is not so much obvious as it is desirable. That it is what it is demonstrates the consistency of the formalism. Let us introduce another variable p = g(x, y, z) distributed according to Pp(p) that is similarly calculated as Pq(q). Then, a third variable, say, s = αp + βq will be distributed according to Ps(s) given by

Ps(s) = Z ∞ Z ∞ Z ∞ Px,y,z(x, y, z)δ(s − (αg(x, y, z) + βf(x, y, z))) dx dy dz, −∞ −∞ −∞ (176)

Average of a linear combination wherefrom we find the average value of s to be

hsis = hαg(x, y, z) + βf(x, y, z)ix,y,z

= αhg(x, y, z)ix,y,z + βhf(x, y, z)ix,y,z

= αhpip + βhqiq. (177)

The average of a linear combination of random variables is the same linear combination of the averages of the variables. Variance of a linear Now, we to the variance of s. combination Creative Commons Attribution License (CC-BY 3.0) 35

var(s) = hs2i − hsi2 = (αp + βq)2 − hαp + βqi2 = α2p2 + 2αβpq + β2q2 − α2hpi2 − 2αβhpihqi − β2hqi2 = α2 hp2i − hpi2 + β2 hq2i − hqi2 + 2αβ (hpqi − hpihqi) = α2 var(p) + β2 var(q) + 2αβ cov(p, q) (178)

5.5 Covariance and Correlation Now, if s = αp − βq, then Covariance Range Theorem

var(s) = α2 var(p) + β2 var(q) − 2αβ cov(p, q). (179)

Since variance is always positive or zero we get that

α2 var(p) + β2 var(q) − 2αβ cov(p, q) ≥ 0. (180)

Let us assume that p is not a sure variable. This implies that its variance is positive. We can choose α and β as we please, so let us choose

α = cov(p, q) and β = var(p) > 0. (181)

Then (180) reads

cov2(p, q) var(p) + var2(p) var(q) ≥ 2 cov(p, q) var(p) cov(p, q). (182)

The next step is to divide both sides by var(p), which we can do, because it is positive. This yields

cov2(p, q) + var(p) var(q) ≥ 2cov2(p, q), (183)

and upon subtracting cov2(p, q) from both sides we obtain

var(p) var(q) ≥ cov2(p, q). (184)

We can express it also in terms of standard deviations, because var = σ2. So σ2(p)σ2(q) ≥ cov2(p, q). (185) And, on taking the square root of the equation Theorem 5.9 (Covariance Range).

σ(p)σ(q) ≥ | cov(p, q)|. (186)

The absolute value of covariance shows up on the right hand side of the inequality because covariance can be negative. Because correlation is defined as covariance divided by the product of The meaning of corr standard deviations, the covariance range theorem implies that

1 ≥ | corr(p, q)|. (187)

This is why correlation is defined as it is. It varies between +1 and −1, which correspond to perfect correlation (for example, when p = q, recall 36 Licensed to Connexions by Zdzislaw Meglicki, Jr

that in this case cov(p, q) = cov(q, q) = var(q) = σ2(q)) and perfect anti-correlation (for example, when p = −q). It is zero for uncorrelated variables. Mean and variance of If p and q are uncorrelated and s = p + q then uncorrelated variables hsi = hpi + hqi and var(s) = var(p) + var(q), (188)

which is good to know. This holds for any number of variables. It is also good to know when two random variables are uncorrelated. It is easy to Statistically independent see that if they are statistically independent, that is variables are uncorrelated

Pp,q(p, q) = Pp(p)Pq(q) (189)

they are also uncorrelated because

cov(p, q) = hpqi − hpihqi Z ∞ Z ∞ Z ∞ Z ∞ = pqPp,q(p, q) dp dq − pPp(p) dp qPq(q) dq −∞ −∞ −∞ −∞ Z ∞ Z ∞ Z ∞ Z ∞ = pPp(p)qPq(q) dp dq − pPp(p) dp qPq(q) dq −∞ −∞ −∞ −∞ = 0. (190)

But being uncorrelated does not in general imply that the variables are statistically independent. Dependent variables may fake lack of correla- tion by fluctuating in such a particular way that hpqi = hpihqi.

5.6

The Central Limit Theorem follows from all that we have been discussing in this section on random variable transformations. The section may therefore be considered a part or an extension of the theorem’s proof. Sum and average, mean and We consider n statistically independent (and therefore uncorrelated) variance random variables (xi,P (xi)), all characterized by the same probability distribution (density) P (x), of some well defined (finite) mean x0 and 2 variance σx. We are interested in three things: the sum sn and/or the average an of all xi, the variance of the sum and/or average, and the

probability density of the sum, Psn and/or average, Pan . As to the first two, we already have our answers, because they follow from (188), namely

n X sn = xi i=1 n X ⇒ hsni = hxii = nx0 i=1 n X 2 ⇒ var(sn) = var(xn) = nσx (191) i=1 Creative Commons Attribution License (CC-BY 3.0) 37

Because an = sn/n, referring to (177) and (178), we get

n 1 X a = x n n i i=1 hs i ⇒ ha i = n = x n n 0 var(s ) σ2 ⇒ var(a ) = n = x (192) n n2 n

The third item, the probability distribution (density) of sn or an is what the Central Limit Theorem is about. There is no easy answer for a finite n, but in the limit of n → ∞ the answer is Gaussian distribution emerges Theorem 5.10 (Central Limit). for a large number of samples

Psn −−−−→ Nnx ,nσ2 n→∞ 0 x

Pan −−−−→ Nx ,σ2 /n. (193) n→∞ 0 x

For very large n both distributions approach a Gaussian, which in the 2 case of the average is centered on x0 and its variance σx/n shrinks like 1/n, to approach zero in infinity. If the xi correspond to, say, successive measurements of a single variable under fixed experimental conditions, so that the variable’s probability distribution does not change as the ex- periment is repeated, the Central Limit Theorem tells us that for a very large n the average of the measurements will have a Gaussian distribution √ centered on x0 with its standard deviation equal σx/ n. It is important to remember that the Central Limit Theorem does The theorem does not apply to not hold if P (x) does not have a finite mean and a finite variance. For Cauchy-Lorentz distribution example, it does not apply to the Cauchy-Lorentz distribution. In this case the Cauchy-Lorentz sum theorem tells us that the distribution of the sum will be a Cauchy-Lorentz distribution, still without a defined mean and without a finite variance.

Proof. The proof takes the Random Variable Transformation Theorem as its starting point. The Dirac delta gets rewritten in its Fourier form, which leads to a product of Fourier transforms of probability distributions, a trick we’ve seen already in the Gaussian Sum Theorem. Each transform is then expanded in the in which the linear term drops out leaving the quadratic one and the higher terms. We substitute this back into the product, take the reverse Fourier transform of it—the origin of which is the same as we had in the Gaussian Sum Theorem—and it all folds neatly into a Gaussian distribution on account of that quadratic term in the Taylor expansion. The methodology should therefore be quite familiar to an attentive student of these notes. But before we get to all this we want to get rid of the x0 shift. To this Introduction of a normalized effect we introduce a new random variable in place of sn and an. Let’s variable call it zn and let its definition be

n 1 X z = √ (x − x ) . (194) n n i 0 i=1 38 Licensed to Connexions by Zdzislaw Meglicki, Jr

We can express sn and an in terms of zn as follows √ sn = n zn + nx0 and 1 a = √ z + x . (195) n n n 0

Now we endeavour to demonstrate that the probability distribution of zn is N 2 , wherefrom probability distributions of sn and an will be obtained 0,σx by the application of the Linear Transformation of a Gaussian Variable Theorem, (114). Because all xn are statistically independent, their combined probabil- Qn ity distribution is i=1 P (xi), and the Random Variable Transformation Theorem integral becomes

Pzn (zn) = n ! n ! Z ∞ Z ∞ Y 1 X ... P (x ) δ z − √ (x − x ) dx ... dx . i n n j 0 1 n −∞ −∞ i=1 j=1 (196)

Our next step is to replace the δ with its integral representation as per (119). Because the delta argument is quite complex here, we’re going to take the 2π factor out of it outright, that is

∞ ∞ ∞ Z 1 Z 1 Z 0 δ(x) = e2πikx dk = ei(2πk)x d(2πk) = eik x dk0. −∞ 2π −∞ 2π −∞ (197) In effect

n ! 1 X δ z − √ (x − x ) n n j 0 j=1 n ! 1 Z ∞ ik X = exp ikz − √ (x − x ) dk 2π n n j 0 −∞ j=1 Z ∞ n √ 1 ikz Y −ik x −x / n = e n e ( j 0) dk. (198) 2π −∞ j=1

And our original integral now splits into

Pzn (zn) Z ∞ n Z ∞ √ ! 1 ikz Y −ik(x −x )/ n = e n P (x )e i 0 dx dk 2π i i −∞ i=1 −∞ 1 Z ∞ Z ∞ √ n = eikzn P (x)e−ik(x−x0)/ n dx dk, (199) 2π −∞ −∞

because at this stage xi is a dummy variable and for all xi the distribution P (xi) is the same. And so we have here something that looks like a Fourier transform of P (x) to the power of n and then the reverse Fourier transform into the zn space. The standard trick! Creative Commons Attribution License (CC-BY 3.0) 39

The integral in the round brackets to the power of n is an unknown functional of P , about which we know nothing other than that it is a sen- sible enough probability distribution to have a finite mean and variance. Given some P , it is also a function of k. Let us then define

∞ Z 0 f(k0) = P (x)e−ik (x−x0) dx. (200) −∞ √ √ Then the integral in (199) is f(k/ n). We want to have the n factor appear here explicitly because we are going to make use of it running to infinity. Now, the central trick of the Central Limit Theorem is to expand f(k0) Taylor expansion: the central in the Taylor series around zero. We can do this quite easily because of trick of the proof the exp function within which the k0 appears, and we do not need to know anything about P . And so

f(0) = 1 0 Z ∞ df(k ) 0 = −i (x − x0) P (x) dx = 0 dk k0=0 −∞ 2 0 Z ∞ d f(k ) 2 2 02 = − (x − x0) P (x) = − var(x) = −σx. (201) dk k0=0 −∞ Therefore  k  σ2k2 f √ = 1 − x + ..., (202) n 2n and the whole expression in (199) becomes

Z ∞  2 2 n 1 ikzn σxk Pzn (zn) = e 1 − + ... dk (203) 2π −∞ 2n

As n runs to infinity we can neglect the ... and then we make use of This is how the Gaussian emerges  a n lim 1 − = e−a, (204) n→∞ n which lets us replace the whole bracket to the power of n with

2 2 e−σxk /2, (205)

which yields ∞ 1 Z 2 2 ikzn −σxk /2 Pzn (zn) = e e dk (206) 2π −∞ Ah, but we have already worked out this type of integral in our discussion of the Gaussian sum theorem, (143). There we have found that

Z ∞ r 2 ikx −ak2 π − x e e dk = e 4a (207) −∞ a

2 Here our x is zn and our a is σx/2, which yields

z2 z2 r − n − n 1 2π 2σ2 1 2σ2 Pz (zn) = e x = √ e x = N 2 . (208) n 2 2 0,σx 2π σx 2πσx 40 Licensed to Connexions by Zdzislaw Meglicki, Jr

The last step is to switch back from zn to the sum sn and the average an. The transformation is given by (195) and the transformation rule for the Gaussian is given by (114), which yields

N 2 −−−−−−−→ N 2 0,σx √ nx0,nσx nzn+nx0

N 2 −−−−−−−→ N 2 , (209) 0,σx √ x0,σx/n zn/ n+x0 which ends the proof.

5.7 Computer Generation of Random Variables Cumulative Distribution Computer generation of random variables that correspond to various dis- Function Theorem tributions is based on our observations made in Section 5.3 and the corol- lary of the Cumulative Distribution Function Theorem presented therein. In particular, in order to generate x with the distribution of Px(x), we need to find, preferably analytically, its cumulative distribution function R x 0 0 D(x) = −∞ Px(x ) dx . This function grows monotonically from 0 to 1. If we can generate a uniform random variable restricted to [0, 1], let’s call it y, then x = D−1(y). The reason why we want to do it analyti- cally is performance. If we have to iterate in order to compute D−1(y), the procedure may be too costly. In Monte Carlo simulations we must be able to generate random numbers of a prescribed distribution cheaply and quickly. In summary, our task is going to be easy if Px(x) is analytically inte- grable and its integral D(x) easily invertible, so as to generate D−1. This procedure is easy to apply to 1. the uniform random variable scattered over an arbitrary final interval [a, b], 2. the exponential random variable, 3. the Cauchy-Lorentz random variable. For the Gaussian variable we will have to be a little more clever, because the Gaussian is not integrable in terms of elementary functions, but we’ll still make use of the Cumulative Distribution Function Theorem. Random distribution in [0, 1] There are various algorithms for the generation of pseudo-random numbers in the [0, 1] interval. The basic ones are discussed in “Numerical Recipes” by Press, Flannery, Teukolsky and Vetterling. The most com- monly used, including system-supplied “random number generators,” are based on the Method of Uniform Deviates. Another popular method is the Rejection Method, also discussed in “Numerical Recipes.” Fortran Fortran’s standard subroutine is RANDOM_NUMBER. The subroutine is available in Fortran 95 and later. The routine is polymorphic meaning that its one variable may be a scalar or an array, both of type REAL. Here is a simple example of its use based on “The GNU Fortran Compiler” INTEGER :: i, n, clock INTEGER, DIMENSION(:), ALLOCATABLE :: seed Creative Commons Attribution License (CC-BY 3.0) 41

REAL :: r(5,5)

call RANDOM_SEED(SIZE = n) ALLOCATE(seed(n)) CALL SYSTEM_CLOCK(COUNT=clock) seed = clock + 37 * (/ (i - 1, i = 1, n) /) call RANDOM_SEED(PUT = seed) DEALLOCATE(seed)

call RANDOM_NUMBER(r) In this example, the subroutine RANDOM_NUMBER returns a 5×5 array of REAL random numbers between 0.0 and 1.0 in r. Before we can call RANDOM_NUMBER though we must seed the generator first. The corresponding routine in Fortran 95 and later is RANDOM_SEED. This subroutine, which is also polymorphic, takes a single variable labeled SIZE, PUT, or GET. The SIZE variable tells the routine that either the PUT or GET variable will be an array of integers of a length bound with the label SIZE. In the example, the length of the array is going to be n. Once we have told RANDOM_SEED what to expect, we allocate the array of seeds dynamically and we initialize its entries to numbers that are related to time returned by the system clock. The time is an integer number of seconds from the beginning of the world, which in the unix world means the 1st of January 1970. The array entries are initiated by executing the implicit DO loop that follows the clock + on the initialization line. Once the array has been initialized it is passed to the RANDOM_SEED subroutine, this is what the PUT label means, and the generator gets seeded. In case we forget what the seeds were, we can obtain the array back from RANDOM_SEED by using the GET label. All subroutines used in the example are INTRINSIC, meaning, they are a part of the language itself. The defaults of INTEGER and REAL in GNU Fortran are INTEGER(KIND=4) and REAL(KIND=4). Guile The problem with GNU Guile is that it evolves somewhat adven- turously, so its interfaces and standards can change at short notice. The following refers to Guile 1.8.7. The Guile random number generation functions are described in Sec- tion 5.5.2.15 of the reference manual. The good thing about them is that functions are available not only for the generation of uni- formly distributed random numbers in the [0, 1] interval. There are also functions for the generation of random numbers with exponen- tial distribution, Gaussian (normal) distribution, random numbers scattered within a sphere and more. To seed the random number generator we use function seed->random-state which returns a new random state using its argument, which should be an integer. There is a predefined global variable *random-state*, which the random number generation functions use by default if an alternative variable is not provided. For example, to seed the generator using the time of day we may use the following code 42 Licensed to Connexions by Zdzislaw Meglicki, Jr

(let ((time (gettimeofday))) (set! *random-state* (seed->random-state (+ (car time) (cdr time))))) Function gettimeofday returns an ordered pair that contains the number of seconds and microseconds from the beginning of the world, which is, as in the case of GNU Fortran, 1st of January 1970. For example guile> (gettimeofday) (1339015475 . 551498) guile> Now, once the generator has been seeded we would use (random N) to generate a random integer number between 0 and N − 1; (random:uniform) to generate an inexact real number (this is Guile’s parlance for “floating point”) in a uniform distribution in [0, 1[ (the notation means that 1 itself is excluded); Guile can generate exponential (random:exp) to generate an inexact real number in an exponential and Gaussian variables distribution with mean 1, multiply the result by x to switch to the mean of x; (random:normal) to generate an inexact real number in a Gaussian distribution with mean 0 and standard deviation 1, as in (111); we would then use (114) to transform it to another possibly off-center Gaussian with a different standard deviation. (random:normal-vector! V) to fill a predefined vector V with in- exact real random numbers that are independent and Gaussian distributed with mean 0 and standard deviation 1, as above. (random:hollow-sphere! V) to fill a predefined vector V with in- exact real random numbers the sum of whose squares is equal 1.0. In effect this function generates random points on an n- dimensional sphere, n being the size of V. (random:solid-sphere! V) to fill a predefined vector V with inexact real random numbers the sum of whose squares is less than 1.0. In effect this function generates random points within an n- dimensional ball, n being the size of V, but not on the surface of the ball.

Inversion generating method Now, once we know how to generate random numbers within [0, 1], how do we go about transforming them into other distributions. As we have commented above, the Cumulative Distribution Function Theorem is the key. Uniform distribution We have defined it in Section 3.1. The actual

distribution function, let’s call it Ux1,x2 is zero outside [x1, x2] and 1/(x2 − x1) inside. The cumulative distribution function is therefore 0 for x < x1, 1 for x2 < x and Z x dx0 x − x D (x) = = 1 for x ∈ [x , x ]. (210) x1,x2 x − x x − x 1 2 x1 2 1 2 1 Creative Commons Attribution License (CC-BY 3.0) 43

If we now set [0, 1] 3 y = Dx1,x2 (x), we can solve it for x, which yields [x1, x2] 3 x = x1 + (x2 − x1)y. (211) Exponential distribution We have defined it in Section 3.2. The dis- tribution function, let’s call it Eσ, was  1 e−x/σ for x ≥ 0 E (x) = σ , (212) σ 0 for x < 0, where σ is the distribution’s standard deviation. The corresponding cumulative distribution function, let’s call it Dσ, is 0 for x < 0 and Z x 0 −x0/σ x −x/σ Dσ(x) = e d = 1 − e . (213) 0 σ Now, we could set this to y, but if y ∈ [0, 1] then 1 − y ∈ [0, 1] too and this makes it easier to solve the equation for x: [0, 1] 3 1 − y = 1 − e−x/σ, (214) therefore y = e−x/σ and 1 [0, ∞] 3 x = σ log . (215) y

Because the standard deviation of Eσ is also its mean, this explains why we got to multiply the return of (random:exp) in Guile, which corresponds to σ = 1, by the mean of the distribution. Cauchy-Lorentz distribution We have defined it in Section 3.4. The

distribution function, let’s call it Cx0,a, was 1 a Cx0,a(x) = 2 2 , (216) π (x − x0) + a

where x0 is the center of the distribution and a is its half-width, that

is, Cx0,a attains half of its peak value of 1/(πa) at x = x0 ± a. The

corresponding cumulative distribution function, let’s call it Dx0,a is Z x a 1 0 Dx0,a(x) = 0 2 2 dx π −∞ (x − x0) + a Z x−x0 0 a d(x − x0) = 0 2 2 π −∞ (x − x0) + a 0 1 Z (x−x0)/a d x −x0 = a  0 2 π −∞ x −x0 a + 1 1   x − x  π  = arctan 0 + . (217) π a 2 This leads to the following equation 1   x − x  π  [0, 1] 3 y = arctan 0 + , (218) π a 2 which solves for x as follows   1  [−∞, ∞] 3 x = a tan π y − + x . (219) 2 0 44 Licensed to Connexions by Zdzislaw Meglicki, Jr

Gaussian is not easily integrable Gaussian distribution The Gaussian distribution is tricky because it is not easily integrable. In order to evaluate Gaussian moments in Section 3.3 we had to resort to a trick that transformed the Gaussian integral into an integral over the whole 2D-plane, then switched to cylindrical coordinates. But even this worked for a definite integral from −∞ to +∞ only. Still, a similar trick will work now, but before we can resort to it, we have to go back and discuss cumulative distributions for probability densities of more than one variable. Cumulative distributions for Let x and y be bound by P (x, y). Following our remarks in Section 4 multivariable probability we define densities Z ∞ Px(x) = P (x, y) dy, −∞ P (x, y) P (x, y) P (y|x) = = . (220) y|x P (x) R ∞ x −∞ P (x, y) dy

And then we define the following cumulative distribution functions

Z x 0 0 Dx(x) = Px(x ) dx , −∞ Z y 0 0 Dy|x(y|x) = Py|x(y |x) dy . (221) −∞

Equating both to some z1 ∈ [0, 1] and z2 ∈ [0, 1], then solving for x and y yields random variables with joint probability density of P (x, y). Gaussian is easier to handle in We are going to apply this procedure to 2D

Px1,x2 (x1, x2) 1  (x − x )2  1  (x − x )2  = √ exp − 1 0 √ exp − 2 0 2πσ2 2σ2 2πσ2 2σ2 1  (x − x )2 + (x − x )2  = exp − 1 0 2 0 . (222) 2πσ2 2σ2

We introduce two new variables, r and θ such that

x1 = x0 + r cos θ

x2 = x0 + r sin θ. (223)

It is easy to see that

2 2 2 (x1 − x0) + (x2 − x0) = r . (224)

According to the Random Variable Transformation theorem, (105),

∂(x1, x2) Pr,θ = Px1,x2 (x1, x2) . (225) ∂(r, θ) Creative Commons Attribution License (CC-BY 3.0) 45

From the above we find that ∂x 1 = cos θ, ∂r ∂x 1 = −r sin θ, ∂θ ∂x 2 = sin θ, ∂r ∂x 2 = r cos θ. (226) ∂θ And the Jacobian is

cos θ sin θ 2 2 = rcos θ + rsin θ = r. (227) −r sin θ r cos θ Therefore 1  r2  P = rP (x , x ) = r exp − . (228) r,θ x1,x2 1 2 2πσ2 2σ2 As there is no explicit dependence here on θ, on integrating θ away from 0 to 2π we get

1  r2  P (r) = r exp − (229) r σ2 2σ2 and P (r, θ) 1 P (θ|r) = r,θ = . (230) θ|r R 2π 2π 0 Pr,θ(r, θ) dθ This leads to the following equations for cumulative distributions of r and θ Z r  02  1 0 r 0 [0, 1] 3 z1 = Dr(r) = 2 r exp − 2 dr σ 0 2σ 2 r  02  02 Z 2σ2 r r = exp − 2 d 2 0 2σ 2σ  r2  = 1 − exp − , 2σ2 Z θ 1 0 θ [0, 1] 3 z2 = Dθ|r(θ|r) = dθ = . (231) 2π 0 2π

We solve these equations as follows. If z1 ∈ [0, 1] then 1 − z1 ∈ [0, 1] too. Therefore we write  r2  1 − z = 1 − exp − (232) 1 2σ2 Therefore r 1 r = σ 2 ln . (233) z1 And θ = 2πz2. (234) 46 Licensed to Connexions by Zdzislaw Meglicki, Jr

The procedure returns two Substituting r and θ into (223) we finally obtain numbers r 1 x1 = x0 + σ 2 ln cos (2πz2) z1 r 1 x2 = x0 + σ 2 ln sin (2πz2) . (235) z1