Probability Distributions
Connexions module m43336
Zdzis law (Gustav) Meglicki, Jr Office of the VP for Information Technology, Indiana University RCS: Section-1.tex,v 1.78 2012/12/17 16:29:57 gustav Exp
Copyright c 2012 by Zdzislaw Meglicki December 17, 2012
Abstract We introduce the concept of a probability distribution and its charac- terizations in terms of moments and averages. We present examples and discuss probability distributions on multidimensional spaces; this also in- cludes marginal and conditional probabilities. We discuss and prove some fundamental theorems about probability distributions. Finally, we illus- trate how random variables associated with various probability distribu- tions can be generated on a computer.
Contents
1 Random Variables and Probability Distributions 2
2 Characteristics of Probability Distributions: Averages and Moments 4
3 Examples of Probability Distributions 7 3.1 Uniform Distribution ...... 7 3.2 Exponential Distribution ...... 9 3.3 Normal (Gaussian) Distribution ...... 10 3.4 Cauchy-Lorentz Distribution ...... 15
n 4 Probability Distributions on R : Marginal and Condi- tional Distributions 15
5 Variable Transformations 20 5.1 Application to Gaussian Distributions ...... 22 5.2 Application to Cauchy-Lorentz Distributions ...... 30 5.3 Cumulative Probability Distribution Theorem ...... 33 5.4 Linear Combination of Random Variables ...... 34 5.5 Covariance and Correlation ...... 35 5.6 Central Limit Theorem ...... 36 5.7 Computer Generation of Random Variables ...... 40
1 2 Licensed to Connexions by Zdzislaw Meglicki, Jr
1 Random Variables and Probability Dis- tributions
Variables in mathematics A variable in mathematics is an argument of a function. The variable may assume various values (hence the name) within its domain to which the function responds by producing the corresponding values, which usually reside in a set different from the domain of the function’s argument. Using a formal notation, we may describe this as follows:
X 3 x 7→ f(x) = y ∈ Y. (1)
Here X is the function’s domain, x ∈ X is he function’s variable, f is the function itself, y is the value that the function returns for a given x and that belongs to Y , the set of function values. Another way to describe this is f : X → Y. (2) Neither of the above specifies what function f actually does. The formulas merely state that f maps elements of X onto elements of Y . For a mapping to be called a function, the mapping from x to y must be unique. But this requirement is not adhered too strictly and we do work with multivalued functions too. Variables in physics A variable in physics is something that can be measured, for example, a position of a material point or temperature or mass. How does a physics variable relate to a variable in mathematics? Depending on a position of the material point x, if the point is endowed with an electric charge qe and some externally applied electromagnetic field E is present, then the force that acts on the point will be a vector valued function of its position:
F(x) = qeE(x). (3)
We read this as follows: the material point is endowed with electric charge qe and is located at x. The position of the point is the variable here (it’s actually a vector variable in this case, but we may also think of it as three scalar variables). Electric field E happens to have a vector value of E(x) at this point. It couples to the point’s charge and in effect the force of F(x), that is also a vector, is exerted upon the point. The variable x may be itself a value of another function, perhaps a function of time t. We may then write x = f(t), or just x(t) for short, in which case F(x(t)) = qeE(x(t)). (4) Random variable A random variable is a physics variable that may assume different values when measured with a certain probability assigned to each possible value. We may think of it as an ordered pair
(X,P : X 3 x 7→ P (x) ∈ [0, 1]) (5)
where, X is a domain and P (x) is the probability of the specific value x ∈ X occurring. The probability is restricted to real values within the line segment between 0 and 1, which is what [0, 1] means. This should not be confused with {0, 1} which is a set of two elements, 0 and 1, that does Creative Commons Attribution License (CC-BY 3.0) 3
not contain anything in between. We will often refer to a specific instance of (5), (x, P (x)) for short, also calling it a random variable. Otherwise, x can be used like a normal physics variable in physics expressions. However, it’s association with probability carries forward to anything the variable touches, meaning that when used as an argument in successive functions, it makes the functions’ outcomes random too. And so, if, say, (x,P (x)) is a random variable, then, for example, (E(x),P (x)) also becomes a random variable, although the resulting probability in (E,PE (E)) is not the same as P (x). There are ways to evaluate PE (E), about which we’ll learn more in Section 5. A set of all pairs (x, P (x)) is the same as P : X 3 x 7→ P (x) ∈ [0, 1], Random variables and because we can understand a function as a subset of a certain relation probability distributions and a relation is a set of pairs—this is one of the definitions of a func- tion. So a theory of random variables is essentially a theory of probability distributions P (x), their possible transformations, evolutions and ways to evaluate them. And random variables themselves are merely arguments of such probability distributions. The notation used in the theory of Markov processes, as well as the concepts, can sometimes be quite convoluted—and it can get even worse when mathematicians lay their hands on it—and it will help us at times to get down to earth and remember that a random variable is simply a measurable quantity x that is associated with a certain probability of occurring P (x). The formal mathematical definition of a random variable is that it Random variables and is a function between a probability space and a measurable space, the mathematicians probability space being a triple (Ω,E,P ) where Ω is a sample space, E is a set of subsets of Ω called events (it has to satisfy a certain list of properties to be such) and P is a probability measure (P (x) dx can be thought of as a measure in our context assuming that X is a continuous domain). A space is said to be measurable if there exists a collection of its subsets with certain properties. The reason why this formal mathematical definition is a bit opaque is because the process of repetitive measurements that yields probabil- ity densities for various values of a measured quantity is quite complex. It is intuitively easy to understand—once you’ve carried out such mea- surements yourself—but not this easy to capture into well defined math- ematical structures that mathematicians like to work with. In particular, mathematicians do not like the frequency interpretation of probability and prefer to work with the Bayesian concept of it. It is easier to formalize. In the following we will be interested in variables that are sampled from Probability density (or a continuous domain, x ∈ R¯ = [−∞, +∞] (the bar over R means that we distribution) have compactified R by adding the infinities to it—sometimes we may be interested in what happens when x → ∞, in particular our integrals will normally run from −∞ to +∞). The associated probabilities will then be described in terms of probability densities such that
P (x) ≥ 0 everywhere (6)
and Z +∞ P (x) dx = 1, (7) −∞ 4 Licensed to Connexions by Zdzislaw Meglicki, Jr
the probability of finding x between, say, x1 and x2 being
Z x2 P (x ∈ [x1, x2]) = P (x) dx ∈ [0, 1]. (8) x1 (7) then means that x has to be somewhere between −∞ and +∞, which is a trivial observation. Nevertheless, the resulting condition, as imposed on P (x), is not so trivial and has important consequences. Conversely, a function that satisfies (6) and (7) can be always thought of as a probability density. If P1(x) and P2(x) are two probability densities and if they differ on more than a set of measure zero, then (x, P1(x)) and (x, P2(x)) are two different random variables. Here we assume that P1 and P2 are normal, well defined functions, not the so called generalized functions about which we’ll say more later. Cumulative probability Once we have a P (x) we can construct another function out of it, distribution namely Z x D(x) = P (x0) dx0 (9) −∞ This function is called the distribution of x, or, more formally, the dis- tribution of (x, P (x)), this emphasizing that it is a function of x and a functional of P , hence a property of the random variable (x, P (x)). Of- tentimes people call P (x) distribution (or continuous distribution) too— physicists especially—and there is nothing that can be done about this. This nomenclature is entrenched. In this situation D(x) would be called a cumulative distribution function. The cumulative distribution D(x) is used, for example, in computer generation of random numbers with arbitrary (not necessarily uniform) distributions, so it is a useful and important function. Still, in the follow- ing I will call P (x) a probability distribution, too, and this will tie nicely with the Schwartz theory of distributions about which I’ll say more later.
2 Characteristics of Probability Distri- butions: Averages and Moments
Average The average value of f(x) on the statistical ensemble described by the random variable (x, P (x)) is given by
Z +∞ hf(x)i = f(x)P (x) dx. (10) −∞
It is easy to see why. For a given xi, the value returned by f is f(xi). If we were to sample N times (N → ∞) the various x-es evaluating their f(x), the value xi would pop up NP (xi)dx times. Therefore their contribution to the total sum of f(x)-es would be Nf(xi)P (xi)dx. The total sum of f(x)-es obtained in the sampling would therefore be
Z +∞ Nf(x)P (x) dx, (11) −∞ Creative Commons Attribution License (CC-BY 3.0) 5
their average obtained by dividing by the number of samples N, thus leading to (10). In particular, the average of x itself (also called the mean of x) is Mean
Z +∞ hxi = xP (x) dx (12) −∞
Averages of higher powers of x are called moments. The average of x Moments itself, hxi is the first moment. The second moment is
Z +∞ hx2i = x2P (x) dx (13) −∞
and so on. The zeroth moment is
Z +∞ Z +∞ hx0i = x0P (x) dx = 1 · P (x) dx = 1, (14) −∞ −∞
the normalization integral itself. Although we may write hxni on paper, the integral that is implied may Existence of moments not exist. This depends on how quickly P (x) dies out. For some com- monly used probability distributions, e.g., Gaussian, all moments (assum- ing a finite n) may exist, for other distributions, e.g., the Cauchy-Lorentz distribution (most physicists know it as Lorentz bell curve encountered in the theory of resonance), none exist other than the zeroth one. The Cauchy-Lorentz random variable does not even have the average value defined! This has striking ramifications regarding data processing. If a particular process is described by the Cauchy-Lorentz distribution, aver- aging the results of measurements for the process makes no sense. What happens when we sample data so distributed is that the running average does not converge. It continues to jump all over the place. If all finite moments of a given probability distribution P (x) do exist the average of any analytic function f of the random variable x can be expressed using the moments, namely
Z +∞ hf(x)i = f(x)P (x) dx −∞ ∞ n Z +∞ X 1 d f(x) n = x P (x) dx n! dxn n=0 x=0 −∞ ∞ n X 1 d f(x) n = hx i. (15) n! dxn n=0 x=0
It is normally easier to just evaluate the integral than the infinite sum in- volving infinite derivatives of f(x), but the above expression is sometimes useful in derivations and approximations. The average value of x − hxi is, of course, zero. This is because there is as much of x on the left side of hxi as there is on the right side, weighed 6 Licensed to Connexions by Zdzislaw Meglicki, Jr
by P (x). We can see this by direct computation too:
Z +∞ hx − hxii = (x − hxi) P (x) dx −∞ Z +∞ Z +∞ = xP (x) dx − hxi P (x) dx −∞ −∞ = hxi − hxi = 0. (16)
But the average value of (x − hxi)2 is not zero unless in one particular case that is somewhat pathological and about which we’ll say more later. Variance It is not zero, because this time the contributions from the left and the right side of hxi do not subtract. This quantity is called variance and we are going to refer to it as σ2. The reason for this is that the square root Standard deviation of variance is called standard deviation and people who measure things (physicists and engineers especially) are intimately familiar with it:
var(x) = σ2(x) = h(x − hxi)2i. (17)
It is easy to express the variance of x in terms of the second and first moments of x:
σ2(x) = h(x − hxi)2i Z +∞ = (x − hxi)2 P (x) dx −∞ Z +∞ = x2 − 2xhxi + hxi2 P (x) dx −∞ = hx2i − 2hxi2 + hxi2 = hx2i − hxi2. (18)
Because the first integral in the evaluation above is manifestly positive (or zero), we find that Observation 2.1. hx2i ≥ hxi2. (19) It’s an important inequality that’s used in many derivations and proofs in random variable theories including quantum mechanics, which is one of them. Sure variable When is the variance of x zero? Let us observe first that (x − hxi)2 is positive for all x with the exception of x = hxi. Because P (x) is positive R +∞ 2 or zero too, in order for −∞ (x − hxi) P (x) dx to be zero P (x) must be zero everywhere where (x − hxi)2 isn’t zero, that is, everywhere with the exception of x = hxi. But the Lebegue measure theory tells us that if P (x) is non-zero at one point only, that is, on a set of measure zero, then R +∞ the integral −∞ P (x) dx must be zero, so the resulting P (x) cannot be a probability distribution. This is where the pathology creeps in. Within the body of the random variable theory we cannot describe a variable that is guaranteed to have a certain value at all times! Such a Sure variables and Dirac delta variable is called a sure variable. In order to include sure variables into Creative Commons Attribution License (CC-BY 3.0) 7
our theory, we have to describe them in terms of the Dirac delta function, which is not a real function. It is a distribution (as in the Schwartz theory of distributions, not as defined by (9)) otherwise also known as a generalized function (this is Sobolev’s terminology). A distribution is a functional that maps functions onto numbers. An integral is an example of a Schwartz distribution, and an informal integral of a Dirac delta is another example. In summary, admitting Dirac deltas to our arsenal of tools, x is a sure variable if it is described by the pair
(x, δ(x − x0)) (20)
where δ(x) is the Dirac delta for which we have
δ(x) = 0 if x 6= 0, (21)
and δ(0) = +∞ (22) where the infinity is “so large” that it defeats the measure theory and
Z +∞ δ(x) dx = 1. (23) −∞
To conclude, we can now say that if the variance of x vanishes then x Variance vanishes for sure is a sure variable the probability distribution function of which is δ(x−x0) variables only where x0 = hxi. The reverse is also true: if x is a sure variable then its variance vanishes. This can be demonstrated trivially by evaluating the variance of (x, δ(x − x0)). Ipso facto, for real random variables that aren’t sure their variance is always positive.
3 Examples of Probability Distributions 3.1 Uniform Distribution A simple example of probability distribution function is uniform distri- bution on [x1, x2](x1 < x2). The function is zero outside of [x1, x2] and 1/(x2 − x1) inside. The mean of such a uniform distribution is
Z +∞ hxi = xP (x) dx −∞ 1 Z x2 = x dx x − x 2 1 x1 x2 − x2 = 2 1 2 (x2 − x1) (x − x )(x + x ) x + x = 2 1 2 1 = 2 1 . (24) 2 (x2 − x1) 2 8 Licensed to Connexions by Zdzislaw Meglicki, Jr
We can just as easily evaluate arbitrary (finite) moments of the distribu- tion: Z +∞ hxni = xnP (x) dx −∞ 1 Z x2 = xn dx x − x 2 1 x1 xn+1 − xn+1 = 2 1 . (25) (n + 1) (x2 − x1) It is useful to take this expession further:
n+1 xn+1 1 − x1 2 x2 n 1 hx i = n + 1 x1 x2 1 − x2 n+1 x1 − 1 1 x2 = xn . (26) n + 1 2 x1 − 1 x2
1 The expression to the right of n+1 looks like a sum of a geometric series n the first element of which is x2 and the successive elements of which are constructed by multiplying the first one by x1/x2:
n n−1 2 n−2 n x2 , x1x2 , x1x2 , . . . , x1 , (27)
the sum of which can be then written as
n X i n−i x1x2 (28) i=0
N-th moment of uniform In summary, the n-th moment of the distribution can be rewritten as distribution Observation 3.1. n n 1 X i n−i hx i = x x . (29) n + 1 1 2 i=0 This form may sometimes be easier to use than (25), though generally not. But it’s good to know that we can switch between (25) and (29). This may come handy. Variance and standard deviation In particular, using (29), the second moment of the distribution is 1 hx2i = x2 + x x + x2 (30) 3 2 1 2 1 and so the variance is
σ2 = hx2i − hxi2 1 1 = x2 + x x + x2 − (x + x )2 3 2 1 2 1 4 2 1 1 = 4x2 + 4x x + 4x2 − 3x2 − 6x x − 3x2 12 2 1 2 1 1 1 2 2 1 1 = x2 − 2x x + x1 = (x − x )2 . (31) 12 2 1 2 1 12 2 1 Creative Commons Attribution License (CC-BY 3.0) 9
Therefore the standard deviation is 1 σ = √ (x2 − x1) . (32) 2 3
If we were to name the uniform distribution Px1x2 (x) then we could Uniform distribution and Dirac think of the Dirac Delta function as delta
δ(x − x1) = lim Px1x2 (x) (33) x2→x1
Although the value of Px1x2 (x) at x = x1 would become infinity for x2 → x1 and so the limit does not really exist, when placed under the integral and multiplied by an arbitrary function, the limit acquires a well defined meaning:
Z +∞ Z +∞ f(x)δ(x) dx = lim f(x)Px1x2 (x) dx −∞ x2→x1 −∞ 1 Z x2 = lim f(x) dx (34) x2→x1 x − x 2 1 x1