Probability: A brief introduction (Dana Longcope 1/18/05)

Probability is a vast sub-topic within mathematics with numerous applications in Physics, Quantum mechanics being only one. Mathematical treatments can appear quite daunting but fortunately most of us have experience with random processes in life, games of chance and such things. The key concept in the the random variable. A random variable x is one which will assume a dierent value each time we it (sometimes we say each time it is “realized”). If we measure it N times we nd N dierent values; we refer to ith measurement as xi. There are basically two dierent kinds of random variable: discrete variables and continuous variables. We discuss these separately below.

Discrete random variables A discrete random variable is one which can take only discrete values, let’s say integers. For example, d is the number of spots showing on a 6-sided die after I roll it (i.e. it is the roll of a die). d may therefore take on the values 1, 2, . . . , 6, and no others; I cannot role a values d = /2, d = √2. I might for example, roll my die 10 dierent times and obtain the 10 realizations

d1 = 5, d2 = 1, d3 = 3, d4 = 4, d5 = 4, d6 = 3, d7 = 6, d8 = 2, d9 = 3, d10 = 5 but the next time I rolled 10 times I would get 10 dierent values. Since d is a random variable we don’t know its value prior to rolling (at least that is the basic hypothesis of random variables). We characterize a random variable by listing the probabilities of its various outcomes. We denote by Pd the probability that a given realization will assume the value d. A probability Pj = 0 means that outcome d = j is completely impossible (so P7 = 0 since the die doesn’t have a 7-spotted side); Pj = 1 means that that particular outcome is a certainty. These are the two extremes in probability and every probability must be within the rage 0 P 1. . . always. j There is no such thing as a negative probability, or a probability more certain than perfect certainty. In the case of a fair 6-sided we know that all 6 possible outcomes are equally likely. Furthermore, since the sum of all probabilities must be one (more on that below) so the value of each one must be 1/6: 1 1 1 1 1 1 P1 = 6 , P2 = 6 , P3 = 6 , P4 = 6 , P5 = 6 , P6 = 6 If I want to know the probability that d will take one value from a set of possibilities I sum the probabilities of each outcome in the set. For example, the probability that d will be an even number is 1 1 1 1 P (d is even) = P2 + P4 + P6 = 6 + 6 + 6 = 2 . A simple consequence of this fact is that if I sum up the probabilities of all possible outcomes I must get 1: we are perfectly certain that d will assume some value. In probability this is called normalization:

Pj = 1 . (1) Xj Let’s consider taking a of our random variable: f(d). Since d will take on only integer values f(x) need only be dened for integers. Perhaps I am playing a game where a die roll d wins me f(d) dollars from the following payo table

1 d 1 2 3 4 5 6 f(d) 1 0 1 0.5 0.5 0.5 (A negative value of f means that I lose f dollars.) The natural thing to ask is whether I should | | play this game? To answer this we compute the mean value or expectation of the function f(d). The mean is dened as a sum over all possibilities

f = P f(d) (2) h i d Xd From the payo table above we nd f = 1/12. This means I lose, on average $0.08 each time I h i roll the die. Of course, I never lose $0.08 on a particular roll; that is just the mean value. One should be careful not to confuse the mean with the experimental average. The mean, f h i is found from the knowledge of the probabilities. It is a precise number which is always the same. We will always use means Quantum Mechanics I. The average, for which I’ll use the notation f, comes from a set of N experimental realizations, di

1 N f = f(d ) N i Xi=1 This is what you compute in the laboratory. Using the 10 realization from earlier I get f = 0.15 (This time at least, I seem to have lost even more that I “expected”.) The average will be dierent every time you perform a new set of experiments, and will almost never be the same as the mean. The useful relationship is that f will be approximately equal to f as long as N is “large enough”. h i This little tidbit goes by the name the law of large numbers. The trick is to know what “large enough” really means. . . but we cannot get into that here. This is (probably) the last we’ll say of averages in this class. Denition (2) gives the recipe for computing the mean of any function f(d). It is worth making note of a few properties of the mean.

1. The mean is linear: If my function can be expressed as the sum of two functions, f(d) = g(d) + h(d), then the mean of the sum is the sum of the means

g + h = g + h . h i h i h i If is a constant (i.e. it does not depend on d and is not otherwise random) then I can take it outside the mean f = f h i h i 2. The mean of a number is that number:

3 = 3P = 3 P = 3 h i j j Xj Xj where I’ve used the fact that P is normalized (i.e. eq. [1]). Since the mean of a function f j h i is not itself a random variable we consider it to be a number. This means that the mean of a mean is that mean: f = f , h i h i This looks somewhat puzzling at rst,Dbut wEe will run into its likeness often in the future.

2 3. The mean of a product in NOT the product of the means: It is usually a bad idea to discuss something that is not true. But this case appears so often and can cause so much harm if it is mistakenly used that I felt it worth stating up front. In mathematical terms

gh = g h h i 6 h i h i Please note that there is a not equals in this expression. Among many other things this means that f 2 is dierent from f 2. These are two dierent things. h i h i It is common to take averages of the random variable itself and of various powers of it. For example, d = 3.5 for the 6-sided die. This tells us that the mean roll is a 3.5. . . although it’s h i not easy to know what that means. One way to state it is to say that d is the centroid of the h i distribution Pd. A given roll will dier from the mean by an amount d = d d . If I use this to nd the mean h i departure from the mean I nd

d = d d = d d = 0 . h i h h ii h i hh ii (This was done laboriously on purpose; please check that you understand each step). The trivial result came because d goes above the mean by as much as it goes below the mean (d is positive as much as it is negative). We can obtain a more informative result by calculating the mean of the square of the departure:

(d)2 = [d d ]2 = d2 2d d + d 2 h i h h i i h h i h i i = d2 2 d d + d 2 = d2 2 d d + d 2 h i h h ii hh i i h i h ih i h i = d2 d 2 h i h i Note that this would also be trivial if d2 were the same as d 2; but it is not. In fact (d)2 0, h i h i h i since it is a sum of non-negative numbers, so this exercise proves that d2 d 2 for any random h i h i variable. The expression above characterizes how far a given roll is expected to dier from the mean roll, is called the variance of the random variable d:

Var(d) = (d)2 = d2 d 2 h i h i h i It is common to discuss the square root of the variance, called the standard deviation

= (d)2 d h i q which tells, in some sense, how far from the mean the value is likely to be: it is the “width” of the distribution. For the case of die rolls we nd that d2 = 91/6 so = 1.71. A roll will be, on h i d average, within about 1.71 of the mean value 3.5. This statement will appear puzzling, knowing what you do about dice. That goes to show that “mean” and “standard deviation”, while their denitions are precise, don’t always convey the information you might want them to.

3 Continuous random variables It is somewhat trickier to work with a random variable which may take on a continuum of values. Say you get your hands on single a neutron sitting by itself. On its own a neutron is unstable; it will live for a time t before decaying into a proton by emitting an electron. The lifetime t of that particular neutron is a random variable and it may take any value greater than zero. Since there are a continuum of possible values the probability of any particular one, say t = 1568.34 seconds, is exactly zero. We must characterize the probability of a continuous random variable in terms of a probability density function, or pdf, p(t). Like its brother-densities, mass density or charge density, this is a function which must be integrated to give useful information. In this case integrating it over a range of values gives the probability that t will assume a value in that range:

b P (a < t < b) = p(t) dt . (3) Za (Since there is no chance at all that t will take on the values a or b exactly, we can replace the argument with a t b, a t < b or a < t b. . . they’re all the same). Since t is certain to take on some value (in this case a value greater than zero) the normalization condition on its pdf is

∞ p(t) dt = 1 . Z0 For a more general random variable x, which might take on negative values as well, we use the normalization

∞ p(x) dx = 1 . (4) Z∞ The density p(t) must be positive in order that all probabilities are positive. It is not necessary that p(t) < 1 only that its integral never exceed 1. . . over any interval. It is actually possible (as you will see in your homework) for p(x) to diverge so long as it is an integrable singularity. Probabilities do not have dimensions, they are simple numbers. Probability densities do, however, have dimensions which are the reciprocal of the random variable’s dimension. For a decay time t 1 in seconds p(t) has units of s . (This way the , which is a probability, has no dimensions.) Nuclear lifetimes, such as that of the neutron, are distributed according to a pdf called an exponential distribution

1 t/ p(t) = e , t > 0 , (5) where is the mean lifetime. The mean lifetime of a neutron is = 1039 sec. You can quickly verify that expression (5) satises the normalization condition. The probability that the nucleus decays during the interval 0 < t < T is

T T 1 t/ T/ P (0 < t < T ) = p(t) dt = e dt = 1 e . Z0 Z0 If, for example, we set T = ln 2 the probability of decay is exactly one-half; that time is known as its half-life (the half-life of a neutron is 12 minutes). There is a 50% chance that it will decay before

4 t = T and an equal chance that it will decay some time after that. It seems lopsided, but that’s how these things work. If you have a very large collection of unstable atoms (and atoms are so small it’s hard to have anything but large numbers of them) then after waiting a time T about half of them will have decayed, and half will still remain. In a time of twice the half-life, T = 2 ln 2 = ln 4, we nd there to be a P = 3/4 probability of decay; only one-quarter of our atoms will be left. It is the nature of the exponential distribution that during each successive half-life approximately one half of the remaining atoms will decay. (I say “approximately” here because our collection of atoms constitutes a set of independent realizations of the random variable t: t1, t2. . . tN . Their combined behavior will be an average which is only approximately equal to the mean.) The mean of a function of a continuous random variable, f(x), is dened by analogy to the discrete case

f = ∞ f(x)p(x) dx , (6) h i Z∞ integrating over the entire range of values the random variable x might assume (for generality I’ve taken this to be < x < ). The mean value is ∞ ∞ x = ∞ xp(x) dx , h i Z∞ and so on. For the exponential distribution we nd

∞ 1 ∞ t/ t = tp(t) dt = te dt = h i Z0 Z0 2 1 ∞ 2 t/ 2 t = t e dt = 2 h i Z0 From this we can see that is indeed the mean lifetime, as I had originally claimed, that the 2 variance of the lifetime is Var(t) = and that its standard deviation is t = .

5