<<

Probability and Basic denitions

• Statistics is a mathematical discipline that allows us to understand phenomena shaped by many events that we cannot keep track of. Since we miss information to predict the individual events, we consider them random. However, the collective eect of many such events can usually be understood and predicted. Precisely this problem is the subject of statistical mechanics, which applies mathematical statistics to describe physical systems with many degrees of freedom (whose dynamics appears random to us due to complexity).

• Events or outcomes are mathematically represented by values that a x can take in a measurement. Most generally, if the random variable can have any value from a set X, then an event is dened by a subset A ⊆ X. If one measures x and nds x ∈ A, then the event A occurred. The most {x} is specied by a single value that x can take. • Ensemble and : One assumes that every measurement is done in exactly the same conditions and independently of all other measurements. A way to facilitate such a complete equality of all measurements is to prepare an ensemble of N identical systems and perform one measurement of x on each. If one obtains N (x) times a measurement x, then the objective probability p(x) of this outcome is dened as: N (x) p(x) ≡ p({x}) = lim N →∞ N For non-elementary events: X N (x ∈ A) p(A) = p({x}) = lim N →∞ N x∈A where N (x ∈ A) is the number of events A in N measurements (the number of times it was found that x ∈ A). One must often estimate an objective probability using fundamental principles, as in statistical mechanics; such estimates can be considered subjective .

• In general, probability has the following properties (which trivially follow from the denition): 0 ≤ p(x) ≤ 1 (∀x ∈ X) X p(x) = p(X) = 1 x∈X • An event represented by the union A ∪ B of two sets A and B occurs if either one of the events A or B occurs. It is possible for both events A and B to occur at the same time if their intersection A ∩ B is not an empty set. However, we consider A and B as outcomes of a single measurement on a scalar random variable. The events A and B are not independent since the occurrence of one can exclude the other (when their intersection is empty). Under these assumptions, the following can be easily deduced from the denition of probability: p(A ∪ B) = p(A) + p(B) − p(A ∩ B)

• Two events A and B are independent if they correspond to the measurements of two uncorrelated random variables a ∈ Xa and b ∈ Xb. The event A occurs if a ∈ A and the event B occurs if b ∈ B. If we label by AB (or A ⊗ B) the event in which both A and B happen (mathematically, (a, b) ∈ A ⊗ B), then it can be easily shown from the denition of probability that: p(AB) = p(A)p(B)

. If an event B occurs (and one observes it), then the conditional probability p(A|B) that an event A would occur as well at the same time is: p(AB) p(A|B) = p(B) If A and B are independent, then the probability of A has nothing to do with B and p(A|B) = p(A).

14 • If the random variable x takes values from a continuous set X (such as real numbers), then one can sensibly dene probability only for a range of values, e.g.:

N(x1 ≤ x < x2) p(x1, x2) = lim N→∞ N

where N(x1 ≤ x < x2) is the number of outcomes x1 ≤ x < x2 among N experiments. Cumulative probability function: P (x) = p(−∞, x) is the probability that a measurement outcome will be smaller than x. The most direct substitute of the probability function itself is function (PDF), or probability density:

dP (x) f(x) = dx The probability that a measurement outcome will be in the interval (x, x + dx) is f(x)dx. The PDF is normalized: ∞ f(x)dx = 1 ˆ −∞

• The , or expectation value, of the random variable x ∈ X is labeled as hxi or x¯, and dened by:

X hxi = xp(x) or hxi = dx xf(x) ˆ x∈X for discrete and continuous distributions respectively. It is the single value that represents best the entire distribution of random outcomes. If the distribution scatters little about the most probable outcome xmost, then hxi ≈ xmost (but they are generally not equal). If the distribution scatters symmetrically about x0 (which need not be equal to xmost), then hxi = x0. The average value is especially useful for learning about the statistics of the sum of random variables. One similarly denes the expectation value of any function of x:

X hg(x)i = g(x)p(x) or hg(x)i = dx g(x)f(x) ˆ x∈X

• The of a random distribution is dened by: X Var(x) = h(x − hxi)2i = (x − hxi)2p(x) x∈X It is usually more convenient to calculate it as:

h(x − hxi)2i = hx2 − 2xhxi + hxi2i = hx2i − 2hxihxi + hxi2 = hx2i − hxi2

Standard deviation σ = pVar(x) = ph(x − hxi)2i is the measure of how much the random outcomes scatter about the average value of the distribution (the average of the deviation from the ; this is squared inside the average in order to avoid the cancellation of scatters in opposite directions, but the square root eventually undoes this squaring and makes σ directly comparable to the outcomes of x). A sharp distribution that scatters little has a small σ.

15 Uniform distribution

• The random variable x whose all possible outcomes are equally likely belongs to the uniform probability distribution U(X). If the set X of all possible outcomes is nite and has N elements, then: 1 p(x) = (∀x ∈ X) N Otherwise, if X = (a, b) is a continuous range of real numbers between a and b, then: 1 f(x) = b − a

b b 1 1 b2 − a2 b + a hxi = dx xf(x) = dx x = = ˆ b − a ˆ b − a 2 2 a a

• The random variable N X n = xi i=1 where  1 , with probability p  x ∈ (∀i) i 0 , with probability 1 − p belongs to the binomial distribution: n ∼ B(N, p). This distribution describes the statistics of the number m of random events xi, which independently occur with probability p. Example applications are tossing a coin, throwing a (how many times n one gets a particular number on the dice with 1 in attempts), (how far does a random walker reach after steps if the p = 6 N n N probability of a forward step is p), etc.

• Let W (n) be the probability that n events will occur in N attempts. In order to determine W (n) experimentally, one must carry out M times a sequence of N measurements of the random variables xi, and count how many times M(n) there were n positive xi = 1 outcomes in a sequence. We will determine this probability analytically. Consider a single sequence. The probability that the sequence

will be 011 ··· 01, for example, is equal to the probability that x1 = 0 (1 − p) and x2 = 1 (p) and x3 = 1 (p) and... xN−1 = 0 (1 − p) and xN = 1 (p). Since all individual events are independent, the probability of the whole sequence is the product of individual event probabilities:

W (011 ··· 01) = (1 − p) · p · p ··· (1 − p) · p

The probability of a particular long sequence can be very small because the number of possible sequences

is very large. If there were n outcomes xi = 1 in the sequence, and hence N − n outcomes xi = 0, then:

W (sequence) = pn(1 − p)N−n

There are dierent sequences that have the same number n of positive outcomes. They are also independent and non-overlapping, so their probabilities add up:

N W (n) = pn(1 − p)N−n n where the binomial coecient

N N! (N − n + 1)(N − n + 2) ··· (N − 1)N ≡ = n n!(N − n)! n!

16 is the number of sequences that have n positive outcomes, i.e. the number of ways one can select n elements from a set of N elements (there are N ways to select the 1st element, then N − 1 ways to select the 2nd element, down to N − n + 1 ways to select the last nth element from the remaining ones; the numerator on the right is the number of possible selections, but among them each group of selected elements is assembled in n! dierent orders and we do not care about the order).

• Normalization: N N X X N h iN W (n) = pn(1 − p)N−n = p + (1 − p) = 1 n n=0 n=0 To see this, just consider brute-force expanding the square brackets above (without disassembling the brackets for 1 − p). There are N copies of [p + (1 − p)] that are multiplied together and generate the sum of all possible products of N factors that can be either p or 1 − p... • The expectation value of binomial distribution B(N, p) is pN, which can be easily understood given that it represents the number of positive outcomes of probability p out of N attempts (recall the very denition of objective probability). Formally:

N N N X X N! X N hni = nW (n) = n × pn(1 − p)N−n = pn(1 − p)N−n n!(N − n)! (n − 1)!(N − n)! n=0 n=1 n=1 N X (N − 1)! = pN pn−1(1 − p)(N−1)−(n−1) (n − 1)![(N − 1) − (n − 1)]! n=1 N−1   X N − 1 0 0 = pN pn (1 − p)N−1−n n0 n0=0 = pN

The last line follows from the normalization of the binomial distribution B(N − 1, p).

• The variance of B(N, p) is p(1 − p)N. We nd this from:

N N X N! X N hn2 − ni = hn(n − 1)i = n(n − 1) × pn(1 − p)N−n = pn(1 − p)N−n n!(N − n)! (n − 2)!(N − n)! n=2 n=2 N X (N − 2)! = p2N(N − 1) pn−2(1 − p)(N−2)−(n−2) (n − 2)![(N − 2) − (n − 2)]! n=2 N−2   X N − 2 0 0 = p2N(N − 1) pn (1 − p)N−2−n n0 n0=0 = p2N(N − 1)

and

Var(n) = hn2i − hni2 = hn2 − ni + hni − hni2 = p2N(N − 1) + pN − p2N 2 = p(1 − p)N

Poisson distribution

• Consider a physical system that experiences random uncorrelated events of certain kind with average rate of λ events per unit time (for example, a of radioactive atoms whose lifetime is τ = λ−1 will experience decay events with rate λ). Since the events are uncorrelated, their number n(t) in a time interval t does not depend on the instant (t = 0) at which one begins to count them. The statistics of the random variable n(t) is described by the , n(t) ∼ Pois(λ, t).

17 • The probability Wλ,t(n) to observe n events in a time interval t can be obtained from the binomial distribution. Divide the time interval t into N → ∞ innitesimal intervals ∆t = t/N. If p is the probability of a single event to occur in a time interval ∆t, then the average number of events that occur in ∆t is: λt hn(∆t)i = 0 × (1 − p) + 1 × p = p → λ∆t = N We computed this rst as a statistical expectation value of the random variable n(∆t) that counts how many events occurred in the interval ∆t. In doing so, we neglected the possibility that more than one event could have happened in such a small interval ∆t → 0, and treated n(∆t) was a binary variable that takes only values 0 or 1. Then, we used the denition of rate λ to express hn(∆t)i in terms of ∆t. Now, since n(∆t) is binary, we can regard the total number of event in the period t

N X n(t) = ni(∆t) i=1

as a random variable from the binomial distribution B(N, p) in the limit N → ∞, p = λt/N. Each individual measurement in the equivalent B(N, p) distribution refers to whether an event ni(∆t) = 1 occurred (with probability p) in an innitesimal interval (i − 1)∆t ≤ t < i∆t. Therefore,

   n  N−n N n N−n N! λt λt Wλ,t(n) = lim p (1 − p) = lim 1 − N→∞ n N→∞ n!(N − n)! N N !n 1 λt  λtN = lim N(N − 1)(N − 2) ··· (N − n + 1) N 1 − n! N→∞ λt N 1 − N !n 1 λt  λtN = lim N n N 1 − n! N→∞ λt N 1 − N !n 1 λt  λtN (λt)n  λtN = lim 1 − = lim 1 − n! N→∞ λt N n! N→∞ N 1 − N (λt)n = e−λt n! In the 2nd line we just canceled out the common factors in N! and (N − n)! and reorganized the terms involving λ, without making any approximations. In 3rd line we neglected all appearances of n next to N, since n is nite and N → ∞. Next, we pulled the factor of N n into the (··· )n term, and then approximated 1 − λt/N ≈ 1 in the denominator. In the last step, we used the denition of the N exponential function exp(x) ≡ limN→∞(1 + x/N) . • The expectation value and variance of Pois(λ, t) can be computed brute-force using the obtained probability Wλ,t(n), or elegantly from the Binomial distribution results: λt hn(t)i = lim Np = lim N = λt N→∞ N→∞ N   λt  λt Var n(t) = lim Np(1 − p) = lim N 1 − = λt N→∞ N→∞ N N We see that the average has the expected value (from the denition of the rate λ as the average number of events per unit time), and the variance is equal to the average.

Exponential distribution

• The time interval τ > 0 between two successive events that occur with an averate rate λ is a random variable belonging to the Exp(λ). This distribution is continuous.

18 • If f(τ) is the exponential PDF, then f(τ)dτ is the probability that an event will occur in the time interval τ ≤ t < τ + dτ after the previous event. We can obtain this probability as the product of Poisson probabilities for two independent outcomes: 1) no event occurs in 0 ≤ t < τ, and 2) one event occurs in τ ≤ t < τ + dτ: (λτ)0 (λdτ)1 f(τ)dτ = W (0) × W (1) = e−λτ × e−λdτ λ,τ λ,dτ 0! 1! In the limit δτ → 0, we nd: f(τ) = λe−λτ Verify by normalization (using the change of variables x = λτ):

∞ ∞ ∞ dτ f(τ) = λ dτ e−λτ = dx e−x = 1 ˆ ˆ ˆ 0 0 0

• The average time interval between two successive Poisson events is as expected (from the denition of rate λ): ∞ ∞ ∞ 1 1 hτi = dτ τf(τ) = λ dτ τe−λτ = dx xe−x = ˆ ˆ λ ˆ λ 0 0 0 The variance is: ∞ ∞ 1 1 1 2 1 1 Var(τ) = hτ 2i − hτi2 = λ dτ τ 2e−λτ − = dx x2e−x − = − = ˆ λ2 λ2 ˆ λ2 λ2 λ2 λ2 0 0 so the is equal to the mean. The last integral is solved using integration by parts, and is known as the Gamma function: ∞ ∞ ∞ ∞ Γ(n + 1) ≡ dx xne−x = −xne−x + n dx xn−1e−x = ··· = n! dx e−x = n! ˆ 0 ˆ ˆ 0 0 0 Gaussian distribution

• Gaussian distribution is of fundamental importance in mathematics and statistical mechanics because of the (proved later): the average of many independent random variables always converges to a Gaussian distribution. Many quantities of interest in statistical mechanics will be over many degrees of freedom in macroscopic systems (e.g. the average velocity of a particle in a gas), so their statistics will be described by Gaussian distributions.

• Gaussian or N(µ, σ) is a continuous probability distribution dened on x ∈ (−∞, ∞) by the PDF: 1  (x − µ)2  f(x) = √ exp − σ 2π 2σ2 The mean and standard deviation are µ and σ respectively. √ • The coecient 1/σ 2π is required by normalization:

∞ ∞ ∞  2  1 (x − µ) 1 √ 2 1 √ √ dx f(x) = √ dx exp − = √ × 2σ2 dξ e−ξ = √ × 2σ2 × π = 1 ˆ σ 2π ˆ 2σ2 σ 2π ˆ σ 2π −∞ −∞ −∞ √ Here, we changed variables to ξ = (x − µ)/ 2σ2 and used the well known integral:

∞ 2 √ I = dξ e−ξ = π ˆ ∞

19 which can be derived from:

∞ 2π ∞ ∞ 2 2 2 1 2 I2 = dx dy e−(x +y ) = dθ rdr e−r = 2π × d(r2) e−r = π ˆ ˆ ˆ 2 ˆ ∞ 0 0 0

by switching from Cartesian (x, y) to polar (r, θ) coordinates. • The mean is:

∞ ∞ 1  (x − µ)2  hxi = dx xf(x) = √ dx (x − µ + µ) exp − ˆ σ 2π ˆ 2σ2 −∞ −∞ ∞ ∞ 1  (x − µ)2  = √ dx (x − µ) exp − + µ dx f(x) = µ σ 2π ˆ 2σ2 ˆ −∞ −∞

Using the trick x = x − µ + µ we separated the initial integral into two. The rst one vanishes because it's an integral of an odd function (x − µ) × exp(··· ) over a symmetric interval. The second integral is simply µ multiplied by the normalization of the PDF. • The variance is:

∞ ∞ 1  (x − µ)2  Var(x) = h(x − hxi)2i = dx (x − µ)2f(x) = √ dx (x − µ)2 exp − ˆ σ 2π ˆ 2σ2 −∞ −∞

∞  2 ∞  −ξ ∞ 1 2 3 2 −ξ2 1 2 3 e 1 −ξ2 = √ × (2σ ) 2 dξ ξ e = √ × (2σ ) 2 −ξ + dξ e  σ 2π ˆ σ 2π 2 −∞ 2 ˆ −∞ −∞ √ 1 2 3 π 2 = √ × (2σ ) 2 × = σ σ 2π 2 so that standard deviation is σ.

Central limit theorem

• The nth of the probability distribution of a random variable x is dened by:

n µn = hx i The zeroth moment is always 1, the rst moment is the mean of the distribution, etc.

• All moments can be obtained by taking derivatives of the generating function G(t) at t = 0:

n tx d G G(t) = he i ⇒ µn = dtn t→0 This follows from the Taylor expansion of the exponential function:

* ∞ + ∞ X (xt)m X hxmi G(t) = hetxi = = tm m! m! m=0 m=0

When we take the nth derivative term-by-term, using

d d2 d tm = mtm−1 (for m ≥ 1) , tm = mtm−1 = m(m − 1)tm−2 (for m ≥ 2) etc. dt dt2 dt

20 we get a sum of various non-negative powers of t:

∞ ∞ dnG X hxmi dntm X hxmi = × = × m(m − 1)(m − 2) ··· (m − n + 1)tm−n dtn m! dtn m! m=0 m=n

Then, evaluating this nth derivative at t = 0 kills all terms with non-zero powers of t, leaving behind just the rst term (t0 = 1):

n  n  n d G hx i 0 1 2 t→0 hx i = × n(n − 1)(n − 2) ··· (n − n + 1)t + (··· )t + (··· )t + ··· −−−→ n! = µn dtn t→0 n! t→0 n!

κn of the probability distribution of a random variable x are dened by the generating function K(t): n  tx  d K K(t) = log he i ⇒ κn = dtn t→0 The rst few cumulants are:

 0·x  κ0 = K(0) = log he i = log(1) = 0

dK d   1 dG µ1 κ1 = = log G(t) = = = hxi dt t→0 dt t→0 G dt t→0 he0·xi " # 2   2  2 d K d 1 dG 1 d G dG 2 Var κ2 = = = G − = µ2 − µ1 = (x) dt2 t→0 dt G dt t→0 G2 dt2 dt t→0

3 ( " 2  2#) d K d 1 d G dG 3 κ3 = = G − = ··· = µ3 − 3µ2 µ1 + 2µ1 dt3 t→0 dt G2 dt2 dt t→0 We see that the zeroth vanishes, the rst cumulant is the mean, and the second cumulant is the variance.

(x) (x) • Homogeneity of moments and cumulants: If µn and κn are moments and cumulants of the random variable x respectively, then for any constant c:

(cx) n (x) (cx) n (x) µn = c µn , κn = c κn  Proof: For moments: (cx) n n n n (x) µn = h(cx) i = c hx i = c µn

(x) th  For cumulants: The cumulant κn is generated by the n derivative of the generating function (x) K(t) at t → 0. When expanded in terms of the generating function G(t) for moments, κn becomes a sum of various combinations of G derivatives which always have the same total power n of dt in the denominators (see the example for κ2 above). Therefore, each such term becomes a product of several moments, e.g. (x) (x) (x) after substituting (which makes ), µn1 µn2 . . . µnk t → 0 G(t) → 1 (cx) n (x) but always with n1 + n2 + ··· + nk = n. Since µn = c µn , then: X κ(cx) = C µ(cx)µ(cx) . . . µ(cx) n n1···nk n1 n2 nk X = cn1+···+nk C µ(x)µ(x) . . . µ(x) n1···nk n1 n2 nk X = cn C µ(x)µ(x) . . . µ(x) n1···nk n1 n2 nk n (x) = c κn

• If one knows all moments of a probability distribution, or alternatively all cumulants, one can recon- struct the PDF.

21  Proof: If all moments hxmi, m ∈ {0, 1, 2 ... } are known, then one can construct an auxiliary function g(t) ≡ G(it) using the Taylor expansion of the generating function G(t):

∞ X hxmi g(t) ≡ G(it) ≡ heitxi = imtm m! m=0 √ Here, i = −1 is the imaginary unit. The function g(t) is useful because it is related to the Fourier transform of the Dirac delta function δ(t):

∞ dt eitx = 2πδ(x) ˆ −∞

where δ(x) is dened by:

∞  0 , x 6= 0  δ(x) = , dx δ(x) = 1 ∞ , x = 0 ˆ −∞

The above relationship between g(t) and δ(t) can be proven by calculating a well-dened Gaussian integral D(x; α) = dt exp(itx − αx2) with a nite positive α, and then showing that D(x; α) converges to δ(x) in´ the α → 0 limit. The Dirac delta function has the following essential property for any function f(x): ∞ dx δ(x − x )f(x) = f(x ) ˆ 0 0 −∞

which follows directly from its denition. Applied to the PDF f(x), it allows us to calculate f(x0) at any desired point x0: * ∞ + ∞ ∞ f(x ) = hδ(x − x )i = dt eit(x−x0) = dt eitx e−itx0 = dt g(t)e−itx0 0 0 ˆ ˆ ˆ −∞ −∞ −∞

if we can obtain g(t), i.e. if we know all moments. This completes the proof. Since G(t) = eK(t) and K(t) is fully determined by the knowledge of all cumulants, we can similarly reconstruct the PDF from the cumulants.

2 • The Gaussian distribution N(µ, σ) has cumulants κ1 = µ, κ2 = σ , and all other cumulants equal to zero.

 Proof: First, calculate the generating function for moments:

∞ 1  (x − µ)2  G(t) = hetxi = √ dx exp − + tx σ 2π ˆ 2σ2 −∞ ∞ ∞ √ µt √ 1 √ 2 2 e 2 2 = √ × 2σ2 dξ e−ξ +t(ξ 2σ +µ) = √ dξ e−ξ +tξ 2σ σ 2π ˆ π ˆ −∞ −∞ ∞ µt √ 2 e − ξ− 1 t 2σ2 + 1 t22σ2 = √ dξ e ( 2 ) 4 π ˆ −∞ µt 1 σ2t2 = e × e 2 √ We changed variables from x to ξ = (x − µ)/ 2σ2 in the second line. Then we completed the square in the exponent in the third line: the exponents of e··· are equal in the third line and the

22 last integral of the second line, but the third line becomes a constant exp( 1 σ2t2) times a pure √ 2 Gaussian integral that evaluates to π as we saw before. Now that we know G(t), we immediately know the generating function for cumulants:   1 K(t) = log G(t) = µt + σ2t2 2 The derivatives of K(t) generate the cumulants and we clearly reproduce the stated result: all 2 cumulants are zero except κ1 = µ and κ2 = σ .

• Central limit theorem: Let xi ∈ (−∞, ∞) be N independent random variables from the same probability distribution with mean µ and variance σ2. Let X be the random variable dened by: N 1 X X = √ (xi − µ) σ N i=1 In the limit N → ∞, the statistics of X converges to the Gaussian distribution N(0, 1) with mean 0 and variance 1, regardless of the distribution of xi.  Proof: Construct the generating function for the cumulants of X: "* N !+# "* N  +#  tX  t X Y t(xi − µ) KX (t) = log he i = log exp √ (xi − µ) = log exp √ σ N i=1 i=1 σ N " N   # N      Y t(xi − µ) X txi tµ = log exp √ = log exp √ exp − √ i=1 σ N i=1 σ N σ N N X  t  tµ   t  tµ  = Ki √ − N √ = N Kx √ − √ i=1 σ N σ N σ N σ N We are able to pull the product symbol outside of the averaging brackets h· · · i at the beginning of the second line only because the random variables xi are independent: the average of a product of random independent values equals the product of their averages. Going through the end, we nd

that the cumulant generating function KX for X equals a shifted sum of analogous generating functions for individual random variables , although evaluated at the rescaled parameter √ Kx xi . Since all have the same distribution, the last expression is simply proportional to . t/σ N xi √ N Now we can calculate the cumulants of X (substituting τ = t/σ N): √ n " n #  n  (X) d KX (t) d K1(t/σ N) µδn,1 1 d K1(τ) µδn,1 κn = = N − √ = N − √ dtn t→0 dtn t→0 σ N σnN n/2 dτ n τ→0 σ N  1 µδ  = N κ(x) − √n,1 σnN n/2 n σ N The zeroth cumulant always vanishes. The rst cumulant is the mean:  1 µ  κ(X) = N √ κ(x) − √ = 0 1 σ N 1 σ N The second cumulant is the variance:  1  κ(X) = N κ(x) = 1 2 σ2N 2 (X) The higher order cumulants κn become negligible when N → ∞.

n N→∞ (X) 1− 2 κn ∝ N −−−−→ 0 n>2 We showed earlier that the probability distribution of X is completely determined by the cumu- lants, and the distribution in which only the rst and second cumulants are nite is Gaussian. This proves the central limit theorem.

23