<<

Selected Topics on Data Analysis in Particle Physics Lecture Notes

V.Karim¨aki

HEP Summer School, Lammi Biological Station March 1992

Printed version edited from hand made transparencies, Spring, 2010

Contents

1 Introduction 1

2 Propagation of measurement errors 2 2.1 Error propagation in case of linear transformation ...... 2 2.2 Error propagation in case of non-linear transformation ...... 4

3 Fitting parameterized model with experimental data 7 3.1 The method of maximum likelihood ...... 7 3.2 Least squares method as a special case of maximum likelihood ..... 8

4 Least squares solutions with error estimates 10 4.1 Linear χ2 solution ...... 10 4.2 Non-linear least squares fit ...... 11 4.2.1 Convergence criteria ...... 13 4.3 Least squares fit with constraints ...... 13 4.4 Fit quality tests ...... 16 4.4.1 Pull or stretch values of fitted parameters ...... 16 4.4.2 Fit ...... 16 4.5 Maximum Likelihood and Poisson ...... 18 4.5.1 Maximum Likelihood and parameter errors ...... 20

5 Introduction to Monte Carlo simulation 21 5.1 Generation of non-uniform distribution by inversion method ...... 21 5.2 Inversion method for discrete distribution ...... 22 5.3 Approximate inversion method using tabulation ...... 23 5.4 Hit-or-Miss method ...... 24 5.5 Hit-or-Miss method by comparison function ...... 24 5.6 Composition method ...... 25

i 6 Generation from common continuous distributions 27 6.1 Uniform distribution ...... 27 6.2 Gaussian distribution ...... 27 6.2.1 Polar or Box-Muller method ...... 29 6.2.2 Faster variation of the polar method ...... 30 6.2.3 Gaussian generation by summation method ...... 31 6.2.4 Kahn’s method ...... 32 6.3 Multidimensional Gaussian distribution ...... 32 6.4 ...... 35 6.5 χ2 distribution ...... 37 6.6 Cauchy or Breit-Wigner distribution ...... 39 6.7 Landau distribution ...... 41

7 Monte Carlo integration 43 7.1 Basics of the MC integration ...... 43 7.2 Convergence and precision of the MC integration ...... 45 7.3 Methods to accelerate the convergence and precision ...... 46 7.3.1 Stratified sampling ...... 47 7.3.2 Importance sampling ...... 48 7.3.3 Control variates ...... 49 7.3.4 Antithetic variates ...... 50 7.4 Limits of integration in the MC method ...... 51 7.5 Comparison of convergence with other methods ...... 52

A Basic statistics for simulation A53 A.1 Introduction ...... A53 A.2 Definition of probability ...... A53 A.3 Combined probabilities ...... A54 A.4 Probability density function ...... A54 A.5 Cumulative distribution function ...... A55 A.6 Marginal and conditional distribution ...... A55 A.7 Expectation value, mean value and ...... A56 A.8 Covariance and correlation ...... A56 A.9 Independent variates, covariance matrix ...... A57 A.10 Change of variables ...... A57 A.11 Distribution of a function of random variates ...... A58 A.12 Moments of a distribution ...... A59 A.13 Characteristic function ...... A60

ii A.14 Cumulants of a distribution ...... A61 A.15 Probability generating function ...... A61

B Exercises B62 B.1 Basic statistics exercises ...... B62 B.2 Uniform random number generation exercises ...... B63 B.3 Generation from arbitrary distributions exercises ...... B64 B.4 Wellknown continuous distributions exercises ...... B65 B.5 Discrete distributions exercises ...... B68 B.6 Monte Carlo integration exercises ...... B69 B.7 Application examples ...... B70

iii 1 Introduction

BlaBla ...

1 2 Propagation of measurement errors

Error propagation is needed when quantities to be studied must be calculated from directly measurable quantities with given covariance matrix. Let us denote a set of measurements as follows:

• Set of measurements: x = (x1, x2, . . . , xn)

• Covariance matrix: Vx, with vij = cov(xi, xj)

The components xi represent individual measurements which may be correlated. In case of uncorrelated xi the covariance matrix is diagonal. Furthermore, let us suppose that we are interested in a set of transformed variables which are derived from the measured variables x by some transformation formula: x0 = x0(x) (1) 0 0 0 0 or: x = (x1, x2, . . . , xm), m ≤ n (2) 0 where the components xi are functions of x. For example: The measured variables could be the Cartesian coordinates x = (x, y, z) and the transformed variables could be the spherical coordinates x0 = (r, ϕ, θ). 0 Now the question is: What is the covariance matrix Vx0 of the new variables x ? This is the problem of the error propagation. The answer is derived by recalling first the definition of the covariance matrix: Z (Vx)ij ≡ cov(xi, xj) ≡ cov(x)ij ≡ σij ≡ vij = (xi − hxii)(xj − hxji)f(x)dx (3) where f(x) is the probability density and where we have listed various notations for the covariance. From the definition it readily follows that

cov(axi, bxj) = ab cov(xi, xj) (4) or more generally: X X X cov( aixi, bjxj) = aibjcov(xi, xj) (5) i j i,j

2.1 Error propagation in case of linear transformation

We first consider linear transformations x → x0. A linear transformation can be written in matrix notation: x0 = J x (6) where J is an m × n matrix independent of x. According to the definition (3) we can derive the expression for the covariance matrix element i, j: X X (Vx0 )ij = cov( Jikxk , Jjlxl) k l X = JikJjl cov(xk, xl) k,l T = (JVxJ )ij

2 where we have used equation (5). The above result implies the following transformation law for the covariance matrix:

T Vx0 = JVx J (7) This the error propagation formula in case of linear transformation of variables x0 = Jx.

Examples

Pn Example 1: Error estimate of sum of measured quantities s = 1 xi. Error ∆s = ? In matrix notation we write s = Jx where J = (1 1 ··· 1) and it follows that

n 2 2 T X X (∆s) ≡ σs = (1 1 ··· 1)Vx(1 1 ··· 1) = σii + 2 σij (8) 1 i>j

2 where we use the fact that covariance matrix is symmetric and the notation (∆xi) = σii. If Vx is diagonal (uncorrelated measurements xi), we get:

n 2 X 2 σs = σi (9) 1

2 where our notation is: σii = σi .

Example 2: Error estimate of weighted mean P i wixi µ = P . i wi of N quantities. Error estimate σµ = ?

Here we assume that xi are independent of each other so that cov(xi, xj) = 0 ↔ i 6= j and  2  σ1  σ2 (0)   2  Vx =  ..  (10)  (0) .  2 σN where all the off-diagonal elements are zero. Now the J matrix is 1 by N:

X −1 J = ( wi) (w1 w2 ··· wN ) i and we get

2 X −2 T X −2 X 2 σµ = ( wi) (w1 w2 ··· wN )Vx(w1 w2 ··· wN ) = ( wi) (wiσi) (11) i i i

−2 Normally the weights are inverse i.e. wi = σi and inserting to the above result we obtain: 1 X 1 = . (12) σ2 σ2 µ i i

3 A special case is that the weights are equal: σi = ∆x for all i. In this case we get

N 1 X ∆x µ = x ; σ = √ (13) N i µ 1 N which are the formulae for a simple (unweighted) mean and its error estimation.

Example 3: Error estimate of simple sums of three quantities x1, x2 and x3:

u1 = x1 + x2 (14)

u2 = x2 + x3. (15) Here the transformation is from 3D to 2D so that n = 3 and m = 2. The transformation matrix is: 1 1 0 J = 0 1 1 so that the covariance matrix of u = (u1, u2) is: σ σ σ  1 0 1 1 0 11 12 13 V = σ σ σ 1 1 (16) u 0 1 1  12 22 23   σ13 σ23 σ33 0 1 where the middle most matrix is the covariance matrix Vx of the variables x = (x1, x2, x3).

2.2 Error propagation in case of non-linear transformation

We define again a transformation of a set of measurements x as: x0 = x0(x)

x = (x1, . . . , xn) 0 0 0 x = (x1, . . . , xm) where x0 is under-constrained i.e. m ≤ n. Here x is again an array of measurements with known covariance matrix Vx which is symmetric n × n matrix. The problem is now how to calculate the covariance matrix of quantities x0 i.e. the covariance matrix 0 0 Vx0 of the m quantities x1, . . . , xm. We expand x0 in Taylor series around the expectation value of x: x = hxi. Each 0 component xi is expanded as: 0 0 0 xi = xi(hxi) + ∇xxi(x = hxi)(x − hxi) + ··· (17) so that the expansion in matrix form reads: x0 = x0(hxi) + J(x − hxi) + ··· (18) where J is now the Jacobian derivative matrix of the transformation: ∂x0 ∂x0  ∇x0   1 ··· 1  1 ∂x1 ∂xn  .   . .. .  J =  .  =  . . .  (19) 0 0 0 ∂xm ∂xm ∇xm ··· ∂x1 ∂xn

4 computed at x = hxi. Neglecting higher order terms we have the expansion: x0 = x0(hxi) + Jx − Jhxi. (20) The first and the third terms are constant, because they are calculated at a fixed point x = hxi, so that their covariances vanish and we have cov(x0) = cov(Jx) = J cov(x) JT . Using the V notation for the covariance matrix, the error propagation formula in case of non-linear transfromation reads: T Vx0 = JVx J (21) which is the same formula as in case of linear transformation except that the matrix J is now the Jacobian derivative matrix of the non-linear transformation computed at x = hxi.

Examples

Example 1: Error estimate of a product u = x1x2. Given the covariance matrix of (x1, x2) what is the error estimate ∆u?   The Jacobian is J = ∂u/∂x1 ∂u/∂x2 = x2 x1 so that  2    2 2  σ1 σ12 x2 2 2 2 2 (∆u) ≡ σu = x2 x1 2 = x2σ1 + 2x1x2σ12 + x1σ2. (22) σ12 σ1 x1 In general a single valued function f = f(x, y) has the error estimate:

∂f 2 ∂f ∂f ∂f 2 σ2 = σ2 + 2 σ + σ2 (23) f ∂x x ∂x ∂y xy ∂y y where σxy is the covariance of x and y. Notice that the formula

∂f ∂f ∆f = ∆x + ∆y, ∂x ∂y which is usually offered as an error formula for a function in elementary courses is valid only, if the correlation between x and y is +100%: σxy = σxσy.

Example 2: Transformation from Cartesian coordinates (x, y) to polar coordinates (r, ϕ). The transformation is  p  r = x2 + y2 y ϕ = arctan  x and the Jacobian: ! ∂r ∂r  x y  1   ∂x ∂y r r x y J = ∂ϕ ∂ϕ = −y x = −y x . ∂x ∂y r2 r2 r r r The transformed covariance matrix is then   1      −y  0 σrr σrϕ T x y σxx σxy x r V ≡ = JVJ = 2 −y x x . σrϕ σϕϕ r r r σxy σyy y r

5 Exercises

Exercise 1: Transformation of momentum variables. Suppose that the momentum of a charged particle is measured in 3D in terms of the curvature ρ and the angular variables λ and ϕ. The curvature ρ is the curvature of the projected trajectory in the xy plane and it is due to a magnetic field B = (0, 0,Bz). The covariance matrix of the measurement is known:   σρρ σρλ σρϕ V = σρλ σλλ σλϕ σρϕ σλϕ σϕϕ Calculate the covariance matrix in terms of the Cartesian momentum components p = (px, py, pz). Hints: The angle λ is related to the polar angle θ as λ = π/2 − θ. The relation between the curvature and momentum is ρ = 0.3qBz/pT where pT is the transverse momentum i.e. the momentum projected in the xy plane, q = ±1 is the −1 particle charge and the units of ρ, Bz and pT are [m ], [T] and [GeV], respectively.

Exercise 2: Error estimate of the invariant mass. A particle decays into two particles with masses m1 and m2. The momenta of the decaying particles p1 = (px1, py1, pz1) and p2 = (px2, py2, pz2) are measured together with the corresponding covariance matrices V1 and V2. The invariant mass squared of of the two-particle system is calculated as:

2 2 2 M = (E1 + E2) − (p1 + p2) .

p 2 2 where Ei = mi + pi . What is the error estimate of the invariant mass M? The particle measurements 1 and 2 are assumed independent.

6 3 Fitting parameterized model with experimental data

3.1 The method of maximum likelihood

Let us suppose that we have N measurements of a quantity y:

y = (y1, y2, . . . , yN ) which may depend on a vector variable x. We want to fit these measurements with a modeling function: y = y(α1, . . . , αm; x) where α1, . . . , αm are the parameters of the model.

The measurements y1, y2, . . . , yN represent a random sample from a probability density g(y) in the N dimensional measurement space. The a priory probability density g(y) is parameterized by the model y = y(α; x):

g(y) → g ◦ y(α; x; y).

Now we define the likelihood function:

Ly(α; x; y) = g ◦ y(α; x; y) (24) subject to the normalization: Z Ly(α; x; y)dy = 1 for all α; x. (25)

The quantity Ly(α; x; y)dy is the probability of the sample y = (y1, y2, . . . , yN ) with parameters α defining a probability density in measurement space. With these introductory preparations we introduce the method of Maximum Likeli- hood: The principle is to determine the model parameters α = (α1, . . . , αm) such that the likelihood function becomes minimum. Or, equivalently, the logarithm of the likelihood function: log L = log[g ◦ y(α; x; y)] (26) becomes maximal under the constraint Z L dy = 1.

The joint probability density g(y) can be for example:

1. A Gaussian: 1 1 T −1 g(y) = N exp[− 2 (y − hyi) V (y − hyi)] (2π det V) 2

with V = cov(y)(N × N symmetric matrix) and hyi represented by the model: hyi = y(α1, . . . , αm, x).

7 350 350

300 300

250 250

200 200

150 150

100 100

50 50

0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

Figure 1: Spectrum of a quantity x in a histogram presentation and as a plot with error bars. The solid line in the second picture is a fit result by LSQ.

2. A , when yi = ni are integers:

nλi i −λi pi = e ni!

where the Poisson expectation values, λi, represent the model: λi = λi(α1, . . . , αm; x).

The likelihood function is a product of the probabilities pi: L = Πpi.

3.2 Least squares method as a special case of maximum like- lihood

The least squares method (LSQ) follows from the principle of maximum likelihood, when the a priory probability density is Gaussian. We consider the spectrum of a 1D quantity x which can be presented as a histogram showing the frequency of x values or as plot with measurement errors (Fig. 1). The measured frequencies in a histogram (histogram ’bars’)are subject to statistical fluctuations which follows Poisson or Gaussian distribution. Suppose the fluctuations are Gaussian i.e. we have a large statistics histogram. Then the likelihood function is: 1 n o 1 T −1 Ly(α; y; x) = N exp − 2 [y − y(α; x)] V [y − y(α; x)] (27) (2π det V) 2

8 The array y holds the measurements (histogram values) and the array

y(α; x) = (y(α1; x1), . . . , y(αN ; xN )) are the corresponding model values. Taking the logarithm of eq. (27) we get:

1 T −1 log Ly(α; y; x) = const − 2 [y − y(α; x)] V [y − y(α; x)] From this it follows that maximizing the Gaussian likelihood is the same as minimizing the so called least squares function:

χ2 = [y − y(α; x)]T V−1 [y − y(α; x)] (28) where Vij = cov(yi, yj), i, j = 1,...,N and y is an array of measurements and y(α; x) represents the model. The matrix V−1 is usually called a weight function. In case the measurements yi are uncorrelated, the covariance matrix is diagonal: Vij = δijσij. In this case the eq. (28) reduces to the following sum:

N  2 X yi − f(α; x) χ2 = (29) σ i=1 i

2 where we denote the model function as f and the diagonal elements of V as σi . The quantity σi is normally the error estimate of the measurement yi.√ For example in Fig. 1 we have a spectrum which follows the behavior f(x) = α1 + α2 x. The curve in the figure is fitted with this parameterization by minimizing (29) with respect to α1 and α2.

9 4 Least squares solutions with error estimates

In the following we consider the following cases:

• linear χ2 minimization

• non-linear or general χ2 minimization

• constrained χ2 minimization

The χ2 minimization problem is called linear, if the model function is linear in terms of its parameters α.

4.1 Linear χ2 solution

Our linear model is now: m X f(x) = αjfj(x), m < N (30) j=1 i.e. a measured value of the quantity y is modeled by this linear expression at a space point x. For N measurements of y: y = (y1, y2, . . . , yN ) we write the model as follows:

f(x) = (f(x1), . . . , f(xN ))

Denoting matrix elements Hij ≡ fj(xi) we can express the model in a form of matrix equation: f(x) = H(x) α (31) where H is a N×m matrix as defined above. With these notations the linear χ2 function is expressed as: χ2 = (y − Hα)T V−1(y − Hα) (32) where V is the covariance matrix (N×N) of the measurements y (y is a column vector and yT a row vector). To minimize, we differentiate (32) with respect to α and get:

2 T −1 T −1 T −1 ∇α χ = −2(H∇α α) V (y−Hα) = −2(HI) V (y−Hα) = −2H V (y−Hα) = 0 where I = ∇α α is a unit matrix. It follows that the linear LSQ solution for the parameters α minimizing the χ2 equation (32) is:

α = (HT V−1H)−1HT V−1y (33)

The solved parameters α = (α1, . . . , αm) are subject to statistical error depending on the precision of the measurements y i.e. on the covariance matrix V (and on the model- ing matrix H). The solution (33) is of the form α = Jy with J = (HT V−1H)−1HT V−1 so that the error propagation techniques described in section 2.1 applies. Using eq. (7) we obtain the error estimate (covariance matrix) of the parameters α:

T  T −1 −1 T −1  −1 T −1 −1 Vα = JVJ = (H V H) H V V V H(H V H)

10 where we have used the symmetry of the matrices V and HT V−1H. The above equation reduces trivially to: T −1 −1 Vα = (H V H) (34)

As an example we consider a special case where the measurements yi are independent of each other.

Example: independent measurements yi

In this case the covariance matrix V of the measurements yi is diagonal and the diagonal elements are variances of yi which are the squared error estimates of yi: Vij = δijσij. 2 2 We use the notation var(yi) ≡ σii ≡ σi . Then the χ function becomes:

N Pm !2 X yi − j αjfj(xi) χ2 = . (35) σ i=1 i

In the simplest case the functions fj are powers of a single variable x and the fitting model is a polynomial. We define a column vector c and an m×m matrix B as follows:

c = HT V−1y B = HT V−1H

Then the ’best fit’ parameters α are:

α = B−1c where the elements of c and B are:

N N X −2 X −2 ci = σk Hki yk = σk fi(xk) yk k=1 k=1 X −1 X X −2 X −2 Bij = Hki(V H)kj = Hki σk δklHlj = σk HkiHkj k k l k

X −2 Bij = σk fi(xk)fj(xk) (36) k Notice that in case that the model is a 1D polynomial, the B matrix elements are i+j 2 Bij = xk /σk.

4.2 Non-linear least squares fit

In case of a model which is non-linear in terms of its parameters α the model expression Hα is replaced by values of a function f:

f(α; x) = fα = (f(α; x1), . . . , f(α; xN ))

2 where α is an array of m model parameters α = (α1, . . . , αm). Then the χ function reads: 2 T −1 χ = (y − fα) V (y − fα) (37)

11 where again y is an array of N measurements and V is their covariance matrix. This χ2 function must be minimized by iteration. The iteration method needs an initial ∗ ∗ ∗ value which we denote by α = (α1, . . . , αm). We expand each component of the model ∗ function in Taylor series around the initial value α at each space point xi, i = 1,...,N:

∗ ∗ f(α; xi) = f(α ; xi) + ∇αf(α; xi)|α=α∗ · (α − α ) + ··· . (38) This is a set of N equations. We linearize by neglecting the higher order terms of the model function f and write the equations (38) in matrix form as:

fα = fα∗ + J∆α (39) where ∆α ≡ α − α∗ and J is the Jacobian derivative matrix of the model function:

 ∇ f(α; x )   ∂f(α;x1) ··· ∂f(α;x1)  α 1 ∂α1 ∂αm  .   . .. .  J =  .  =  . . .  (40) ∂f(α;x ) ∂f(α;x ) ∇αf(α; xN ) N ··· N ∂α1 ∂αm computed at α = α∗. We denote by ε∗ the departure of the measurements from the model at α = α∗: ∗ ε = y − fα∗ Together with this notation and using the linearized form of the model (39) we get a linearized expression for the χ2 function:

χ2 = (ε∗ − J∆α)T V−1(ε∗ − J∆α). (41)

We find the minimum by setting the differential to zero:

2 T −1 ∗ T −1 ∗ ∇αχ = −2(J∇α∆α) V (ε − J∆α) = −2J V (ε − J∆α) = 0 and solving for ∆α: ∆α = (JT V−1J)−1JT V−1ε∗ (42)

∗ ∗ This is a correction to the parameters (α1, . . . , αm) at a new iteration stage:

∗ αnext = α + ∆α.

The iteration is continued until satisfying some convergence criteria which are discussed below. The covariance matrix of the converged solution α = (α1, . . . , αm) is:

T −1 −1 Vα = (J V J) (43)

It is the same expression as in case of linear LSQ (34) except that the matrix H of linear mapping is replaced by the Jacobian J (40) of the non-linear transformation computed at the χ2 minimum: H ↔ J.

12 Example: independent measurements yi

In this case the covariance matrix V is diagonal and the diagonal elements are variances 2 of yi which are the squared error estimates of yi: Vij = δijσij. Then the χ function becomes: N  2 X yi − f(α; xi) χ2 = . (44) σ i=1 i Similarly to the previous section we define an m-element column vector c and an m×m matrix B as follows:

c = JT V−1ε∗ B = JT V−1J.

B and c are calculated at α = α∗ at each iteration stage and the iterated correction is

∆α = B−1c.

The elements of c and B are (in case of uncorrelated measurements):

N   X −2 ∗ ∂f(α; xk) ci = σk εk ∂αj ∗ k=1 α=α X −1 X X −2 X −2 Bij = Jki(V J)kj = Jki σk δklJlj = σk JkiJkj k k l k   X −2 ∂f(α; xk) ∂f(α; xk) = σk ∂αi ∂αj ∗ k α=α

∗ ∗ where εk = yk − f(α ; xk). The inverse of B gives the covariance matrix Vα.

4.2.1 Convergence criteria

We denote ε = y−fα, the deviations of the measurements y from the model f = f(α; x) at an iteration stage. Then ∆χ2 = 2∆εT V−1 = −2∆f T V−1 so we have estimated distance from minimum: ∆χ2 = −2∆αT J T V−1ε (45) The equation ∆χ2 = 1 is a 1STD (hyper)ellipsoid in the error space. The condition ∆χ2 < 0.1 is usually a sufficient convergence criterium since then ∆α << pσ2 and i αi the parameter changes at the iteration i are much smaller than their statistical error.

4.3 Least squares fit with constraints

By constraints we mean that the parameters to be fitted are related to each other by a physical law. In particle physics this kind of law is for example the momentum-energy conservation law. Take a particle decay, for example: K± → π±π+π−.

13 − : π  +  K - - π+ PP PP PPq π+

Consider that the 3-momenta of the four particles are measured. Hence there are 3×4 parameters (all momentum components) to be fitted so that the energy-momentum (4- momentum) conservation constraint is satisfied. Let us denote them αij, i = 1, 2, 3, j = 1,..., 4 and the covariance matrix of particle j measurement ar Vj. Usually one chooses the parameters for the particle j as:

αj = (1/pj, λj, φj) (46)

The 4-momentum conservation imposes the following four constraint equations:

4 4 4 X X X p1j = p2j = p3j = 0 (47) j=1 j=1 j=1 4 X q 2 2 2 2 mj + p1j + p2j + p3j = 0. (48) j=1

Here the momentum and energy of the incoming particle are taken negative.

The measured αj do not satisfy the 4-momentum conservation due to measurement errors, so the problem is find corrections ∆αj such that the constraint equations hold and the function χ2 = ∆αT V −1∆α is minimized. More generally we have the equations:

χ2 = ∆yT V−1∆y, ∆y = y − ymeas (49)

fk(y; A) = 0, k = 1, ..., m (50) where y refers to an array of n measurements and Ato an array of possible extra p(p < m) parameters to be fitted. The solution goes by the method of Lagrange multipliers. One has to minimize the function:

m 2 T −1 X χ = ∆y V ∆y + 2 λkfk(y; A) k=1 or in matrix form: χ2 = ∆yT V−1∆y + 2λT f (51) to be minimized for ∆y, Aand λ. The number of parameters is N = n + p + m. The solution must be found by iteration. Suitable starting values are ∆y = 0. By linearizing the functions f around ”so far the best” values A∗, y∗ we have the condition for new iterated values of A and y:

f(A; y) = f(A∗; y∗) + A(∆a − ∆a∗) + B(∆y − ∆y∗) ≈ 0 (52)

14 where ∆a and ∆y are the new corrections to the initial values. Notice that A − A∗ = ∆a − ∆a∗ and y − y∗ = ∆y − ∆y∗. Here the matrices A (m × p) and B (m × n) are the Jacobians: ∂(f) ∂(f) A = , B = ∂(A) ∂(y) calculated at A∗, y∗. The linearized constraint equations are then:

A∆a + B∆y − c ; c = A∆a∗ + By∗ − f(A∗, y∗) and the linearized χ2:

χ2 = ∆yT V−1∆y + 2λT (A∆a + B∆y − c). (53)

Minimization by differentiation yields the following group of equations:  V−1∆y + BT λ = 0  AT λ = 0 (54) A∆a + B∆y − c = 0

The total number of equations here is N = n + m + p for ∆y, ∆a and λ. A few matrix operations are needed to find the solution (we leave the detailed derivation to T T −1 the reader). We first denote VB = BVB and VA = A VB A. With these notations we get:  ∆a = V−1AT V−1c  A B −1 −1 T −1 −1 λ = (VB AVA A VB − VB )c (55) ∆y = −VBT λ The solution is iterated until no significant changes to the parameter values occur. A special case is that there are no extra (unmeasured) parameters (p=0) which implies A ≡ 0. This is the case e.g. when all particles can be measured. Then the correction vector at an iteration step for the measurements reads: ( ∆y = VBT V−1c B (56) c = B∆y∗ − f(y∗)

∗ meas ∗ where y = y + ∆y is from the previous iteration and VB is defined above. We denote the converged solution vector as

meas yb = y + ∆y. (57)

Next we consider the calculation of the covariance matrix V(yb) for the solution yb. The Jacobian of the expression (57) is ∂(y) ∂(∆y) ∂(c) b = 1 + = 1 + VBT V−1 ∂(y) ∂(y) B ∂(y)

∗ ∗ ∂(c) ∂(f) On the other hand c = B ∆y − f and it follows: ∂(y) = − ∂(y) = −B. The covariance matrix of yb is obtained by error propagation: ∂(y) ∂(y) T V(y) = b V b b ∂(y) ∂(y)

15 and we get:

T −1 T −1 T V(yb) = (1 − VB VB B) V (1 − VB VB B) T −1 T −1 T −1 = V − 2 VB VB BV + VB VB BVB VB BV

T −1 V(yb) = V − VB VB BV (58) −1 T −1 Notice: we use here VB BVB = VB VB = 1 (unit matrix), and the fact that a covariance matrix and hence its inverse is symmetric by definition. On the other hand, for the covariance matrix of ∆y we get: ∂(∆y) ∂(∆y) T V(∆y) = V ∂(y) ∂(y) T −1 T −1 T = VB VB BV(VB VB B) T −1 = VB VB BV

V(∆y) = V − V(yb) (59)

4.4 Fit quality tests

4.4.1 Pull or stretch values of fitted parameters

A pull value for the parameter i is defined as

∆yi ∆yi pi = p = p (60) V(∆yii) Vii − V(y)ii i.e. the difference between the fitted and measured values divided by the error estimate. The pull values pi are randomly distributed and they should follow the standard N(0, 1). Plotting the pull values and checking, if they follow N(0, 1) is used to control the fit quality. If this test fails, the reason might be that the used model does not describe the experimental measurements, or the measurements are biased or may be the error estmates, which are used for the weights, are not properly estimated.

4.4.2 Fit probability distribution

When fitted repeatedly several times with variable input, the resulting χ2 values should 2 follow the theoretical χ distribution with the parameter ND, the number of degrees of freedom. By inspecting the χ2 distribution is not an easy way to see, if the fits are OK, especially when the number ND varies from one fit to another. A better way is to make the so called χ2 probability distribution which should follow uniform distribution in the range (0,1) independent of ND. If a variable is a random variate, like χ2 values, then also any of its function is a random variate. It can be shown, that if a variate x follows probability distribution f(x), then the values of its cumulative distribution function Z x F (x) = f(t)dt −∞

16 Figure 2: Random Standard Normal distribution and its fit follows uniform distribution in the range (0,1). Let us denote the χ2 distribution function as fND (x). Then the corresponding cumulative distribution function reads: Z χ2 2 FND (χ ) = fND (t)dt. 0 It is not integrable in closed from, but its execution can be found from certain libraries. The ROOT TMath class offers one, the Prob() function. The code double chiProb=TMath::Prob(chi,ndf); computes the χ2 probability function for given χ2 value (chi) and for the number of degrees of freedom ndf.

Figure 3: χ2 distribution with NDF=3 and the corresponding χ2 probability distribution.

In real life there are often outliers in the measured data, which do not fit well the

17 assumed model. The outliers tend to create a ”forward peak” in the fits’ χ2 probability distribution. In above figures we have an ideal case with no outliers.

4.5 Maximum Likelihood and Poisson statistics

Fitting histograms with low statistics is not ideal with the χ2 method, because the bin fluctuations are not Gaussian. In contrast, the large statistics histograms can be fitted √ with least squares method assuming that the bin error is ni where ni is the number of entries in the bin number i.

Figure 4: Example of a small statistics histogram - exponential distribution. The true mean of the generated histogram should be = 1 and the slope should be = -1. The curve is fitted with the Least Squares method. For this the Poisson likelihood fit gives the estimate slope=-1.09.

For small statistics histograms one chooses Maximum Likelihood fit with Poisson statis- tics: < n >ni i − P (ni) ≡ pi = e , i = 1,...,Nbins, (61) ni! i.e. we have Nbins Poisson distributions which describe the probability of fluctuations in each bin i. Suppose we have a model f(x;a ¯) to describe the spectrum dN/dx. The model should be such that Z xi+1 < ni >= f(x;a ¯)dx ≡ fi(¯a), i = 1,...,Nbins xi

(fi(¯a) is a notation for the integral). Bins are independent so the likelihood function is N N bins bins ni Y Y fi(¯a) L = p = e−fi(¯a) (62) i n ! i=1 i=1 i

18 The parametersa ¯ can be determined by minimizing − log L:

N Xbins − log L = [log ni! + fi(¯a) − ni log fi(¯a)] (63) i=1 N Xbins = const + [fi(¯a) − ni log fi(¯a)] (64) i=1 The minimization is performed by finding the zeros of the derivatives of − log L with respect to the parametersa ¯:

ni   ∂(log L) X ∂fi(¯a) ni − = 1 − = 0, j = 1,...,N , (65) ∂a ∂a f (¯a) p j i=1 j i where Np is the number of parametersa ¯. Normally this group of Np equations must be solved numerically by iteration.

Figure 5: Comparison of small statistics histograms fitted with Least Squares and Poisson Likelihood method. Exponential histograms with only 20 entries were generated and fitted 10000 times. The fitted slopes (absolute values) are plotted. The LSQ method largely and systematically underestimates the slope values, whereas the Poisson Likelihood gives correct estimates on the average.

In ROOT, for example, a loglikelihood fit to an exponential can be performed as follows:

his->Fit("expo","LO"); where his is a pointer to exponential histogram.

19 4.5.1 Maximum Likelihood and parameter errors

Let us denote: W (¯a) ≡ log L(¯a) (66) We expand W (¯a) in Taylor series arounda ¯ =a ¯∗, wherea ¯∗ maximizes W :

∗ ∗ 1 ∗ T ∗ W (¯a) = W (¯a ) + J(¯a − a¯ ) − 2 (¯a − a¯ ) H (¯a − a¯ ) + ... where J is Jacobian of W and H is Hessian of −W : ∂2 log L Hij = − ∂ai∂aj Now at maximuma ¯ =a ¯∗ the second term in the series vanishes. By comparison with the series expansion of exponential we conclude that:

1 T W (¯a) ' const exp(− 2 ∆¯a H ∆¯a) whcih is a general (multidimensional) Gaussian distribution with the weight matrix H. The covariance matrix is its inverse:

−1 cov(ai, aj) ≡ Vij = (H )ij (67)

This is the error matrix of the parameters fitted with the Maximum Likelihood method.

20 5 Introduction to Monte Carlo simulation

Monte Carlo simulation is sometimes called stochastic simulation. The basic require- ment of Monte Carlo simulation is to have means to generate random numbers which follow density functions typical in the physical processes in question. For example γ-spectra have peaks which can be simulated with Gaussians.

5.1 Generation of non-uniform distribution by inversion method

This method can also be called method of inverted cumulative distribution function. It is generally valid and useful always, when the inverse of the cumulative distribution can be evaluated. Here we use the notations: f(x) = density function of the random variate x and F (x) the corresponding cumulative distribution function: Z x F (x) = f(t)dt −∞

Figure 6: A probability density function and its cumulative distribution

In all its simplicity the inversion method works as follows:

1◦ Generate u from uniform distribution (0, 1) 2◦ Calculate x = F −1(u)

Then the random variates x follow the density f(x). It can be proved as follows. The cumulative probability (=F ) of the variate x:

−1 −1 P [x ≤ t] = P [F (u) ≤ t] = P [F (F (u)) ≤ F (t)] = P [u ≤ F (t)] = F (t)  The second equality follows from the fact that cumulative distribution functions are monotonically increasing and the last equality follows from u being uniform in the range (0, 1).

21 Example: Particle lifetime or intensity of a radiative source as a function of time: exponential distribution as a function of time: f(t) = λe−λt (t ≥ 0) F (t) = 1 − e−λt (0 ≤ F < 1) F −1(u) = −λ−1 log(1 − u) (0 ≤ u < 1)

−1 Here λ is a parameter. The mean lifetime is τ = λ and the half-life is t1/2 = τ log 2. Now we can write a C++ function which generates variates from exponential with mean=tau: double rnexpo(double tau){ double rnduni = (double)rand()/(double)RAND_MAX; return -tau*log(rnduni); // } Notice: Instead of 1 − u one can use u, because both are uniform (0,1). We use ROOT to make an example plot with this code using tau=2:

Figure 7: Random exponential distribution generated with the above code. The straight line in logarithmic scale shows that the plot is exponential.

If one is working with ROOT, the class TRandom provides good many utilities for random number generation. For example the code: rnexp=gRandom->Exp(Tau); provides an exponential random number with mean Tau.

5.2 Inversion method for discrete distribution

In order to generate from a discrete probability pi, i = 1, 2,... one tabulates the cu- mulative probability: j X Fj = pi, j = 1,...,N (68) i=1 where N is large enough so that FN ' 1. The procedure is then:

22 1◦ Generate u from uniform [0,1)

◦ 2 Determine k such that Fk−1 ≤ u < Fk

Then the probabilities pk follow the wanted discrete distribution. Example: Suppose the discrete probalities are pi = const/i, i = 1,..., 10. The task is to write a code for the generation. void generateK(int *k) { static int done=0; static float cons; const int nProb=10; static float iProbCumul[nProb]; if (!done) { // initializations done = 1; float iProb[nProb]; for (int i=0; i=iProbCumul[i-1]) *k=i+1; } int main() { // testing for (int i=0; i<100; i++) { int k; generateK(&k), std::cout<

5.3 Approximate inversion method using tabulation

This method can be used for continuous density distribution, when a closed form is not available for inversion method. One tabulates the pairs of values (xi,F(xi)), xi < xi+1. Then the generation algorithm goes as follows:

1◦ Generate u from uniform [0,1)

◦ 2 Find xi such that F (xi) ≤ u < F (xi+1)

23 3◦ Compute x by interpolation:

[F (x ) − u]x + [u − F (x )]x x = i+1 i i i+1 F (xi+1) − F (xi)

5.4 Hit-or-Miss method

Suppose a density distribution f = f(x) is defined in a finite range:

x ∈ [a, b] and M = max{f(x)|a ≤ x ≤ b}.

If the inversion method is not applicable (i.e. the inverse cimulative density func- tion cannot be computed in closed form), the following method is always valid (von Neumann 1951):

◦ 1 Generate a random number r1 uniform in [a,b)

◦ 2 Generate a random number r2 uniform in [0,M)

◦ ◦ 3 If r2 > f(r1), go back to 1 (miss)

◦ 4 If r2 ≤ f(r1), take r1 (hit) 5◦ Go back to 1◦ for the next random number

The random numbers r1 then follow the density distribution f(x). Efficiency of the method goes like the ratio of the areas:

b R f(x)dx 1 Eff = a = M(b − a) M(b − a)

The efficiency can be bad for the distributions which make a narrow peak. Note: If the maximum M cannot be calculated, one can use a value M 0 which is known to be larger than maximum in the range [a, b) (which would make the efficiency smaller).

5.5 Hit-or-Miss method by comparison function

Let f be a density function from which we should generate random variates. Now we do not restrict the range of f’s definition to a finite range (in the figure below the range is (−∞, ∞).

24 Let g be another density function for which we have a generation method available (e.g. the inversion method) and which satisfies the condition:

∃ constant a such that a g(x) ≥ f(x) ∀x.

We thus require that we can find such a constamt a that the graph of the product function a g(x) is everywhere above the graph of the function f(x) (and is defined in the same range).

6 ag(x)

f(x)

- x

The distribution f can be generated then using the following Hit-or-Miss method which uses the inversion method for g:

◦ 1 Generate a random number r1 uniform (0, 1)

◦ −1 2 Compute G (r1) = x

◦ 3 Generate a random number r2 uniform (0, a g(x))

◦ ◦ 4 If r2 > f(x) → 1 (miss)

◦ 5 If r2 ≤ f(x) take x (hit) 6◦ → 1◦

5.6 Composition method

The composition method can be used, when the density function f can be expressed as the following integral composition: Z Z f(x) = gy(x)dH(y) = gy(x)h(y)dy (69) where h and gy are density functions. The distribution f is then generated with the following algorithm:

1◦ Generate a random number y from the h distribution

◦ 2 Generate x from gy distribution using the generated y value

25 he density distribution h can also be discrete. Then in the probability element h(y)dy in the formula (69) is replaced by the discrete probability pi and the integral by a sum:

X f(x) = pigi(x) i

The functions gi can be for example pieces of the function f in suitable intervals i: f(x) Z gi(x) = , pi = f(x)dx pi i 6 where integration is taken over the interval i (pi g4 is a normalization factor). Then the generation g3 algorithm reads: r g2 r g5 1◦ Generate a random i from the discrete dis- r g1 tribution pi. - i =16r 26 36 46 5 2◦ Generate x in the interval i from the distri- bution gi.

In each interval one uses a generation method applicable in the interval in question. This is a method used often in practical applications.

26 6 Generation from common continuous distribu- tions

6.1 Uniform distribution

From now on we use the notation Unif(a, b) for a uniform distribution in the interval (a, b). The density function is:

 (b − a)−1 a ≤ x ≤ b f(x) = 0 otherwise

Or by using the step function: f(x) Θ(x − a)Θ(b − x) 6 f(x) = - b − a a b x The mean and variance are: 1 x¯ = 1 (a + b) σ2 = (b − a)2 2 x 12 The generation from the distribution Unif(a, b) is based on the generation from the basic distribution Unif(0, 1) for which there is a large number of options in various program libraries. This generator forms in fact a base of generation for almost all other density distributions. The generation of general uniform distribution Unif(a, b) is based on the transformation:

x → x0 = a + (b − a)x, which is a linear mapping from the interval (0, 1) to the interval (a, b). Hence the algorithm is simply:

◦ 1 Generate r1 Unif(0, 1)

◦ 2 Compute r2 = a + (b − a)r1

Then the random numbers r2 are distributed as Unif(a, b).

6.2 Gaussian distribution

In what follows we denote the general normal (Gaussian) distribution as N (µ, σ) where µ is the mean of the distribution and σ its standard deviation. The Gaussian density function is

2 1 − 1 x−µ f(x) = √ e 2 ( σ ) , (70) 2πσ which is defined on the whole real axis. The Gaussian N (0, 1) is called standard normal distribution. The Gaussian cumulative distribution function is:

x−µ Z σ 1 − 1 t2 x − µ F (x) = √ e 2 dt ≡ Ψ( ). (71) 2π −∞ σ

27 Figure 8: Standard normal distribution

The function Ψ is the cumulative distribution function of the standard normal N (0, 1). It cannot be computed in closed form. Ψ is found for example in ROOT TMath class as TMath::Freq(double x). The even moments of N (µ, σ) are:

(2k)! µ = σ2k, k = 1, 2, 3, ... 2k 2kk! All odd moments vanish. It follows that the skewness and kurtosis are µ µ γ = 3 = 0, γ = 4 − 3 = 0. 1 σ3 2 σ4

The characteristic function is Φ(t) = exp(itµ− 1 t2σ2). The half width at half maximum √ 2 (HWHM) of N (µ, σ) is = 2 log 2 σ ' 1.18σ. The tails of Gaussian decrease fast: As can be seen from the table below, only about 0 2.70/00 of the Gaussian variates are farther than three standard deviations (three σ s) from the mean. P (| x − µ |≤ σ) = 2Ψ(1) − 1 = 68.27% P (| x − µ |≤ 2σ) = 2Ψ(2) − 1 = 95.45% P (| x − µ |≤ 3σ) = 2Ψ(3) − 1 = 99.73%. Let us consider the following example in which we apply the generation method pre- sented in Appendix A to generate random numbers of a function which depends on two random variates of given density. Example: Let x and y be independent of each other with the density N (0, 1). Find out, what is the density distribution of the ratio z = y/x. Thus we have to calculate the integral: ∞ ∞ Z Z 1 − 1 x2 1 − 1 y2 y f(z) = dx dy √ e 2 √ e 2 δ(z − ) 2π 2π x −∞ −∞ Using the formula X δ(y − yi) δ(g(y)) = , |g0(y )| i i

28 y where yi’s are zeros of the function g, we get the δ-function in the form: δ(z − x ) = |x|δ(y − zx). It follows that the density fuction of the variate z is:

∞ ∞ Z 1 2 Z 1 2 1 − 2 x − 2 y f(z) = 2π dx | x | e dy e δ(y − zx) −∞ −∞ ∞ ∞ Z Z 2 − 1 x2 − 1 z2x2 1 − 1 x2(1+z2) = dx xe 2 e 2 = dx xe 2 2π π 0 0 1 1 = . π 1 + z2 This is so called which we shall discuss more closely later. Let us consider the transformations between the standard normal and general Gaussian distribution: N (µ, σ) ↔ N (0, 1). We can easily show that

x−µ if x is N (µ, σ), then t = σ is N (0, 1) if t is N (0, 1), then x = σt + µ is N (µ, σ)

It follows that it is sufficient to master the generation of N (0, 1) in order to be able to generate variates which follow the general Gaussian. The cumulative distribution function Ψ (71), nor its inverse, cannot be calculated in closed form, so the inverse method cannot be applied, but possibly by the tabulation method. However, the total R − 1 t2 integral e 2 dt can be calculated by a trick in which one uses the polar coordinates:

∞ ∞ ∞ ∞ ∞ R − 1 x2 2 R − 1 x2 R − 1 y2 R R − 1 (x2+y2) [ e 2 dx] = e 2 dx e 2 dy = e 2 dxdy −∞ −∞ −∞ −∞−∞ ∞ 2π ∞ 1 2 1 2 R R − 2 r R − 2 r 1 2 = e rdrdϕ = 2π e d( 2 r ) = 2π. r=0 ϕ=0 0

May be the most important method of generation of N (0, 1) is based on a similar trick presented in the following.

6.2.1 Polar or Box-Muller method

Let x and y be independent standard normal variates. Their common 2D distribution function in plane is:

1 1 1 − 1 x2 1 − y2 1 − (x2+y2) f(x, y) = √ e 2 √ e 2 = e 2 2π 2π 2π We transform this to the polar coordinate system which is defined as:

 x = r cos ϕ (72) y = r sin ϕ

29 The surface element in xy plane transforms as dxdy → rdrdϕ, where the factor r is the Jacobian determinant of the trasformation (72). The density of the new variates r ja ϕ is then: 1 2 1 − 2 r g(r, ϕ) = f(x, y)r = 2π re . We observe that the combined density of r and ϕ is separable:

1 2 1 − 2 r g(r, ϕ) = fϕ(ϕ)fr(r), where fϕ(ϕ) = 2π ja fr(r) = re . Hence the new variates r and ϕ are independent of each other so they can be generated independently. The generation of ϕ is trivial: Unif(0, 2π). The generation of r can be made with the inversion method, because the cumulative density is: Z r − 1 r2 Fr(r) = fr(r)dr = 1 − e 2 , 0 and its inverse: −1 p Fr (u) = −2 log (1 − u). So we conclude: when generating the variates r ja ϕ, one gets by inserting to the trasformation formula (72) two normal variates x and y. We have derived the polar or Box-Muller algorithm:

◦ 1 Generate u1 Unif(0, 1) and compute ϕ = 2πu1

◦ √ 2 Generate u2 Unif(0, 1) and compute r = −2 log u2 3◦ Inserting to the formula (72) we get √  x = −2 log u cos(2πu ) √ 2 1 (73) y = −2 log u2 sin(2πu1)

two independent normal variates N (0, 1).

The algorithm is explicit and precise, but rather slow, because it involves computation √ of four fairly CPU-costly functions: , log, cos ja sin .

6.2.2 Faster variation of the polar method

1◦ Generate u ja u Unif(0, 1) 1 2 (v , v ) √@ 1 2 ◦ q 2 v1 ← 2u1 − 1; v2 ← 2u2 − 1. The points (v1, v2) w@× distribute uniformly in a square.

◦ 2 2 ◦ 3 w ← v1 + v2. If w > 1, return to 1 . The points (v1, v2) are now uniformly in unit circle and they are used to compute random sin ja cos:  √ cos(ϕ) = v1/√w sin(ϕ) = v2/ w

Moreover w is Unif(0, 1) (show) and it replaces the random number u2 in the formula (73).

30 4◦ Inserting to the formula (73) we get:  p x = v1 −2 log w/w p y = v2 −2 log w/w

The method is faster, because one avoids the computation of sin and cos. A negative π effect is a small loss in the generation of random numbers. The relative loss is 1 − 4 = 21% of the random numbers due to the cut at the point 3◦.

6.2.3 Gaussian generation by summation method

This method is based on the central limit theorem which states that the density of the sum variate Sn of n independent random numbers x1, x2, ...xn

n X Sn = xi i=1 approaches N (µ, σ2) in the limit n → ∞, where

n n X X µ = E(x ) ja σ2 = σ2 i xi i=1 i=1

Example: Let xi be random variates Unif(0, 1). Then for the sum variate we have:

1 n µ = n · 2 = 2 2 1 n σ = n · 12 = 12 . We define a new random variate: r12 n r12 √ y = (S − ) = S − 3n . n n n 2 n n The mean and variance of this variate are: r12 √ r12 n √ y¯ = S¯ − 3n = · − 3n = 0 n n n n 2

r12 n 12 n σ2 = ( )2var(S − ) = · = 1 . yn n n 2 n 12

According the the central limit theorem the behaviour of the variate yn approaches N (0, 1) in the limit n → ∞. For practical applications the convergence is fast and n ≈ 10 is large enough for many applications. The relative precision is poorest in the tails, because they get truncated: √ |yn| ≤ 3n.

Often one uses the value n = 12. Then the expression of yn is simple:

12 X y12 = −6 + xi i=1

31 The summation method is especially suitable for calculation in parallel processors: the terms of the sum can be computed simultaneously, when one uses multiplicative congruential method for the generation of Unif(0, 1). This is based on the following property: k If xi+1 = axi (mod m) then xi+k = a xi (mod m) It follows that the sequence of k (e.g. k = 64) consecutive random numbers Unif(0, 1) can be computed directly from the same seed using simply different multiplicative co- efficients a, a2, ..., ak. In parallel processor computers this can be done simultaneously.

6.2.4 Kahn’s method

The Kahn’s method is based on the approximative formula of the normal distribution:

−kx 1 − 1 x2 e √ e 2 ' k = f(x), 2π (1 + e−kx)2

q 8 where the constant is k = π . The cumulative density and its inverse can be easily calculated: x F (x) = R f(x)dx = (1 + e−kx)−1 −∞ −1 −1 u F (u) = k log 1−u The generation algorithm is in all its simplicity as follows:

1◦ Generate u Unif(0, 1)

◦ p π u 2 Compute x = 8 log 1−u

The method is fast and well applicable for example in the simulation of measurement errors of physical quantities, if one does not require a precise Gaussian distribution.

6.3 Multidimensional Gaussian distribution

Variate: ~x = (x1, ..., xk) vector in k-dimensional space Parameters: ~µ = (µ1, ..., µk) vector in k-dimensional space V symmetric kxk matrix pos. definite (det V > 0). 1 Density function: f(~x) = exp[− 1 (~x − ~µ)T V −1(~x − ~µ)] p(2π)k det V 2

~ ~ 1~ ~ Char. function: Φ(t) = exp(it · ~µ − 2 t · V t) The connection of the parameters ~µ ja V to the characteristics of the density function is as follows: ~µ = E(~x) expectation value of ~x vij = cov(xi, xj) covariance matrix elements v = σ2 diagonal elements are variances ii xi

The inverse V −1 is often called a weight matrix.

32 Example 1: Gaussian distribution in plane (k=2)

According to the definition of correlation ρ, one has v12 = v21 = cov(x1, x2) = ρσ1σ2, so that V can be expressed as

 2  σ1 ρσ1σ2 V = 2 ρσ1σ2 σ2 from which we get the weight matrix:

−2  2  −1 (σ1σ2) σ2 −ρσ1σ2 V = 2 2 1 − ρ −ρσ1σ2 σ1

A general 2-dimensional Gaussian density distribution expressed as a function σ1, σ2 and the correlation coefficient ρ is then:

Figure 9: Two-dimensional Gaussian density with correlation coefficient ρ = −0.7.

h 2 2i − 1 1 ( x1−µ1 ) −2ρ x1−µ1 x2−µ2 +( x2−µ2 ) 1√ 2 1−ρ2 σ1 σ1 σ2 σ2 f(x1, x2) = 2 e (74) 2πσ1σ2 1−ρ

Example 2: If the component variates x1, x2, ..., xk are mutually independent, all off-diagonal elements of the matrix V vanish, so V ja V −1 are diagonal matrices:

−1 2 −1 −2 vij = (V )ij = 0, when i 6= j ja vii = σi ;(V )ii = σi

It follows that the corresponding density function factorizes for the components xi:

k “ ”2 Y 1 − 1 xi−µi f(~x) = √ e 2 σi i=1 2πσi

Generation of multidimensional Gaussian a) Uncorrelated case

33 Each component of the vector ~x are generated independent of each other: One generates ui from N (0, 1) and computes xi = σiui + µi, i = 1, ..., k resulting in a random vector ~x. b) Two-dimensional distribution with correlation

The density given in the equation (74) can be generated with the following modified Box-Muller method: √ x1 = √−2 log u1 sin 2πu2 x2 = −2 log u2 sin(2πu2 − φ) where the phase angle φ defines the correlation: cos φ = ρ. The validity of the method can be proved e.g. using the following theorem: Theorem: If x and y are independent Gaussian random variates:

2 x : N (µx, σx) 2 y : N (µy, σµ)

2 2 2 2 then the variate z = ax + by is Gaussian N (aµx + bµy , a σx + b σy). Proof: The theorem is proved with the aid of characteristic function:

Φz(t) = Φax+by(t) = Φax(t)Φby(t) = Φx(at)Φy(bt) 1 2 2 2 1 2 2 2 = exp(iatµx − 2 a t σx) exp(ibtµy − 2 b t σy) 1 2 2 2 2 2 = exp[it(aµx + bµy) − 2 t (a σx + b σy)]. We see that the characteristic function of z is that of a Gaussian whose mean and 2 2 2 2 variance are aµx + bµy and a σx + b σy . c) General case

Let the vector ~x be a general k-dimensional Gaussian variate and ~µ its mean vector:

~x = (x1, ..., xk); ~µ = (µ1, ..., µk)

Furthermore, let us assume that in addition to ~µ we know the covariance matrix V :   v11 v12 ··· v1k  v v ··· v  T  21 22 2k  V = E[(~x − ~µ)(~x − ~µ) ] =  . . .   . . .  vk1 vk2 ··· vkk

The question is: How to generate vectors ~x ? In the following we present a method which is based on the fact that the covariance matrix V can always be expressed as a product V = CCT . This C-matrix is utilized in the generation algorithm which goes as follows:

◦ 1 Generate ~u = (u1, ··· , uk), where ui are N (0, 1)

34 2◦ Compute ~x = ~µ + C~u

Then the vectors ~x are Gaussian N (~µ,V ). The algorithm is proved as follows: At the point 2◦ one computes ~x as a matrix product where every component of ~x is a linear combination of standard normal variates ui. From the above theorem it follows that the vector ~x is a normal variate. Moreover we see easily that these variates have the right mean ~µ and covariance matrix V :

E(~x) = E(~µ + C~u) = E(~µ) + CE(~u) = E(~µ) + C~0 = ~µ E[(~x − ~µ)(~x − ~µ)T ] = E[C~u(C~u)T ] = E[C~u~uT CT ] = CE(~u~uT )CT = CCT = V (m.o.t.)

The expectation value of ~u is a null vector and the expectation value of ~u~uT unit matrix, because ui are independent N (0, 1). A matrix C can be computed with the following recursive algorithm (so called Cholesky factorization):

do i = 1, k do j = 1, i − 1 Pj−1 cij = (vij − n=1 cincjn)/cjj cji = 0 enddo 1 Pi−1 2 2 cii = (vii − n=1 cin ) enddo

The resulting C-matrix is a triangular matrix: All the elements above the diagonal are zero.

6.4 Exponential distribution

The exponential distribution is important in physics phenomena. The density function is: f(x) = λe−λx, x > 0. The most important characterizing parameters are:

−1 Expectation: hxi = λ Skewness: γ1 = 2 2 −2 Variance: σx = λ Kurtosis: γ2 = 6

The characteristic function is: φ(t) = (1 − itλ−1)−1

In the following we consider a few examples related to exponential probability.

35 Example 1: Decay of a radioactive sample. Number of nuclear decays in a short time interval is proportional to the length of the interval and to the number of nuclei:

N t Z dN Z N dN = −λNdt ⇒ = − λdt ⇒ log = −λt N N0 N0 0 −λt N(t) = N0e The parameter λ can be called the (mean) decay rate. Example 2: Particle lifetime. The probability distribution of the lifetime is: 1 t f(t) = exp(− ), τ0 τ0 where τ0 is the mean lifetime of the particle in its rest frame. The mean lifetime of a moving particle is affected by the time dilatation, according to which its lifetime, measured in the laboratory frame, gets longer: τ0 → γτ0. Here γ is so called Lorentz factor: 1 γ = q v 2 1 − ( c )

The total energy of the particle is E = m = γm0 (m=’relativistic mass’) in units where speed of light is unity (c ≡ 1). Furthermore relationship between particle relativistic mass, momentum and energy is: p p = mv = Ev ⇒ v = E With these formulae we can derive the probability density function of the particle flight path length `:

` E ` p −1 m0 ` t = = · =⇒ f(`) = ( cτ0) exp (− ) v p c m0 p cτ0

Generation of exponential

The cumulative distribution function and its inverse can be easily calculated: x Z f(x) = λe−λx ⇒ F (x) = λe−λtdt = 1 − e−λx

0 ⇒ F −1(u) = −λ−1 log(1 − u) Hence one can use the method of inverse distribution function according to which the generation algorithm is as follows:

1◦ Generate u from Unif(0, 1) 2◦ Compute x = −λ−1 log u

Simulation applications of the exponential are useful especially in the radiative physics. A free path of a particle, i.e. distance traveled before any interaction, is a random variable which follows exponential independently of the type of interaction: decay, elastic or inelastic collision with atoms, Compton scattering e.t.c.

36 6.5 χ2 distribution

2 Variate: χN positive real number Parameter: N positive integer (’degrees of freedom’) 2 χ2 N χ N 2 1 2 −1 − 2 Density function: f(χN ) = 2 ( 2 ) e /Γ( 2 ) Char. function: φ(t) = (1 − 2it)−N/2

Figure 10: Chi-squared distribution for N=3

2 Notice that χN does not mean the square of χN , but it is just a notation for a variate which follows the χ2 distribution. Often one omits the parameter N in the notation: 2 2 χN → χ . The Γ function which appears in the normalization factor has the following properties: √ Γ(n) = (n − 1)! Γ( 1 ) = π Γ(x) = (x − 1)Γ(x − 1) √ 2 x− 1 −x Γ(x) ' 2πx 2 e (1 + 0.0833/x)

The last (approximate) expression is so called Stirling’s formula. Other characteristic properties of the χ2 distribution are:

2 q 8 Mean: hχ i = N Skewness: γ1 = N 2 Variance: σχ2 = 2N Kurtosis: γ2 = 12/N

Lemma: The sum of N independent standard normal variates xi : N (0, 1) squared 2 follow the χN distribution: N X 2 SN = xi i=1

37 Proof: We prove the lemma by using characteristic functions. Let us have x : N (0, 1). 2 Then the characteristic functions of x :n and SN :n are:

∞ ∞ Z 1 Z 1 1 itx2 1 − x2 1 − x2(1−2it) − φx2 (t) = e √ e 2 dx = √ e 2 dx = (1 − 2it) 2 2π 2π −∞ −∞ N Y − N φ (t) = φP 2 (t) = φ 2 (t) = (1 − 2it) 2 SN xi xi i=1

The characteristic function of the random variate SN equals the characteristic function 2 2 of χN so that the probability densities SN and χN are the same.  2 xi−µi More generally: If xi follows N (µi, σ ) then follow N (0, 1). This implies that i σi the sum variate: N  2 X xi − µi χ2 = (75) N σ i=1 i follows the χ2 distribution. From this we can readily conclude on the basis of the 2 central limit theorem that the χN distribution behaves as

lim f(χ2) = N (N, 2N) N→∞

2 for large N. It means that the χN density√ function approaches Gaussian with the mean and stanfard deviation N and 2N, respectively. The convergence is rather slow. Another variate, whose density distribution approaches the Gaussian faster, is 2 obtained by the following transformation of the χN variable: √ p 2 a 2 = 2χ − 2N − 1 χN The density dostribution of this variate approaches N (0, 1).

2 Another way to get variates χN is by summing up independent exponential variates. In the following yi’s are exponentially distributed and x is N (0, 1):

1◦ N even: N/2 2 X −yi χN = 2 yi ; f(yi) = e (76) i=1 ◦ [N/2] 2 N odd: 1 X 1 − x2 2 2 2 χN = 2 yi + x ; f(x) = √ e (77) i=1 2π The notation [N/2] means an integer nearest to N/2 which is smaller. 1 − χ2 2 1 2 2 Notice: f(χ2) = 2 e so for the parameter N = 2 the χ distribution is exponential.

38 Generation of the χ2 distribution a) If N is large (≥ 20): from Gaussian b) N ≤ 20: using the formulas (76-77):

1◦ Generate [N/2] random numbers from e−x.

2◦ Compute the sum (76).

3◦ If N is odd, generate x : N (0, 1) and add x2 to the sum.

The χ2 distribution is important in:

• statistical tests (e.g. random number generator tests)

• optimization of the parameters (’model fitting’) when checking the applicability of a parameterized model

Example: Confidence Level (CL)

∞ Z CL = f(χ2)dχ2

2 χfit

2 χfit versus CL can be obtained from curves or tables. A very powerful way of testing is to plot the so called χ2 probability (see section 4.4).

6.6 Cauchy or Breit-Wigner distribution

The density function of the Cauchy distribution is of the form: 1 1 b 1 f(x) = = , b > 0. πb x−a 2 π b2 + (x − a)2 1 + ( b ) The density function has two parameters from which the other (a) sets the mean and the other (b) sets the width. In the following Cauchy distribution is denoted as C(a, b). The characteristic figures of Cauchy distribution are:

2 Mean: hxi = a Variance: σx = ∞(!)

The variance is not defined, because the integral of the variance formula diverges. The same is true for all higher order even central moments. All odd central moments vanish (=0), which follows from the fact that the Cauchy distribution is symmetric around its mean a. The characteristic function is:

φ(t) = exp(iat − b | t |).

39 Figure 11: Cauchy distribution (blue), a=0, b=1; standard normal for comparison

Though the standard deviation of the Cauchy distribution is = ∞, its half width half maximum (HWHM) can be calculated: HWHM = b. The Cauchy distribution has the following property (same as Gaussian N (µ, σ2)) :  x is C(a1, b1) ⇒ z = x + y is C(a1 + a2, b1 + b2) y is C(a2, b2)

In physics the Cauchy distribution appears in the resonance peak (Breit-Wigner):

1 Γ/2 Γ f(E) = 2 2 = C(E0, ) π (Γ/2) + (E − E0) 2 Here Γ on the width of the resonance and E its energy at rest. When it comes to a resonance particle, E0 = hEi is the particle mass.

Example 1: The lifetime distribution of the resonance is given by the Fourier trans- formation of the Breit-Wigner, E → t:

f(t) ∝| φ(t) |2=| exp(iat) |2 · | exp(−b | t |) |2= exp(−Γ | t |).

It follows that the mean lifetime τ of the resonance particle is inversely proportional to the width of the resonance peak: τ ∝ 1/Γ.

Example 2: A rotating straight line going through the point (a, b)(b > 0). When the angle φ between the line and the x axis is uniformly distributed random variate, then the coordinates x of the intercept of the line and x axis are distributed like Cuachy C(a, b) (exercise). One of the applications of the Cauchy distribution is to look for errors in the simulaton program: using Cauchy (long tails) instead of Gaussian in the errors generation one gets plenty of situations which would be rare otherwise. Then there is a larger chance to get a programming bug to show up.

40 Generation of Cauchy distribution

There are several methods: π π a) In the attached figure the angular variate φ is Unif(− 2 , + 2 ). Then the random variate x is C(a, b). Hence the algorithm reads:

P ◦ PP 1 Generate u Unif(0, 1) φ PP b r PP PP 2◦ Compute x = a + b tan[π(u − 1 )] PPP - 2 a x

b) If u1 and u2 independent and normal N (0, 1), then the ratio u = u1/u2 is C(0, 1). The algorithm reads:

◦ 1 Generate u1 and u2 from N (0, 1)

◦ 2 Compute x = a + b u1/u2 c) If the points (x1, x2) in plane are distributed uniformly in unit circle, then the random variate x = x1/x2 is C(0, 1), and the algorithm reads:

◦ 1 Generate u1 ja u2 Unif(0, 1)

◦ 2 2 2 If (2u1 − 1) + (2u2 − 1) ≤ 1, then

◦ 1 1 3 x = a + b(u1 − 2 )/(u2 − 2 )

6.7 Landau distribution

In physics the Landau distribution is related to the motion of charged particles in a medium. When a particle undergoes electro-magnetic interaction in medium and ionizes atoms, it loses energy. The energy loss, so called ionization energy, per traveled unit distance (keV/cm) is a random variate which follows Landau distribution. The exact definition of the distribution reads:

c+∞ 1 Z Φ(λ) = exp(λs + s log s)ds (c > 0). 2πi c−i∞ This is a Laplace transformation of the function ss. In practical applications the following approximate expression is precise enough: 1 − 1 (λ+e−λ) Φ(λ) ' √ e 2 6 2π ¯ miss¨a λ = λ(E − Ep)/(Em − Ep) Ep = most probable ionizaton energy Em = average ionization energy ' 153.5X( z )2 Z ρ keV cm2/g - β A E λ¯ ' 1.270 Ep

41 where z is the charge of the ioniziting particle and β = v/c its velocity. The quantities Z, A and ρ are the atomic number, the mass number and the density of medium. X is the distance in which the energy loss is measured. A typical feature of the Landau distribution is its very long tail. In other words the random fluctuations with E > Ep are highly probable. There is more information on the subject in F. Sauli: Princinples of operation of mul- tiwire proportional and drift chambers, CERN Yellow Report 77–09 (May 1977) (http://cdsweb.cern.ch/record/117989/files/Chapter01.pdf).

Generation of Landau distribution

We do not discuss here the theory of generation. If needed, one can use existing generation programs. In the ROOT analysis toolkit the function is TRandom::Landau(peak,halfWidth). In the picture below we use this function to produce a typical histogram of the Landau distribtion.

Figure 12: A random generated Landau distribution, peak=10, halfWidth=2

42 7 Monte Carlo integration

7.1 Basics of the MC integration

The MC method is simple and also efficient compared to other numerical methods, when the volume of integration is multi-dimensional. Let us consider 1-dimensional integration. Conclusions which will be drawn below are valid also for n-dimensional integration:

b Z I = g(x)dx (78)

a On the other hand the expectation value of the function g(x) over the uniform distri- bution Unif(a, b) is:

b 1 Z E(g) = g(x)dx, (79) b − a a from which it follows that I = (b − a)E(g).

Let us assume that the values of the function g(x) are computed at random points xi which are distributed uniformly Unif(a, b). An estimate of the g’s expectation value is then: N 1 X E(g) ' g(x ) N i i=1 It follows that the integral (78) has the estimate:

N b − a X I ' I = g(x ) (80) N N i i=1 where xi, i = 1, ··· ,N, are random numbers whose distribution is Unif(a, b). The formula (80) can immediately be genralized to multi-dimensional space:

Z Z N VA X I = ··· g(~x)d~x ' I = g(~x ) (81) N N i A i=1 where ~xi are random vectors (points in space) uniformly distributed in the space of integration A and VA is the (hyper)volume of A. A is (hyper)rectangle, if the limits of integration are constant.

43 Example: Consider integration on plane. In the for- 6 mula (81) we have the volume (area) VA. A question b2 arises how to know or estimate VA, when the area has an arbitratry shape whose value cannot be computed in closed form? In this case one can apply Hit or Miss method in the following way: A

Generate uniformly pairs (x1, x2) in a rectangle which contains the area A (see figure). Total N generations a from which M pairs hits the area A. The size of the 2 - area, VA, is then estimated as: a1 b1 M V ' (b − a )(b − a ) A 2 2 1 1 N

If the volume of the surrounding region is denoted as VbA one gets more generally: Z Z M VbA X ··· g(~x)d~x ' g(~x ). (82) N i A i=1 Often a suitable variable transformation

rk = rk(x1, ··· , xn), k = 1, ··· , n, can transform the integration region A to a hypercube:

0 ≤ rk ≤ 1, k = 1, ··· , n, and the equation (81) transforms to the normal form:

1 1 N Z Z 1 X ··· f(~r)d~r = lim f(~ri) (83) N→∞ N 0 0 i=1 where f is the transformation of g:

∂(x1, ··· , xn) f(~r) = g(~x) ∂(r , ··· , r ) 1 n xk=xk(r1,··· ,rn) (see Appendix A.10: Variable transformation).

b b Z 1 Z n Example: General constant integration limits: I = ··· g(~x)d~x.

a1 an The required transformation is obtained by the equations:

xk = ak + (bk − ak)rk, k = 1, ··· , n. n Q The Jacobian determinant of this transformation is simply: (bk − ak), so we get k=1 b b 1 1 Z 1 Z n n Z Z Y ~ ··· g(~x)d~x = (bk − ak) ··· g[~a + (b − ~a) · ~r]d~r. k=1 a1 an 0 0 In what follows we restrict ourselves to the integrals of normal form in order to simplify the notation.

44 7.2 Convergence and precision of the MC integration

Let us conside 1-variable integral (due to simplicity of notation, the formulae are valid for multi-variable integrals):

1 N Z 1 X I = f(x)dx ' I = f(x ), 0 < x < 1. (84) N N i i 0 i=1

IN is the sum of random variables and approaches therefore a Gaussian distribution according to the central limit theorem. The variance of this distribution is:

 N  N 2 2 1 X 1 X 2 N 2 σ = σ f(xi) = σ (f(xi)) = σ . IN N N 2 N 2 f i=1 i=1 We have derived an important result for the precision and the rate of convergence to the (crude) Monte Carlo integration:

V 2 var(I ) = var(f), (85) N N where V is the total volume of the integration region. This means that IN converges − 1 as N 2 :

V I = IN ± √ σf . (86) N

We emphasize that the convergence (86) is independent of the dimension of the in- tegration. This is the basic reason to the good efficiency of the MC integration in multi-dimensional space. In the formula (86) there appears a notion variance of the function f within the range of integration. In the following we explain this notion more in detail. Without a loss of generality we consider again the unit range [0,1]. The variance of the function f means the variance with respect to uniform density. The definition of variance gives readily the theoretical expression of var(f):

1 1 Z Z 2 2 2 2 2 var(f) ≡ σf = (f(x) − hfi) · 1 · dx = (f(x) − I) dx = hf i − hfi . (87) 0 0 We see that the expression contains the value of the integral I, so this equation has only a theoretical meaning from the point of view of MC integration. In practice the integral (87) must also be computed with MC method. According to the equation (87) we obtain the following empirical expression for var(f):

N " N #2 1 X 1 X var(f) ≡ σ2 ' f(x )2 − f(x ) . (88) f N i N 2 i i=1 i=1

45 If we do not know the volume of the integration region, we need to use hit or miss method as in the formula (82). The error estimate of IN is then:   1 M " M #2 2 Vb X 1 X  ∆I = f(x )2 − f(x ) , N N i N i  i=1 i=1  where Vb is the volume of the surrounding region, M the number of generations which hit the integration region and N the number of generations in the whole surrounding volume Vb. Example: Constant function: f(x) = a, 0 < x < 1. Then hfi = a and the variance of f is: 1 1 2 R 2 R σf = (f − a) dx = 0 · dx = 0 0 0 The example shows that the precision of MC integration 6f(x) σf ∆IN = √ N is the better the closer f is to a constant func- tion. When var(f) is large (figure), the largest contribution to integral comes from the region where f is large. Despite of that, in the crude MC, the number of generations in the region I = hfi where the function is small is the same as in - x the region where the function is large. There- 0 1 fore points are ’wasted’ in crude Monte Carlo.

7.3 Methods to accelerate the convergence and precision

The method of MC integration in which the random points are generated uniformly in the region of integration is called crude Monte Carlo. According to the formula σf ∆IN = √ N the precision can be improved by − 1 - accelerating the convergence N 2 - reducing the variance. Convergence can be accelerated by using quasi random numbers which are cus- tomized for a given problem. It can be shown that with quasi random numbers it is possible to attain a convergence which goes like ∝ N −1 at the best. We do not discuss this option in detail, but rather concentrate to methods of variance reduction. Most commonly used variance reduction methods are: - stratified sampling (’kerrostettu otanta’ in Finnish) - importance sampling - using control variates - using antithetic (negatively correlated) variates which we consider in the following.

46 7.3.1 Stratified sampling

The stratified sampling (’kerrostettu otanta’ in Finnish) is in a sense analogous to the composition method (5.6). Let us consider the basic form (1D case).

1 Z I = f(x)dx.

0 In the method the integration region is di- 6f(x) vided in smaller intervals. Let the dividing points be

0 = α0 < α1 < ··· < αk = 1.

Then the integral I is written as a sum: - x α k j 0 1 X Z I = f(x)dx α0 α1 ··· αk j=1 αj−1 When crude Monte Carlo is applied in each interval, the formula for stratified sampling reads:

k nj X αj − αj−1 X I0 = f[α + (α − α )r ], (89) N n j−1 j j−1 ij j=1 j i=1 P where N = nj and rij are random numbers Unif(0,1). In the following we study j 0 the variance of the MC estimate IN . k  2 nj 0 X αj − αj−1 X var(IN ) = var(f) nj j=1 i=1 αj−1

α k  2 k Z j X αj − αj−1 X αj − αj−1 2 = nj · var(f) = (f − hfi) dx. nj nj j=1 αi−1

 α α  k  Z j  Z j 2 X αj − αj−1 1  var(I0 ) = f 2(x)dx − f(x)dx . (90) N n n j=1  j j   αj−1 αj−1 

0 With suitable choice of nj and αj one has var(IN ) − var(IN ) > 0 which means that the 0 variance of IN is reduced in comparison with the crude MC estimate IN .

1 N Example: Uniform intervals αj − αj−1 = k and nj = k . Then (see formula (85)) α α k j 2 k j 2 k X Z  1 X Z  var(I ) − var(I0 ) = f(x)dx − f(x)dx . N N N N j=1 j=1 αj−1 αj−1

47 One can show that this is always ≥ 0. For example in case that the range of integration is divided in two, one gets:

1 2 1 1 Z Z 2 var(I ) − var(I0 ) = f(x)dx − f(x)dx ≥ 0. N N N 0 1 2 An optimal division depends on the shape of the function. If the shape of the function is not known, division into uniform intervals gives already a smaller variance. In the previous example the numbers of generation nj were chosen all equal. Obviously a better result is obtained, when the numbers nj are set proportional to the integral of |f| over each interval: α Z j nj ∝ |f(x)|dx.

αj−1 This method can be used only iteratively, because in the beginning the integral of the function |f| is unknown. In another method one adjusts the intervals αj such that the contribution to the total integral from each interval is roughly equal:

α Z j |f(x)|dx ' const.

αj−1

In this case the numbers of generation nj are set the same at each interval. The stratied sampling is used for example in the Monte Carlo integration code Vegas (http://en.wikipedia.org/wiki/VEGAS_algorithm).

7.3.2 Importance sampling

The basic principle in the importance sam- 6 pling is to try and weight the integrand such that its variance is as small as possible. For this purpose let us consider the following f(x) identities: - x a f(x) b f(x) = · g(x) ⇓ g(x) 6 f(x) f(x)dx = dG(x), g(x) f(x) g(x) where g is a density function which satisfies the conditions listed below and G is the cor- - x a b responding cumulative density.

48 The importance sampling corresponds to the change of variable r = G(x) so the integral of the function f can be written:

b 1 Z Z f(G−1(r)) I = f(x)dx = dr. (91) g(G−1(r)) a 0 The function g must satisfy the conditions:

• g(x) = dG/dx is a density function in the range (a, b) of integration • G(x) can be inverted or there is some other method available to generate from g • The ratio f(x)/g(x) as uniform as possible in the range (a, b)

The MC estimate of the integral (91) with importance sampling is then:

N 1 Xf(xi) I = , (92) N N g(x ) i=1 i where the random points xi must be generated from the density g. The non- uniform sampling is compensated by weighting the terms of sum by g−1. If the inversion method can be used, the formula goes in the form (92)

N −1 1 Xf(G (ri)) I = , (93) N N g(G−1(r )) i=1 i where the random numbers ri are generated from Unif(0,1). Example 1: If f(x) > 0 ∀x one could choose g(x) = cf(x)(c = constant) so f var( g ) = var(c) = 0. This way the MC integration would be absolutely precise. This is a useless observation, however, because in order to know G(x) we should know the integral R f(x)dx ! Example 2: If f(x) resembles Gaussian N (0, σ), one gets: ∞ √ Z N 2πσ X + 1 ( xi )2 f(x)dx ' f(x )e 2 σ N i −∞ i=1 where the random points xi are generated from the distribution N (0, σ).

7.3.3 Control variates

In this method the integral is divided in two terms

1 1 1 Z Z Z I = f(x)dx = [f(x) − g(x)]dx + g(x)dx. (94)

0 0 0 The problem is to find a suitable function g which fulfils the following conditions:

49 • var(f − g) < var(f)

1 • R g(x)dx can be computed analytically. 0

The integral R (f − g)dx is computed with crude Monte Carlo. In comparison with the importance sampling this method has a number of advantages:

- g’s zeros is not a problem

- g’s generation method is not needed

- g can have negative values.

The variance of the function f − g is

var(f − g) = var(f) + var(g) − 2cov(f, g) which is smaller than the variance of the original integrand var(f), if 1 cov(f, g) > var(g). 2 The condition for the method to work is a sufficiently large positive correlation between the functions f: and g.

7.3.4 Antithetic variates

The crude Monte Carlo estimate of the integral IN is divided in two parts:

0 00 IN = IN + IN (95)

0 00 where the covariance of the two parts is negative: cov(IN ,IN ) < 0. Then the variance of IN becomes smaller, because

0 00 0 00 var(IN ) = var(IN ) + var(IN ) + 2cov(IN ,IN ).

0 00 The negative correlation of the terms IN and IN is obtained by using different variates in their generation - so called antithetic variates (’vastakkaisvariaatit’ in Finnish) u, ϕ(u):

1◦ u ja ϕ(u) are Unif(0,1)

2◦ cov(u, ϕ(u)) < 0.

Example: Consider the variates u and 1 − u where u is generated from Unif(0,1). These variates provide maximal negative correlation.

50 Then it follows from the equation (95):

N 1 X 1 1  I = f(u ) + f(1 − u ) . 6 N N 2 i 2 i f(x) i=1 This gives a reduced variance provided that the inte- grand f is not symmetric (f(u) 6= f(1 − u)). The vari- - ance gets reduced, since if ui hits a region where f is x u 1 − u small, then 1 − ui hits a region where f is large. More 10 generally, consider the estimate

N N 1 X 1 X I = {αf(αu ) + (1 − α)f[1 − (1 − α)u ]} ≡ F (u ). (96) N N i i N α i i=1 i=1 The variance of (96 becomes minimized with a choice of α which minimizes the variance of the function Fα:

1  1 2 1 Z Z Z 2 2 2 var(Fα) = Fαdx −  Fαdx = Fαdx − I 0 0 0 α 1 α Z Z Z = α f 2(x)dx + (1 − α) f 2(x)dx + 2(1 − α) f(x)f[1 − (α−1 − 1)x]dx − I2.

0 α 0 This expression has a minimum in the range 0 < α < 1. The precise value of α is difficult to determine, but a good approximation is given by the equation f(α) = (1 − α)f(1) + αf(0). Notice that (96) is also an applicaition of the stratified sampling with divisions [0, α] and [α, 1], because 0 ≤ αui ≤ α and α ≤ 1 − (1 − α)ui ≤ 1.

7.4 Limits of integration in the MC method

Non-constant limits of integration can often yield difficulties in MC integration. As an example let us consider the integral:

1 x 1 Z Z I = f(x, y)dxdy x=0 y=0

A flowchart in which y ◦ 1 generate random number xi Unif[0,1], 2◦ generate random number y Unif[0,x ] and i i N 1 X 3◦ compute I = f(x , y ) N N i i 0 i=1 0x 1 gives a wrong answer, because the number of points gets equal in each vertical band (figure) and hence the density of points in the region of integration is not uniform.

51 Three different methods are presented below in order to compute the integral by MC method. In the formulae a is the area of the triangle.

A. Rejection B. Folding C. Weighting ◦ ◦ ◦ 1 gen. 0 < xi < 1 1 gen. 0 < r1 < 1 1 gen. 0 < xi < 1 ◦ ◦ ◦ 2 gen. 0 < yi < 1 2 gen. 0 < r2 < 1 2 gen. 0 < yi < xi 3◦ jos y > x → 1◦ 3◦ x = max(r , r ) ◦ 1 PN i i i 1 2 3 N i xif(xi, yi) ◦ a PN 4 N i f(xi, yi) yi = min(r1, r2) N Compensating the non- 4◦ a P f(x , y ) Choosing the points at N i i i uniformity of points by random in the square Folding the points weighting with inverse and rejecting those above the diagonal of the density (= 1/x) above the diagonal below the diagonal

Rejection method is inefficient, because 50% of the generations are lost. As a matter of fact, the method C is the same as importance sampling with the weighting function f(x, y) = 1/x.

7.5 Comparison of convergence with other methods

When the efficiency of the MC integration is compared with other numerical methods, let us consider the situation first in 1D case. Other methods used frequently are trapezoid rule, Simpson’s rule and Gauss’s m-point rule. Precision in these methods depends on the number of divisions. ?¨¨ ? Trapezoid rule: ¨ s = sagitta ¨¨ 6 6 Error ∝ s ∝ h2 ∝ n−2 ⇒ convergence ∝ n−2 h = division n = nb. of divisions  h - Z b h Simpson’s rule: f(x)dx = (f0 + 4f1 + 2f2 + ··· + 4f2n−1 + f2n) a 3 b−a −4 h = 2n ; fi = f(a + ih). Convergence is ∝ n . k Z b X Gauss rule: f(x)dx = gm(xi−1, xi), a i=1 where gm is m-point rule estimate in the interval [xi−1, xi]. In this method the conver- gence is ∝ n−2m+1 where n = k · m. Comparison in N-dimensional space:

Method Convergence Nb. of points − 1 Monte Carlo n 2 n Trapezoid n−2/N nN Simpson n−4/N nN Gauss n−(2m−1)/N nN One can see that for numerical methods other than Monte Carlo, both convergence and precision deteriorate quickly. Also the number of points needed grows quickly as a function of N in other numerical methods.

52 A Basic statistics for simulation

A.1 Introduction

Monte Carlo simulation is to large extent based on generation and application of differ- ent kind of random numbers. Therefore understanding of simulation methods requires good knowledge on the basic concepts of probability and statistics. In this Appendix we give a brief introduction to these concepts. The reader who wants to learn more can consult for instance the following bibliography: S. Brandt: Statistical and Computational Methods in Data Analysis (North Holland, 1970) W.T. Eadie, D. Drijard, F.E. James, M. Roos, B. Sadoulet: Statistical Methods in Experimental Physics (North Holland 1971) D.J. Hudson: Lectures on Elementary Statistics and Probability, CERN 63-29 J. Orear: Notes on Statistics for Physicists, UCRL-8417

A.2 Definition of probability

1. Frequency of events based or ’experimental’ definition:

Let N be the total number of events and n(X) the number of events related to given type X of events. The probability P (X) for an event type X to happen is then defined as the asymptotic limit:

n(X) P (X) = lim (97) N→∞ N 2. Mathematical or ’modern’ definition reads as follows:

Let Ω be a set of events and the elementary events i and j mutually exclu- sive. Then we postulate:

' $ a) P (Xk) ≥ 0 ∀k X i b) P (Xi or Xj) = P (Xi) + P (Xj) (98) Xj X c) P (Xk) = 1 Ω Ω

& %

A53 A.3 Combined probabilities

Let us assume that there are events of type A and B. The formulae for combined probabilities are then:

P (A or B) ≡ P (A + B) = P (A) + P (B) − P (AB) ' $ P (A and B) ≡ P (AB) = P (A) · P (B), if B A and B are independent, otherwise ' $ P (AB) = P (A|B)P (B) = P (B|A)P (A) A P (A|B) = conditional probability & %

& % When number of different type of events is larger than two, the combined probability of any event to happen is computed from the following general formula:

n X n P ( Ai) = S1 − S2 + S3 − · · · − (−1) Sn, where (99) i=1

n X X X S1 = P (Ai),S2 = P (AiAj),S3 = P (AiAjAk) e.t.c. i=1 i

A.4 Probability density function

A random variate can have either discrete or continuous values. We define first the discrete probability density in view of the following example:

Example: A γ -spectrometer whose ’energy windows’ are (Ei,Ei +∆E), i = 0,...,M. The normalized channel readings Ni form a discrete probability density distribution:

Ni P = P (E < E < E + ∆E) = . N 6 i i i N i

X X Ni N P = = = 1. i N N i - Ei In nature the γ energy distribution is continuous in fact. We define the continuous density function as a limit: P (E < E0 < E + ∆E) f(E) = lim ∆E→0 ∆E Any density function is non-negative in all its domain of definition and its integral over the full domain equals unity (the probability that a random variate gets some value=1):

6f(E)

Area=1 - E

A54 E Zmax f(E) ≥ 0; f(E)dE = 1

Emin

Generally: Any function of n variables f(~x), ~x = (x1, x2, ..., xn), which is uniquely defined in a given region A and which fulfils the conditions

f(~x) ≥ 0 ∀~xA Z f(~x)dn~x = 1 (100) A can be interpreted as a probability density of the vector quantity ~x = (x1, ..., xn). In R P case of discrete probability the integral is replaced by sum: → f(~xi) = 1.

A.5 Cumulative distribution function

Let f(x) be a (one variable) density function. Then the cumulative distribution func- tion F is defined as follows: Z x F (x) = f(t)dt (101) tmin It follows from the definition:

F (x ) = 0 6F (x) min 1

F (xmax) = 1

x1 < x2 ⇒ F (x1) ≤ F (x2) - x

A value of the distribution function at certain x1 gives the probability that the random variate x is smaller than x1: P (x < x1) = F (x1).

A.6 Marginal and conditional distribution

Marginal distribution is a projection of at least 2-dimensional density function:

Z ymax g(x) = f(x, y)dy (102) ymin Conditional distribution: Let f(x, y) be a 2-dimenstional probability density dis- tribution. Then the distribution of y on the condition x = h(y) is ∝ f(h(y), y). Also the conditional distibution must be normalized to one, so that:

f(h(y), y) g(y | x = h(y)) = . (103) R f(h(y), y)dy

A55 A.7 Expectation value, mean value and variance

Let f(x) be the probablity density of a quantity x. Then the expectation value of a quantity g = g(x) is (g is a general function): Z E(g) = g(x)f(x)dx (104)

If especially g(x) = x, we have the mean of the distribution: Z E(x) ≡ x¯ ≡ hxi = xf(x)dx (105)

When g(x) = (x − x¯)2 we have the variance: 2 2 2 2 var(x) ≡ σx = E((x − x¯) ) = ··· = E(x ) − E(x) (106) √ The quantity σ2 (’root mean square’ = r.m.s) is often called a standard deviation.

A.8 Covariance and correlation

The formulae (104)-(106) can be generalized to multiple dimensions. For simplicity we write the corresponding formulae below for 2-dimemsional distributions f(x, y). The mean and variance of the variates x and y are: ZZ hxi ≡ E(x) = xf(x, y)dxdy ZZ hyi ≡ E(y) = yf(x, y)dxdy ZZ 2 2 σx = (x − E(x)) f(x, y)dxdy ZZ 2 2 σy = (y − E(y)) f(x, y)dxdy, where the integration takes over the domain of definition of the function f. An important quantity related to the distribution f(x, y) is covariance which is defined as follows: cov(x, y) = E[(x − x¯)(y − y¯)] (107) = E(xy) − E(x)E(y) and correlation coefficient: cov(x, y) corr(x, y) ≡ ρ(x, y) = (108) p 2 2 σxσy. The correlation coefficient is dimensionless and its values are limited as: −1 ≤ ρ ≤ 1.

...... y6 ...... y6 y 6 ...... - - - cov(x, y)> 0x cov(x, y)= 0 x cov(x, y)< 0 x

A56 A.9 Independent variates, covariance matrix

The random variates x1, x2, ..., xn are mutually independent, if their common density function factorizes:

f(~x) = f(x1, ..., xn) = f1(x1)f2(x2) ··· fn(xn) (109)

It is easy to see that the covariance and correlation of independent variates = 0. The inverse statement is not necessarily true. Random variates for which ρ = 0 are uncor- related, but not necessarily independent. For example if x and y distribute uniformly inside a circle, their correlation is zero, but they are not independent due to boundary condition ’inside circle’. The covariance of two random variates can be generalized to probability density of multiple variates: Z Z cov(xi, xj) = ··· (xi − x¯i)(xj − x¯j)f(x1, ··· , xn)dx1 ··· dxn (110)

A matrix whose elements are cov(xi, xj) is called covariance matrix or error matrix. A covariance matrix is diagonal, if the variates are mutually independent. Global correlation coefficient is defined:

ρk = max ρ(xk, y) y where y is a linear combination of all other variates xi, i 6= k. ρk measures the total correlation of xk with all the other variates. It can be computed from the formula

p −1 ρk = 1 − Vkk · (V )kk (111) where V is the covariance matrix.

A.10 Change of variables

In a typical simulation task one has to generate random variates from some prob- ability distribution. Often the quantity in question is generated ’around a corner’: One generates an aid quantity as a function of which the quantity of interest can be expressed. The usage of the method provides that one knows the transformation of densities between the aid quantity and the original random quantity of interest. Let us consider the situation in case of multiple variates. Let

f = f(x1, x2, ..., xn) be a density function. Let the equations

0 0 xi = xi(x1, ..., xn), i = 1, ..., n define a inversely unique change of variables.

0 xi → xi, i = 1, ..., n.

A57 0 0 Then the density function f of the variates xi is:

0 0 0 ∂(x1, ..., xn) f (x1, ..., xn) = f(x1, ..., xn)| 0 0 (112) 0 0 xi=xi(x ,...,xn) ∂(x1, ..., xn) 1 where ∂x1 ∂xn 0 ··· 0 ∂x1 ∂x1 ∂(x1, ..., xn) . . = . . ∂(x0 , ..., x0 ) 1 n ∂x1 ∂xn 0 ··· 0 ∂xn ∂xn is the Jacobian determinant (’Jacobian’) of the transformation.

Esimerkki: β-hiukkasen impulssin p jakautuma√ on f = f(p). Mik¨aon energian E = pp2 + m2 jakautuma g? Lasketaan p = E2 − m2, josta √ ∂p Ef( E2 − m2) g(E) = f(p) = √ ∂E E2 − m2 One observes that in a spectral transformation from a measured quantity to a calculated quantity one has to multiply by the Jacobian determinant.

A.11 Distribution of a function of random variates

Let f(x) and g(y) be the probability densities of the variates x and y. What is the density distribution h(z) of a variate z = z(x, y)? The answer is expressed using the Dirac’s δ function as follows: ZZ h(z) = f(x)g(y)δ(z − z(x, y))dxdy (113)

The Dirac’s δ function is defined as follows:  0 when x 6= 0 δ(x) = (114) ∞ when x = 0

The most important properties:

x x Z dΘ(x) Z δ(t)dt = Θ(x); = δ(x); Θ(x)dx = xΘ(x) dx −∞ −∞ ∞ Z f(x)δ(x − x0)dx = f(x0) (115) −∞

X δ(x − xi) δ(g(x)) = | g0(x ) | i i

Here f ja g are arbitrary functions. The points xi are zeros of the function g. The Θ function (so called step function) is defined as follows:

 0 when x < 0 Θ(x) = 1 when x > 0

A58 Example: The sum z = x + y: ZZ Z Z h(z) = f(x)g(y)δ(z − x − y)dxdy = f(z − y)g(y)dy = f(x)g(z − x)dx

Let us apply the formula for the case where f ja g are Unif(0,1). They can be expressed with the Θ function as follows: 6 1 f(x) = Θ(x)Θ(1 − x) g(y) = Θ(y)Θ(1 − y) - 0 1 What is the probability density of the sum z = x + y? According to the formula:

∞ Z h(z) = Θ(x)Θ(1 − x)Θ(z − x)Θ(1 − z + x)dx

−∞ ∞ Z = xΘ(x) { δ(1 − x)Θ(z − x)Θ(1 − z + x) + Θ(1 − x)δ(z − x)Θ(1 − z + x)

−∞ −Θ(1 − x)Θ(z − x)δ(1 − z + x) } dx

= zΘ(z)Θ(1 − z) + (2 − z)Θ(z − 1)Θ(2 − z)

In the calculation one utilizes partial integration together with the formulae of the δ function (115) (reader’s exercise).

h(z) 6 1 @ @ @ @ @ @ - z 0 1 2

Kuva A.11. Density function of the sum z = x + y, when x, y = tas(0, 1)

The result is a ’triangle function’ which has a maximum at z = 1. One can show that the distribution of a sum of three random variates Unif(0,1) builds up from three pieces of parabola.

A.12 Moments of a distribution

Moments are characteristic numbers which give information about the shape of a dis- tribution. Let us define: 0 n µn ≡ E(x ) algebraic of order n n µn ≡ E[(x − x¯) ] central moment of order n

A59 Moments of more than one-variable density functions are defined analogously. For example the algebraic moment of order m in x and of order n in y:

0 m n µmn = E(x y ) Using central moments the following characteristics of density are defined: 2 3 2 β1 = √µ3/µ2 skewness β2 = µ4/µ2 kurtosis γ1 = β1 skewness coefficient γ2 = β2 − 3 kurtosis coefficient

The kurtosis coefficient γ2 (’huipukkuus’ in Finnish) is defined such that its value for Gaussian = 0. Hence it is a characteristic which can be used to indicate how much a distribution resembles Gaussian.

A.13 Characteristic function

Characteristic function is a Fourier transformation of the density function. Let f(x) be the density function of a random variate x. The characteristic is then defined as:

∞ Z itx itx Φx(t) = E(e ) = e f(x)dx (116) −∞ Φ is a complex valued function of a real variable t. In a discrete case the definition is written as:

X itxj Φx(t) = pje (117) j

If the cumulative distribution function F (x) of x is continuous and dF (x) = f(x)dx, one can express f(x) by Fourier inversion of the characteristic function:

∞ 1 Z f(x) = Φ (t)e−ixtdt 2π x −∞ It is easy to prove the following important properties of the charateristic function:

ibt Φ(0) = 1, |Φ(t)| ≤ 1, Φa x+b(t) = e Φx(a t), Φx+y(t) = Φx(t)Φy(t).

The last equation is valid on the condition that x and y are independent. The series expansion of Φ can be expressed with algebraic moments and, conversely, the moments can be calculated from the Φ function:

∞ X (it)j Φ (t) = µ0 x j j! j=0 dn µ0 = (−i)n Φ (t) (118) n dtn x |t=0 dn µ = (−i)n e−ixt¯ Φ (t) n dtn x |t=0 One can say that the characteristic function is moments generating function.

A60 For example take the Gaussian distribution:

2 1 − 1 x−x¯ f(x) = √ e 2 ( σ ) 2πσ

∞ Z ixt ixt¯ − 1 t2σ2 Φx(t) = e f(x)dx = e 2 −∞ 2 d − 1 t2σ2 2 µ = − e 2 = σ 2 dt2 |t=0

A.14 Cumulants of a distribution

Let Φx(t) be the characteristic function of the random variate x. We define the cu- mulants generating function:

Kx(t) = ln Φx(t) (119)

The coefficients Kj of the series expansion of Kx(t)

∞ X (it)j K (t) = K (120) x j j! j=0 are the cumulants of the x’s density function. The first five cumulants expressed by using the disribution moments are:

0 2 2 K1 = µ1 =x, ¯ K2 = µ2 = σ ,K3 = µ3,K4 = µ4 − 3µ2,K5 = µ5 − 10µ2µ3 .

A.15 Probability generating function

Let P (k) = pk be a discrete density distribution. The characteristic function is:

∞ X itk Φk(t) = pke k=0 Let us denote z = exp(it). Then

∞ k X k G(z) = E(z ) = pkz (121) k=0 is probability generating function. G’s derivatives are in connection with the characteristic properties of the distribution as we see from the following examples:

∞ 0 X G (1) = kpk = hki k=0 G00(1) = hk2i − hki 2 2 2 00 0 2 0 σk = hk i − hki = G (1) − [G (1)] + G (1).

A61 B Exercises

B.1 Basic statistics exercises

1. Prove the validity of the following equations using the basic definitions of mean, variance and covariance: a) hax + byi = ahxi + bhyi b) var(x) = hx2i − hxi2 c) cov(x, y) = hxyi − hxihyi d) var(x + y) = var(x) + var(y) + 2cov(x, y)

2. The statistical quantity x is uniformly distributed in the range (a,b). Calculate the variance of x. 3. Show that if x and y are independent random variables, then cov(x, y) = 0. 4. Show that kurtosis the coefficient for Gaussian distribution vanishes. 5. A β-active isotope is studied with a magnetic spectrometer. Hence the β- spectrum is measured as a function of particle momentum p. One wants to determine the end-point energy of the spectrum with Kurie-plot method in which the spectrum should be measured as a function of β-particle kinematic energy T . How does one compute the transformation momentum spectrum → kinetic energy spectrum? Hint: Use the formulae T = E −mc2 and E2 = (pc)2 +(mc2)2 where m is electron rest mass. 6. Random variates x and y are independent and uniformly distributed in the range (0,1). What is the density function of the product z = xy? What is the proba- bility for z <1/e?

7. Random variates x1 and x2 are independent and uniformly distributed in the range (0,1). Show that the variates z1 = x1 + x2 − 1 and z2 = x1 − x2 have the same density function. Show that cov(z1, z2) = 0. 8. Random variates x and y are independent and uniformly distributed in the range (0,1). Derive the density function for the sum z = x + y. Hint: See the solution in the lecture notes (Appendix), calculate the missing intermediate steps. 9. Write the following small simulation program: two numbers in the range (0,1) are chosen at random and then added. This is repeated N times (say N = 10000). See what kind of statistical distribution you get for the sum values by making a histogram. 10. Show that if x and y are independent random variates, then the characteristic function of the sum variable z = x + y equals the product of the characteristic functions of the x and y variables. Hint: Express the density function of z as as a double integral containing a δ-function. Integrate over z utilizing the δ-function.

B62 B.2 Uniform random number generation exercises

1. Consider the following random number generators: a) xi+1 = 7xi (mod 11) b) xi+1 = 6xi (mod 11) Generate the first 10 random numbers of the two series for x0 = 1. When the generators a) and b) are combined using the shuffling method what are the first 3 generated numbers?

2. On the basis of given rules find the period of the following generators: a) xi+1 = 11xi (mod 32), x0 odd b) xi+1 = 121xi + 567 (mod 1000)

3. We define a random variate z = max{x1, x2, . . . , xk} where the variates xi, i = 1, . . . , k are mutually independent and unif(0, 1). Show that the distribtion function of z is F (z) = zk.

4. Assume that in the generalized Buffon’s needle experiment the needle length is l > D, where D is the distance between the parallel lines drawn on a plane. Show that the probability of a needle to drop on a line is P (hit) = 2 (1 − p1 − q2 + √ πq q arccos q), where q = D/l. Hint: R arccos t dt = t arccos t − 1 − t2.

5. Write code (say Subroutine Runs(N,NR), if in FORTRAN) which generates N random numbers unif(0, 1) and computes the number NR of up-down runs in the sequence of generated N numbers. For random number generation you can use a library function.

6. Write a program to test the algorithm of the previous exercise. The program should call the above routine (or function) NN times and make a histogram of the number NR (number of up-down runs). Compare the mean and variance of the histogram with the theoretical asymptotic limits (2N − 1)/3 and (16N − 29)/90.

7. In order to execute a χ2–test write a function (e.g. Function Chisq(NN,m), if in FORTRAN) in which one generates NN random numbers unif(0, 1), computes i−1 i the number of these random numbers in each of the m intervals ( m , m ), i = 1, ..., m, into a table (say LUKUM(i)) and computes the χ2–function using the statistics in the table.

8. Write a program to test the function of the√ previous exercise such that it computes repetitively the quantity y = p2χ2 − 2m − 3 e.g. using values NN = 10000 and m = 100. Make a plot of the quantity y to see that the computed random variates y follow approximately the normal distribution N (0, 1).

9. Write an algorithm (e.g. Subroutine Corr Test(N,Rho), if in FORTRAN) which computes the correlation coefficient of successive pairs of random numbers as- suming that the generation is performed by a function named Rn User(Dummy). N (input parameter) is the total number of pairs to be generated and Rho is the correlation coefficient.

B63 10. Write an algorithm (say Subroutine Chi Test(M,N,Chi), if in FORTRAN) which computes the χ2 test function (Chi) for a random generator function Rn Flat (unif(0, 1)). The variable M is the total number of random numbers to be generated and N is the number of divisions in the interval (0, 1).

11. You have a random number generator unif(0, 1) at your disposal. How do you generate uniformly random numbers or points a) in the range (a, b)? b) in the square bounded by the lines x = ±1, y = ±1? c) inside the unit circle? d) on the circumference of the unit circle? How do you compute an estimate for the value of π using the algorithm of the point c)? Hint: Hit or miss method.

B.3 Generation from arbitrary distributions exercises

1. The distribution function of a random variate x is √ F (x) = 1 − exp(−3 x), x > 0.

How do you generate the random numbers x?

2. Write the generator of the previous exercise as a function routine (e.g. Func- tion Rnexsq(Dummy) if in fortran), test it and make a histogram of the random numbers x in the range (0, 10).

3. The density function of a random variate x is of the form ( c/(1 + bx)2, when0 ≤ x ≤ D; f(x) = 0, otherwise,

where c and b are constants (c > b). Show that D = (c − b)−1. How do you apply the inversion method to generate the random numbers x?

4. Write the generation algorithm of the previous exercise as a function in which the parameters c and b of the distribution are input parameters. Test the function by generating a histogram of random numbers when c = 2, b = 1.

5. When intermediate vector boson W decays as W → eν (electron + neutrino) the spectrum of the e transverse momentum pT is approximately of the shape pT f(pT ) = const × . p 2 1 − (pT /pmax) Using the inversion method write an algorithm which generates the transverse momenta pT , when pmax = 40 GeV/c. Make a histogram of the generated pT values.

6. In physics data analysis one encounters often with a problem: One knows the theoretical spectrum of the quantity y in the form of a function h = h(y) and

B64 the measurement precision or resolution function r = ry(x). Then the observable spectrum of the quantity y is Z f(x) = ry(x)h(y)dy.

How do you simulate (or generate) the observable spectrum f when one assumes that h is the spectrum of the previous example and the resolution function is a Gaussian distribution with a standard deviation σ = 3 GeV/c. Hint: Composi- tion method. 7. A random variate x gets values in the interval (0,1) and its density function f = f(x) fulfils the condition f(x) < xf(1). Draw a flow chart or pseudo code for an algorithm which generates the variates x using the generalized hit or miss method when the comparison function is a straight line through origin and the point (1,f(1)). 8. Apply the (general) hit or miss method to generate from normal distribution N (0, 1) by choosing as comparison function A ·C(0,1) where A is a constant and C(0,1) is the Cauchy distribution with parameters a = 0, b = 1. By solving an extreme value problem for the ratio N (0, 1)/(A ·C(0,1)) show that the best value for the constant A is p2π/e. Write a flow diagram or a working routine. 9. The density function of the variates x, y is ( 1 , x2 + y2 ≤ 1 f(x, y) = π 0, x2 + y2 > 1. How does one express the same density as a function of a) polar coordinates r, φ, b) the quantities r2, φ? Conclude from this how to generate random points uniformly in a unit circle using the polar coordinates. 10. A straight line of infinite length rotates freely around a point (a, b) on the line (b > 0). The rotation angle is a uniformly distributed random variate. The line crosses the x-axis at the points (x, 0). Show that the random variate x follows the Cauchy distribution: 1 1 f(x) = x−a 2 . bπ 1 + ( b )

B.4 Wellknown continuous distributions exercises

1. We define the random variates x1, x2 (cf. Box–M¨ullermethod) as: p x1 = −2 log u1 sin(2πu2) p x2 = −2 log u1 sin(2πu2 − φ),

where u1 and u2 are uniformly distributed in the range [0,1] and φ is a constant (0 ≤ φ ≤ 2π). Show directly on the basis of the definition of mean, variance and covariance that:

a) E(x1) = E(x2) = 0 b) var(x1) = var(x2) = 1 c) ρ(x1, x2) = cos φ. R 1 2 1 R 1 Hint: Use 0 sin 2πudu = 2 and 0 sin 2πu cos 2πudu = 0.

B65 2. The random variates x1, x2, . . . , xn are independent and uniform in the range (0,1). Let us denote n X y = xi. i=1 Determine the constants a and b such that the mean and variance of the variate z = ay + b are hzi = 0 and var(z) = 1, respectively.

3. Write a function Ransum(Rnd,M) which yields normal random numbers Rnd for the distribution N (0, 1) using the summation method. M is a parameter (input) which equals the number of terms in the sum. Test the routine by generating eg. 10000 random numbers and by computing their mean and variance.

4. The distribution of independent variates x and y is normal N (0,1). Show that the quantity q = x2 + y2 follows the exponential distribution.

5. The random numbers x and x are uniform and independent in the range (0,1). 1 2 √ Show analytically that the random variates z1 = x1 and z2 = max(x1, x2) have the same density distribution (which?).

6. The random points (x, y, z) are distributed uniformly within a unit sphere centred about the origin. Hence the density distribution of these points is ( 3 , x2 + y2 + z2 ≤ 1 f(x, y, z) = 4π 0, otherwise.

When one transforms from Cartesian to spherical coordinates r, θ, φ,(x = r sin θ cos φ, y = r sin θ sin φ, z = r cos θ) the following tranformation equation is valid: 3 1 1 dxdydz = −dr3 d cos θ dφ 4π 2 2π Use this information to find an algorithm which generates directly r, θ, φ coor- dinates such that the points (x, y, z) are distributed uniformly on the surface of the sphere.

7. When one transforms from Cartesian coordinates (x, y, z) to spherical coordinates (r, θ, φ) the space element transforms as follows: dxdydz → r2 sin θdθdφ. On the basis of this information show that the algorithm √ ◦ 3 1 Generate u1 unif(0,1), compute r = u1 ◦ 2 Generate u2 unif(0,1), compute cos θ = −1 + 2u2, sin θ = sin(arccos(−1 + 2u2)) ◦ 3 Generate u3 unif(0,1), compute φ = 2πu3 4◦ Compute x = r sin θ cos φ, y = r sin θ sin φ, z = r cos θ

generates uniformly points in a unit sphere.

8. When nuclei are bombarded by heavy ions, they disintegrate in N pieces with the probabilites as shown in the table below:

B66 N 2 3 4 5 6 P 0.1 0.3 0.3 0.2 0.1

A random generator gives integers uniformly in the range 0 → 99. The first six numbers of the random sequence are 97, 55, 24, 13, 86 and 91. Using this sequence generate six simulated disintegrations of the atoms to N pieces.

9. Write an algorithm which generates random variates N from the discrete distri- bution of the previous exercise.

10. In a Poisson test one generates NN=900 random numbers unif(0,1) and these are distributed in N=300 uniform windows (bins). The expected number of windows with zero entries is about 300e−3 ≈ 15. Prove this result.

11. Pairs of random numbers (x1, x2) are generated independent of each other from the normal distribution N (0, 1). Show that the random numbers

p 2 x3 = ρx1 + 1 − ρ x2

where −1 < ρ < 1, follow the normal distribution N (0, 1) and that the correlation coefficient of the variates x1 and x3 equals ρ. 12. Write a routine (eg. Subroutine Gauss2(X1,X2,E1,E2,S1,S2,RO), if in Fortran) which generates 2-dimensional Gaussian variates (X1,X2) whose expectation val- ues and standard deviates are (E1,E2) and (S1,S2), respectively. The parameter RO is the correlation coefficient. Use the method of previous exercise.

13. Test the algorithm of the previous exercise by plotting a 2-dimensional histogram (’scatter’ plot or lego plot) with the parameter values E1=E2=0, and

a) S1=1.0, S2=1.0, RO=0.5, b) S1=1.0, S2=0.5, RO=0.8.

14. Prove the Cholesky factorization algorithm of a covariance matrix.

15. Write an algorithm by which you generate unit vectors (x, y, z) such that their direction is distributed isotropically.

16. Write a code for Cholesky algorithm and the corresponding algorithm for Gaus- sian generation in k dimensional space. (say Subroutine Gaussk(X,U,V,k), if in Fortran). The input arguments are U (mean vector ~µ) and V (k × k covariance matrix). X is the generated random vector. Test and apply in the case k = 2, ~µ = (6.0, 3.0) and

 2.0 0.5  V = . 0.5 1.0

B67 B.5 Discrete distributions exercises

1. Random numbers m and n are independent Poisson variates with mean µ and ν, respectively. Show that their sum l = m + n is a Poisson variate with mean λ = µ + ν.

2. Show that the expectation value of the binomial variate k = 0, 1, ..., N is hki = N k N−k Np. The frequency of a binomial variate k is Pk = k p (1 − p) with 0 < p < 1. λn 3. Derive the mean and variance of the Poisson distribution p = e−λ. n n! 4. There are 3 kind of nuclei, A, B and C, in a radioactive sample. Their fractional numbers are p1, p2 and p3. The nuclei are uniformly distributed inside a sphere with radius R. Write a simulation algorithm which generates a random point (x, y, z) in the sphere as a position of a decaying nucleus and a random index i = 1, 2 or 3 corresponding to the type of nucleus A, B or C, respectively.

5. A radioactive source emits particles with a mean frequency of λ = 0.5s−1. Start- ing from t = 0 generate consecutive particle emission times t until the time exceeds t1 = 2s. The number of emissions preceeding this last emission is de- noted by n. Repeat this kind of generation, say, N = 1000 times and make statistics on the numbers n (n = 0, 1, 2, 3, ··· ). What is the theoretical probabil- ity distribution of the random variates n? Compare the generated distribution with this.

6. Write and test a code which generates Poisson random variates. Write the code as a function, say Integer Function Psngen(T) where T is the given mean of the Poisson distribution.

7. A source emits β particles isotropically. The mean free path of the particles in the surrounding material is L. Write a simulation algorithm which generates the first collision point of a β particle. Assume that the source is point-like and that it is placed in the origin.

3 8. A water reservoir can take V0 = 1000 m of water. Water is used with a smooth rate of ν = 10 m3/day. The reservoir gets new water only by the effect of rains. Rains appear at random with a mean rate of λ = 0.1day−1 (once in 10 days on the average). The amount of new water brought by a rain is a Gaussian random 3 3 variate with the mean and standard deviation of V1 = 80m and σV1 = 5m , respectively. Write an algorithm which simulates the total amount of water in the reservoir day by day. Assume that the reservoir is full in the beginning and that the water which may flood over is lost.

9. Write a program which generates 10000 random numbers from the normal dis- tribution N (0,1), inserts them in an array and sorts them in an increasing order. Write a routine which generates normal variates from the table by empirical method. Compare the speed of the routine with the speed of the routine which you used to generate the table. Note for cernlib users: Call Timex(Time Elapsed) gives the cpu time used so far.

B68 B.6 Monte Carlo integration exercises

1. Calculate var(f) in the interval (0,1) for the functions a) f(x) = x2 and b) f(x) = exp(x). How many random numbers unif(0,1) one has to generate in order the Monte Carlo-method would give the integral of these functions with the relative accuracy of one per mille?

2. Write an algorithm which computes by the integral Z I = exydxdy A where the surface A is a closed region limited by the curves y = x2 − 1 and y = 1 − x2.

3. How does the empirical error estimation formula of MC-integration change, when one uses the ’hit or miss’ method in the integration, i.e. one generates points in a (hyper)rectangle B which envelops the integration region A and rejects the points outside A? Use the notations VB = the volume of rectangle and N (M) = number of points in A (B).

4. Write a flow chart (or pseudo code) for an algorithm which computes by MC method the integral Z Z 2 2 2 I = ··· x1x2 ··· xN dx1dx2...dxN

where the region of integration is a unit sphere in N-dimensional space. Apply the rejection method to define the region of integration.

5. Write the algorithm of the previous example as a function which is invoked as Value I = Finteg(N dim,N gen) and which computes the integral by generating N gen points (including rejections) in N dim dimensional space. Test the the routine in case N dim = 2 when the exact result is π/24.

6. Write the MC integration sum formula for the integral

1 Z I = (ex2 − 1)dx

0 using the importance sampling method with the weight function g(x) = 2x.

7. Show that the error squared of the MC integration with importance sampling method is about

2 N  2 " N # 1 X f(xi) 1 X f(xi) (∆I )2 = − N N 2 g(x ) N 3 g(x ) 1 i 1 i

where g(x) is a density function used in sampling and xi are random numbers generated from g.

B69 8. Write a routine MCimps(N gen,Funct,Gsamp) which computes the integral of a function Funct by importance sampling method using the function Gsamp to sample random points in the integration region with a given density. Funct and Gsamp should be coded functions. N gen is the number of points to be generated. Apply the algorithm for the integral Z ∞ Z ∞ √ − 1 r2 re 2 dxdy −∞ −∞

(r = px2 + y2) and the normal distribution N (0, 1) is used as the sampling density.

9. How would you code the function Gsamp in the previous example when the integration region is a rectangle a < x < b, c < y < d and the algorithm works as crude Monte Carlo so that the routine MCimps need not be changed.?

10. How would you apply the method of control variates to compute the integral R 1 1 2 I = 0 exp(x)dx when the control function is g(x) = 1 + x + 2 x ? How is the error formula in this case?

B.7 Application examples

1. The exercise following this one requires a solution to the following problem: A point (x0, y0, z0) is inside a sphere with radius R and centred about the origin. A straight line starts from this point to the direction (θ, φ) and crosses the surface of the sphere at a certain point. What is the distance of the point (x0, y0, z0) from the crossing point? Show that the distance is calculated from the formula p 2 2 s = q + p − p where p = x0 sin θ cos φ + y0 sin θ sin φ + z0 cos θ and q = R − 2 2 2 x0 − y0 − z0. 2. Write a simulation algorithm by which you can study how much the intensity of a spherical and homogeneous radiation source diminishes due to selfabsorption i.e. due to the fact that the emitted particle has a chance to get absorbed in the material within the source. Code the algorithm as a program and test it by assuming that l0 = R where l0 is the mean absorption length and R is the radius of the source.

3. When a set of linear equations with n unknowns is solved by Monte Carlo method, one generates Markov chains on the basis of an n×n–matrix. What is the Markov chain obtained by the 3 × 3 matrix: 0.2 0.5 0.0 P = 0.1 0.4 0.3 0.0 0.6 0.2 when one uses random numbers in the range 0 → 99 and the first 7 numbers of the random sequence are 0, 66, 13, 78, 51, 96 ja 77? Assume that one starts with the first row of the matrix.

B70 4. The neutrino flux from supernova 1987A was observed by a small burst of neu- trinos in some underground neutrino experiments in Japan and US. In order to eliminate the possibility that such a burst is a random fluctation in the back- ground neutrino flux one must be able to estimate the probability of such a fluctuation. Suppose that the detector counts the background neutrinos with a mean frequency ν. Write a simulation algorithm by which you can estimate the probability that during a measurement period of length T the detector counts a random burst of ≥ N neutrinos within a short time window ∆t. Apply the algorithm using the following numerical values: ν = 1/min, T = 5h, N = 4 ja ∆t = 5s. 5. The spectrum of particle transverse momenta r in high energy particle collisions follows approximately the density function f(r) = a2r exp(−ar) where a is a constant. Show that the following algortihm:

◦ 1 generate r1 and r2 unif(0,1) ◦ 2 calculate r = − log(r1r2)/a gives random numbers r with the density f(r). 6. Show that if the random variates x and y are distributed exponentially with means x0 and y0, the quantity z = min{x, y} is also exponentially distributed −1 −1 −1 and that its mean z0 is given by: z0 = x0 +y0 . Hint: Consider the probability P (min{x, y} < z). 7. A material composed of the elements A and B is studied with X-rays. The X-rays scattering from the element A traverse the material a mean distance xA before scattering and those scattering from the element B a mean distance of xB. Show that when the X-ray scatters from either of the atom A or B, the probablity that −1 −1 −1 the scatterer is the atom A is xT /xA where xT = xA + xB . 8. A collimated beam of K− particles is sent to a 2 m long hydrogen bubble chamber where each beam particle is observed either a) to traverse the chamber without a collision, b) to deacy or c) to collide with a hydrogen nucleus. Write an algorithm, which simulates these different possibilities and which also gives the interaction point (in case b or c). One knows that half of the K− particles decay (provided they do not collide) within a 3.47 m distance and that the collision cross section is about 26.4 mbarn. The density of liquid hydrogen in bubble chamber is ρ=0.063 g/cm3. 9. A processor tends to send signals to a data bus in such a way that the time difference of consecutive signals is random and uniformly distributed between 1–9 µs. The bus can accept only one signal at a time. The signal stays in the bus a time t which is distributed between 0–9 µs with the density √ f(t) = (6 t)−1, 0 < t < 9. Design an algorithm which simulates the time instants when signals come to a queue for the data bus, enter the bus and exit. Take especially into account the forming of a queue.

B71 10. Write the code for the above simulation algorithm and study with it the following problems: a) How many signals on the average enter the bus during a time window of 0.1 s? b) How many signals are in the queue on the average? c) What is the length of the queue at maximum when you simulate the functioning of the bus during a total time of 1 s.

11. In the simulation of Compton scattering one has to generate random numbers from the following distributions (derive the normalization constants):

f1(x) = Cx

f2(x) = B/x

where x0 < x < 1 and x0 > 0. How do you perform the generation? 12. Write a routine which generates the Compton scattering parameter e = E0/E. Give photon energy before collision as an input parameter. Apply the routine to make a histogram for the quantities e and cos θ, when E = 10 MeV.

B72 Index Central limit theorem, 38 Characteristic function, A60 Covariance matrix, 2 maximum likelihood, 20 Cumulants, A61

Error, 2 matrix, 2 propagation, 2

Fit quality, 16 fit probability, 16 pulls, 16 Fitting, 7 least squares, 8 maximum likelihood, 7, 18

Generating moments, A60 Generating function, A61

Introduction, 1

Jacobian, 4, A58

Least Squares, 8 constraints, 13 convergence, 13 error estimates, 10 Life time, 22 half-life, 22

Maximum Likelihood, 7 gaussian statistics, 8 poisson statistics, 18 Moments, A59 Monte Carlo, 21

Propagation, 2 linear transformations, 2 non-linear transformations, 4

ROOT, 17 likelihood, 19 shuffling, B63

Table of contents, iii

B73