E. Santovetti Università degli Studi di Roma Tor Vergata

StatisticalStatistical datadata analysisanalysis lecturelecture II

Useful books: G. Cowan, Statistical Data Analysis, Clarendon Press, Oxford, 1998 G. D'Agostini: "Bayesian reasoning in data analysis - A critical introduction", World Scientific Publishing 2003

1 DataData analysisanalysis inin particleparticle physicsphysics Aim of experimental particle physics is to find or build environments, able to test the theoretical models, e.g. Standard Model (SM). In particle physics we study the result of an interaction and measure several quantities for each produced particles (charge, momentum, energy …)

e+ e-

Tasks of the data analysis is: Measure (estimate) the parameters; Quantify the uncertainty of the parameter estimates; Test the extent to which the predictions of a theory are in agreement with the data. There are several sources of uncertainty: Theory is not deterministic (quantum mechanics) Random measurement fluctuations, even without quantum effects Errors due to nonfunctional instruments or procedures

We can quantify the uncertainty using probability 2 DefinitionDefinition In probability theory, the probability P of some event A, denoted with P(A), is usually defined in such a way that P satisfies the Kolmogorov axioms:

1) The probability of an event is a non negative real number P ( A)≥0 ∀A∈S

2) The total (maximal) probability is one P(S )=1

3) If two events are pairwise disjoint, the probability of the two events is the sum of the two probabilities

A∩B=∅ ⇒ P ( A∪B)=P ( A)+ P (B)

From this axioms we can derive further properties:

P( A)=1−P( A) P( A∪A)=1 P(∅)=0 A⊂B ⇒ P( A)≤P(B)

P( A∪B)=P( A)+P(B)−P( A∩B) 3 Andrey Kolmogorov, 1933 ConditionalConditional probability,probability, independenceindependence An important concept to introduce is the conditional probability: probability of A, given B (with P(B)≠0). In effect it is meaningless to define an absolute probability, probability depends on the various information we have about the event itself and the neighboring conditions. In this way we establish a connection between A and B. In physics connections (relations) are important Let us make an example with the rolling dice: P ((n< 3)∩(n even)) 1/6 P(n< 3∣n even)= = =1/3 P (n even) 3/6 If two events are independent (uncorrelated):

P( A∩B)=P ( A)⋅P (B) and in that case:

P( A∩B) P( A)⋅P(B) P( A∣B)= = =P( A) P(B) P(B) As expected, the probability of A given B, if A and B are independent, does not 4 depend on B. InterpretationInterpretation ofof probabilityprobability I. Probability of a given event is the relative frequency of happening. Let A is a particular event

examples: quantum mechanics effects, particle scattering, radioactive decays... The limit operation has to be considered not in the usual mathematical meaning. II. Subjective probability: A and B are hypotheses, statements that are true or false

We can define the probability as the price a person think fair to pay (not the maximum), if he gains 1 if the event will happen and 0 if the event will not (de Finetti, Savage). If the possible gain is T, and you think fair to bet S that the event will happen, P(A) = S/T.

S P ( A)= T Both the interpretations are consistent with the Kolmogorov axioms In particle physics frequency interpretation often most useful, but subjective probability can provide more natural treatment of non-repeatable phenomena: 5 systematic uncertainties ISOISO definitiondefinition ofof probabilityprobability

. . . In contrast to this frequency­based point of view of probability, an equally valid viewpoint is that probability is a measure of the degree of belief that an event will occur. For example, suppose one has a chance of winning a small sum of money D and one is a rational bettor. One’s degree of belief in event A occurring is p=0.5 if one is indifferent to this two betting choices: (1) receiving D if event A occurs but nothing if it does not occur; (2) receiving D if event A does not occur but nothing if it does occur.

In the case of generic p (0 ≤ p ≤ 1) the two choices to which the rational bettor is indifferent are: 1) receiving (1-p)D if event A occurs but nothing if it does not occur; 2) receiving pD if event A does not occur but nothing if it does occur

The cases in which the probability is easily evaluable from objective parameters are few and fictitious. In the real life we have complex problems on which we have to decide and face with our responsibilities

6 BayesBayes theoremtheorem From the definition of conditional probability we can write:

and, since: Thomas Bayes (1702 – 1761)

An essay towards solving a problem in the doctrine of chances, Philos. Trans. R. Soc. 53 (1763) 370; we can conclude: reprinted in Biometrika, 45 (1958) 293.

The probability of the event A, given B, is the probability of event B, given A, multiplied by probability of A and divided by probability of B Probability of B at the denominator can be seen as a normalization factor 7 TheThe lawlaw ofof totaltotal probabilityprobability E

Consider a subset E of the total sample space S. S Assume that S is divided into disjoint subsets Hi (hypotheses), such that H S=U i H i i We can write E as

Hi E=E∩S=E∩(U i H i )=U i (E∩H i) and the probability is P (E)=P (E∩(U H ))= P (E∩H )= P (E∣H ) P (H ) i i ∑i i ∑i i i law of total probability The Bayes theorem becomes Again, the denominator is a normalization factor.

P (E∣H i )P ( H i ) P ( H i∣E)= P (H ∣E )∝ P (E∣H ) P (H ) P (E∣H ) P( H ) i i i ∑i i i 8 BayesBayes theorem:theorem: interpretationinterpretation keyskeys The Bayes theorem can be also written as P(H) is modified by the fact that E is true by the same factor P ( H ∣E) P (E∣H ) i = i P(E) is modified if H is true (a soccer team has double P( H ) P( E) probability to win the match if at the half time is up in the i score then ...

The Bayes theorem can be used to test a theory (H) given a P( H i∣E)∝ P (E∣H i) P (H i) new experimental evidence (E). The probability of the theory to be true, after the new evidence is proportional to the probability of the theory 'a priori' and to the probability of the likelihood event E given the theory H

The probability of an hypothesis given two events can be evaluated in two ways: 1) applying the theorem directly to the P( H∣E ∩E ) event E = E1∩E2. 2) applying the theorem first to the event 1 2 E1 and then applying to the resulting probability the theorem with the event E2. It is remarkable that the results are the same and the order independent

n P( H k∣E1∩E 2∩... E n)∝Πi=1 P (Ei∣H k)⋅P 0(H k )∝ P (E1∩...∩E n∣H k)⋅P 0(H k )

9 BayesBayes theorem:theorem: interpretationinterpretation keyskeys (2)(2) The Bayes theorem can be also written as This version of Bayes theory is very useful if we want to P( H ∣E) P (E∣H ) P( H ) 1 = 1 ⋅ 1 compare two different hypotheses or theories (it is often P (H ∣E ) P( E∣H ) P (H ) meaningless to define absolute probability for a theory) 2 2 2 Probabilities ratio is modified by the ratio of the likelihood factors (Bayes factor) Bayes factor

in summary...

Final probability = likelihood · initial probability

We can use the theorem to solve the problems of the ”inverse probability”, e.g. The problem of the causes probabilities.

If there are several causes that can generate the same experimental effect, the probability that the effect is produced by a certain cause is proportional to the probability of the cause multiplied by the probability that this cause produces the observed effect

P(C i∣E)∝ P (E∣C i)⋅P 0(C i )

10 BayesBayes theorem:theorem: exampleexample 11

Consider a school with 60% of male students and 40% of female students. Female students wear pants or skirt in the same number while male students wear only pants. If an observer sees, from very far, a student with pants. What is the probability that this student is a girl? This problem can be easily solved using the Bayes theorem with: event A: the observed student is a girl; event B: the observed student wears the pants. We have to evaluate P(A|B) and: P(A) is probability that a student is female, without any condition: 40% = 2/5 P(A') is probability that a student is male, without any condition: 60% = 3/5 P(B|A): probability that a student wears the pants, given that this student is female: 1/2 P(B|A'): probability that a student wears the pants, given that this student is male: 1 P(B): probability that a student (any) wears the pants. Since the number of the students that wear the pants is 80 (60 male + 20 female) over 100 total students, P(B) = 80/100 = 4/5

P (B∣A)P ( A) 1/2⋅2/5 P (A∣B)= = =1/4 P (B) 4/5 11 BayesBayes theorem:theorem: exampleexample 22

Suppose the probability (for anyone) to have AIDS is: P(AIDS) = 0.001 P(no AIDS) = 0.999 Suppose now you made the AIDS test and the result is positive (+) but …. you know that the test can be wrong P(+ | AIDS) = 0.98 probabilities to (in)correctly identify an infected person P(- | AIDS) = 0.02

P(+ | no AIDS) = 0.03 probabilities to (in)correctly P(- | no AIDS) = 0.97 identify an uninfected person How worried should you be? The probability to have AIDS, given a positive result, is: P ( pos∣AIDS ) P( AIDS ) P( AIDS∣pos)= P ( pos∣AIDS )P ( AIDS )+ P ( pos∣no AIDS ) P (no AIDS ) 0.98×0.001 P ( AIDS∣pos)= =0.0317 0.98×0.001+ 0.03×0.999 You are probably ok! 12 simple example of inverse probability problem TheThe casecase ofof suspicioussuspicious cardsharper:cardsharper: exampleexample 33 Suppose to meet an old friend (you don't see him since a long time) and he asks you to gamble the coffee: the one gets the highest card wins the coffee. You accept the gamble and you loose! It's a long time that you don't see your friend and you don't know him very well anymore. What is the probability that he is a cardsharper and is cheating you?

Where H is the hypothesis that he is a cardsharper, v1 is the event that he wins one gamble. We see that we have to establish the a priori probability (not easy). We can write:

From this we can then evaluate our probability:

13 Starting from 10% probability we have 18% probability that our friend is a cardsharper TheThe casecase ofof suspicioussuspicious cardsharper:cardsharper: exampleexample 33 At this time, you friend proposes us to gamble an other time... Mmmmh You accept and loose again! What is the new probability that he is a cardsharper?

and considering that:

We have

The same results we can obtain if we apply the Bayes theorem after the fist gamble, with the a priori probability = P(H|v1)

14 TheThe casecase ofof suspicioussuspicious cardsharper:cardsharper: exampleexample 33

The probability changes with new facts. If you continue to gamble and continue to loose the probability changes and goes rapidly to 1. In any case, this is strongly dependent on the a priori probability. If this was zero (you have a complete confidence in your friend), he will never be a cardsharper.

15 FrequentistFrequentist statisticsstatistics –– philosophyphilosophy approachapproach In frequentist statistics, probabilities are associated only with the data, i.e., outcomes of repeatable observations. There are two problems: ● We can define probability only in approximate way, since the number of trials is limited. In an extreme case where we do just one trial probability is 0 or 1. (practical limitation)

● A substantial problem is related to the fact that we should be able to repeat the experiment several times and exactly in the same conditions (and independently one of the other). Choosing if the condition are the same or not makes this definition of probability subjective and possibly arbitrary. The requirement to have a high number of experiments is in contrast with the difficult repeatability of certain events (what about the probability to find a certain space shuttle in a certain position at certain time?). ”Empirical law of the chance” 16 Probability as an objective reality, innate in nature BayesianBayesian statisticsstatistics –– philosophyphilosophy approachapproach

Difficult to establish certainties. More easy to invalidate a theory than to validate it. To be valid, a theory has to be valid a priori and the probability of its validity, given a certain event, has to be high. To invalidate a theory it is enough that this latter probability is low.

Probability changes in time with new evidences: experience changes the degree of reliability of a certain theory

Impossible to establish certainties but we can classify theories according their reliability.

The Bayesian statistics is the systematic application of the Bayes theorem to update the confidence degree of the possible hypothesis

The dependence on the initial (a priori) probability is strongly reduced when we have many experimental informations. In this case the conclusions come from experimental data

Probability has a not avoidable subjective component 17 RandomRandom variablesvariables andand probabilityprobability densitydensity functionsfunctions (PDF)(PDF)

A random variable is a numerical characteristic assigned to an element of the sample space. It can be discrete or continuous. Suppose the outcome of an experiment is the continuous value x

f(x) is the probability density function (pdf). From one of the Kolmogorov axiom the pdf is normalized to 1.

For a discrete variable xi

18 CumulativeCumulative distributiondistribution functionfunction

Consider the probability to have a value less than or equal to x

x F (x)= f (t )dt ∫−∞

This is the Cumulative distribution function. From this we can introduce a new definition of the pdf as: ∂ F (x) f (x)= ∂ x

pdf cumulative

19 HistogramsHistograms

An histogram is the distribution of a finite number of possible results of a certain variable (say x) A pdf is an histogram with an infinite number of data sample, zero bin width and normalized area

N ( x) f (x)=lim n→ ∞ nΔ x

n = number of entries Δx = bin width

20 MultivariateMultivariate pdfpdf

Most of the times the outcome of an experiment is characterized by several different variables (example: mass, time, etc...) Consider two variables, x and y

event A: x [x, x+dx] event B: y [y, y+dy]

Pdf now function of all the variables

∫...∫ f (x1, x 2,.... , xn)dx1 dx2 .... dxn=1

If two variables are uncorrelated

f (x x )= f ( x ) f ( x ) 1, 2 1 1 2 2 21 MarginalMarginal pdfpdf

Sometimes we want the pdf respect to one variable (or a subsample of the total). Since we don't care about the rest of the variables, we integrate on them

f (x1)=∫ dx2∫dx3 ...∫ F (x1, x 2, .... xn)dx n

Marginal pdf ~ projection of joint pdf onto individual axes.

22 FunctionFunction ofof aa randomrandom variablevariable A function of a random variable is itself a random variable. Suppose x follows a pdf f(x), consider a function a(x). What is the pdf g(a) of a?

dS = region of x space for which a is in [a, a+da]. For one-variable case with unique inverse this is simply

Example:

A uniform pdf in the variable x becomes a completely different pdf in the new variable a 23 FunctionsFunctions withoutwithout aa uniqueunique inverseinverse If the function doesn't have a unique inverse function, we have to consider all the possible solutions and include all the dx intervals corresponding to the da interval

a( x)=x 2

2 da a=x x=±√a dx=± 2√a

dx dx dS=dx ∪dx =[−a− ,−a]∪[a ,a+ ] 1 2 2√a 2√a

f (√a) f (−√a) g (a)= + 2√a 2√a 24 FunctionsFunctions ofof moremore thanthan oneone r.v.r.v.

Suppose now to have a function of many random variables

where dS represents the region of the hyper-space (x1, …, xn) defined by the relation

25 FunctionsFunctions ofof moremore thanthan oneone r.v.r.v. (example)(example)

Let us make an example with n=2 and random variables (x,y)>0 follow the joint pdf f(x,y). Consider the function a(x,y)

(Mellin convolution)

26 MoreMore onon transformationtransformation ofof variablesvariables

Consider now the case of the pdf of n variables

and assume there exist n new variables

for which there exist the inverse functions the pdf of the new variables y is:

J is the Jacobian determinant:

To get the pdf of a variables subsample, we integrate over the unwanted components

27 ExpectationExpectation valuesvalues

Consider a r.v. X with pdf f(x)

We define the expectation value (mean) of x as the sum of all the possible x, each weighted with its probability (center of gravity)

For a generic function y(x)

A measure of the width of a pfd is given by

28 CovarianceCovariance andand correlationcorrelation

Define covariance cov[x,y] (also indicated as matrix Vxy)

Strictly related to the covariance is the correlation factor ρxy

In case of two independent variables

As expected the correlation between two independent (uncorrelated) variables is zero (the converse is not always true).

For a generic sistem with n variables:

The covariance matrix is a crucial ingredient in evaluating the errors of variables and of variable functions (error propagation) 29 CorrelationCorrelation examplesexamples

30 ErrorError propagationpropagation Suppose we have a system of observables x with pdf f(x)

We have all the informations about the errors and correlations from the covariance matrix. It could be happen that we need to evaluate, errors (and correlations) of variables y(x). A rigorous way is to build the new pdf g(y) and from g(y) evaluate the new covariance matrix for the y variables. This way could be not practical since the transformations from x to y can be very complex and also because the f(x) function is not always fully known. In order to evaluate V[y] we need E[y] and E[y2]. Let expand the y(x) function with the Taylor series up to the first order

Putting together, we have:

31 ErrorError propagationpropagation (2)(2) In general for m variables y, we will have a new covariance matrix

or in the matrix notation

A = m n matrix

If the variables x are all uncorrelated

and we find the usual formula where the errors add in quadrature

The error propagation tells us how to find the errors in the new variables starting from the covariance matrix of the original variables y ok y not ok The limit of this approach is that we are assuming σ that the y(x) is linear in a region of size σ around y μ. This is not always the case σy 32

σx x σx x ErrorError propagation:propagation: specialspecial casescases

If x1 and x2 are uncorrelated: ● absolute errors add in quadrature for sum (or difference) ● relative errors add in quadrature for product (or ratio)

Consider now

Consider now the case in which the two variables are completely correlated

Then the error

For 100% correlation error in difference → 0 (the same for the y = x1 + x2 and cov = -1) 33 CatalogueCatalogue ofof pdfspdfs

34 DistributionsDistributions Distribution HEP usage Binomial Branching ratio Multinomial Histogram with fixed N Poisson Number of events Uniform Monte Carlo methods Exponential Decay time Gaussian Measurement error Chi square Goodness of fit Cauchy Mass of resonance Landau Ionization energy loss Beta Prior pdf for efficiency Gamma Sum of exponential variables Student's t Resolution function with adjustable tails Crystal Ball Invariant mass with radiating particles Argus function Background shape

35 BinomialBinomial Consider N independent experiments (Bernoulli trials) Outcome of each experiment is success or failure (1 or 0) Probability of success is p (failure: 1-p) What is the probability to have n success over N experiments? (0£n£N) The probability to have (in order) the sequence ssfsf is

Anyhow the order is not important and we have to multiply for the number of combinations (permutations) to have n success in N trials: N!/n!(N-n)!

Number of times you get six in if you throw the dice 10 times P=1/6, N=10

= 10/6 = 1.67

36 Example: observe N decays of W±, the number n of which are W→μν is a binomial r.v., p = branching ratio. EfficiencyEfficiency evaluationevaluation An interesting application of the the comes from the efficiency evaluation, quite common problem in experimental physics. It can be defined as the fraction of times, or probability, that the device gives a positive signal upon the occurrence of a process of interest. A practical way to estimate the efficiency is to perform a large number N of sampling of our process of interest, and to count the number of times n the device gives a positive signal (i.e.: it has been efficient). This leads to the estimate of the true efficiency given by

m=εN is a binomial variable with expectation value εtrueN and

and the error on the efficiency will be

37 MultinomialMultinomial Like the binomial but with several final possibilities Outcome of each experiment can be 1,2,3,4, … ,m Probability parameter p became a vector

For N trials we want the probability to get: n1 trials of outcome 1, n2 trials of outcome 2, n3 trials of outcome 3, … , nm trials of outcome m.

Considering the outcome i as the positive result and all the rest a failure we can write

We can also find the covariance matrix:

38 Example: consider a histogram with m bins and N total entries: ni is the entries number of the i-th bin (all uncorrelated) PoissonPoisson Consider the limit of the Binomial distribution n follows the

with the following expectation values:

Example: number of scattering events n with cross section σ found for a fixed integrated luminosity

39 UniformUniform Consider a continuous random variable x distribuitad as follow

The expectation values are:

For any r.v. x with cumulative distribution F(x) y=F(x) is uniform in [0,1]. This can be used to generate a particular distribution from the uniform one (you need F-1)

Example: in the decay π0  γγ the energy of the photon is uniform in [Emin, Emax]

40 ExponentialExponential The exponential pdf for a continuous r.v. t is defined by:

The expectation values are:

Example: proper decay time of an unstable particle with life time τ

41 GaussianGaussian The Gaussian (normal) distribution for a continuous r.v. x is defined by:

The expectation values are: Due to the importance of Gaussian distribution the name μ and σ are often used for ather pdf to indicate mean and variance The Gaussian pdf is so useful because almost any random variable that is a sum of a large number of small contributions follows it. This follows from the Central Limit Theorem

42 CentralCentral limitlimit theoremtheorem

For n independent random variables xi, distributed according arbitrary pdfs but with finite σi, the sum

in the limit n goes to infinity, y is a Gaussian r.v. with

Experimental errors are often the sum of several contributions and then measurements can be treated as Gaussian variables. The CLT can be proved using characteristic functions (Fourier transforms), see …. For finite n, the theorem is approximately valid to the extent that the fluctuation of the sum is not dominated by one (or few) terms. Beware of measurement errors with non-Gaussian tails. Good example: velocity component vx of air molecules. 43 Bad example: energy loss of charged particle traversing thin gas layer. CentralCentral limitlimit theorem:theorem: exampleexample 11 The best way to see the central limit theorem is to imagine to throw dices: first you throw one dice, then two dice, and so on. Every time, you make the distribution of the sum of all the dice. The distribution tends to a definite shape (Gaussian) with an increasing peak value (3.5n). The variable is

The width σ starts from ~1/3 and increases with √n

44 CentralCentral limitlimit theorem:theorem: exampleexample 22 In this case we sum an increasing number of variables coming from a disjoint uniform distribution. The variable is

Even at n>10 the Gaussian approximation is very good The starting uniform disjoint distribution is chosen to have a σ=1

45 MultivariateMultivariate GaussianGaussian distributiondistribution

The multivariate Gaussian distribution for a set of continuous variables (x1,x2,...xn) is defined by:

x and μ are column vector

It is essentially the distribution of several Gaussian variables correlated. If the variables are uncorrelated, the distribution becomes the product of all the single Gaussian

46 ChiChi squaresquare ((2)) distributiondistribution The chi square distribution for a continuous variable z is defined by:

n is the number of degree of freedom (ndof), n≥1

Consider n Gaussian r.v. xi, with means μi and σi

follows the chi square distribution with n dofi

The chi square distribution is very useful for many applications. One of the most important application is the test of goodness of a fit 47 Cauchy-LorentzCauchy-Lorentz (Breit(Breit Wigner)Wigner) distributiondistribution The Breit-Wigner distribution for a continuous variable x is defined by:

x0 = most probable value Γ = full width half maximum

Cauchy function has many application in physics and mathematics Solution of the forced oscillator (resonance) Closely related to the Poisson kernel, solution of the Laplace equation (mathematics)

Mass of resonance particle (K*, ρ, Φ, etc...), x0 is the mass, Γ is the resonance with, inverse of the life time (particle physics) 48 LandauLandau distributiondistribution The is used in physics to describe the fluctuations in the energy loss of a charged particle passing through a thin layer of matter

+ + - - + - +- - +

d

Very long tail at high Δ (σ much larger than Gaussian) Energy loss depends on β (Bethe-Bloch). Measurement of ΔE can be used to make particle identification

49 L. Landau, J. Phys. USSR 8 (1944) 201; see also W. Allison and J. Cobb, Ann. Rev. Nucl. Part. Sci. 30 (1980) 253 BetaBeta distributiondistribution In probability theory and statistics, the is a family of continuous probability distributions defined on the interval (0, 1) parameterized by two positive shape parameters, typically denoted by α and β. The beta distribution can be suited to the statistical modeling of proportions in applications where values of proportions equal to 0 or 1 do not occur.

Often used to represent a pdf of continuous r.v. Different from zero in a finite interval

50 CrystalCrystal BallBall functionfunction In particle physics, the (from the Crystal Ball experiment at SLAC) is a modified Gaussian used to describe the invariant mass of the mother particle, in a two bodies decay in presence of radiated photons.

tipical example:

51 ArgusArgus functionfunction In physics, the ARGUS distribution, named after the particle physics experiment ARGUS,is the of the reconstructed invariant mass of a decayed particle candidate in continuum background. The function is very useful to model the combinatorial background distribution or the a continuum distribution with sharp edges at the kinematic boundaries

Φ and ϕ are the cumulative and density functions of the Gaussian standard We know the analytic expression of the cumulative distribution. Faster calculation

52