39. Statistics

Revised October 2019 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in high-energy physics. In statistics, we are interested in using a given sample of data to make inferences about a probabilistic model, e.g., to assess the model’s validity or to determine the values of its parameters. There are two main approaches to statistical inference, which we may call frequentist and Bayesian. In frequentist statistics, probability is interpreted as the frequency of the outcome of a repeat- able experiment. The most important tools in this framework are parameter estimation, covered in Section 39.2, statistical tests, discussed in Section 39.3, and confidence intervals, which are constructed so as to cover the true value of a parameter with a specified probability, as described in Section 39.4.2. Note that in frequentist statistics one does not define a probability for a hypothesis or for the value of a parameter. In Bayesian statistics, the interpretation of probability is more general and includes degree of belief (called subjective probability). One can then speak of a probability density function (p.d.f.) for a parameter, which expresses one’s state of knowledge about where its true value lies. Bayesian methods provide a natural means to include additional information, which in general may be subjective; in fact they require prior probabilities for the hypotheses (or parameters) in question, i.e., the degree of belief about the parameters’ values, before carrying out the measurement. Using Bayes’ theorem (Eq. (38.4)), the prior degree of belief is updated by the data from the experiment. Bayesian methods for interval estimation are discussed in Sections 39.4.1 and 39.4.2.4. For many inference problems, the frequentist and Bayesian approaches give similar numerical values, even though they answer different questions and are based on fundamentally different in- terpretations of probability. In some important cases, however, the two approaches may yield very different results. For a discussion of Bayesian vs. non-Bayesian methods, see references written by a statistician [1], by a physicist [2], or the detailed comparison in Ref. [3]. 39.1 Fundamental concepts Consider an experiment whose outcome is characterized by one or more data values, which we can write as a vector x.A hypothesis H is a statement about the probability for the data, often written P (x|H). (We will usually use a capital letter for a probability and lower case for a probability density. Often the term p.d.f. is used loosely to refer to either a probability or a probability density.) This could, for example, define completely the p.d.f. for the data (a simple hypothesis), or it could specify only the functional form of the p.d.f., with the values of one or more parameters not determined (a composite hypothesis). If the probability P (x|H) for data x is regarded as a function of the hypothesis H, then it is called the likelihood of H, usually written L(H). Often the hypothesis is characterized by one or more parameters θ, in which case L(θ) = P (x|θ) is called the likelihood function. In some cases one can obtain at least approximate frequentist results using the likelihood evaluated only with the data obtained. In general, however, the frequentist approach requires a full specification of the probability model P (x|H) both as a function of the data x and hypothesis H. In the Bayesian approach, inference is based on the posterior probability for H given the data x, which represents one’s degree of belief that H is true given the data. This is obtained from Bayes’ theorem (38.4), which can be written

P (x|H)π(H) P (H|x) = . (39.1) R P (x|H0)π(H0) dH0

M. Tanabashi et al. (Particle Data Group), Phys. Rev. D 98, 030001 (2018) and 2019 update 6th December, 2019 11:50am 2 39. Statistics

Here P (x|H) is the likelihood for H, which depends only on the data actually obtained. The quantity π(H) is the prior probability for H, which represents one’s degree of belief for H before carrying out the measurement. The integral in the denominator (or sum, for discrete hypotheses) serves as a normalization factor. If H is characterized by a continuous parameter θ then the posterior probability is a p.d.f. p(θ|x). Note that the likelihood function itself is not a p.d.f. for θ. 39.2 Parameter estimation Here we review point estimation of parameters, first with an overview of the frequentist approach and its two most important methods, maximum likelihood and least squares, treated in Sections 39.2.2 and 39.2.3. The Bayesian approach is outlined in Sec. 39.2.5. An estimator θb (written with a hat) is a function of the data used to estimate the value of the parameter θ. Sometimes the word ‘estimate’ is used to denote the value of the estimator when evaluated with given data. There is no fundamental rule dictating how an estimator must be constructed. One tries, therefore, to choose that estimator which has the best properties. The most important of these are (a) consistency, (b) bias, (c) efficiency, and (d) robustness. (a) An estimator is said to be consistent if the estimate θb converges in probability (see Ref. [3]) to the true value θ as the amount of data increases. This property is so important that it is possessed by all commonly used estimators. (b) The bias, b = E[ θb] − θ, is the difference between the expectation value of the estimator and the true value of the parameter. The expectation value is taken over a hypothetical set of similar experiments in which θb is constructed in the same way. When b = 0, the estimator is said to be unbiased. The bias depends on the chosen metric, i.e., if θb is an unbiased estimator of θ, then θb2 is not in general an unbiased estimator for θ2. (c) Efficiency is the ratio of the minimum possible variance for any estimator of θ to the variance V [ θb] of the estimator θb. For the case of a single parameter, under rather general conditions the minimum variance is given by the Rao-Cramér-Fréchet bound,

∂b 2 σ2 = 1 + /I(θ) , (39.2) min ∂θ where " # " # ∂ ln L2 ∂2 ln L I(θ) = E = −E (39.3) ∂θ ∂θ2 is the Fisher information, L is the likelihood, and the operator E[] in (39.3) is the expectation value with respect to the data. For the ﬁnal equality to hold, the range of allowed data values must not depend on θ. The mean-squared error,

MSE = E[(θb − θ)2] = V [θb] + b2 , (39.4) is a measure of an estimator’s quality which combines bias and variance. (d) Robustness is the property of being insensitive to departures from assumptions in the p.d.f., e.g., owing to uncertainties in the distribution’s tails. It is not in general possible to optimize simultaneously for all the measures of estimator quality described above. For some common estimators, the properties above are known exactly. More generally, it is possible to evaluate them by Monte Carlo simulation. Note that they will in general depend on the unknown θ.

6th December, 2019 11:50am 3 39. Statistics

39.2.1 Estimators for mean, variance, and median Suppose we have a set of n independent measurements, x1, . . . , xn, each assumed to follow a p.d.f. with unknown mean µ and unknown variance σ2 (the measurements do not necessarily have to follow a Gaussian distribution). Then

1 n µ = X x (39.5) b n i i=1 n 1 X σc2 = (x − µ)2 (39.6) n − 1 i b i=1

2 2 2 are unbiased estimators of µ and σ . The variance of µb is σ /n and the variance of σc is h i 1 n − 3 V σc2 = m − σ4 , (39.7) n 4 n − 1

th where m4 is the 4 central moment of x (see Eq. (38.8)). For Gaussian distributed√ xi, this 4 becomes 2σ /(n − 1) for any n ≥ 2, and for large n the standard deviation of σb is σ/ 2n. For any 2 n and Gaussian xi, µb is an efficient estimator for µ, and the estimators µb and σc are uncorrelated. Otherwise the arithmetic mean (39.5) is not necessarily the most efficient estimator; this is discussed further in Sec. 8.7 of Ref. [4]. 2 If σ is known, it does not improve the estimate µb, as can be seen from Eq. (39.5); however, if µ is known, one can substitute it for µb in Eq. (39.6) and replace n − 1 by n to obtain an estimator 2 2 of σ still with zero bias but smaller variance. If the xi have different, known variances σi , then the weighted average 1 n µ = X w x , (39.8) b w i i i=1 where w = 1/σ2 and w = P w , is an unbiased estimator for µ with a smaller variance than an i i i i √ unweighted average. The standard deviation of µb is 1/ w. As an estimator for the median xmed, one can use the value xbmed such that half the xi are below and half above (the sample median). If there are an even number of observations and the sample median lies between two observed values, the estimator is set by convention to their arithmetic average. If the p.d.f. of x has the form f(x − µ) and µ is both mean and median, then for large n the variance of the sample median approaches 1/[4nf 2(0)], provided f(0) > 0. Although estimating the median can often be more difficult computationally than the mean, the resulting estimator is generally more robust, as it is insensitive to the exact shape of the tails of a distribution. 39.2.2 The method of maximum likelihood Suppose we have a set of measured quantities x and the likelihood L(θ) = P (x|θ) for a set of parameters θ = (θ1, . . . , θN ). The maximum likelihood (ML) estimators for θ are defined as the values that give the maximum of L. Because of the properties of the logarithm, it is usually easier to work with ln L, and since both are maximized for the same parameter values θ, the ML estimators can be found by solving the likelihood equations,

∂ ln L = 0 , i = 1,...,N. (39.9) ∂θi Often the solution must be found numerically. Maximum likelihood estimators are important because they are asymptotically (i.e., for large data samples) unbiased, eﬃcient and have a Gaussian

6th December, 2019 11:50am 4 39. Statistics sampling distribution under quite general conditions, and the method has a wide range of applica- bility. In general the likelihood function is obtained from the probability of the data under assumption of the parameters. An important special case is when the data consist of i.i.d. (independent and identically distributed) values. Here one has a set of n statistically independent quantities x = (x1, . . . , xn), where each component follows the same p.d.f. f(x; θ). In this case the joint p.d.f. of the data sample factorizes and the likelihood function is

n Y L(θ) = f(xi; θ) . (39.10) i=1 In this case the number of events n is regarded as ﬁxed. If however the probability to observe n events itself depends on the parameters θ, then this dependence should be included in the likelihood. For example, if n follows a Poisson distribution with mean µ and the independent x values all follow f(x; θ), then the likelihood becomes

µn n L(θ) = e−µ Y f(x ; θ) . (39.11) n! i i=1 Equation (39.11) is often called the extended likelihood (see, e.g., Refs. [5–7]). If µ is given as a function of θ, then including the probability for n given θ in the likelihood provides additional information about the parameters. This therefore leads to a reduction in their statistical uncertainties and in general changes their estimated values. In evaluating the likelihood function, it is important that any normalization factors in the p.d.f. that involve θ be included. However, we will only be interested in the maximum of L and in ratios of L at diﬀerent values of the parameters; hence any multiplicative factors that do not involve the parameters that we want to estimate may be dropped, including factors that depend on the data but not on θ. Under a one-to-one change of parameters from θ to η, the ML estimators θb transform to η(θb). That is, the ML solution is invariant under change of parameter. However, other properties of ML estimators, in particular the bias, are not invariant under change of parameter. −1 The inverse V of the covariance matrix Vij = cov[θbi, θbj] for a set of ML estimators can be estimated by using

∂2 ln L −1 (Vb )ij = − . (39.12) ∂θi∂θj θb For ﬁnite samples, however, Eq. (39.12) can result in a misestimation of the variances. In the large sample limit (or in a linear model with Gaussian data), L has a Gaussian form and ln L is (hyper)parabolic. In this case, s times the standard deviations σi of the estimators for the parameters can be obtained from the hypersurface deﬁned by the θ such that

2 ln L(θ) = ln Lmax − s /2 , (39.13) where ln Lmax is the value of ln L at the solution point (compare with Eq. (39.73)). The minimum and maximum values of θi on the hypersurface then give an approximate s-standard deviation conﬁdence interval for θi (see Section 39.4.2.2). 39.2.2.1 ML with binned data If the total number of data values xi,(i = 1, . . . , ntot), is small, the unbinned maximum likelihood method, i.e., use of Equation (39.10) (or (39.11) for extended ML), is preferred since binning

6th December, 2019 11:50am 5 39. Statistics can only result in a loss of information, and hence larger statistical errors for the parameter estimates. If the sample is large, it can be convenient to bin the values in a histogram with N bins, so that one obtains a vector of data n = (n1, . . . , nN ) with expectation values µ = E[n] and probabilities f(n; µ). Suppose the mean values µ can be determined as a function of a set of parameters θ. Then one may maximize the likelihood function based on the contents of the bins. P As mentioned in Sec. 39.2.2, the total number of events ntot = i ni can be regarded either as ﬁxed or as a random variable. If it is ﬁxed, the histogram follows a multinomial distribution,

ntot! n1 nN fM(n; θ) = p1 ··· pN , (39.14) n1! ··· nN ! where we assume the probabilities pi are given functions of the parameters θ. The distribution can be written equivalently in terms of the expected number of events in each bin, µi = ntotpi. If the ni are regarded as independent and Poisson distributed, then the data are instead described by a product of Poisson probabilities,

N n Y µ i f (n; θ) = i e−µi , (39.15) P n ! i=1 i where the mean values µi are given functions of θ. The total number of events ntot thus follows a P Poisson distribution with mean µtot = i µi. When using maximum likelihood with binned data, one can ﬁnd the ML estimators and at the same time obtain a statistic usable for a test of goodness-of-ﬁt (see Sec. 39.3.2). Maximiz- ing the likelihood L(θ) = fM/P(n; θ) is equivalent to maximizing the likelihood ratio λ(θ) = fM/P(n; θ)/f(n; µˆ), where in the denominator f(n; µ) is a model with an adjustable parameter for each bin, µ = (µ1, . . . , µN ), and the corresponding estimators are µˆ = (n1, . . . , nN ) (called the “saturated model”). Equivalently one often minimizes the quantity −2 ln λ(θ). For independent Poisson distributed ni this is [8]

N n −2 ln λ(θ) = 2 X µ (θ) − n + n ln i , (39.16) i i i µ (θ) i=1 i where for bins with ni = 0, the last term in (39.16) is zero. The expression (39.16) without the terms µi − ni also gives −2 ln λ(θ) for multinomially distributed ni, i.e., when the total number of entries is regarded as fixed. In the limit of zero bin width, minimizing (39.16) is equivalent to maximizing the unbinned extended likelihood function (39.11); in the corresponding multinomial case without the µi − ni terms one obtains Eq. (39.10). A smaller value of −2 ln λ(θb) corresponds to better agreement between the data and the hypothesized form of µ(θ). The value of −2 ln λ(θb) can thus be translated into a p-value as a measure of goodness-of-fit, as described in Sec. 39.3.2. Assuming the model is correct, then according to Wilks’ theorem [9], for sufficiently large µi and provided certain regularity conditions are met, the minimum of −2 ln λ as defined by Eq. (39.16) follows a χ2 distribution (see, e.g., Ref. [8]). If there are N bins and m fitted parameters, then the number of degrees of freedom for the χ2 distribution is N − m if the data are treated as Poisson-distributed, and N − m − 1 if the ni are multinomially distributed. P Suppose the ni are Poisson-distributed and the overall normalization µtot = i µi is taken as an adjustable parameter, so that µi = µtotpi(θ), where the probability to be in the ith bin, pi(θ), does not depend on µtot. Then by minimizing Eq. (39.16), one obtains that the area under the fitted P P function is equal to the sum of the histogram contents, i.e., i µî = i ni. This is a property not possessed by the estimators from the method of least squares (see, e.g., Sec. 39.2.3 and Ref. [7]).

6th December, 2019 11:50am 6 39. Statistics

39.2.2.2 Frequentist treatment of nuisance parameters Suppose we want to determine the values of parameters θ using a set of measurements x described by a probability model Px(x|θ). In general the model is not perfect, which is to say it cannot provide an accurate description of the data even at the most optimal point of its parameter space. As a result, the estimated parameters can have a systematic bias. One can improve the model by including in it additional parameters. That is, Px(x|θ) is replaced by a more general model Px(x|θ, ν), which depends on parameters of interest θ and nuisance parameters ν. The additional parameters are not of intrinsic interest but must be included for the model to be accurate for some point in the enlarged parameter space. Although including additional parameters may eliminate or at least reduce the eﬀect of systematic uncertainties, their presence will result in increased statistical uncertainties for the parameters of interest. This occurs because the estimators for the nuisance parameters and those of interest will in general be correlated, which results in an enlargement of the contour deﬁned by Eq. (39.13). To reduce the impact of the nuisance parameters one often tries to constrain their values by means of control or calibration measurements, say, having data y. For example, some components of y could represent estimates of the nuisance parameters, often from separate experiments. Suppose the measurements y are statistically independent from x and are described by a model Py(y|ν). The joint model for both x and y is in this case therefore the product of the probabilities for x and y, and thus the likelihood function for the full set of parameters is

L(θ, ν) = Px(x|θ, ν)Py(y|ν) . (39.17) Note that in this case if one wants to simulate the experiment by means of Monte Carlo, both the primary and control measurements, x and y, must be generated for each repetition under assumption of fixed values for the parameters θ and ν. Using all of the parameters (θ, ν) in Eq. (39.13) to find the statistical errors in the parameters of interest θ is equivalent to using the profile likelihood, which depends only on θ. It is defined as

Lp(θ) = L(θ, νb(θ)), (39.18) where the double-hat notation indicates the profiled values of the parameters ν, defined as the values that maximize L for the specified θ. The profile likelihood is discussed further in Section 39.3.2.1 in connection with hypothesis tests. 39.2.3 The method of least squares The method of least squares (LS) coincides with the method of maximum likelihood in the following special case. Consider a set of N independent measurements yi at known points xi. The 2 measurement yi is assumed to be Gaussian distributed with mean µ(xi; θ) and known variance σi . The goal is to construct estimators for the unknown parameters θ. The log-likelihood function contains the sum of squares

N (y − µ(x ; θ))2 χ2(θ) = −2 ln L(θ) + constant = X i i . (39.19) σ2 i=1 i The parameter values that maximize L are the same as those which minimize χ2. The minimum of the chi-square function in Equation (39.19) deﬁnes the least-squares estimators θb for the more general case where the yi are not Gaussian distributed as long as they are independent. If they are not independent but rather have a covariance matrix Vij = cov[yi, yj], then the LS estimators are determined by the minimum of

χ2(θ) = (y − µ(θ))T V −1(y − µ(θ)) , (39.20)

6th December, 2019 11:50am 7 39. Statistics

where y = (y1, . . . , yN ) is the (column) vector of measurements, µ(θ) is the corresponding vector of predicted values, and the superscript T denotes the transpose. If the yi are not Gaussian distributed, then the LS and ML estimators will not in general coincide. Often one further restricts the problem to the case where µ(xi; θ) is a linear function of the parameters, i.e., m X µ(xi; θ) = θjhj(xi) . (39.21) j=1 2 m−1 Here the hj(x) are m linearly independent functions, e.g., 1, x, x , . . . , x or Legendre polyno- mials. We require m < N and at least m of the xi must be distinct. Minimizing χ2 in this case with m parameters reduces to solving a system of m linear equations. 2 Deﬁning Hij = hj(xi) and minimizing χ by setting its derivatives with respect to the θi equal to zero gives the LS estimators,

θb = (HT V −1H)−1HT V −1y ≡ Dy . (39.22)

The covariance matrix for the estimators Uij = cov[θbi, θbj] is given by

U = DVDT = (HT V −1H)−1 , (39.23) or equivalently, its inverse U −1 can be found from

1 ∂2χ2 N −1 X −1 (U )ij = = hi(xk)(V )klhj(xl) . (39.24) 2 ∂θi∂θj θ=θb k,l=1 The LS estimators can also be found from the expression

θb = Ug , (39.25) where the vector g is deﬁned by

N X −1 gi = yjhi(xk)(V )jk . (39.26) j,k=1

For the case of uncorrelated yi, for example, one can use (39.25) with

N h (x )h (x ) (U −1) = X i k j k , (39.27) ij σ2 k=1 k N y h (x ) g = X k i k . (39.28) i σ2 k=1 k

Expanding χ2(θ) about θb, one ﬁnds that the contour in parameter space deﬁned by

2 2 2 χ (θ) = χ (θb) + 1 = χmin + 1 (39.29) has tangent planes located at plus-or-minus-one standard deviation σ from the LS estimates θb θb (the relation is approximate if the ﬁt function µ(x; θ) is nonlinear in the parameters). In constructing the quantity χ2(θ) one requires the variances or, in the case of correlated measurements, the covariance matrix. Often these quantities are not known a priori and must be

6th December, 2019 11:50am 8 39. Statistics estimated from the data. In this case the least-squares and maximum-likelihood methods are no longer exactly equivalent even for Gaussian distributed measurements. An important example is where the measured value yi represents the event count in a histogram bin. If, for example, yi represents a Poisson variable, for which the variance is equal to the mean, then one can either estimate the variance from the predicted value, µ(xi; θ), or from the observed number itself, yi. In the first option, the variances become functions of the parameters, and as a result the estimators may need to be found numerically. The second option can be undefined if yi is zero, and for small yi, the variance will be poorly estimated. In either case, one should constrain the normalization of the fitted curve to the correct value, i.e., one should determine the area under the fitted curve directly from the number of entries in the histogram (see Ref. [7], Section 7.4). As noted in Sec. 39.2.2.1, this issue is avoided when using the method of extended maximum likelihood with binned data by minimizing Eq. (39.16). In that case if the expected number of events µtot does not depend on the other fitted parameters θ, then its extended ML estimator is equal to the observed total number of events. As the minimum value of the χ2 represents the level of agreement between the measurements and the fitted function, it can be used for assessing the goodness-of-fit; this is discussed further in Section 39.3.2. 39.2.4 Parameter estimation with constraints In some applications one is interested in using a set of measured quantities y = (y1, . . . , yN ) to estimate a set of parameters θ = (θ1, . . . , θM ) subject to a number of constraints. For example, one may have measured coordinates from two tracks, and one wishes to estimate their momentum vectors subject to the constraint that the tracks have a common vertex. The parameters can also include momenta of undetected particles such as neutrinos, as long as the constraints from conservation of energy and momentum and from known masses of particles involved in the reaction chain provide enough information for these quantities to be inferred. A set of K constraints can be given in the form of equations

ck(θ) = 0 , k = 1,...,K. (39.30)

In some problems it may be possible to define a new set of parameters η = (η1, . . . , ηL) with L = M − K such that every point in η-space automatically satisfies the constraints. If this is possible then the problem reduces to one of estimating η with, e.g., maximum likelihood or least squares and then transforming the estimators back into θ-space. In many cases it may be difficult or impossible to find an appropriate transformation η(θ). Suppose that the parameters are determined through minimizing an objective function such as χ2(θ) in the method of least squares. Here one may enforce the constraints by finding the stationary points of the Lagrange function

K 2 X L(θ, λ, y) = χ (θ, y) + λkck(θ) (39.31) k=1 with respect to both the parameters θ and a set of Lagrange multipliers λ = (λ1, . . . , λK ). Combin- ing the parameters and Lagrange multipliers into an (M+K)-component vector γ = (θ1, . . . , θM , λ1, . . . , λK ), the solutions for γ, i.e., the estimators γˆ, are found (e.g., numerically) from the system of equations

∂L Fi(γ, y) ≡ = 0 , i = 1,...,M + K. (39.32) ∂γi To obtain the covariance matrix of the estimated parameters one can ﬁnd solutions γ˜ corresponding to the expectation values of the data hyi and expand Fi(γˆ, y) to ﬁrst order about these values.

6th December, 2019 11:50am 9 39. Statistics

This gives (see, e.g., Sec. 11.6 of Ref. [7]) linearized approximations for the estimators, γˆ(y) ≈ γ˜ + C(y − hyi), where the matrix C = −A−1B, and A and B are given by " # " # ∂F ∂F A = i and B = i . (39.33) ij ∂γ ij ∂y j γ˜,hyi j γ˜,hyi In practice the values hyi and corresponding solutions γ˜ are estimated using the data from the actual measurement. Using this approximation for γˆ(y), one can ﬁnd the covariance matrix Uij = cov[ˆγi, γˆj] of the the estimators for the γi in terms of that of the data Vij = cov[yi, yj] using error propagation (cf. Eqs. (39.42) and (39.43)),

U = CVCT . (39.34)

The upper-left M × M block of the matrix U gives the covariance matrix for the estimated parameters cov[θî, θˆj]. One can show for linear constraints that cov[θî, θˆj] is also given by the upper-left M × M block of 2A−1. If the parameters are estimated using the method of least squares, then the number of degrees of freedom for the distribution of the minimized χ2 is increased by the number of constraints, i.e., it becomes N − M + K. Further details can be found in, e.g., Ch. 8 of Ref. [4] and Ch. 7 of Ref. [10]. 39.2.5 The Bayesian approach In the frequentist methods discussed above, probability is associated only with data, not with the value of a parameter. This is no longer the case in Bayesian statistics, however, which we introduce in this section. For general introductions to Bayesian statistics see, e.g., Refs. [11–14]. Suppose the outcome of an experiment is characterized by a vector of data x, whose probability distribution depends on an unknown parameter (or parameters) θ that we wish to determine. In Bayesian statistics, all knowledge about θ is summarized by the posterior p.d.f. p(θ|x), whose integral over any given region gives the degree of belief for θ to take on values in that region, given the data x. It is obtained by using Bayes’ theorem, P (x|θ)π(θ) p(θ|x) = , (39.35) R P (x|θ0)π(θ0) dθ0 where P (x|θ) is the likelihood function, i.e., the joint p.d.f. for the data viewed as a function of θ, evaluated with the data actually obtained in the experiment, and π(θ) is the prior p.d.f. for θ. Note that the denominator in Eq. (39.35) serves to normalize the posterior p.d.f. to unity. As it can be difficult to report the full posterior p.d.f. p(θ|x), one would usually summarize it with statistics such as the mean (or median) value, and covariance matrix. In addition one may construct intervals with a given probability content, as is discussed in Sec. 39.4.1 on Bayesian interval estimation. 39.2.5.1 Priors Bayesian statistics supplies no unique rule for determining the prior π(θ); this reflects the analyst’s subjective degree of belief (or state of knowledge) about θ before the measurement was carried out. For the result to be of value to the broader community, whose members may not share these beliefs, it is important to carry out a sensitivity analysis, that is, to show how the result changes under a reasonable variation of the prior probabilities. One might like to construct π(θ) to represent complete ignorance about the parameters by setting it equal to a constant. A problem here is that if the prior p.d.f. is flat in θ, then it is not flat for a nonlinear function of θ, and so a different parametrization of the problem would lead in general to a non-equivalent posterior p.d.f.

6th December, 2019 11:50am 10 39. Statistics

For the special case of a constant prior, one can see from Bayes’ theorem (39.35) that the posterior is proportional to the likelihood, and therefore the mode (peak position) of the posterior is equal to the ML estimator. The posterior mode, however, will change in general upon a transformation of parameter. One may use as the Bayesian estimator a summary statistic other than the mode, such as the median, which is invariant under parameter transformation. But this will not in general coincide with the ML estimator. The difficult and subjective nature of encoding personal knowledge into priors has led to what is called objective Bayesian statistics, where prior probabilities are based not on an actual degree of belief but rather derived from formal rules. These give, for example, priors which are invariant under a transformation of parameters, or ones which result in a maximum gain in information for a given set of measurements. For an extensive review see, e.g., [15]. Objective priors do not in general reflect degree of belief, but they could in some cases be taken as possible, although perhaps extreme, subjective priors. The posterior probabilities as well therefore do not necessarily reflect a degree of belief. However one may regard investigating a variety of objective priors to be an important part of the sensitivity analysis. Furthermore, use of objective priors with Bayes’ theorem can be viewed as a recipe for producing estimators or intervals which have desirable frequentist properties. An important procedure for deriving objective priors is due to Jeffreys. According to Jeffreys’ rule one takes the prior as q π(θ) ∝ det(I(θ)) , (39.36) where " # ∂2 ln P (x|θ) Iij(θ) = −E (39.37) ∂θi∂θj is the Fisher information matrix. One can show that the Jeffreys prior leads to inference that is invariant under a transformation of parameters. One should note that the Jeffreys prior does not in general correspond to one’s degree of belief about the value of a parameter. As examples, the Jeffreys prior for the mean µ of a Gaussian distribution is a constant, and for the mean of a Poisson √ distribution one finds π(µ) ∝ 1/ µ. √ Neither the constant nor 1/ µ priors can be normalized to unit area and are therefore said to be improper. This can be allowed because the prior always appears multiplied by the likelihood function, and if the likelihood falls to zero sufficiently quickly then one may have a normalizable posterior density. An important type of objective prior is the reference prior due to Bernardo and Berger [16]. To find the reference prior for a given problem one considers the Kullback-Leibler divergence Dn[π, p] of the posterior p(θ|x) relative to a prior π(θ), obtained from a set of i.i.d. data x = (x1, . . . , xn): Z p(θ|x) D [π, p] = p(θ|x) ln dθ . (39.38) n π(θ) This is effectively a measure of the gain in information provided by the data. The reference prior is chosen so that the expectation value of this information gain is maximized for the limiting case of n → ∞, where the expectation is computed with respect to the marginal distribution of the data, Z p(x) = p(x|θ)π(θ) dθ . (39.39)

For a single, continuous parameter the reference prior is usually identical to the Jeﬀreys prior. In the multiparameter case an iterative algorithm exists, which requires sorting the parameters by order of inferential importance. Often the result does not depend on this order, but when it does,

6th December, 2019 11:50am 11 39. Statistics this can be part of a sensitivity analysis. Further discussion and applications to particle physics problems can be found in Ref. [17]. 39.2.5.2 Bayesian treatment of nuisance parameters As discussed in Sec. 39.2.2, a model may depend on parameters of interest θ as well as on nuisance parameters ν, which must be included for an accurate description of the data. Knowl- edge about the values of ν may be supplied by control measurements, theoretical insights, physical constraints, etc. Suppose, for example, one has data y from a control measurement which is characterized by a probability Py(y|ν). Suppose further that before carrying out the control measurement one’s state of knowledge about ν is described by an initial prior π0(ν), which in practice is often taken to be a constant or in any case very broad. By using Bayes’ theorem (39.1) one obtains the updated prior π(ν) (i.e., now π(ν) = π(ν|y), the probability for ν given y),

π(ν|y) ∝ P (y|ν)π0(ν) . (39.40) In the absence of a model for P (y|ν) one may make some reasonable but ad hoc choices. For a single nuisance parameter ν, for example, one might characterize the uncertainty by a p.d.f. π(ν) centered about its nominal value with a certain standard deviation σν. Often a Gaussian p.d.f. provides a reasonable model for one’s degree of belief about a nuisance parameter; in other cases, more complicated shapes may be appropriate. If, for example, the parameter represents a non- negative quantity then a log-normal or gamma p.d.f. can be a more natural choice than a Gaussian truncated at zero. Note also that truncation of the prior of a nuisance parameter ν at zero will in general make π(ν) nonzero at ν = 0, which can lead to an unnormalizable posterior for a parameter of interest that appears multiplied by ν. The likelihood function, prior, and posterior p.d.f.s all depend on both θ and ν, and are related by Bayes’ theorem, as usual. Note that the likelihood here only refers to the primary measurement x. Once any control measurements y are used to find the updated prior π(ν) for the nuisance parameters, this information is fully encapsulated in π(ν) and the control measurements do not appear further. One can obtain the posterior p.d.f. for θ alone by integrating over the nuisance parameters, i.e., Z p(θ|x) = p(θ, ν|x) dν . (39.41) Such integrals can often not be carried out in closed form, and if the number of nuisance parameters is large, then they can be difficult to compute with standard Monte Carlo methods. Markov Chain Monte Carlo (MCMC) techniques are often used for computing integrals of this type (see Sec. 40.5). 39.2.6 Propagation of errors Consider a set of n quantities θ = (θ1, . . . , θn) and a set of m functions η(θ) = (η1(θ), . . . , ηm(θ)). Suppose we have estimated θb = (θb1,..., θbn), using, say, maximum-likelihood or least-squares, and we also know or have estimated the covariance matrix Vij = cov[θbi, θbj]. The goal of error propagation is to determine the covariance matrix for the functions, Uij = cov[ηbi, ηbj], where ηb = η(θb). In particular, the diagonal elements Uii = V [ηbi] give the variances. The new covariance matrix can be found by expanding the functions η(θ) about the estimates θb to first order in a Taylor series. Using this one finds

X ∂ηi ∂ηj Uij ≈ V . (39.42) ∂θ ∂θ kl k,l k l θb This can be written in matrix notation as U ≈ AV AT where the matrix of derivatives A is

∂η i Aij = , (39.43) ∂θj θb

6th December, 2019 11:50am 12 39. Statistics and AT is its transpose. The approximation is exact if η(θ) is linear (it holds, for example, in Equation (39.23)). If this is not the case, the approximation can break down if, for example, η(θ) is significantly nonlinear close to θb in a region of a size comparable to the standard deviations of θb. 39.3 Statistical tests In addition to estimating parameters, one often wants to assess the validity of certain statements concerning the data’s underlying distribution. Frequentist hypothesis tests, described in Sec. 39.3.1, provide a rule for accepting or rejecting hypotheses depending on the outcome of a measurement. In significance tests, covered in Sec. 39.3.2, one gives the probability to obtain a level of incompatibility with a certain hypothesis that is greater than or equal to the level observed with the actual data. In the Bayesian approach, the corresponding procedure is based fundamentally on the posterior probabilities of the competing hypotheses. In Sec. 39.3.3 we describe a related construct called the Bayes factor, which can be used to quantify the degree to which the data prefer one or another hypothesis. 39.3.1 Hypothesis tests A frequentist test of a hypothesis (often called the null hypothesis, H0) is a rule that states for which data values x the hypothesis is rejected. A region of x-space called the critical region, w, is specified such that there is no more than a given probability under H0, α, called the size or significance level of the test, to find x ∈ w. If the data are discrete, it may not be possible to find a critical region with exact probability content α, and thus we require P (x ∈ w|H0) ≤ α. If the data are observed in the critical region, H0 is rejected. The data x used to construct a test could be, for example, a set of values that characterizes an individual event. In this case the test corresponds to classification as, e.g., signal or background. Alternatively the data could represent a set of values from a collection of events. Often one is interested in knowing whether all of the events are of a certain type (background), or whether the sample contains at least some events of a new type (signal). Here the background-only hypothesis plays the role of H0, and in the alternative H1 both signal and background are present. Rejecting H0 is, from the standpoint of frequentist statistics, the required step to establish discovery of the signal process. The critical region is not unique. Its choice should take into account the probabilities for the data predicted by some alternative hypothesis (or set of alternatives) H1. Rejecting H0 if it is true is called a type-I error, and occurs by construction with probability no greater than α. Not rejecting H0 if an alternative H1 is true is called a type-II error, and for a given test this will have a certain probability β = P (x ∈/ w|H1). The quantity 1 − β is called the power of the test of H0 with respect to the alternative H1. A strategy for defining the critical region can therefore be to maximize the power with respect to some alternative (or alternatives) given a fixed size α. To maximize the power of a test of H0 with respect to the alternative H1, the Neyman–Pearson lemma states that the critical region w should be chosen such that for all data values x inside w, the likelihood ratio f(x|H ) λ(x) = 1 (39.44) f(x|H0) is greater than or equal to a given constant cα, and everywhere outside the critical region one has λ(x) < cα, where the value of cα is determined by the size of the test α. Here H0 and H1 must be simple hypotheses, i.e., they should not contain undetermined parameters. It is convenient to define the test using a scalar function of the data x called a test statistic, t(x), such that the boundary of the critical region is given by a surface of constant t(x). The Neyman–Pearson lemma is equivalent to the statement that the likelihood ratio (39.44) represents the optimal test statistic. It can be difficult in practice, however, to determine λ(x), since this

6th December, 2019 11:50am 13 39. Statistics

requires knowledge of the joint p.d.f.s f(x|H0) and f(x|H1). Often one does not have explicit formulae for these, but rather Monte Carlo models that allow one to generate instances of x that follow the p.d.f.s. In the case where the likelihood ratio (39.44) cannot be used explicitly, there exist a variety of other multivariate methods for constructing a test statistic that may approach its performance. These are based on machine-learning algorithms that use samples of training data corresponding to the hypotheses in question, often generated from Monte Carlo models. Methods often used in HEP include Fisher Discriminants, Neural Networks, Boosted Decision Trees and Support Vector Machines. Descriptions of these and other methods can be found in Refs. [18–21], in Proceedings of the PHYSTAT conference series [22], and in the HEP Community White Paper [23]. Software for HEP includes the TMVA [24] and scikit-learn [25] packages. An important issue in constructing a test is the choice of variables that enter into the data vector x. For purposes of classification one may choose, for example, to form certain functions of particle momenta such as, e.g., invariant masses that are felt to be physically meaningful in the context of a particular event type. It may be difficult to know, however, whether there may exist further features that would help distinguish between signal and background. Recently, so-called Deep Neural Networks containing several or more hidden layers have been applied in HEP [26,27]; these allow one to use directly as inputs the elements of the data vector x (features) that represent lower-level quantities such as individual particle momenta, rather than needing to first construct “by hand” higher level features. Each hidden layer then allows the network to construct significant high-level features in an automatic way. The multivariate algorithms designed to classify events into signal and background types also form the basis of tests of the hypothesis that a sample of events consists of background only. Such a test can be constructed using the distributions of the test statistic t(x) for event classification obtained from a multivariate algorithm such as a Neural Network output. The distributions p(t|s) and p(t|b) for signal and background events, respectively, are used to construct the likelihood ratio of the signal-plus-background hypothesis relative to that of background only. To the extent that the test statistic t(x) approximates the likelihood ratio (or a monotonic function thereof) for individual events given by (39.44), the resulting test of the background-only hypothesis for the event sample will have maximum power with respect to the signal-plus-background alternative (see Ref. [28]). 39.3.2 Tests of significance (goodness-of-fit) Often one wants to quantify the level of agreement between the data and a hypothesis without explicit reference to alternative hypotheses. This can be done by defining a statistic t whose value reflects in some way the level of agreement between the data and the hypothesis. The analyst must decide what values of the statistic correspond to better or worse levels of agreement with the hypothesis in question; the choice will in general depend on the relevant alternative hypotheses. The hypothesis in question, H0, will determine the p.d.f. f(t|H0) for the statistic. The significance of a discrepancy between the data and what one expects under the assumption of H0 is quantified by giving the p-value, defined as the probability to find t in the region of equal or lesser compatibility with H0 than the level of compatibility observed with the actual data. For example, if t is defined such that large values correspond to poor agreement with the hypothesis, then the p-value would be Z ∞ p = f(t|H0) dt , (39.45) tobs where tobs is the value of the statistic obtained in the actual experiment. The p-value is a function of the data, and is therefore itself a random variable. If the hypothesis used to compute the p-value is true, then for continuous data p will be uniformly distributed between

6th December, 2019 11:50am 14 39. Statistics zero and one. Note that the p-value is not the probability for the hypothesis; in frequentist statistics, this is not defined. The p-value should not be confused with the size (significance level) of a test, or the confidence level of a confidence interval (Section 39.4), both of which are pre-specified constants. We may formulate a hypothesis test, however, by defining the critical region to correspond to the data outcomes that give the lowest p-values, so that finding p ≤ α implies that the data outcome was in the critical region. When constructing a p-value, one generally chooses the region of data space deemed to have lower compatibility with the model being tested as one having higher compatibility with a given alternative, such that the corresponding test will have a high power with respect to this alternative. When searching for a new phenomenon, one tries to reject the hypothesis H0 that the data are consistent with known (e.g., Standard Model) processes. If the p-value of H0 is sufficiently low, then one is willing to accept that some alternative hypothesis is true. Often one converts the p-value into an equivalent significance Z, defined so that a Z standard deviation upward fluctuation of a Gaussian random variable would have an upper tail area equal to p, i.e.,

Z = Φ−1(1 − p) . (39.46)

Here Φ is the cumulative distribution of the standard Gaussian, and Φ−1 is its inverse (quantile) function. Often in HEP the level of significance where an effect is said to qualify as a discovery is Z = 5, i.e., a 5σ effect, corresponding to a p-value of 2.87 × 10−7. One’s actual degree of belief that a new process is present, however, will depend in general on other factors as well, such as the plausibility of the new signal hypothesis and the degree to which it can describe the data, one’s confidence in the model that led to the observed p-value, and possible corrections for multiple observations out of which one focuses on the smallest p-value obtained (the “look-elsewhere effect”, discussed in Section 39.3.2.2). 39.3.2.1 Treatment of nuisance parameters for frequentist tests Suppose one wants to test hypothetical values of parameters θ, but the model also contains nuisance parameters ν. To find a p-value for θ we can construct a test statistic qθ such that larger values constitute increasing incompatibility between the data and the hypothesis. Then for an observed value of the statistic qθ,obs, the p-value of θ is Z ∞ pθ(ν) = f(qθ|θ, ν) dqθ , (39.47) qθ,obs which depends in general on the nuisance parameters ν. In the strict frequentist approach, θ is rejected only if the p-value is less than α for all possible values of the nuisance parameters. The difficulty described above is effectively solved if we can define the test statistic qθ in such a way that its distribution f(qθ|θ) is independent of the nuisance parameters. Although exact independence is only found in special cases, it can be achieved approximately by use of the profile likelihood ratio. This is given by the profile likelihood from Eq.(39.18) divided by the value of the likelihood at its maximum, i.e., when evaluated with the ML estimators θb and νb:

L(θ, νb(θ)) λp(θ) = . (39.48) L(θb, νb) Wilks’ theorem [9] states that, providing certain general conditions are satisﬁed, the distribution of 2 −2 ln λp(θ), under assumption of θ, approaches a χ distribution in the limit where the data sample is very large, independent of the values of the nuisance parameters ν. Here the number of degrees

6th December, 2019 11:50am 15 39. Statistics of freedom is equal to the number of components of θ. More details on use of the profile likelihood are given in Refs. [29,30] and in contributions to the PHYSTAT conferences [22]; explicit formulae for special cases can be found in Ref. [31]. Further discussion on how to incorporate systematic uncertainties into p-values can be found in Ref. [32]. Even with use of the profile likelihood ratio, for a finite data sample the p-value of hypothesized parameters θ will retain in general some dependence on the nuisance parameters ν. Ideally one would find the the maximum of pθ(ν) from Eq. (39.47) explicitly, but that is often impractical. An approximate and computationally feasible technique is to use pθ(νb(θ)), where νb(θ) are the profiled values of the nuisance parameters as defined in Section 39.2.2.2. The resulting p-value is correct if the true values of the nuisance parameters are equal to the profiled values used; otherwise it could be either too high or too low. This is discussed further in Section 39.4.2 on confidence intervals. One may also treat model uncertainties in a Bayesian manner but then use the resulting model in a frequentist test. Suppose the uncertainty in a set of nuisance parameters ν is characterized by a Bayesian prior p.d.f. π(ν). This can be used to construct the marginal (also called the prior predictive) model for the data x and parameters of interest θ, Z Pm(x|θ) = P (x|θ, ν)π(ν) dν . (39.49)

The marginal model does not represent the probability of data that would be generated if one were really to repeat the experiment, as in that case one would assume that the nuisance parameters do not vary. Rather, the marginal model represents a situation in which every repetition of the experiment is carried out with new values of ν, randomly sampled from π(ν). It is in effect an average of models each with a given ν, where the average is carried out with respect to the prior p.d.f. π(ν). The marginal model for the data x can be used to determine the distribution of a test statistic Q, which can be written Z Pm(Q|θ) = P (Q|θ, ν)π(ν) dν . (39.50) In a search for a new signal process, the test statistic can be based on the ratio of likelihoods corresponding to the experiments where signal and background events are both present, Ls+b, to that of background only, Lb. Often the likelihoods are evaluated with the profiled values of the nuisance parameters, which may give improved performance. It is important to note, however, that it is through use of the marginal model for the distribution of Q that the uncertainties related to the nuisance parameters are incorporated into the result of the test. Different choices for the test statistic itself only result in variations of the power of the test with respect to different alternatives. 39.3.2.2 The look-elsewhere effect The “look-elsewhere effect” relates to multiple measurements used to test a single hypothesis. The classic example is when one searches in a distribution for a peak whose position is not predicted in advance. Here the no-peak hypothesis is tested using data in a given range of the distribution. In the frequentist approach the correct p-value of the no-peak hypothesis is the probability, assuming background only, to find a signal as significant as the one found or more so anywhere in the search region. This can be substantially higher than the probability to find a peak of equal or greater significance in the particular place where it appeared. There is in general some ambiguity as to what constitutes the relevant search region or even the broader set of relevant measurements. Although the desired p-value is well defined once the search region has been fixed, an exact treatment can require extensive computation. The “brute-force” solution to this problem by Monte Carlo involves generating data under the background-only hypothesis and for each data set, fitting a peak of unknown position and recording

6th December, 2019 11:50am 16 39. Statistics a measure of its significance. To establish a discovery one often requires a p-value smaller than 2.87×10−7, corresponding to a 5σ or larger effect. Determining this with Monte Carlo thus requires generating and fitting a very large number of experiments, perhaps several times 107. In contrast, if the position of the peak is fixed, then the fit to the distribution is much easier, and furthermore one can in many cases use formulae valid for sufficiently large samples that bypass completely the need for Monte Carlo (see, e.g., [31]). However, this fixed-position or “local” p-value would not be correct in general, as it assumes the position of the peak was known in advance. A method that allows one to modify the local p-value computed under assumption of a fixed position to obtain an approximation to the correct “global” value using a relatively simple calculation is described in Ref. [33]. Suppose a test statistic q0, defined so that larger values indicate increasing disagreement with the data, is observed to have a value u. Furthermore suppose the model contains a nuisance parameter θ (such as the peak position) which is only defined under the signal model (there is no peak in the background-only model). An approximation for the global p-value is found to be pglobal ≈ plocal + hNui , (39.51) where hNui, which is much smaller than one in cases of interest, is the mean number of “upcrossings” of the statistic q0 above the level u in the range of the nuisance parameter considered (e.g., the mass range).

The value of hNui can be estimated from the number of upcrossings hNu0 i above some much lower value, u0, by using a relation due to Davis [34],

−(u−u0)/2 hNui ≈ hNu0 ie . (39.52)

By choosing u0 sufficiently low, the value of hNui can be estimated by simulating only a very small number of experiments, or even from the observed data, rather than the 107 needed if one is dealing with a 5σ effect. 39.3.2.3 Goodness-of-fit with the method of least squares When estimating parameters using the method of least squares, one obtains the minimum value of the quantity χ2 (39.19). This statistic can be used to test the goodness-of-fit, i.e., the test provides a measure of the significance of a discrepancy between the data and the hypothesized functional form used in the fit. It may also happen that no parameters are estimated from the data, but that one simply wants to compare a histogram, e.g., a vector of Poisson distributed numbers n = (n1, . . . , nN ), with a hypothesis for their expectation values µi = E[ni]. As the distribution is 2 2 2 Poisson with variances σi = µi, the χ (39.19) becomes Pearson’s χ statistic,

N (n − µ )2 χ2 = X i i . (39.53) µ i=1 i

If the hypothesis µ = (µ1, . . . , µN ) is correct, and if the expected values µi in (39.53) are sufficiently large (or equivalently, if the measurements ni can be treated as following a Gaussian distribution), then the χ2 statistic will follow the χ2 p.d.f. with the number of degrees of freedom equal to the number of measurements N minus the number of fitted parameters. Alternatively, one may fit parameters and evaluate goodness-of-fit by minimizing −2 ln λ from Eq. (39.16). One finds that the distribution of this statistic approaches the asymptotic limit faster than does Pearson’s χ2. Therefore if one uses the asymptotic χ2 p.d.f. as the statistic’s approximate sampling distribution to compute a p-value, one obtains in general a more accurate result from −2 ln λ than from Pearson’s χ2 (see Ref. [8] and references therein).

6th December, 2019 11:50am 17 39. Statistics

Assuming the goodness-of-fit statistic follows a χ2 p.d.f., the p-value for the hypothesis is then Z ∞ p = f(z; nd) dz , (39.54) χ2 2 where f(z; nd) is the χ p.d.f. and nd is the appropriate number of degrees of freedom. Values are shown in Fig. 39.1 or obtained from the ROOT function TMath::Prob. If the conditions for using the χ2 p.d.f. do not hold, the statistic can still be defined as before, but its p.d.f. must be determined by other means in order to obtain the p-value, e.g., using a Monte Carlo calculation. 2 Since the mean of the χ distribution is equal to nd, one expects in a “reasonable” experiment 2 2 2 to obtain χ ≈ nd. Hence the quantity χ /nd is sometimes reported. Since the p.d.f. of χ /nd depends on nd, however, one must report nd as well if one wishes to determine the p-value. The 2 p-values obtained for different values of χ /nd are shown in Fig. 39.2. If the minimized χ2 value indicates a low level of agreement between data and hypothesis, one may be tempted to expect a high degree of uncertainty for any fitted parameters. Poor goodness- of-fit, however, does not mean that one will have large statistical errors for parameter estimates. If, for example, the error bars (or covariance matrix) used in constructing the χ2 are underestimated, then this will lead to underestimated statistical errors for the fitted parameters. The standard deviations of estimators that one finds from, say, Eq. (39.13) reflect how widely the estimates would be distributed if one were to repeat the measurement many times, assuming that the hypothesis and measurement errors used in the χ2 are also correct. They do not include the systematic error which may result from an incorrect hypothesis or incorrectly estimated measurement errors in the χ2. 39.3.3 Bayes factors In Bayesian statistics, all of one’s knowledge about a model is contained in its posterior probability, which one obtains using Bayes’ theorem (Eq. (39.35)). Thus one could reject a hypothesis H if its posterior probability P (H|x) is sufficiently small. The difficulty here is that P (H|x) is proportional to the prior probability P (H), and there will not be a consensus about the prior probabilities for the existence of new phenomena. Nevertheless one can construct a quantity called the Bayes factor (described below), which can be used to quantify the degree to which the data prefer one hypothesis over another, and is independent of their prior probabilities. Consider two models (hypotheses), Hi and Hj, described by vectors of parameters θi and θj, respectively. Some of the components will be common to both models and others may be distinct. The full prior probability for each model can be written in the form

π(Hi, θi) = P (Hi)π(θi|Hi) . (39.55)

6th December, 2019 11:50am 18 39. Statistics

1.000 0.500

0.200 n = 1 2 3 4 6 8 15 25 40 0.100 10 20 30 50 0.050 for test

0.020

p-value 0.010

for confidence intervals 0.005 α 0.002 0.001 1 2 3 4 5 7 10 20 30 40 50 70 100 χ2

Figure 39.1: One minus the χ2 cumulative distribution, 1 − F (χ2; n), for n degrees of freedom. This gives the p-value for the χ2 goodness-of-ﬁt test as well as one minus the coverage probability for conﬁdence regions (see Sec. 39.4.2.2).

This gives what the ratio of posterior probabilities for models i and j would be if the overall prior probabilities for the two models were equal. If the models have no nuisance parameters, i.e., no internal parameters described by priors, then the Bayes factor is simply the likelihood ratio. The Bayes factor therefore shows by how much the probability ratio of model i to model j changes in the light of the data, and thus can be viewed as a numerical measure of evidence supplied by the data in favour of one hypothesis over the other.

Although the Bayes factor is by construction independent of the overall prior probabilities P (Hi) and P (Hj), it does require priors for all internal parameters of a model, i.e., one needs the functions π(θi|Hi) and π(θj|Hj). In a Bayesian analysis where one is only interested in the posterior p.d.f. of a parameter, it may be acceptable to take an unnormalizable function for the prior (an improper prior) as long as the product of likelihood and prior can be normalized. But improper priors are only defined up to an arbitrary multiplicative constant, and so the Bayes factor would depend on this constant. Furthermore, although the range of a constant normalized prior is unimportant for parameter determination (provided it is wider than the likelihood), this is not so for the Bayes factor when such a prior is used for only one of the hypotheses. So to compute a Bayes factor, all internal parameters must be described by normalized priors that represent meaningful probabilities over the entire range where they are defined. An exception to this rule may be considered when the identical parameter appears in the models for both numerator and denominator of the Bayes factor. In this case one can argue that the arbitrary constants would cancel. One must exercise some caution, however, as parameters with the same name and physical meaning may still play different roles in the two models.

6th December, 2019 11:50am 19 39. Statistics

2.5

2.0

1% 1.5 10% 5% χ2/n 32% 1.0 50% 68% 90% 95% 0.5 99%

0.0 0 10 20 30 40 50 Degrees of freedom n

Figure 39.2: The ‘reduced’ χ2, equal to χ2/n, for n degrees of freedom. The curves show as a function of n the χ2/n that corresponds to a given p-value.

Both integrals in Equation (39.58) are of the form

Z m = P (x|θ)π(θ) dθ , (39.59) which is the marginal likelihood seen previously in Eq. (39.49) (in some fields this quantity is called the evidence). Computing marginal likelihoods can be difficult; in many cases it can be done with the nested sampling algorithm [35] as implemented, e.g., in the program MultiNest [36]. A review of Bayes factors can be found in Ref. [37]. 39.4 Intervals and limits When the goal of an experiment is to determine a parameter θ, the result is usually expressed by quoting, in addition to the point estimate, some sort of interval which reflects the statistical precision of the measurement. In the simplest case, this can be given by the parameter’s estimated value θb plus or minus an estimate of the standard deviation of θb, σ . If, however, the p.d.f. of the bθb estimator is not Gaussian or if there are physical boundaries on the possible values of the parameter, then one usually quotes instead an interval according to one of the procedures described below. In reporting an interval or limit, the experimenter may wish to

• communicate as objectively as possible the result of the experiment; • provide an interval that is constructed to cover on average the true value of the parameter with a speciﬁed probability;

6th December, 2019 11:50am 20 39. Statistics

• provide the information needed by the consumer of the result to draw conclusions about the parameter or to make a particular decision; • draw conclusions about the parameter that incorporate stated prior beliefs. With a sufficiently large data sample, the point estimate and standard deviation (or for the multiparameter case, the parameter estimates and covariance matrix) satisfy essentially all of these goals. For finite data samples, no single method for quoting an interval will achieve all of them. In addition to the goals listed above, the choice of method may be influenced by practical considerations such as ease of producing an interval from the results of several measurements. Of course the experimenter is not restricted to quoting a single interval or limit; one may choose, for example, first to communicate the result with a confidence interval having certain frequentist properties, and then in addition to draw conclusions about a parameter using a judiciously chosen subjective Bayesian prior. It is recommended, however, that there be a clear separation between these two aspects of reporting a result. In the remainder of this section, we assess the extent to which various types of intervals achieve the goals stated here. 39.4.1 Bayesian intervals As described in Sec. 39.2.5, a Bayesian posterior probability may be used to determine regions that will have a given probability of containing the true value of a parameter. In the single parameter case, for example, an interval (called a Bayesian or credible interval) [θlo, θup] can be determined which contains a given fraction 1 − α of the posterior probability, i.e.,

Z θup 1 − α = p(θ|x) dθ . (39.60) θlo

Sometimes an upper or lower limit is desired, i.e., θlo or θup can be set to a physical boundary or to plus or minus inﬁnity. In other cases, one might be interested in the set of θ values for which p(θ|x) is higher than for any θ not belonging to the set, which may constitute a single interval or a set of disjoint regions; these are called highest posterior density (HPD) intervals. Note that HPD intervals are not invariant under a nonlinear transformation of the parameter. If a parameter is constrained to be non-negative, then the prior p.d.f. can simply be set to zero for negative values. An important example is the case of a Poisson variable n, which counts signal events with unknown mean s, as well as background with mean b, assumed known. For the signal mean s, one often uses the prior

( 0 s < 0 π(s) = . (39.61) 1 s ≥ 0

This prior may be regarded as providing an interval whose frequentist properties can be studied, rather than as representing a degree of belief. For example, to obtain an upper limit on s, one may proceed as follows. The likelihood for s is given by the Poisson distribution for n with mean s + b,

(s + b)n P (n|s) = e−(s+b) , (39.62) n! along with the prior (39.61) in (39.35) gives the posterior density for s. An upper limit sup at conﬁdence level (or here, rather, credibility level) 1 − α can be obtained by requiring

s Z sup R up −∞ P (n|s) π(s) ds 1 − α = p(s|n)ds = R ∞ , (39.63) −∞ −∞ P (n|s) π(s) ds

6th December, 2019 11:50am 21 39. Statistics where the lower limit of integration is effectively zero because of the cut-off in π(s). By relating the integrals in Eq. (39.63) to incomplete gamma functions, the solution for the upper limit is found to be 1 s = F −1 [p, 2(n + 1)] − b , (39.64) up 2 χ2 −1 2 where Fχ2 is the quantile of the χ distribution (inverse of the cumulative distribution). Here the quantity p is p = 1 − α 1 − Fχ2 [2b, 2(n + 1)] , (39.65) 2 −1 where Fχ2 is the cumulative χ distribution. For both Fχ2 and Fχ2 above, the argument 2(n + 1) gives the number of degrees of freedom. For the special case of b = 0, the limit reduces to 1 s = F −1(1 − α; 2(n + 1)) . (39.66) up 2 χ2 It happens that for the case of b = 0, the upper limit from Eq. (39.66) coincides numerically with the frequentist upper limit discussed in Section 39.4.2.3. Values for 1 − α = 0.9 and 0.95 are given by the values µup in Table 39.3. The frequentist properties of confidence intervals for the Poisson mean found in this way are discussed in Refs. [2] and [38]. As in any Bayesian analysis, it is important to show how the result changes under assumption of different prior probabilities. For example, one could consider the Jeffreys√ prior as described in Sec. 39.2.5. For this problem one finds the Jeffreys prior π(s) ∝ 1/ s + b for s ≥ 0 and zero otherwise. As with the constant prior, one would not regard this as representing one’s prior beliefs about s, both because it is improper and also as it depends on b. Rather it is used with Bayes’ theorem to produce an interval whose frequentist properties can be studied. If the model contains nuisance parameters then these are eliminated by marginalizing, as in Eq. (39.41), to obtain the p.d.f. for the parameters of interest. For example, if the parameter b in the Poisson counting problem above were to be characterized by a prior p.d.f. π(b), then one would first use Bayes’ theorem to find p(s, b|n). This is then marginalized to find p(s|n) = R p(s, b|n)π(b) db, from which one may determine an interval for s. One may not be certain whether to extend a model by including more nuisance parameters. In this case, a Bayes factor may be used to determine to what extent the data prefer a model with additional parameters, as described in Section 39.3.3. 39.4.2 Frequentist confidence intervals The unqualified phrase “confidence intervals” refers to frequentist intervals obtained with a procedure due to Neyman [39], described below. The boundary of the interval (or in the multiparameter case, region) is given by a specific function of the data, which would fluctuate if one were to repeat the experiment many times. The coverage probability refers to the fraction of intervals in such an ensemble that contain the true parameter value. Confidence intervals are constructed so as to have a coverage probability greater than or equal to a given confidence level, regardless of the true parameter’s value. It is important to note that in the frequentist approach, such a probability is not meaningful for a fixed interval. In this section we discuss several techniques for producing intervals that have, at least approximately, this property of coverage. 39.4.2.1 The Neyman construction for confidence intervals Consider a p.d.f. f(x; θ) where x represents the outcome of the experiment and θ is the unknown parameter for which we want to construct a confidence interval. The variable x could (and often does) represent an estimator for θ. Using f(x; θ), we can find using a pre-defined rule and probability 1 − α for every value of θ, a set of values x1(θ, α) and x2(θ, α) such that

Z x2 P (x1 < x < x2; θ) = f(x; θ) dx ≥ 1 − α . (39.67) x1

6th December, 2019 11:50am 22 39. Statistics

If x is discrete, the integral is replaced by the corresponding sum. In that case there may not exist a range of x values whose summed probability is exactly equal to a given value of 1 − α, and one requires by convention P (x1 < x < x2; θ) ≥ 1 − α. This is illustrated for continuous x in Fig. 39.3: a horizontal line segment [x1(θ, α), x2(θ, α)] is drawn for representative values of θ. The union of such intervals for all values of θ, designated in the ﬁgure as D(α), is known as a conﬁdence belt. Typically the curves x1(θ, α) and x2(θ, α) are monotonic functions of θ, which we assume for this discussion.

D(α)

x x θ 2(θ), θ2( ) θ0 x x 1(θ), θ1( ) parameter

x x (θ ) 2(θ0) 1 0

Possible experimental values x

Figure 39.3: Construction of the conﬁdence belt (see text).

Upon performing an experiment to measure x and obtaining a value x0, one draws a vertical line through x0. The conﬁdence interval for θ is the set of all values of θ for which the corresponding

line segment [x1(θ, α), x2(θ, α)] is intercepted by this vertical line. Such confidence intervals are said to have a confidence level (CL) equal to 1 − α. Now suppose that the true value of θ is θ0, indicated in the figure. We see from the figure that θ0 lies between θ1(x) and θ2(x) if and only if x lies between x1(θ0) and x2(θ0). The two events thus have the same probability, and since this is true for any value θ0, we can drop the subscript 0 and obtain

1 − α = P (x1(θ) < x < x2(θ)) = P (θ2(x) < θ < θ1(x)) . (39.68)

In this probability statement, θ1(x) and θ2(x), i.e., the endpoints of the interval, are the random variables and θ is an unknown constant. If the experiment were to be repeated a large number of times, the interval [θ1, θ2] would vary, covering the ﬁxed value θ in a fraction 1 − α of the experiments.

6th December, 2019 11:50am 23 39. Statistics

The condition of coverage in Eq. (39.67) does not determine x1 and x2 uniquely, and additional criteria are needed. One possibility is to choose central intervals such that the probabilities to find x below x1 and above x2 are each α/2. In other cases, one may want to report only an upper or lower limit, in which case one of P (x ≤ x1) or P (x ≥ x2) can be set to α and the other to zero. Another principle based on likelihood ratio ordering for determining which values of x should be included in the confidence belt is discussed below. When the observed random variable x is continuous, the coverage probability obtained with the Neyman construction is 1 − α, regardless of the true value of the parameter. Because of the requirement P (x1 < x < x2) ≥ 1−α when x is discrete, one obtains in that case confidence intervals that include the true parameter with a probability greater than or equal to 1 − α. An equivalent method of constructing confidence intervals is to consider a test (see Sec. 39.3) of the hypothesis that the parameter’s true value is θ (assume one constructs a test for all physical values of θ). One then excludes all values of θ where the hypothesis would be rejected in a test of size α or less. The remaining values constitute the confidence interval at confidence level 1 − α. If the critical region of the test is characterized by having a p-value pθ ≤ α, then the endpoints of the confidence interval are found in practice by solving pθ = α for θ. In the procedure outlined above, one is still free to choose the test to be used; this corresponds to the freedom in the Neyman construction as to which values of the data are included in the confidence belt. One possibility is to use a test statistic based on the likelihood ratio, f(x; θ) λ(θ) = , (39.69) f(x; θb) where θb is the value of the parameter which, out of all allowed values, maximizes f(x; θ). This results in the intervals described in Ref. [40] by Feldman and Cousins. The same intervals can be obtained from the Neyman construction described above by including in the confidence belt those values of x which give the greatest values of λ(θ). If the model contains nuisance parameters ν, then these can be incorporated into the test (or the p-values) used to determine the limit by profiling as discussed in Section 39.3.2.1. As mentioned there, the strict frequentist approach is to regard the parameter of interest θ as excluded only if it is rejected for all possible values of ν. The resulting interval for θ will then cover the true value with a probability greater than or equal to the nominal confidence level for all points in ν-space. If the p-value is based on the profiled values of the nuisance parameters, i.e., with ν = νb(θ) used in Eq. (39.47), then the resulting interval for the parameter of interest will have the correct coverage if the true values of ν are equal to the profiled values. Otherwise the coverage probability may be too high or too low. This procedure has been called profile construction in HEP [41] (see also [32]). 39.4.2.2 Gaussian distributed measurements An important example of constructing a confidence interval is when the data consists of a single random variable x that follows a Gaussian distribution; this is often the case when x represents an estimator for a parameter and one has a sufficiently large data sample. If there is more than one parameter being estimated, the multivariate Gaussian is used. For the univariate case with known σ, the probability that the measured value x will fall within ±δ of the true value µ is

µ+δ 1 Z 2 2 1 − α = √ e−(x−µ) /2σ dx 2πσ µ−δ δ δ = erf √ = 2Φ − 1 , (39.70) 2 σ σ

6th December, 2019 11:50am 24 39. Statistics where erf is the Gaussian error function, which is rewritten in the ﬁnal equality using Φ, the Gaussian cumulative distribution. Fig. 39.4 shows a δ = 1.64σ conﬁdence interval unshaded. The choice δ = σ gives an interval called the standard error which has 1 − α = 68.27% if σ is known. Values of α for other frequently used choices of δ are given in Table 39.1. f(x;µ,σ)

1−α

α/2 α/2

−3 −2 −1 0 1 2 3 (x−µ)/σ

Figure 39.4: Illustration of a symmetric 90% conﬁdence interval (unshaded) for a Gaussian- distributed measurement of a single quantity. Integrated probabilities, deﬁned by α = 0.1, are as shown.

Table 39.1: Area of the tails α outside ±δ from the mean of a Gaussian distribution.

α δ α δ 0.3173 1σ 0.2 1.28σ 4.55 ×10−2 2σ 0.1 1.64σ 2.7 ×10−3 3σ 0.05 1.96σ 6.3×10−5 4σ 0.01 2.58σ 5.7×10−7 5σ 0.001 3.29σ 2.0×10−9 6σ 10−4 3.89σ

We can set a one-sided (upper or lower) limit by excluding above x + δ (or below x − δ). The values of α for such limits are half the values in Table 39.1.

6th December, 2019 11:50am 25 39. Statistics

The relation (39.70) can be re-expressed using the cumulative distribution function for the χ2 distribution as α = 1 − F (χ2; n) , (39.71) for χ2 = (δ/σ)2 and n = 1 degree of freedom. This can be seen as the n = 1 curve in Fig. 39.1 or obtained by using the ROOT function TMath::Prob. For multivariate measurements of, say, n parameter estimates θb = (θb1,..., θbn), construction of the conﬁdence region requires the full covariance matrix Vij = cov[θbi, θbj], which can be estimated as described in Sections 39.2.2 and 39.2.3. Under fairly general conditions with the methods of maximum-likelihood or least-squares in the large sample limit, the estimators will be distributed according to a multivariate Gaussian centered about the true (unknown) values θ, and furthermore, the likelihood function itself will take on a Gaussian shape. The standard error ellipse for the pair (θbi, θbj) is shown in Fig. 39.5, corresponding to a contour 2 2 χ = χmin + 1 or ln L = ln Lmax − 1/2. The ellipse is centered about the estimated values θb, and the tangents to the ellipse give the standard deviations of the estimators, σi and σj. The angle of the major axis of the ellipse is given by

2ρijσiσj tan 2φ = 2 2 , (39.72) σj − σi where ρij = cov[θbi, θbj]/σiσj is the correlation coefficient. The correlation coefficient can be visualized as the fraction of the distance σi from the ellipse’s horizontal center-line at which the ellipse becomes tangent to vertical, i.e., at the distance ρijσi below the center-line as shown. As ρij goes to +1 or −1, the ellipse thins to a diagonal line. It could happen that one of the parameters, say, θj, is known from previous measurements to a precision much better than σj, so that the current measurement contributes almost nothing to the knowledge of θj. However, the current measurement of θi and its dependence on θj may still be important. In this case, instead of quoting both parameter estimates and their correlation, one 2 sometimes reports the value of θi, which minimizes χ at a fixed value of θj, such as the PDG best value. This θi value lies along the dotted line between the points where the ellipse becomes tangent 2 1/2 to vertical, and has statistical error σinner as shown on the figure, where σinner = (1 − ρij) σi. Instead of the correlation ρij, one reports the dependency dθbi/dθj, which is the slope of the dotted line. This slope is related to the correlation coefficient by dθb /dθ = ρ × σi . i j ij σj As in the single-variable case, because of the symmetry of the Gaussian function between θ and θb, one finds that contours of constant ln L or χ2 cover the true values with a certain, fixed probability. That is, the confidence region is determined by

ln L(θ) ≥ ln Lmax − ∆ ln L, (39.73) or where a χ2 has been deﬁned for use with the method of least-squares,

2 2 2 χ (θ) ≤ χmin + ∆χ . (39.74)

Values of ∆χ2 or 2∆ ln L are given in Table 39.2 for several values of the coverage probability 1 − α and number of ﬁtted parameters m. For Gaussian distributed data, these are related by 2 −1 −1 ∆χ = 2∆ ln L = F 2 (1 − α), where F 2 is the chi-square quantile (inverse of the cumulative χm χm distribution) for m degrees of freedom. For non-Gaussian data samples, the probability for the regions determined by Equations (39.73) or (39.74) to cover the true value of θ becomes independent of θ only in the large-sample limit. So

6th December, 2019 11:50am 26 39. Statistics

θ i σi σinner ^ θ i ρ σ σi ij i φ σj σj ^ θ j θj

Figure 39.5: Standard error ellipse for the estimators θbi and θbj. In the case shown the correlation is negative.

Table 39.2: Values of ∆χ2 or 2∆ ln L corresponding to a coverage probability 1 − α in the large data sample limit, for joint estimation of m parameters.

(1 − α) (%) m = 1 m = 2 m = 3 68.27 1.00 2.30 3.53 90. 2.71 4.61 6.25 95. 3.84 5.99 7.82 95.45 4.00 6.18 8.03 99. 6.63 9.21 11.34 99.73 9.00 11.83 14.16 for a finite data sample these are not exact confidence regions according to our previous definition. Nevertheless, they can still have a coverage probability only weakly dependent on the true parameter, and approximately as given in Table 39.2. In any case, the coverage probability of the intervals or regions obtained according to this procedure can in principle be determined as a function of the true parameter(s), for example, using a Monte Carlo calculation. One of the practical advantages of intervals that can be constructed from the log-likelihood function or χ2 is that it is relatively simple to produce the interval for the combination of several experiments. If N independent measurements result in log-likelihood functions ln Li(θ), then the combined log-likelihood function is simply the sum,

N X ln L(θ) = ln Li(θ) . (39.75) i=1

This can then be used to determine an approximate conﬁdence interval or region with Eq. (39.73), just as with a single experiment.

6th December, 2019 11:50am 27 39. Statistics

39.4.2.3 Poisson or binomial data Another important class of measurements consists of counting a certain number of events, n. In this section, we will assume these are all events of the desired type, i.e., there is no background. If n represents the number of events produced in a reaction with cross section σ, say, in a fixed integrated luminosity L, then it follows a Poisson distribution with mean µ = σL. If, on the other hand, one has selected a larger sample of N events and found n of them to have a particular property, then n follows a binomial distribution where the parameter p gives the probability for the event to possess the property in question. This is appropriate, e.g., for estimates of branching ratios or selection efficiencies based on a given total number of events. For the case of Poisson distributed n, limits on the mean value µ can be found from the Neyman procedure as discussed in Section 39.4.2.1 with n used directly as the statistic x . The upper and lower limits are found to be 1 µ = F −1(α ; 2n) , (39.76a) lo 2 χ2 lo 1 µ = F −1(1 − α ; 2(n + 1)) , (39.76b) up 2 χ2 up where confidence levels of 1−αlo and 1−αup refer separately to the corresponding intervals µ ≥ µlo −1 2 and µ ≤ µup, and Fχ2 is the quantile of the χ distribution (inverse of the cumulative distribution). For central confidence intervals at confidence level 1 − α, set αlo = αup = α/2.

Table 39.3: Lower and upper (one-sided) limits for the mean µ of a Poisson variable given n observed events in the absence of background, for conﬁdence levels of 90% and 95%.

1 − α =90% 1 − α =95% n µlo µup µlo µup 0 – 2.30 – 3.00 1 0.105 3.89 0.051 4.74 2 0.532 5.32 0.355 6.30 3 1.10 6.68 0.818 7.75 4 1.74 7.99 1.37 9.15 5 2.43 9.27 1.97 10.51 6 3.15 10.53 2.61 11.84 7 3.89 11.77 3.29 13.15 8 4.66 12.99 3.98 14.43 9 5.43 14.21 4.70 15.71 10 6.22 15.41 5.43 16.96

It happens that the upper limit from Eq. (39.76b) coincides numerically with the Bayesian upper limit for a Poisson parameter, using a uniform prior p.d.f. for µ. Values for conﬁdence levels of 90% and 95% are shown in Table 39.3. For the case of binomially distributed n successes out of N trials with probability of success p, the upper and lower limits on p are found to be

−1 nFF [αlo; 2n, 2(N − n + 1)] plo = −1 , (39.77a) N − n + 1 + nFF [αlo; 2n, 2(N − n + 1)] −1 (n + 1)FF [1 − αup; 2(n + 1), 2(N − n)] pup = −1 . (39.77b) (N − n) + (n + 1)FF [1 − αup; 2(n + 1), 2(N − n)]

6th December, 2019 11:50am 28 39. Statistics

−1 Here FF is the quantile of the F distribution (also called the Fisher–Snedecor distribution; see Ref. [4]). 39.4.2.4 Parameter exclusion in cases of low sensitivity An important example of a statistical test arises in the search for a new signal process. Suppose the parameter µ is defined such that it is proportional to the signal cross section. A statistical test may be carried out for hypothesized values of µ, which may be done by computing a p-value, pµ, for all µ. Those values not rejected in a test of size α, i.e., for which one does not find pµ ≤ α, constitute a confidence interval with confidence level 1 − α. In general one will find that for some regions in the parameter space of the signal model, the predictions for data are almost indistinguishable from those of the background-only model. This corresponds to the case where µ is very small, as would occur, e.g., in a search for a new particle with a mass so high that its production rate in a given experiment is negligible. That is, one has essentially no experimental sensitivity to such a model. One would prefer that if the sensitivity to a model (or a point in a model’s parameter space) is very low, then it should not be excluded. Even if the outcomes predicted with or without signal are identical, however, the probability to reject the signal model will equal α, the type-I error rate. As one often takes α to be 5%, this would mean that in a large number of searches covering a broad range of a signal model’s parameter space, there would inevitably be excluded regions in which the experimental sensitivity is very small, and thus one may question whether it is justified to regard such parameter values as disfavored. Exclusion of models to which one has little or no sensitivity occurs, for example, if the data fluctuate very low relative to the expectation of the background-only hypothesis. In this case the resulting upper limit on µ may be anomalously low. As a means of controlling this effect one often determines the mean or median limit under assumption of the background-only hypothesis, as discussed in Sec. 39.5. One way to mitigate the problem of excluding models to which one is not sensitive is the CLs method, where the measure used to test a parameter is increased for decreasing sensitivity [42,43]. The procedure is based on a statistic called CLs, which is defined as pµ CLs = , (39.78) 1 − pb where pb is the p-value of the background-only hypothesis. In the usual formulation of the method, both pµ and pb are defined using a single test statistic, and the definition of CLs above assumes this statistic is continuous; more details can be found in Refs. [42, 43]. A point in a model’s parameter space is regarded as excluded if one finds CLs ≤ α. As the denominator in Eq. (39.78) is always less than or equal to unity, the exclusion criterion based on CLs is more stringent than the usual requirement pµ ≤ α. In this sense the CLs procedure is conservative, and the coverage probability of the corresponding intervals will exceed the nominal confidence level 1 − α. If the experimental sensitivity to a given value of µ is very low, then one finds that as pµ decreases, so does the denominator 1 − pb, and thus the condition CLs ≤ α is effectively prevented from being satisfied. In this way the exclusion of parameters in the case of low sensitivity is suppressed. The CLs procedure has the attractive feature that the resulting intervals coincide with those obtained from the Bayesian method in two important cases: the mean value of a Poisson or Gaussian distributed measurement with a constant prior. The CLs intervals overcover for all values of the parameter µ, however, by an amount that depends on µ. The problem of excluding parameter values to which one has little sensitivity is particularly acute when one wants to set a one-sided limit, e.g., an upper limit on a cross section. Here one

6th December, 2019 11:50am 29 39. Statistics tests a value of a rate parameter µ against the alternative of a lower rate, and therefore the critical region of the test is taken to correspond to data outcomes with a low event yield. If the number of events found in the search region fluctuates low enough, however, it can happen that all physically meaningful signal parameter values, including those to which one has very little sensitivity, are rejected by the test. Another solution to this problem, therefore, is to replace the one-sided test by one based on the likelihood ratio, where the critical region is not restricted to low rates. This is the approach followed in the Feldman-Cousins procedure described in Section 39.4.2.1. The critical region for the test of a given value of µ contains data values characteristic of both higher and lower rates. As a result, for a given observed rate one can in general obtain a two-sided interval. If, however, the parameter estimate µˆ is sufficiently close to the lower limit of zero, then only high values of µ are rejected, and the lower edge of the confidence interval is at zero. Note, however, that the coverage property of 1 − α pertains to the entire interval, not to the probability for the upper edge µup to be greater than the true value µ. For parameter estimates increasingly far away from the boundary, i.e., for increasing signal significance, the point µ = 0 is excluded and the interval has nonzero upper and lower edges. An additional difficulty arises when a parameter estimate is not significantly far away from the boundary, in which case it is natural to report a one-sided confidence interval (often an upper limit). It is straightforward to force the Neyman prescription to produce only an upper limit by setting x2 = ∞ in Eq. (39.67). Then x1 is uniquely determined and the upper limit can be obtained. If, however, the data come out such that the parameter estimate is not so close to the boundary, one might wish to report a central confidence interval (i.e., an interval based on a two-sided test with equal upper and lower tail areas). As pointed out by Feldman and Cousins [40], if the decision to report an upper limit or two-sided interval is made by looking at the data (“flip-flopping”), then in general there will be parameter values for which the resulting intervals have a coverage probability less than 1−α. With the confidence intervals suggested in [40], the prescription determines whether the interval is one- or two-sided in a way which preserves the coverage probability (and are thus said to be unified). The intervals according to this method for the mean of Poisson variable in the absence of background are given in Table 39.4. (Note that α in Ref. [40] is defined following Neyman [39] as the coverage probability; this is opposite the modern convention used here in which the coverage probability is 1 − α.) The values of 1 − α given here refer to the coverage of the true parameter by the whole interval [µ1, µ2]. In Table 39.3 for the one-sided upper limit, however, 1 − α refers to the probability to have µup ≥ µ (or µlo ≤ µ for lower limits). A potential difficulty with unified intervals arises if, for example, one constructs such an interval for a Poisson parameter s of some yet to be discovered signal process with, say, 1 − α = 0.9. If the true signal parameter is zero, or in any case much less than the expected background, one will usually obtain a one-sided upper limit on s. In a certain fraction of the experiments, however, a two-sided interval for s will result. Since, however, one typically chooses 1−α to be only 0.9 or 0.95 when setting limits, the value s = 0 may be found below the lower edge of the interval before the existence of the effect is well established. It must then be communicated carefully that in excluding s = 0 at, say, 90% or 95% confidence level from the interval, one is not necessarily claiming to have discovered the effect, for which one would usually require a higher level of significance (e.g., 5 σ). Another possibility is to construct a Bayesian interval as described in Section 39.4.1. The presence of the boundary can be incorporated simply by setting the prior density to zero in the unphysical region. More specifically, the prior may be chosen using formal rules such as the reference prior or Jeffreys prior mentioned in Sec. 39.2.5. In HEP a widely used prior for the mean µ of a Poisson distributed measurement has been the

6th December, 2019 11:50am 30 39. Statistics

Table 39.4: Unified confidence intervals [µ1, µ2] for a the mean of a Poisson variable given n observed events in the absence of background, for confidence levels of 90% and 95%.

1 − α =90% 1 − α =95% n µ1 µ2 µ1 µ2 0 0.00 2.44 0.00 3.09 1 0.11 4.36 0.05 5.14 2 0.53 5.91 0.36 6.72 3 1.10 7.42 0.82 8.25 4 1.47 8.60 1.37 9.76 5 1.84 9.99 1.84 11.26 6 2.21 11.47 2.21 12.75 7 3.56 12.53 2.58 13.81 8 3.96 13.99 2.94 15.29 9 4.36 15.30 4.36 16.77 10 5.50 16.50 4.75 17.82 uniform distribution for µ ≥ 0. This prior does not follow from any fundamental rule nor can it be regarded as reflecting a reasonable degree of belief, since the prior probability for µ to lie between any two finite values is zero. The procedure above can be more appropriately regarded as a way for obtaining intervals with frequentist properties that can be investigated. The resulting upper limits have a coverage probability that depends on the true value of the Poisson parameter, and is nowhere smaller than the stated probability content. Lower limits and two-sided intervals for the Poisson mean based on flat priors undercover, however, for some values of the parameter, although to an extent that in practical cases may not be too severe [2, 38]. In any case, it is important to always report sufficient information so that the result can be combined with other measurements. Often this means giving an unbiased estimator and its standard deviation, even if the estimated value is in the unphysical region. It can also be useful with a frequentist interval to calculate its subjective probability content using the posterior p.d.f. based on one or several reasonable guesses for the prior p.d.f. If it turns out to be significantly less than the stated confidence level, this warns that it would be particularly misleading to draw conclusions about the parameter’s value from the interval alone. 39.5 Experimental sensitivity In this section we describe methods for characterizing the sensitivity of a search for a new physics signal. As discussed in Sec. 39.3, an experimental analysis can often be formulated as a test of hypothetical model parameters. Therefore we may quantify the sensitivity by giving the results that we expect from such a test under specific assumptions about the signal process. Here to be concrete we will consider a parameter µ proportional to the rate of a signal process, although the concepts described in this section may be easily generalized to other parameters. One may wish to establish discovery of the signal process by testing and rejecting the hypothesis that µ = 0, and in addition one often wants to test nonzero values of µ to construct a confidence interval (e.g., limits) as described in Sec. 39.4. In the frequentist framework, the result of each tested value −1 of µ is the p-value pµ or equivalently the significance Zµ = Φ (1 − pµ), where as usual Φ is the standard Gaussian cumulative distribution and its inverse Φ−1 is the standard Gaussian quantile.

Prior to carrying out the experiment, one generally wants to quantify what signiﬁcance Zµ is

6th December, 2019 11:50am 31 39. Statistics expected under given assumptions for the presence or absence of the signal process. Specifically, for the significance of a test of µ = 0 (the discovery significance) one usually quotes the Z0 one would expect if the signal is present at a given nominal rate, which we can define in general to correspond to µ = 1. For limits, one often gives the expected limit under assumption of the background-only (µ = 0) model. These quantities are used to optimize the analysis and to quantify the experimental sensitivity, that is, to characterize how likely it is to make a discovery if the signal is present, and to say what values of µ one may be able to exclude if the signal is in fact absent. First we clarify the notion of expected significance. Because the significance Zµ is a function of the data, it is itself a random quantity characterized by a certain sampling distribution. This distribution depends on the assumed value of µ, which is not necessarily the same as the hypothesized 0 value of µ being tested. We may therefore consider the distribution f(Zµ|µ ), i.e., the distribution 0 of Zµ that would be obtained by considering data samples generated under assumption of µ . In a 0 similar way one can talk about the sampling distribution of an upper limit for µ, f(µup|µ ). One can identify the expected significance or limit with either the mean or median of these distributions, but the median may be preferred since it is invariant under monotonic transformations. For example, the monotonic relation between p-value and significance, p = 1 − Φ(Z), then gives 0 0 med[pµ|µ ] = 1 − Φ(med[Zµ|µ ]), whereas the corresponding relation does not hold in general for the mean. In some cases one may be able to write down approximate formulae for the distributions of Zµ and for limits, but more generally they must be determined from Monte Carlo calculations. In many cases of interest, the significance Zµ and the limits on µ will have approximate Gaussian distributions. As an example, consider a Poisson counting experiment, where the result consists of an observed number n of events, modeled as a Poisson distributed variable with a mean of µs + b. Here s and b, the expected numbers of events from signal and background processes, are taken to be known. If we are interested in discovering the signal process we test and try to reject the hypothesis µ = 0. To characterize the experimental sensitivity, we want to give the discovery significance expected under the assumption of µ = 1. In the limit where its mean value is large, the Poisson variable n can be approximated√ as an almost continuous Gaussian variable with mean µs + b and standard deviation σ = µs + b. In the usual case where a physical signal model corresponds to µ > 0, the p-value of µ = 0 is the probability to find n greater than or equal to the value observed, n − b p0 = Φ √ , (39.79) b √ −1 and the corresponding significance is Z0 = Φ (1 − p0) = (n − b)/ b. The median (here equal to the mean) of n assuming µ = 1 is s + b, and therefore the median discovery significance is s med[Z0|µ = 1] = √ . (39.80) b √ The figure of merit “s/ b” has been widely used in HEP as a measure of expected discovery significance. A better approximation for the Poisson counting experiment, however, may be obtained by testing µ = 0 using the likelihood ratio (39.48) λ(0) = L(0)/L(ˆµ), where (µs + b)n L(µ) = e−(µs+b) (39.81) n! is the likelihood function, µˆ = (n − b)/s is the ML estimator. In this example there are no nuisance parameters, as s and b are taken to be known. For the case where the relevant signal models

6th December, 2019 11:50am 32 39. Statistics

correspond to positive µ, one may test the µ = 0 hypothesis with the statistic q0 = −2 ln λ(0) when µˆ > 0, i.e., an excess is observed, and q = 0 otherwise. One can show (see, e.g., [31]) that in the 0 √ large-sample limit, the discovery significance is then Z0 = q0, for which one finds s n Z = 2 n ln + b − n (39.82) 0 b for n > b and Z0 = 0 otherwise. To approximate the expected discovery significance assuming µ = 1, one may simply replace n with the expected value E[n|µ = 1] = s+b (the so-called “Asimov data set”), giving s s med[Z |µ = 1] = 2 (s + b) ln 1 + − s . (39.83) 0 b This has been shown in Ref. [31] to provide a good approximation to the median discovery significance for values√ of s of several and for b well below unity. The right-hand side of Eq. (39.83) reduces to s/ b in the limit s b. Beyond the simple Poisson counting experiment, in general one may test values of a parameter µ with more complicated functions of the measured data to obtain a p-value pµ, and from this one can quote the equivalent significance Zµ or find, e.g., an upper limit µup. In this case as well one may quantify the experimental sensitivity by giving the significance Zµ expected if the data are generated with a different value of the parameter µ0. In some problems, finding the sampling distribution of the significance or limits may be possible using large-sample formulae as described, e.g., in Ref. [31]. In other cases a Monte Carlo study may be needed. Using whatever method of calculation is most appropriate, one usually quotes the expected (mean or, preferably, median) significance or limit as the primary measures of experimental sensitivity. Even if the true signal is present at its nominal rate, the actual discovery significance Z0 obtained from the real data is subject to statistical fluctuations and will not in general be equal to its expected value. In an analogous way, the observed limit will differ from the expected limit even if the signal is absent. Upon observing such a difference one would like to know how large this is compared to expected statistical fluctuations. Therefore, in addition to the observed significance and limits it is useful to communicate not only their expected values but also a measure of the width of their distributions. As the distributions of significance and limits are often well approximated by a Gaussian, one may indicate the intervals corresponding to plus-or-minus one and/or two standard deviations. If the distributions are significantly non-Gaussian, one may use instead the quantiles that give the same probability content, i.e., [0.1587, 0.8413] for ±1σ, [0.02275, 0.97725] for ±2σ. An upper limit found significantly below the background-only expectation may indicate a strong downward fluctuation of the data, or perhaps as well an incorrect estimate of the background rate. The procedures described above pertain to frequentist hypothesis tests and limits. Bayesian limits, just like those found from a frequentist procedure, are functions of the data and one may therefore find, usually with approximations or Monte Carlo studies, their sampling distribution and corresponding mean (or, preferably, median) and standard deviation. When trying to establish discovery of a signal process, the Bayesian approach may employ a Bayes factor as described in Sec. 39.3.3. In the case of the Poisson counting experiment with the likelihood from Eq. (39.81), the log of the Bayes factor that compares µ = 1 to µ = 0 is ln B10 = ln(L(1)/L(0)) = n ln(1 + s/b) − s. That is, the expectation value, assuming µ = 1, of ln B10 for this problem is s E[ln B |µ = 1] = (s + b) ln 1 + − s . (39.84) 10 b

6th December, 2019 11:50am 33 39. Statistics

p Comparing this to Eq. (39.83), one finds med[Z0|1] = 2E[ln B10|1]. Thus for this particular problem the frequentist median discovery significance can be related to the corresponding Bayes factor in a simple way. In some analyses, the goal may not be to establish discovery of a signal process but rather to measure, as accurately as possible, the signal rate. If we consider again the Poisson counting experiment described by the likelihood function of Eq. (39.81), the ML estimator µˆ = (n − b)/s has a variance, assuming µ = 1, of n − b 1 s + b V [ˆµ] = V = V [n] = , (39.85) s s2 s2 √ √ so that the standard deviation of µˆ is σµˆ = s + b/s. One may therefore use s/ s + b as a figure of merit to be√ maximized in order to obtain the best measurement accuracy of a rate parameter. The quantity s/ s + b is also the expected significance with which one rejects s assuming the signal is absent, and thus can be used to optimize the expected upper limit on s. References [1] B. Efron, Am. Stat. 40, 11 (1986). [2] R. D. Cousins, Am. J. Phys. 63, 398 (1995). [3] A. Stuart, J.K. Ord, and S. Arnold, Kendall’s Advanced Theory of Statistics, Vol. 2A: Classical Inference and the Linear Model, 6th ed., Oxford Univ. Press (1999), and earlier editions by Kendall and Stuart. The likelihood-ratio ordering principle is described at the beginning of Ch. 23. Chapter 26 compares different schools of statistical inference. [4] F. James, Statistical methods in experimental physics (2006), ISBN 9789812567956. [5] L. Lyons, Statistics for Nuclear and Particle Physicists (1986), ISBN 9780521379342, URL http://www.cambridge.org/uk/catalogue/catalogue.asp?isbn=0521255406. [6] R. J. Barlow, Nucl. Instrum. Meth. A297, 496 (1990). [7] G. Cowan, Statistical data analysis (1998), ISBN 9780198501565. [8] S. Baker and R. D. Cousins, Nucl. Instrum. Meth. 221, 437 (1984). [9] S. S. Wilks, Annals Math. Statist. 9, 1, 60 (1938). [10] O. Behnke et al., editors, Data analysis in high energy physics, Wiley-VCH, Weinheim, Germany (2013), ISBN 9783527410583, 9783527653447, 9783527653430, URL http://www. wiley-vch.de/publish/dt/books/ISBN3-527-41058-9. [11] A. O’Hagan and J.J. Forster, Bayesian Inference, (2nd edition, volume 2B of Kendall’s Ad- vanced Theory of Statistics, Arnold, London, 2004). [12] D. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, (Oxford University Press, 2006). [13] P.C. Gregory, Bayesian Logical Data Analysis for the Physical Sciences, (Cambridge University Press, 2005). [14] J.M. Bernardo and A.F.M. Smith, Bayesian Theory, (Wiley, 2000). [15] Robert E. Kass and Larry Wasserman, J. Am. Stat. Assoc. 91, 1343 (1996). [16] J.M. Bernardo, J. R. Statist. Soc. B41, 113 (1979); J.M. Bernardo and J.O. Berger, J. Am. Stat. Assoc. 84, 200 (1989). See also J.M. Bernardo, Reference Analysis, in Handbook of Statis- tics, 25 (D.K. Dey and C.R. Rao, eds.), 17-90, Elsevier (2005) and references therein. [17] L. Demortier, S. Jain and H. B. Prosper, Phys. Rev. D82, 034002 (2010), [arXiv:1002.1111]. [18] Christopher M. Bishop, Pattern Recognition and Machine Learning, (Springer, New York, 2006).

6th December, 2019 11:50am 34 39. Statistics

[19] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, (2nd edition, Springer, New York, 2009). [20] A. Webb, Statistical Pattern Recognition, 2nd ed., (Wiley, New York, 2002). [21] L.I. Kuncheva, Combining Pattern Classiﬁers, (Wiley, New York, 2004). [22] Links to the Proceedings of the PHYSTAT conference series (Durham 2002, Stanford 2003, Oxford 2005, and Geneva 2007, 2011) can be found at phystat.org. [23] K. Albertsson et al. (2018), [arXiv:1807.02876]. [24] A. Hocker et al. (2007), [arXiv:physics/0703039]; Software available from tmva.sourceforge.net. [25] F. Pedregosa et al., J. Machine Learning Res. 12, 2825 (2011), [arXiv:1201.0490]. [26] P. Baldi, P. Sadowski and D. Whiteson, Nature Commun. 5, 4308 (2014), [arXiv:1402.4735]. [27] D. Guest, K. Cranmer and D. Whiteson, Ann. Rev. Nucl. Part. Sci. 68, 161 (2018), [arXiv:1806.11484]. [28] K. Cranmer, J. Pavez and G. Louppe (2015), [arXiv:1506.02169]. [29] N. Reid, Likelihood Inference in the Presence of Nuisance Parameters, Proceedings of PHYS- TAT2003, L. Lyons, R. Mount, and R. Reitmeyer, eds., eConf C030908, Stanford, 2003. [30] W. A. Rolke, A. M. Lopez and J. Conrad, Nucl. Instrum. Meth. A551, 493 (2005), [arXiv:physics/0403059]. [31] G. Cowan et al., Eur. Phys. J. C71, 1554 (2011), [Erratum: Eur. Phys. J.C73,2501(2013)], [arXiv:1007.1727]. [32] L. Demortier, P-Values and Nuisance Parameters, Proceedings of PHYSTAT 2007, CERN- 2008-001, p. 23. [33] E. Gross and O. Vitells, Eur. Phys. J. C70, 525 (2010), [arXiv:1005.1891]. [34] R. B. Davies, Biometrika 74, 33 (1987). [35] J. Skilling, Nested Sampling, AIP Conference Proceedings, 735, 395–405 (2004). [36] F. Feroz, M. P. Hobson and M. Bridges, Mon. Not. Roy. Astron. Soc. 398, 1601 (2009), [arXiv:0809.3437]. [37] R. E. Kass and A. E. Raftery, J. Am. Statist. Assoc. 90, 430, 773 (1995). [38] B. P. Roe and M. B. Woodroofe, Phys. Rev. D63, 013009 (2001), [hep-ex/0007048]. [39] J. Neyman, Phil. Trans. Roy. Soc. Lond. A236, 767, 333 (1937); Reprinted in A Selection of Early Statistical Papers on J. Neyman, (University of California Press, Berkeley, 1967). [40] G. J. Feldman and R. D. Cousins, Phys. Rev. D57, 3873 (1998), [arXiv:physics/9711021]; This paper does not specify what to do if the ordering principle gives equal rank to some values of x. Eq. 21.6 of Ref. [3] gives the rule: all such points are included in the acceptance region (the domain D(α))/ Some authors have assumed the contrary, and shown that one can then jbtain null intervals. [41] K. Cranmer, in “Statistical Problems in Particle Physics, Astrophysics and Cosmology (PHYSTAT 05): Proceedings, Oxford, UK, September 12-15, 2005,” 112–123 (2005), [arXiv:physics/0511028], URL http://www.physics.ox.ac.uk/phystat05/proceedings/ files//Cranmer_LHCStatisticalChallenges.ps. [42] A. L. Read, in “Workshop on conﬁdence limits, CERN, Geneva, Switzerland, 17- 18 Jan 2000: Proceedings,” 81–101 (2000), URL http://weblib.cern.ch/abstract? CERN-OPEN-2000-205. [43] T. Junk, Nucl. Instrum. Meth. A434, 435 (1999), [hep-ex/9902006].

6th December, 2019 11:50am