Center for Cosmology and Particle Physics Practical for Particle Physics

Kyle Cranmer, New York University

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 1 Center for Cosmology and Particle Physics

Lecture 3

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 98 Center for Cosmology and Outline Particle Physics Lecture 1: Building a probability model ‣ preliminaries, the marked Poisson process ‣ incorporating systematics via nuisance parameters ‣ constraint terms ‣ examples of different “narratives” / search strategies Lecture 2: Hypothesis testing ‣ simple models, Neyman-Pearson lemma, and likelihood ratio ‣ composite models and the profile likelihood ratio ‣ review of ingredients for a hypothesis test Lecture 3: Limits & Confidence Intervals ‣ the meaning of confidence intervals as inverted hypothesis tests ‣ asymptotic properties of likelihood ratios ‣ Bayesian approach

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 99 Center for Cosmology and LEP Higgs Particle Physics

Nchan ni sifs(xij )+bifb(xij ) A simple likelihood L(x|H ) P ois(ni|si + bi) Q = 1 = i j si+bi L x H ! Nchan ! ni ratio with no free ( | 0) i P ois(ni|bi) j fb(xij)

parameters !Nchan ni ! sifs(xij) q = ln Q = −stot + ln 1+ # b f (x ) $ "i "j i b ij 50 (a) LEP 0.12 2 Observed mH = 115 GeV/c Expected for background 40 LEP Expected for signal 0.1 plus background -2 ln(Q) 30

0.08 20

10 0.06

Probability density 0 0.04 -10 Observed 0.02 Expected for background -20 Expected for signal plus background

0 -30 -15 -10 -5 0 5 10 15 106 108 110 112 114 116 118 120 2 -2 ln(Q) mH(GeV/c ) Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 85 Center for Cosmology and The Test Statistic and its distribution Particle Physics Consider this schematic diagram

signal + background background-only observed Probability Density

1-CL! b CL" s+b

Test Statistic signal like background like The “test statistic” is a single number that quantifies the entire experiment, it could just be number of events observed, but often its more sophisticated, like a likelihood ratio. What test statistic do we choose? And how do we build the distribution? Usually “toy Monte Carlo”, but what about the uncertainties... what do we do with the nuisance parameters?

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 86 Center for 680 where the ai are the parameters used to parameterize the fake-tau background and !Cosmologyrepresents and all nui- An example Particle Physics 681 sance parameters of the model: "H ,mZ,"Z,rQCD,a1,a2,a3. When using the alternate parameterization 682 of theEssentially, signal, the exact you form need of Equation to fit 14your is modified model to to coincide the data with parametetwice: rs of that model. 683 Figureonce 14 with shows everything the fit to the floating, signal candidates and once for a mwithH = 120signal GeV fixed Higg withto 0 (a,c) and without 684 (b,d) the signal contribution. It can be seen that the background shapes and normalizations are trying to P (m, a µ =0, νˆ(µ = 0; m, a)) 685 accommodate the excessλ( nearµ =m 0)## = =120 GeV, but| the control samples are constraining the variation. 686 Table 13 shows the significance calculated from theP profile(m, a likelihoodµ,ˆ νˆ) ratio for the ll-channel, the lh- 687 channel, and the combined fit for various Higgs boson masses.| P (m, a µ,ˆ νˆ) P (m, a µ =0, νˆ(µ = 0; m, a)) | | 14 14 ATLAS ATLAS 12 VBF H(120)$ ## $lh 12 VBF H(120)$ ## $lh s = 14 TeV, 30 fb-1 s = 14 TeV, 30 fb-1 10 10

Events / ( 5 GeV ) 8 Events / ( 5 GeV ) 8

6 6

4 4

2 2

0 0 60 80 100 120 140 160 180 60 80 100 120 140 160 180 M ## (GeV) M ## (GeV) Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 100 (a) (b)

14 14 ATLAS ATLAS 12 VBF H(120)$ ## $ll 12 VBF H(120)$ ## $ll s = 14 TeV, 30 fb-1 s = 14 TeV, 30 fb-1 10 10

Events / ( 5 GeV ) 8 Events / ( 5 GeV ) 8

6 6

4 4

2 2

0 0 60 80 100 120 140 160 180 60 80 100 120 140 160 180 M ## (GeV) M ## (GeV)

(c) (d)

Figure 14: Example fits to a data sample with the signal-plus-background (a,c) and background only 1 (b,d) models for the lh- and ll-channels at mH = 120 GeV with 30 fb− of data. Not shown are the control samples that were fit simultaneously to constrain the background shape. These samples do not include pileup.

27 Center for Cosmology and Properties of the Profile Likelihood Ratio Particle Physics After a close look at the profile likelihood ratio P (m, a µ, νˆ(µ; m, a)) λ(µ)= | P (m, a µ,ˆ νˆ) | one can see the function is independent of true values of ! ‣ though its distribution might depend indirectly Wilks’s theorem states that under certain conditions the distribution of -2 ln ! ("="0) given that the true value of " is "0 converges to a chi-square distribution ‣ more on this tomorrow, but the important points are: ‣ “asymptotic distribution” is known and it is independent of ! !

● more complicated if parameters have boundaries (eg. µ! 0) Thus, we can calculate the p-value for the background-only hypothesis without having to generate Toy Monte Carlo!

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 101 Center for Cosmology and Toy Monte Carlo Particle Physics Explicitly build distribution by generating “toys” / pseudo experiments assuming a specific value of µ and !. ‣ randomize both main measurement m and auxiliary measurements a ‣ fit the model twice for the numerator and denominator of profile likelihood ratio ‣ evaluate -2ln "(µ) and add to histogram Choice of µ is straight forward: typically µ=0 and µ=1, but choice of ! is less clear ‣ more on this tomorrow This can be very time consuming. Plots below use millions of toy pseudo- experiments on a model with ~50 parameters.

signalplusbackground signalplusbackground background 10 10 background test statistic data test statistic data

1 1 2-channel 5-channel 10-1 10-1 3.35σ 10-2 4.4σ 10-2 10-3 10-3 10-4 10-4 10-5 10-5 10-6

10-6 10-7 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 45 Profile Likelihood Ratio Profile Likelihood Ratio Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 102 Center for Cosmology and What makes a statistical method Particle Physics To describe a statistical method, you should clearly specify ‣ choice of a test statistic

● simple likelihood ratio (LEP) QLEP = Ls+b(µ = 1)/Lb(µ = 0) ˆ ˆ ● ratio of profiled likelihoods (Tevatron) QTEV = Ls+b(µ =1, νˆ)/Lb(µ =0, νˆ￿)

● profile likelihood ratio (LHC) λ(µ)=Ls+b(µ, νˆ)/Ls+b(ˆµ, νˆ)

‣ how you build the distribution of the test statistic

● toy MC randomizing nuisance parameters according to π(ν) • aka Bayes-frequentist hybrid, prior-predictive, Cousins-Highland

● toy MC with nuisance parameters fixed (Neyman Construction)

● assuming asymptotic distribution (Wilks and Wald, more tomorrow)

‣ what condition you use for limit or discovery

● more on this tomorrow

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 103 Center for Cosmology and Experimentalist Justification Particle Physics So far this looks a bit like magic. How can you claim that you incorporated your systematic just by fitting the best value of your uncertain parameters and making a ratio? It won’t unless the the parametrization is sufficiently flexible. So check by varying the settings of your simulation, and see if the profile likelihood ratio is still distributed as a chi-square

Nominal (Fast Sim) Here it is pretty stable, but 10-1 miss Smeared PT ATLAS Q2 scale 1 it’s not perfect (and this is

Probability 2 10-2 Q scale 2 Q2 scale 3 a log plot, so it hides some Q2 scale 4 -3 pretty big discrepancies) 10 Leading-order tt Leading-order WWbb -4 Full Simulation 10 For the distribution to be -1 -5 L dt=10 fb 10 independent of the nuisance parameters your 10-6 parametrization must be 02468101214161820sufciently flexible. log Likelihood Ratio Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 104 Center for Cosmology and A very important point Particle Physics If we keep pushing this point to the extreme, the physics problem goes beyond what we can handle practically The p-values are usually predicated on the assumption that the true distribution is in the family of functions being considered ‣ eg. we have sufficiently flexible models of signal & background to incorporate all systematic effects ‣ but we don’t believe we simulate everything perfectly ‣ ..and when we parametrize our models usually we have further approximated our simulation.

● nature -> simulation -> parametrization

At some point these approaches are limited by honest systematics uncertainties (not statistical ones). Statistics can only help us so much after this point. Now we must be physicists!

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 105 Center for Cosmology and Particle Physics

Confidence Intervals (Limits)

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 106 Center for Cosmology and Particle Physics

What is a “Confidence Interval? LEP1 and SLD 80.5 LEP2 and Tevatron (prel.) ‣ you see them all the time: 68% CL ]

Want to say there is a 68% chance GeV [ 80.4

that the true value of (mW, mt) is in W this interval m

80.3 ‣ but that’s P(theory|data)! ∆α mH [GeV] 114 300 1000 Correct frequentist statement is that 150 175 200

the interval covers the true value mt [GeV] 68% of the time ‣ Bayesian “credible interval” does mean probability parameter is ‣ remember, the contour is a function of the data, which is random. So it moves in interval. The procedure is very intuitive: around from experiment to experiment f(x θ)π(θ) P (θ V )= π(θ x)= dθ | ∈ | dθf(x θ)π(θ) ￿V ￿V | Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 ￿ 107 Center for Cosmology and Neyman Construction example Particle Physics For each value of θ consider f(x θ) | f(x θ) |

θ

θ2

θ1

θ0 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 108 Center for Cosmology and Neyman Construction example Particle Physics

Let’s focus on a particular point f(x θo) |

f(x θ ) | 0

x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 109 Center for Cosmology and Neyman Construction example Particle Physics

Let’s focus on a particular point f(x θ ) | o ‣ we want a test of size α ‣ equivalent to a 100(1 α )% confidence interval on θ − ‣ so we find an acceptance region with 1 α probability −

f(x θ ) | 0

1 α − x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 110 Center for Cosmology and Neyman Construction example Particle Physics

Let’s focus on a particular point f(x θ ) | o ‣ No unique choice of an acceptance region ‣ here’s an example of a lower limit

f(x θ ) | 0

1 1α α α − − x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 111 Center for Cosmology and Neyman Construction example Particle Physics

Let’s focus on a particular point f(x θ ) | o ‣ No unique choice of an acceptance region ‣ and an example of a central limit

f(x θ ) | 0

1 α − x α/2

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 112 Center for Cosmology and Neyman Construction example Particle Physics

Let’s focus on a particular point f(x θ ) | o ‣ choice of this region is called an ordering rule ‣ In Feldman-Cousins approach, ordering rule is the likelihood ratio. Find contour of L.R. that gives size α

f(x θ ) | 0

f(x θ0) | = kα 1 α f(x θbest(x)) − | x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 113 Center for Cosmology and Neyman Construction example Particle Physics Now make acceptance region for every value of θ f(x θ) |

θ

θ2

θ1

θ0 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 114 Center for Cosmology and Neyman Construction example Particle Physics This makes a confidence belt for # f(x θ) |

θ

θ2

θ1

θ0 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 115 Center for Cosmology and Neyman Construction example Particle Physics This makes a confidence belt for # the regions of data in the confidence belt can be considered as consistent with that value of #

θ

θ0 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 116 Center for Cosmology and Neyman Construction example Particle Physics

Now we make a measurement x0

the points θ where the belt intersects x 0 a part of the confidence interval in θ for this measurement

eg. [θ , θ+] − θ

θ+

θ −

x

x0

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 117 Center for Cosmology and Neyman Construction example Particle Physics For every point θ , if it were true, the data would fall in its acceptance region with probability 1 α − If the data fell in that region, the point θ would be in the interval [θ , θ+] − So the interval [ θ , θ + ] covers the true value with probability 1 α − − θ

θ+

θ −

x

x0 Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 118 Center for Cosmology and A Point about the Neyman Construction Particle Physics This is not Bayesian... it doesn’t mean the probability that the true value of θ is in the interval is 1 α ! −

θ

θtrue

θ+

θ −

x

x0

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 119 Center for Cosmology and Inverting Hypothesis Tests Particle Physics There is a precise dictionary that explains how to move from from hypothesis testing to parameter estimation. ‣ Type I error: probability interval does not cover true value of the parameters (eg. it is now a function of the parameters) ‣ Power is probability interval does not cover a false value of the parameters (eg. it is now a function of the parameters)

● We don’t know the true value, consider each point θ 0 as if it were true What about null and alternate hypotheses?

‣ when testing a point θ 0 it is considered the null ‣ all other points considered “alternate” So what about the Neyman-Pearson lemma & Likelihood ratio? ‣ as mentioned earlier, there are no guarantees like before ‣ a common generalization that has good power is: f(x H ) f(x θ ) | 0 | 0 f(x H ) f(x θ (x)) | 1 | best Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 120 Center for Cosmology and The Dictionary Particle Physics There is a formal 1-to-1 mapping between hypothesis tests and confidence intervals: ‣ some refer to the Neyman Construction as an “inverted hypothesisClassical test” Hypothesis Testing (cont.)

“Test for θ=θ0” ↔ “Is θ0 in confidence interval for θ”

“There is thus no need to derive optimum properties separately for tests and for intervals; there is a one-to-one Kyle Cranmercorrespondence (NYU) betweenCERN School the HEP, problems Romania, Sept. as 2011 in the dictionary in 121 Table 20.1” – Stuart99, p. 175. Using the likelihood ratio hypothesis test, this correspondence is the basis of intervals in G. Feldman, R Cousins, Phys Rev D57 3873 (1998). Bob Cousins, CMS, 2008 44 Center for Cosmology and Discovery in pictures Particle Physics Discovery: test b-only (null: s=0 vs. alt: s>0) • note, one-sided alternative. larger N is “more discrepant”

obs b-only p-value b-only s+b aka “CLb” P( N | s+b )

N events more discrepant Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 122 Center for Cosmology and Sensitivity for discovery in pictures Particle Physics When one specifies 5" one specifies a critical value for the data before “rejecting the null”. Leaves open a question of sensitivity, which is quantified as “power” of the test against a specific alternative ‣ In Frequentist setup, one chooses a “test statistic” to maximize power

● Neyman-Pearson lemma: likelihood ratio most powerful test for one-sided alternative

Power of test against s0

Critical region defined by 5σ s +b b-only 0 P( N | s+b )

N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 123 Center for Cosmology and Measurements in pictures Particle Physics Measurement typically denoted " = X± Y. ‣ X is usually the “best fit” or maximum likelihood estimate ‣ ±Y usually means [X-Y, X+Y] is a 68% confidence interval

Intervals are formally “inverted hypothesis tests”: (null: s=s0 vs. alt: s# s0)

‣ One hypothesis test for each value of s0 against a two-sided alternative ‣ No “uniformly most powerful test” for a two-sided alternative

obs, (sbest+b) s +b s +b b-only down 2.5% up 2.5% P( N | s+b )

N events more discrepant Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 124 Center for Cosmology and Upper limits in pictures Particle Physics What do you think is meant by “95% upper limit” ?

Is it like the picture below? ‣ ie. increase s, until the probability to have data “more discrepant” is < 5%

obs s +b b-only ok 95 excluded

aka “CLs+b”

5% P( N | s+b )

more discrepant N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 125 Center for Cosmology and Upper limits in pictures Particle Physics Upper-limits are trying to exclude large signal rates.

‣ form a 95% “confidence interval” on s of form [0,s95]

Intervals are formally “inverted hypothesis tests”: (null: s=s0 vs. alt: s

‣ One hypothesis test for each value of s0 against a one-sided alternative

Power of test depends on specific values of null s0 and alternate s’ ‣ but “uniformly most powerful” since it is a one-sided alternative

obs s +b b-only ok 95 excluded

5% P( N | s+b )

more discrepant N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 126 Center for Cosmology and The sensitivity problem Particle Physics The physicist’s worry about limits in general is that if there is a strong downward fluctuation, one might exclude arbitrarily small values of s ‣ with a procedure that produces proper frequentist 95% confidence intervals, one should expect to exclude the true value of s 5% of the time, no matter how small s is!

‣ This is not a problem with the procedure, but an undesirable consequence of the Type I / Type II error-rate setup

b-only s95+b

5% P( N | s+b )

N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 127 Center for Cosmology and Power in the context of limits Particle Physics

Remember, when creating confidence intervals the null is s=s0 ‣ and power is defined under a specific alternative (eg. s=0)

Power of test against s=0

b-only s95+b

5% P( N | s+b )

N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 128 Center for Cosmology and CLs Particle Physics To address the sensitivity problem, CLs was introduced

‣ common (misused) nomenclature: CLs = CLs+b/CLb

‣ idea: only exclude if CLs<5% (if CLb is small, CLs gets bigger)

CLs is known to be “conservative” (over-cover): expected limit covers with 97.5%

● Note: CLs is NOT a probability

“The CLs ... methods combine size and power in a very ad hoc way and are unlikely to have satisfactory statistical properties.” -- D. Cox & N. Reid "CLb"

b-only s95+b

"CLs+b" P( N | s+b )

N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 129 Center for Cosmology and The Power Constraint Particle Physics An alternative to CLs that protects against setting limits when one has no sensitivity is to explicitly define the sensitivity of the experiment in terms of power. ‣ A clean separation of size and power. (a new, arbitrary threshold for sensitivity) ‣ Feldman-Cousins foreshadowed the recommendation sensitivity defined as 50% power against b-only ‣ David van Dyk presented similar idea at PhyStat2011 [arxiv.org:1006.4334]

Power of test against s=0

b-only s95+b

5% P( N | s+b )

N events Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 130 Center for Cosmology and The Power Constraint Particle Physics An alternative to CLs that protects against setting limits when one has no sensitivity is to explicitly define the sensitivity of the experiment in terms of power. ‣ A clean separation of size and power. (a new, arbitrary threshold for sensitivity) ‣ Feldman-Cousins foreshadowed the recommendation sensitivity defined as 50% power against b-only ‣ David van Dyk presented similar idea at PhyStat2011 [arxiv.org:1006.4334]

Power of test against s=0 10 3

b-only s95+b

2 Taken from

) 10 4 Feldman-Cousins /c 5% 2 paper (eV 2 m P( N | s+b ) " 10

90% C. L. Sensitivity

1 -3 -2 -1 10 10 10 1 sin2(2!)

N events FIG. 15. Comparision of the confidence region for an example ofthetoymodelinwhich sin2(2θ) = 0 and the sensitivity of the experiment, as defined in the text. Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 130

34 Center for Cosmology and The Power Constraint Particle Physics An alternative to CLs that protects against setting limits when one has no sensitivity is to explicitly define the sensitivity of the experiment in terms of power. ‣ A clean separation of size and power. (a new, arbitrary threshold for sensitivity) ‣ Feldman-Cousins foreshadowed the recommendation sensitivity defined as 50% power against b-only ‣ David van Dyk presented similar idea at PhyStat2011 [arxiv.org:1006.4334]

“ Both measures are useful quantities that should be reported in order to extract the most science from catalogs”

Power of test against s=0 10 3

b-only s95+b

2 Taken from

) 10 4 Feldman-Cousins /c 5% 2 paper (eV 2 m P( N | s+b ) " 10

90% C. L. Sensitivity

1 -3 -2 -1 10 10 10 1 sin2(2!)

N events FIG. 15. Comparision of the confidence region for an example ofthetoymodelinwhich sin2(2θ) = 0 and the sensitivity of the experiment, as defined in the text. Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 130

34 Center for Cosmology and “Power-Constrained” CLs+b limits Particle Physics Even for s=0, there is a 5% chance of a strong downward fluctuation that would exclude the background-only hypothesis ‣ we don’t want to exclude signals for which we have no sensitivity ‣ idea: don’t quote limit below some threshold defined by an N-" downward fluctuation of b-only pseudo-experiments (for example: -1" by convention) 11

10

SM b-only expectation

σ 9 /

σ 8

7 -1" background

6 fluctuation

5 Observed limit is 4 “too lucky” for 3 comfort, impose 2 “power constraint”

95% Upper-limit on 1

0 -2" band must go 120 130 140 150 160 170 180 190 200 to 0 by simple m (GeV) logical argument, H so remove it Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 131 Center for Cosmology and Coverage Comparison with CLs Particle Physics The CLs procedure purposefully over-covers (“conservative”) ‣ and it is not possible for the reader to determine by how much The power-constrained approach has the specified coverage until the constraint is applied, at which point the coverage is 100% ‣ limits are not ‘aggressive’ in the sense that they under-cover ‣ recent critique of PCL here: http://arxiv.org/pdf/1109.2023v1

' ) 1.02 µ *+, 34 1 ' ( +, - 0.98 ' ! 0.96 ' " Coverage probability 0.94

' # 0.92

' $ 0.9 CL 0.88 µ = 0.64 s ' % min ''µ '1'&2(" PCL ./0 0.86 ' ï! ï" ï# ï$ ï% & % $ # " ! 0 1 2 3 4 5 µˆ µ Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 132 Center for Cosmology and Coverage Comparison with CLs Particle Physics The CLs procedure purposefully over-covers (“conservative”) ‣ and it is not possible for the reader to determine by how much The power-constrained approach has the specified coverage until the constraint is applied, at which point the coverage is 100% ‣ limits are not ‘aggressive’ in the sense that they under-cover ‣ recent critique of PCL here: http://arxiv.org/pdf/1109.2023v1

' ) 1.02 µ *+, 34 1 ' ( +, - 0.98 ' ! 0.96 ' " Coverage probability 0.94

' # 0.92

' $ 0.9 CL 0.88 µ = 0.64 s ' % min ''µ '1'&2(" PCL ./0 0.86 ' ï! ï" ï# ï$ ï% & % $ # " ! 0 1 2 3 4 5 µˆ µ Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 132 Center for Cosmology and Now let’s study Feldman-Cousins Particle Physics

7

6

5

4 µ 3

2

1

0 0 1 2 3 4 5 6 7 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 133 with Gaussian intervals), 90%, 95%, and 99%. Values in italics indicate results which must be taken with particular caution, since the probability of obtaining the number of events observed or fewer is less than 1%, even if µ = 0. (See Sec. IV C below.) Figure 8 shows, for n =0throughn = 10, the value of µ2 as a function of b,for 90% C.L. The small horizontal sections in the curves are the result of the mild pathology mentioned above, in which the original curves make a small dip, which we have eliminated. Dashed portions in the lower right indicate results which must be taken with particular caution, corresponding to the italicized values in the tables. Dotted portions on the upper left indicate regions where µ1 is non-zero. These corresponding values of µ1 are shown in Fig. 9. Figure 8 can be compared with the Bayesian calculation in Fig. 28.8 of Ref. [2] which uses a uniform prior for µt. A noticeable difference is that our curve for n =0decreases as a function of b, while the result of the Bayesian calculation stays constant (at 2.3). The decreasing limit in our case reflects the fact that P (n0 µ)decreasesasb increases. We find that objections to this behavior are typically based on| a misplaced Bayesian interpretation of classical intervals, namely the attempt to interpret themasstatementsaboutP (µ n0). t|

B. Gaussian with Boundary at Origin

It is straightforward to apply our ordering principle to the other troublesome example of Sec. III, the case of a Gaussian resolution function (Eq. 3.1) for µ,whenµ is physically bounded to non-negative values. In analogy with the Poisson case, for a particular x, we let µbest be the physically allowed value of µ for which P (x µ) is maximum. Then | In manyµbest analyses,=max(0,x the), and contribution of the signal process to the mean number of events is assumed to be non-negative. This condition effectively implies that any physical estimator 1/√2π,x0 for µ must be non-negative.P Even(x µbest if we)= regard this2 to be the case,≥ however, it is convenient(4.2) | ! exp( x /2)/√2π,x<0. to define an effective estimatorµ ˆ as the value− of µ that maximizes the likelihood, even this givesµ< ˆ We0 then (but compute providingR thatin analogy the Poisson to Eq. mean 4.1, using values, Eqs.µs 3.1i + andbi, 4.2: remain nonnegative). This In manywill allow analyses, us in Sec. the 3.1 contribution to modelµ ˆ ofas thea Gaussian signal process distributed to thevariable, mean and number in this of way events we can is P (x µ) exp( (x µ)2/2),x 0 assumeddetermine to be non-negative. the distributionsR This(x of)= the condition test| statistics e=ffectively that− we implies− consider.2 that≥ Therefore any physical in the estimator following(4.3) P (x µbest) ! exp(xµ µ /2),x<0. for µ mustwe will be always non-negative. regardµ ˆ Evenas an ifeff weective| regard estimator this to which be− theis allowed case, however,to take on it negative is convenient values. During our Neyman construction of confidence intervals, R determines the order in which to define an effective estimatorµ ˆ as the value ofCenterµ that for maximizes the likelihood, even this Cosmology and A different way to picture Feldman-Cousinsvalues of x are added to the acceptanceParticle region Physics at a particular value of µ. In practice, this max(0,n b),gives and isµ< ˆ given2.10 in(but Test the providing third statistic column thattµ of= the Table2ln Poisson I.λ( Weµ) mean then compute values, µsP (in+µbbesti,), remain nonnegative). This − means that for a given− value of µ, one finds the interval| [x1,x2]suchthatR(x1)=R(x2) Most peoplewhich think is given of plot inwill the allowon fourth left us column. inwhenand Sec. The 3.1thinking to fifth model column ofµ ˆFeldman-Cousinsas contains a Gaussianthe ratio, distributed variable, and in this way we can From the definition of λ(µ) in Eq. (7), one can see that 0 λ 1, with λ near 1 implying good determine the distributions of the test statisticsx2 that we consider.≤ ≤ Therefore in the following ‣ bars are regions “orderedagreement by” R = between P ( n µ ) /P the ( n dataµ best ) and, with the hypothesizedP (x µ)dx value= α. of µ(4.1). Equivalently it is convenient(4.4) we will always regardµ ˆ| as an| effective estimatorx1 which| is allowed to take on negative values. to use the statistic " FIGURES But this pictureand is thedoesn’t quantity generalize on which ourWe orderingwell solve to for principlemanyx1 and ismeasuredx based.2 numericallyR is aquantities. ratio to the of twodesired likelihoods: precision, for each µ in a grid with the likelihood of obtaining n given0.001 the spacing. actual With mean theµ, acceptance and the likelihood regions all of constructed obtaining,wethenreadon ff the confidence ‣ Instead, just use R2.1 as the Test test statistic statistic...tµ = and2ln Rλ is(µ !)(µ)t = 2lnλ(µ) (8) given the best-fit physically allowedintervals mean. [µ1,µ Values2]foreach− of n arex0 as added in Fig.µ to 1.− the acceptance region 7 for a given µ in decreasing order of TableR, until X thecontains sum oftheP results(n µ) meets for representative or exceeds the measured desired values and confidence levels. From theas definition the basis of of aλ statistical(µ) in Eq. test. (7), Higher one| can values see that of tµ 0thusλ correspond1, with λ tonear increasing 1 implying incompat- good C.L. This ordering, for values ofFiguren necessary 10 shows to the obtain confidence total probability belt for 90% of C.L. 90%,≤ is≤ shown agreementibility between between the the data data and and theµ. hypothesized value of µ. Equivalently it is convenient 6 in the column labeled “rank”. Thus,It is the instructive acceptance to compare region fo Fig.r µ 10=0 with.5 (analogous Fig. 3. At to large a x, the confidence intervals horizontal lineto segment use the in Figure statisticWe may[µ1 1),,µ2 isdefine] are the the interval a sametestn of in= aboth [0 hypothesized, 6]. plots, Due since to the thatvalue discreteness is of farµ awayby of usingfromn, the the con statisticstrainingtµ boundary.directly as measure of discrepancy between the data and the hypothesis, with higher values of t 5 the acceptance region contains more summed probability than 90%; this is unavoidable no µ matter what the orderingcorrespond principle, and to increasing leads to confidence disagreement. intervals To which quantify are9 conservative. the level of disagreement we compute the p-value, tµ = 2lnλ(µ) (8) 4 For comparison, in the column of Table I labeled “U.L.”, we− place check marks at the

values of n which are in the acceptanceµ region of standard 90% C.L. upper limits for this µ as the basis of a statistical test. Higher values of tµ thus correspond to increasing incompat- ∞ 3 example; and inibility the column between labeled the “central”, data and weµ place. checkp marks= at thef(t valuesµ) dt of ,n which (9) µ µ| µ are in the acceptance region of standard 90% C.L central confidence￿tµ,obs intervals. The constructionWe proceeds may define by finding a test the ofacceptance a hypothesized region for value all values of µ of µby,forthe using the statistic tµ directly 2 where tµ,obs is the value of the statistic tµ observed from the data and f(tµ µ) denotes the given value of bas. With measure a computer, of discrepancy we perform between the construction the data on a and grid the of dis hypothesis,crete values with higher| values of tµ pdf of t under the assumption of the signal strength µ. Useful approximations for this and of µ, in the intervalcorrespond [0, 50] in to stepsµ increasing of 0.005. disagreement. This suffices for the To precision quantify desired the level (0.01) of in disagreement we compute 1 other related pdfs are given in Sec. 3.3. The relation between the p-value and the observed endpoints of confidencethe p-value, intervals. We find that a mild pathology arises as a result of the tµ and also with the significance Z are illustrated in Fig. 1. fact that the observable n is discrete. When the vertical dashed line is drawn at some n0 (in 0 01234567analogy with in Fig. 1), it can happen that the set of intersected horizontal line segments is -log∞ !(µ) not simplyx connected. When(a) this occurs we naturallyp takeµ = the confidencef(tµ µ) intervaldtµ , to have (b)(9) t | Kyle Cranmer (NYU)µ1 corresponding toCERN the bottom-most School HEP, Romania, segment Sept. intersected, 2011 and￿ µ,toobs have µ2 corresponding134 to FIG. 1. A generic confidencethe belt construction top-most and segment its use. Foreachvalueof intersected.µ,onedraws ahorizontalacceptanceinterval[x1,x2]suchthatP (x [x1,x2] µ)=α.Uponperformingan ∈where| tµ,obs is the value of the statistic tµ observed from the data and f(tµ µ) denotes the experiment to measure x and obtainingWe the value thenx0,onedrawsthedashedverticallinethrough repeat the construction for a selection of fixed valuesofb. We find an additional | pdf of t under the assumption of the signal strength µ. Useful approximations for this and x0.Theconfidenceinterval[µ1,µmild2]istheunionofallvaluesof pathology, againµ for causedµ which the by corresponding the discreteness in n: when we compare the results for acceptance interval is intercepted by the vertical line. other related pdfs are given in Sec. 3.3. The relation between the p-value and the observed different values of b for fixed n0, the upper endpoint µ2 is not always a decreasing function of b, as would betµ expected.and also When with thethis significancehappens, we forceZ are the illustrated function to in be Fig. non-increasing, 1. by lengthening selected confidence intervals as necessary. We have investigated this behavior, and compensated for it, over a fine grid of b in the range [0, 25] in increments of 0.001 (with some additional searching(a) to even finer precision). (b) Our compensation forFigure the two 1: (a) pathologies Illustration mentioned of the relation in the between previous the paragraphsp-value obtained adds from an observed value of 2 slightly to our intervals’the conservatism, test statistic whichtµ. (b) however The standard remains normal dominated distribution by the unavoidableϕ(x)=(1/√2π)exp( x /2) showing the − effects due20 to the discretenessrelation in betweenn. the significance Z and the p-value. The confidence belts resulting from our construction are shown in Fig. 7, which may When using the statistic t , a data set may result in a low p-value in two distinct ways: be compared with Figs. 5 and 6. At large n, Fig. 7µ is similar to Fig. 6; the background the estimated signal strengthµ ˆ may be found greater or less than the hypothesized value µ. is effectively subtracted without constraint, and our ordering principle produces two-sided As a result, the set of µ values that are rejected because their p-values are found below a intervals which are approximately central intervals. At small n, the confidence intervals from specified threshold α may lie to either side of those values not rejected, i.e., one may obtain Fig. 7 automatically become upper limits on µ; i.e., the lower endpoint µ1 is 0 for n 4 a two-sided confidence interval for µ. ≤ in this case. Thus, flip-flopping between Figs. 5 and 6 is replaced by one coherent set of confidence intervals,Figure (and 1: no(a) interval Illustration is the ofempty the set). relation between the p-value obtained from an observed value of Tables II-IX give our confidence intervals [µ1,µ2] for the signal mean5 µ for the most 2 the test statistic tµ. (b) The standard ϕ(x)=(1/√2π)exp( x /2) showing the commonly usedrelation confidence between levels, the namely significance 68.27% (sometimesZ and the p called-value. 1-σ intervals by analogy −

When using the statistic8 tµ, a data set may result in a low p-value in two distinct ways: the estimated signal strengthµ ˆ may be found greater or less than the hypothesized value µ. As a result, the set of µ values that are rejected because their p-values are found below a specified threshold α may lie to either side of those values not rejected, i.e., one may obtain a two-sided confidence interval for µ.

5 2.2 Test statistic t˜ for µ 0 µ ≥ Often one assumes that the presence of a new signal can only increase the mean event rate beyond what is expected from background alone. That is, the signal process necessarily has µ 0, and to take this into account we define an alternative test statistic below called t˜ . ≥ µ In many analyses, the contribution of the signal processEven to for the when mean considering number models of for events which µ is 0, however, we will not restrict the ≥ assumed to be non-negative. This condition effectivelyeffective implies estimator thatµ ˆ anyto be physical positive, and estimator if the data fluctuate low relative to the expected for µ must be non-negative. Even if we regard this tobackground be the case, one can however, findµ< ˆ 0. it By is defining convenientµ ˆ in this way we will see in Sec. 3.1 that its distribution can be approximated by a Gaussian, which in turn allows one to obtain to define an effective estimatorµ ˆ as the value of µ thatsimple maximizes approximations the for likelihood, the pdfs of the even test statistics this considered. givesµ< ˆ 0 (but providing that the Poisson mean values,Forµs ai + modelbi, remain where µ nonnegative).0, if one finds data This such thatµ< ˆ 0, then the best level of ≥ will allow us in Sec. 3.1 to modelµ ˆ as a Gaussian distributedagreement variable, between the and data in and this any way physical we value can of µ occurs for µ = 0. We therefore define determine the distributions of the test statistics that we consider. Therefore in the following

ˆ we will always regardµ ˆ as an effective estimator which is allowed to take on negativeL(µ, values.θ(µ)) µˆ 0, L(ˆµ,θˆ) ≥ λ˜(µ)= (10) ˆ  L(µ,θ(µ))  ˆ µ<ˆ 0 . L(0,θˆ(0)) 2.1 Test statistic tµ = 2lnλ(µ)  Center for − ˆ ˆ  Cosmology and Feldman-Cousins withHere θˆ and(0) and θˆwithout(µ) refer to the constraint conditional ML estimatorsParticle of θ givenPhysics a strength parameter From the definition of λ(µ) in Eq. (7), one can see that 0 λ 1, with λ near 1 implying good With a physical constraintof≤ 0(µ>0) or µ≤,respectively. the confidence band changes, but agreement between the data and the hypothesized valueThe of variableµ. Equivalentlyλ˜(µ) can be used it instead is convenient of λ(µ) in Eq. (8) to obtain the corresponding test conceptually the same. Do not get empty˜ intervals. to use the statistic statistic, which we denote tµ. That is,

ˆ L(µ,θˆ(µ)) 2ln ˆ µ<ˆ 0 , ˜ ˜ − L(0,θˆ(0)) tµ = 2lnλ(µ)tµ = 2lnλ(µ)= (8)ˆ (11) −  L(µ,θˆ(µ)) −  2ln µˆ 0 . − L(ˆµ,θˆ) ≥ as the basis of a statistical test. Higher values of tµ thus correspond to increasing incompat- As was done with the statistic tµ, one can quantify the level of disagreement between the ibility between the data and µ. Two-sided data and the hypothesized value of µ withTwo-sided the p-value, just as in Eq. (9). An approximate formula for the distribution of t˜µ needed to do this is given in Sec. 3.4. We may define a test of a hypothesizedunconstrained value of µ by using the statistic tµconstraineddirectly Also similar to the case of tµ, values of µ both above and belowµ ˆ may be excluded by a (µ) as measure of discrepancy between the data and thegiven hypothesis, data set, i.e., with(µ) one may higher obtain either values a one-sided of tµ or two-sided confidence interval for µ. ! ! correspond to increasing disagreement. To quantifyFor the the level case of of no disagreement nuisance parameters, we the compute test variable t˜µ is equivalent to what is used in the p-value, constructing confidence intervals according to the procedure of Feldman and Cousins [8]. -2 ln -2 ln

2.3 Test statistic q0 for discovery of a positive signal ∞ pµ = f(tµ µ) dtµ , (9) An important special case of the statistic t˜µ described above is used to test µ = 0 in a class tµ,obs | ￿ of model where we assume µ 0. Rejecting the µ = 0 hypothesis effectively leads to the ≥ discovery of a new signal. For this important case we use the special notation q = t˜ . Using where tµ,obs is the value of the statistic tµ observed from the data and f(tµ µ) denotes the 0 0 the definition (11) with µ = 0 one| finds pdf of tµ under the assumption of the signal strength µ. Useful approximations for this and other related pdfs are given in Sec. 3.3. The relation between the p-value and the2ln observedλ(0)µ ˆ 0 , q = − ≥ (12) t and also with the significance Z are0 illustrated in Fig. 1. µ 0 0 µ µ  0ˆµ<0 , Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 136 where λ(0) is the profile likelihood ratio for µ = 0 as defined in Eq. (7). (a) (b) 6

Figure 1: (a) Illustration of the relation between the p-value obtained from an observed value of 2 the test statistic tµ. (b) The standard normal distribution ϕ(x)=(1/√2π)exp( x /2) showing the relation between the significance Z and the p-value. −

When using the statistic tµ, a data set may result in a low p-value in two distinct ways: the estimated signal strengthµ ˆ may be found greater or less than the hypothesized value µ. As a result, the set of µ values that are rejected because their p-values are found below a specified threshold α may lie to either side of those values not rejected, i.e., one may obtain a two-sided confidence interval for µ.

5 We may contrast this to the statistic t0, i.e., Eq. (8), used to test µ = 0. In this case one may reject the µ = 0 hypothesis for either an upward or downward fluctuation of the data. This is appropriate if the presence of a new phenomenon could lead to an increase or decrease in the number of events found. In an experiment looking for neutrino oscillations, for example, the signal hypothesis may predict a greater or lower event rate than the no- oscillation hypothesis.

When using q0, however, we consider the data to show lack of agreement with the background-only hypothesis only ifµ> ˆ 0. That is, a value ofµ ˆ much below zero may indeed constitute evidence against the background-only model, but this type of discrepancy does not show that the data contain signal events, but rather points to some other systematic error. For the present discussion, however, we assume that the systematic uncertainties are dealt with by the nuisance parameters θ. If the data fluctuate such that one finds fewer events than even predicted by background processes alone, thenµ< ˆ 0 and one has q0 = 0. As the event yield increases above the expected background, i.e., for increasingµ ˆ, one finds increasingly large values of q0, corre- sponding to an increasing level of incompatibility between the data and the µ = 0 hypothesis. To quantify the level of disagreement between the data and the hypothesis of µ =0using the observed value of q0 we compute the p-value in the same manner as done with tµ, namely,

∞ pµ = f(qµ µ) dqµ , (15) ∞ q | p0 = f(q0 0) dq0 . ￿ µ,obs (13) q | ￿ 0,obs which can be expressed as a significance using Eq. (1). Here f(q µ) is the pdf of q assuming p =µ ∞ f(q µ) dq , µ (15) µ | µ| µ the hypothesis µ. In Sec. 3.6 we provide useful approximations￿q forµ,obs this and other related Here f(q0 0) denotes the pdf of the statistic q0 under assumption of the background-only | pdfs. which can be expressed as a significance using Eq. (1). Here f(q µ) is the pdf of q assuming µ| µ (µ = 0) hypothesis. An approximation for this and other related pdfs are giventhe hypothesis in Sec.µ. In 3.5. Sec. 3.6 we provide useful approximations for this and other related pdfs. 2.5 Alternative test statistic q˜ for upper limits µ Center for Cosmology and 2.4 Test statisticModifiedq for upper test limits statistic for 1-sided2.5 Alternativeupper test limits statistic q˜µ for upperParticle limits Physics µ For the case where one considers models for which µ 0, the variable λ˜(µ) can be used ≥ instead of λ(µ) in Eq. (14) to obtainFor the the case corresponding where one considers test models statistic, for which whichµ we0, denote the variableq ˜µ.λ˜(µ) can be used For 1-sided upper-limit one construct a test that is more powerful for ≥all For purposes of establishing an upper limit onThat the is, strength parameter µinstead, we of considerλ(µ) in Eq. two (14) to obtain the corresponding test statistic, which we denoteq ˜µ. closely related test statistics.µ>0 (but First, has we mayno power define for µ=0) simply byThat discarding is, “upward fluctuations” ˆ L(µ,θˆ(µ)) ˆ 2ln µ<ˆ 0 L(µ,θˆ(µ)) ˆ 2ln ˆ µ<ˆ 0 2lnλ˜(µ)ˆµ µ −2lnλ˜(µL)ˆ(0,µθ(0))µ − L(0,θˆ(0))   ˆ 2lnλ(µ)ˆµ µ, − ≤ q˜ = − ˆ ≤ = L(µ,θˆ(µ)) (16) q˜µ = =µ  L(µ,θ(µ))  2ln 0 (16)µˆ µ q = − ≤  0ˆ(14)2ln µ>µ 0 µˆ µ L(ˆµ,θˆ) µ 0ˆµ>µ  − L(ˆµ,θˆ) ≤ ≤− ≤ ≤ ￿ 0ˆµ>µ,   0ˆµ>µ.  0ˆµ>µ.     We give an approximation for the pdf f(˜qµ µ￿) in Sec. 3.7.  | where λ(µ) is the profile likelihood ratio as definedWe give in an Eq. approximation (7). The reason for the for pdf settingf(˜qµ µ￿) inqµ Sec.=0 3.7. In numerical| examples we have found that the difference between the tests based on qµ (Eq. (14)) andq ˜ usually to be negligible, but use of q leads to important simplifications. forµ>µ ˆ is that when setting an upper limit, oneIn numerical would not examples regard we have data found with thatµ>µ ˆ µ the diasfference between the testsµ based on qµ Furthermore, in the context of theOne-sided approximation used in Sec. 3, the two statistics are equiv- representing less compatibility with µOne-sidedthan the(Eq. data (14)) obtained, andq ˜µ usually and therefore to be negligible, this is but not use taken of qµ leads to important simplifications. alent. That is, assuming the approximations below, qµ can be expressed as a monotonic Furthermore, in the context of the approximation used in Sec.constrained 3, the two statistics are equiv- as part of the rejection region of the test.unconstrained From the definition of the test statisticfunction one ofq ˜µ seesand thus that they lead to the same results. alent. That is, assuming the approximations below, qµ can be expressed as a monotonic (µ) higher values of qµ represent greater incompatibility between the data and(µ) the hypothesized ! function ofq ˜µ and thus they lead! to the same results. value of µ. 3 Approximate sampling distributions One should note that q is not simply a special case of q with µ = 0,In but order ratherto find the hasp-value a of a hypothesis using Eqs. (13) or (15) we require the sampling -2 ln

0 µ -2 ln 3 Approximate samplingdistribution distributions for the test statistic being used. In the case of discovery we are testing the different definition (see Eqs. (12) and (14)). That is, q0 is zero if the data fluctuatebackground-only downward hypothesis (µ = 0) and therefore we need f(q 0), where q is defined by 0| 0 (ˆµ<0), but qµ is zero if the data fluctuate upwardIn order (ˆµ>µ to find). With the p-value that caveat of a hypothesisEq. in (12). mind, When using testing we Eqs. will a nonzero (13) or value (15) of µ wefor require purposes the of finding sampling an upper limit we need the distribution f(q µ)whereq is defined by Eq. (14), or alternatively we require the pdf distribution for the test statistic being used. Inµ| the caseµ of discovery we are testing the often refer in the following to qµ with the idea that this means either q0 orofq theµ as corresponding appropriate statisticq ˜µ as defined by Eq. (16). In this notation the subscript of q background-only hypothesis (µ =refers 0) and to the therefore hypothesis we being need tested,f(q and0 0), the where second argumentq0 is defined in f(q byµ) gives the value of to the context. | µ| Eq. (12). When testing a nonzeroµ valueassumed of inµ thefor distribution purposes of of the finding data. an upper limit we need

As with the case of discovery, one quantifiesthe the distribution level of agreementf(qµ µ)where betweenqµ isWe defined also the need bydata the Eq. distribution and (14), orf alternatively(qµ µ￿)withµ = weµ￿ to require find what the significance pdf to expect and | | ￿ hypothesized µ with p-value. For, e.g., an observedof the value correspondingq , statisticone hasq ˜µ ashow defined this is distributed by Eq. (16).if the data In correspondthis notation to a strength the subscript parameter of diffqerent from the one µ,obs being tested. For example, it is useful to characterize the sensitivity of a planned experiment refers to the hypothesis being tested, and the second argument in f(qµ µ) gives the value of by quoting the median significance, assuming dataµ|95 distributed according to a specified signal µ assumed in the distributionµ of themodel, data. with which one would expect to exclude the background-onlyµ hypothesis. For this one 0 would need f(q0 µ0￿), usually with µ￿ = 1. From this one can find the median q0, and thus the We also need the distribution f(qµ µ￿)with|µ = µ￿ to find what significance to expect and median| discovery significance.￿ When considering upper limits, one would usually quote the Kyle Cranmer (NYU) how this isCERN distributed School if HEP, the data Romania,value correspond of µSept.for which 2011 to a the strength median p parameter-value is equal di tofferent 0.05, as from this gives the137 the one median upper limit 7being tested. For example, it is usefulon µ at to 95% characterize confidence level. the sensitivityIn this case one of would a planned need f( experimentq 0) (or alternatively f(˜q 0)). µ| µ| by quoting the median significance, assumingIn Sec. 3.1 we data present distributed an approximation according for the to profile a specified likelihood signal ratio, valid in the large model, with which one would expectsample to exclude limit. This the allows background-only one to obtain approximations hypothesis. for For all of this the requiredone distributions, would need f(q µ ), usually with whichµ = are1. From given in this Sections one can3.3 through find the 3.6 median The approximationsq , and thus become the exact in the large 0| ￿ ￿ 0 median discovery significance. When considering upper limits, one would usually quote the value of µ for which the median p-value is equal to 0.05, as this gives the8 median upper limit on µ at 95% confidence level. In this case one would need f(q 0) (or alternatively f(˜q 0)). µ| µ| In Sec. 3.1 we present an approximation for the profile likelihood ratio, valid in the large sample limit. This allows one to obtain approximations for all of the required distributions, which are given in Sections 3.3 through 3.6 The approximations become exact in the large

8 Study on H WW at 160GeV →

Center for Extracting the upper limits Cosmology and A real life example Particle Physics EachThe colored colored curve curves is represents stand for values a single of pseudo-experiment evaluating Toy-Data • ‣withthe test diff statisticerent µ ispoints changing as µ, the parameter of interest, changes

µ Haoshuang Ji First attempt on Higgs combination Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 138 Center for Cosmology and Coverage Particle Physics Coverage is the probability that the interval covers the true value.

Methods based on the Neyman-Construction always cover.... by construction. ‣ sometimes they over-cover (eg. “conservative”) Bayesian methods, do not necessarily cover ‣ but that’s not their goal. ‣ but that also means you shouldn’t interpret a 95% Bayesian “Credible Interval” in the same way

Coverage can be thought of as a calibration of our statistical apparatus. [explain under-/over-coverage]

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 139 Center for Cosmology and Discrete Problems Particle Physics In discrete problems (eg. number counting analysis with counts described by a Poisson) one sees: ‣ discontinuities in the coverage (as a function of parameter) ‣ over-coverage (in some regions) ‣ Important for experiments with few events. There is a lot of discussion about this, not focusing on it here

0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 2 4 6 8 10 12 14 16 18 20 n

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 140 Center for Cosmology and Neyman Construction with Nuisance parameters Particle Physics In the strict sense, one wants coverage for µ for all values of the nuisance parameters (here $) ‣ The “full construction” one n Challenge for full Neyman Construction is computational time (scan in 50- D isn’t practical) and to avoid significant over-coverage ‣ note: projection of nuisance parameters is a union (eg. set theory) not an integration (Bayesian) PHYSTAT2003, SLAC, Stanford, California, September 8-11, 2003

ideal shape of conf. region 140 , but some overcoverage may just be a natural M 120 (x0,e0) ε 100

80

60

40

20 0 50 100 150 200 250 µ min µmax µ x

G. Punzi - PHYSTAT 05 - Oxford, UK Figure 1: TheK. CranmerNeyman - PHYSTATconstruction 03 - SLACfor a test statistic x, Figure 2: Contours of the likelihood L(x, M H0, b) are | Kyle Cranmer (NYU) CERN School HEP, Romania,an Sept.auxiliary 2011 measurement M, and a nuisance parameter141 shown as concentric ellipses for b = 32 and b = 80. b. Vertical planes represent acceptance regions Wb for H0 Contours of the likelihood ratio in Eq. 5 are shown as given b. The condition for discovery corresponds to data diagonal lines. This figure schematically illustrates that if (x0, M0) that do not intersect any acceptance region. one chooses acceptance regions based solely on contours The contours of L(x, M H0, b) are in color. of the likelihood ratio, that similarity is badly violated. | For example, data M = 80, x = 130 would be considered part of the acceptance region for b = 32, even though it ˆ should clearly be ruled out. where ˆb conditionally maximizes L(x, M H1, b) and ˆb conditionally maximizes L(x, M H , b). | | 0 Now let us take s = 50 and ∆ = 5%, both of which region. Fig. 1 shows the acceptance region with this could be determined from Monte Carlo. In our toy ex- slight modification. 7 ample, we collect data M0 = 100. Let α = 2.85 10− , In the case where s = 50, ∆ = 5%, and M0 = 100, which corresponds to 5σ. The question now is· how one must observe 167 events to claim a discovery. many events x must we observe to claim a discovery?1 While no figure is provided, the range of b consis- The condition for discovery is that (x0, M0) do not lie tent with M0 = 100 (and no constraint on x) is in any acceptance region W . In Fig. 1 a sample of b [68, 200]. In this range, the tests are similar to b ∈ acceptance regions are displayed. One can imagine a a very high degree. horizontal plane at M0 = 100 slicing through the var- ious acceptance regions. The condition for discovery is that x0 > xmax where xmax is the maximal x in the 7. THE COUSINS-HIGHLAND intersection. TECHNIQUE There is one subtlety which arises from the or- dering rule in Eq. 5. The acceptance region Wb = The Cousins-Highland approach to hypothesis test- (x, M) l > lα is bounded by a contour of the ing is quite popular [4] because it is a simple smear- { | } likelihood ratio and must satisfy the constraint of size: ing on the nuisance parameter [5]. In particular, the L(x, M H0, b) = (1 α). While it is true that background-only hypothesis L(x H , b) is transformed Wb | − 0 the likelihood is independent of b, the constraint on from a compound hypothesis with| nuisance parameter ! size is dependent upon b. Similar tests are achieved b to a simple hypothesis L"(x H ) by | 0 when lα is independent of b. The contours of the like- lihood ratio are shown in Fig. 2 together with con- L"(x H0) = L(x H0, b)L(b)db, (6) tours of L(x, M H , b). While tests are roughly sim- | | 0 "b ilar for b M|, similarity is violated for M b. This violation≈ should be irrelevant because clearly# where L(b) is typically a normal distribution. The b M should not be accepted. This problem can problem with this method is largely philosophical: be#avoided by clipping the acceptance region around L(b) is meaningless in a frequentist formalism. In a M = b N∆b, where N is sufficiently large ( 10) Bayesian formalism one can obtain L(b) by consider- to have ±negligible affect on the size of the acceptance≈ ing L(M b) and inverting it with the use of Bayes’s theorem |and the a priori likelihood for b. Typically, L(M b) is normal and one assumes a flat prior on b. In |the case where s = 50, L(b) is a normal distribu- 1 tion with mean µ = M = 100 and standard deviation In practice, one would measure x0 and M0 and then ask, 0 “have we made a discovery?”. For the sake of explanation, we σ = ∆M0 = 5, one must observe 161 events to claim a have broken this process into two pieces. discovery. Initially, one might think that 161 is quite

263 Center for Cosmology and Profile Construction Particle Physics

Gary Feldman presented an approximate Neyman Construction, based on the profile likelihood ratio as an ordering rule, but only performing the construction on a subspace (eg. their conditional maximum likelihood estimate)

The profile construction means that one does not need to scan each nuisance parameter (keeps dimensionality constant) ‣ easier computationally (in RooStats) This approximation does not guarantee exact coverage, but ‣ tests indicate impressive performance ‣ one can expand about the profile construction to improve coverage, with the limiting case being the full construction

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 142 Center for Cosmology and Particle Physics

Lecture 4

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 143 Center for Cosmology and Particle Physics

Asymptotic Properties of likelihood based tests

&

Likelihood-based methods

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 144 Center for Cosmology and Likelihood-based Intervals Particle Physics

Wilks’s theorem tells us how the profile 2 log λ(θ) χ2 likelihood ratio evaluated at % is − ∼ n “asymptotically” distributed when ! is true ‣ asymptotically means there is sufficient

data that the log- is ) θ | ) θ (

parabolic λ

‣ 2 log

does NOT require the model f(x|!) to be − ( Gaussian f 2 log λ(θ) ‣ there are some conditions that must be − met for this to true Note common exceptions:

‣ a parameter has no effect on the Trial factors or the look likelihood (eg. mH when testing s=0) elsewhere effect in high energy related to look-elsewhere effect physics. Eilam Gross, Ofer Vitells

‣ require s!0, but this just leads to a Eur.Phys.J. C70 (2010) 525-530 &-function at 0 + "#$ e-Print: arXiv:1005.1891 [physics.data-an] Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 145 Center for Cosmology and Likelihood-based Intervals Particle Physics

Wilks’s theorem tells us how the profile 2 log λ(θ) χ2 likelihood ratio evaluated at % is − ∼ n “asymptotically” distributed when ! is true ‣ asymptotically means there is sufficient

data that the log-likelihood function is ) θ | ) θ (

parabolic λ

‣ 2 log

does NOT require the model f(x|!) to be − ( Gaussian f 2 log λ(θ) So we don’t really need to go to the − trouble to build its distribution by using Toy Monte Carlo or fancy tricks with Fourier Transforms θ

We can go immediately to the threshold value of the profile likelihood ratio

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 146 Likelihood-Ratio Interval example

Center for Cosmology and Likelihood-based Intervals Particle Physics

68% C.L. likelihood-ratio interval 2 n for Poisson process with n=3 χ ∼ observed: ) θ ( λ !"(µ) = µ3 exp(-µ)/3! ) θ 2 log Maximum at µ = 3. ( λ − 2 log

2 −

|

∆2ln = 1 for approximate ±1 − ) θ ) θ ( λ log 2 ( ! f Gaussian standard deviation θ yields interval [1.58, 5.08]

!"#$%&'(%)*'+,'-)$."/.0'''''''''''''And typically we only show the likelihood 1*,'2,'345.,'67'789':;88<= curve and don’t even bother with the implicit (asymptotic) distribution

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 148 Bob Cousins, CMS, 2008 35 Center for Cosmology and “The Asimov paper” Particle Physics Recently we showed how to generalize this asymptotic approach ‣ generalize Wilks’s theorem when boundaries are present ‣ use result of Wald to get f(-2log!(µ) | µ’) Asymptotic formulae for likelihood-based tests of new physics Glen Cowan, Kyle Cranmer, Eilam Gross, Ofer Vitells

Eur.Phys.J.C71:1554,2011 http://arxiv.org/abs/1007.1727v2

med[q |µ’] f(q |µ) µ µ f(q | ’) µ µ

p−value

Figure 2: Illustration of the the p- value corresponding to the median of qµ assuming a strength parame- q ! µ ter µ (see text).

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 149 procedure can be extended to the case where several search channels are combined, and in Sec. 4.3 we describe how to give statistical error bands for the sensitivity.

4.1 The median significance from Asimov values of the test statistic

By using the Asimov data set one can easily obtain the median values of q0, qµ andq ˜µ,and these lead to simple expressions for the corresponding median significance. From Eqs. (53), (60) and (68) one sees that the significance Z is a monotonic function of q,andtherefore the median Z is simply given by the corresponding function of the median of q,whichis approximated by its Asimov value. For discovery using q0 one wants the median discov- ery significance assuming a strength parameter µ! and for upper limits one is particularly interested in the median exclusion significance assuming µ =0,med[Z 0]. For these one ! µ| obtains

med[Z0 µ!]=√q0,A , (79) | med[Z 0] = √q . (80) µ| µ, A

When usingq ˜µ for establishing upper limits, the general expression for the exclusion significance Zµ is somewhat more complicated depending on µ!,butisinanycasefoundby substituting the appropriate values ofq ˜µ, A and σA into Eq. (68). For the usual case where one wants the median significance for µ assuming data distributed according to the background- only hypothesis (µ! =0),Eq.(68)reducesinfacttoarelationofthesameformasEq. (60), and therefore one finds

med[Z 0] = q˜ . (81) µ| µ, A ! 4.2 Combining multiple channels

In many analyses, there can be several search channels which need to be combined. For each channel i there is a likelihood function Li(µ, θi), where θi represents the set of nuisance parameters for the ith channel, some of which may be common between channels. Here the strength parameter µ is assumed to be the same for all channels. If the channels are statistically independent, as can usually be arranged, the full likelihood function is given by the product over all of the channels,

20 0 10 median[q | 0] µ

!1 10

!2 10

!3 10 Figure 9: The distributions f(qµ|0) (red) and f(qµ|µ)(blue) !4 10 from both the asymptotic formulae 0 5 10 15 20 q and Monte Carlo histograms (see µ text). Center for Cosmology and Median & bands from asymptotics Particle Physics

GetThe verticalMedian line and in Fig. bands 9 gives in the seconds, median value not of q µdays!assuming a strength parameter µ! =0.Theareatotherightofthislineunderthecurveoff(qµ µ)givesthep-value of 0 350 | the10 hypothesized µmedian[q,asshownshadedingreen.Theupperlimiton | 0] µ at a confidence level µ 300 CL = 1 α is the value of µ for which the p-value is p = α.Figure9showsthedistributions !1 µ 10 − 250 for the value of µ that gave pµ =0.05, corresponding to the 95% CL upper limit. 200 !2 10In addition to reporting the median limit, one would like to know how much it would vary for given statistical fluctuations in the data. This is illustratedEvents 150 in Fig. 10, which shows the !3 100 same10 distributions as in Figure 9, but here the vertical line indicates the 15.87% quantile of the Figure 9: The distributions Figure 11: Distribution of the f q µ 50 distribution ( µ 0), corresponding to having ˆ fluctuate downwardf(qµ|0) (red) one and standardf(qµ|µ)(blue) deviation upper limit on µ at 95% CL, as- !4 | below10 its median. from0 both the asymptotic formulae suming data corresponding to the 0 5 10 15 20 !2 !1 0 1 2 3 4 background-only hypothesis (see and Monte Carlo histograms95 (see q µ text). µ text). up 0 CL 95% limits 10 s+b 15.87% quantile (median!1!) 3 The!1 vertical line in Fig. 9 gives the median value of q assuming a strength parameter 10 2.5µ µ! =0.Theareatotherightofthislineunderthecurveoff(qµ µ)givesthep-value of µ 2 |µ the hypothesized!2 ,asshownshadedingreen.Theupperlimiton at a confidence level 10 µ up 1.5 CL = 1 α is the value of µ for which the p-value is pµ = α.Figure9showsthedistributions − µ p . for the!3 value of that gave µ =005, corresponding to the1 95% CL upper limit. 10 Figure 12: The median (central In addition to reporting the median limit, one would like0.5 to know how much it would vary Figure 10: The distributions blue line) and error bands (±1σ in !4 green, ±2σ in yellow) for the 95% for given10 statistical fluctuations in the data. This is illustrated0f(qµ|0) in (red) Fig. and 10,f which(qµ|µ)(blue)as shows the 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 in Fig. 9 and the 15.87% quantile of CL upper limit on the strength pa- same distributions as in Figureq 9, but here the vertical line indicates the 15.87% quantilem of the rameter µ (see text). µ f(q |0) (see text). distribution f(qµ 0), corresponding to havingµ ˆ fluctuate downwardµ one standard deviation Kyle Cranmer (NYU)| CERN School HEP, Romania, Sept. 2011 150 below its median. 6ImplementationinRooStats By simulating the experiment many times with Monte Carlo, we can obtain a histogram µ of the0 upper limits on at 95% CL, as shown in Fig. 11. The 1σ (green) and 2σ (yellow) 10 Many of± the results presented± above are implemented or are being implemented in the error bands are obtained from the MC experiments. The vertical lines indicate the error 15.87% quantile (median!1!) RooStats framework [15], which is a C++ class library based ontheROOT[16]andRooFit[17] bands as estimated directly (without Monte Carlo) using Eqs.(88)and(89).Ascanbeseen !1 packages. The tools in RooStats can be used to represent arbitrary probability density func- from10 the plot, the agreement between the formulae andtions MC that pred inheritictions from isRooAbsPdf excellent.,theabstractinterfacesforprobabilitydensityfunctions provided by RooFit. Figures!2 9 through 11 correspond to finding upper limit on µ for a specific value of the peak 10 The framework provides an interface with minimization packages such as Minuit [18]. position (mass). In a search for a signal of unknown mass, the procedure would be repeated This allows one to obtain the estimators required in the the profile likelihood ratio:µ ˆ, for all!3 masses (in practice in small steps). Figure 12 shows theˆ median upper limit at 95% CL 10 θˆ,andθˆ.TheAsimovdatasetdefinedinEq.(24)canbedeterminedforaprobability as a function of mass. The median (central blue line) anddensity erro functionrbands( by specifying1σ in green, the ExpectedData()2σ in command argument in a call to the Figure 10: ± The distributions± yellow)!4 are obtained using Eqs. (88) and (89). The pointsgenerateBinned and connectingmethod. curve The correspond Asimov data together with the standard HESSE 10 f(qµ|0) (red) and f(qµ|µ)(blue)as to the0 upper limit5 from10 a single15 arbitrary20 Monte25 Carlo30matrix datain provided seFig.t, 9 generated and by theMinuit 15.87% accordingmakes quantile it isto of possible the to determine the Fisher information matrix q shown in Eq. (28), and thus obtain the related quantities suchasthevarianceofˆµ and the background-only hypothesis.µ As can be seen, most of thef plot(qµ|0)slieasexpectedwithinthe (see text). 1σ error band. noncentrality parameter Λ,whichenterintotheformulaeforanumberofthedistributions ± of the test statistics presented above. By simulating the experiment many times with Monte Carlo, we can obtain a histogram p of the upper limits on µ at 95% CL, as shown in Fig. 11. TheThe distributions1σ (green) of andthe various2σ (yellow) test statistics and the related formulae for -values, sensi- 28 tivities and± confidence intervals± as given in Sections 2, 3 and4arebeingincorporatedaswell. error bands are obtained from the MC experiments. The vertical lines indicate the error RooStats currently includes the test statistics tµ, t˜µ, q0,andq,qµ,and˜qµ as concrete imple- bands as estimated directly (without Monte Carlo) usingmentations Eqs.(88)and(89).Ascanbeseen of the TestStatistic interface. Together with the Asimov data, this provides from the plot, the agreement between the formulae andthe MC ability pred toictions calculate is the excellent. alternative estimate, σA,forthevarianceofˆµ shown in Eq. (30). The noncentralµ chi-square distribution is being incorporated into both RooStats and ROOT’s Figures 9 through 11 correspond to finding upper limitmathematics on for a libraries specific for value more of general the peak use. The various transformations of the noncentral position (mass). In a search for a signal of unknown mass, the procedure would be repeated for all masses (in practice in small steps). Figure 12 shows the median upper limit at 95% CL 29 as a function of mass. The median (central blue line) and errorbands( 1σ in green, 2σ in ± ± yellow) are obtained using Eqs. (88) and (89). The points and connecting curve correspond to the upper limit from a single arbitrary Monte Carlo data set, generated according to the background-only hypothesis. As can be seen, most of the plotslieasexpectedwithinthe 1σ error band. ±

28 2.2 Test statistic t˜ for µ 0 µ ≥ Often one assumes that the presence of a new signal can only increase the mean event rate beyond what is expected from background alone. That is, the signal process necessarily has µ 0, and to take this into account we define an alternative test statistic below called t˜ . ≥ µ In many analyses, the contribution of the signal processEven to for the when mean considering number models of for events which µ is 0, however, we will not restrict the ≥ assumed to be non-negative. This condition effectivelyeffective implies estimator thatµ ˆ anyto be physical positive, and estimator if the data fluctuate low relative to the expected for µ must be non-negative. Even if we regard this tobackground be the case, one can however, findµ< ˆ 0. it By is defining convenientµ ˆ in this way we will see in Sec. 3.1 that its sampling distribution can be approximated by a Gaussian, which in turn allows one to obtain to define an effective estimatorµ ˆ as the value of µ thatsimple maximizes approximations the for likelihood, the pdfs of the even test statistics this considered. givesµ< ˆ 0 (but providing that the Poisson mean values,Forµs ai + modelbi, remain where µ nonnegative).0, if one finds data This such thatµ< ˆ 0, then the best level of ≥ will allow us in Sec. 3.1 to modelµ ˆ as a Gaussian distributedagreement variable, between the and data in and this any way physical we value can of µ occurs for µ = 0. We therefore define determine the distributions of the test statistics that we consider. Therefore in the following

ˆ we will always regardµ ˆ as an effective estimator which is allowed to take on negativeL(µ, values.θ(µ)) µˆ 0, L(ˆµ,θˆ) ≥ λ˜(µ)= (10) ˆ  L(µ,θ(µ))  ˆ µ<ˆ 0 . L(0,θˆ(0)) 2.1 Test statistic tµ = 2lnλ(µ)  Center for − ˆ ˆ  Cosmology and Feldman-Cousins withHere θˆ and(0) and θˆwithout(µ) refer to the constraint conditional ML estimatorsParticle of θ givenPhysics a strength parameter From the definition of λ(µ) in Eq. (7), one can see that 0 λ 1, with λ near 1 implying good Wilks’s theorem gives a short-cutof≤ 0 or forµ≤,respectively. the Monte Carlo procedure used to find ˜ agreement between the datathreshold and the on hypothesized test statistic ⇒ value MINOSThe of variable µis. asymptotic Equivalentlyλ(µ) can be approximation used it instead is convenient of λ( µof) inFeldman-Cousins Eq. (8) to obtain the corresponding test ˜ to use the statistic statistic, which we denote tµ. That is, ‣ With a physical constraint (µ>0) the confidence band changes ˆ L(µ,θˆ(µ)) 2ln ˆ µ<ˆ 0 , ˜ ˜ − L(0,θˆ(0)) tµ = 2lnλ(µ)tµ = 2lnλ(µ)= (8)ˆ (11) −  L(µ,θˆ(µ)) −  2ln µˆ 0 . − L(ˆµ,θˆ) ≥ as the basis of a statistical test. Higher values of tµ thus correspond to increasing incompat- As was done with the statistic tµ, one can quantify the level of disagreement between the ibility between the data and µ. Two-sided data and the hypothesized value of µ withTwo-sided the p-value, just as in Eq. (9). An approximate formula for the distribution of t˜µ needed to do this is given in Sec. 3.4. We may define a test of a hypothesizedunconstrained value of µ by using the statistic tµconstraineddirectly Also similar to the case of tµ, values of µ both above and belowµ ˆ may be excluded by a (µ) as measure of discrepancy between the data and thegiven hypothesis, data set, i.e., with(µ) one may higher obtain either values a one-sided of tµ or two-sided confidence interval for µ. ! ! correspond to increasing disagreement. To quantifyFor the the level case of of no disagreement nuisance parameters, we the compute test variable t˜µ is equivalent to what is used in the p-value, constructing confidence intervals according to the procedure of Feldman and Cousins [8]. -2 ln -2 ln

2.3 Test statistic q0 for discovery of a positive signal ∞ pµ = f(tµ µ) dtµ , (9) An important special case of the statistic t˜µ described above is used to test µ = 0 in a class tµ,obs | ￿ of model where we assume µ 0. Rejecting the µ = 0 hypothesis effectively leads to the ≥ discovery of a new signal. For this important case we use the special notation q = t˜ . Using where tµ,obs is the value of the statistic tµ observed from the data and f(tµ µ) denotes the 0 0 the definition (11) with µ = 0 one| finds pdf of tµ under the assumption of the signal strength µ. Useful approximations for this and other related pdfs are given in Sec. 3.3. The relation between the p-value and the2ln observedλ(0)µ ˆ 0 , q = − ≥ (12) t and also with the significance Z are0 illustrated in Fig. 1. µ 0 0 µ µ  0ˆµ<0 , Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 151 where λ(0) is the profile likelihood ratio for µ = 0 as defined in Eq. (7). (a) (b) 6

Figure 1: (a) Illustration of the relation between the p-value obtained from an observed value of 2 the test statistic tµ. (b) The standard normal distribution ϕ(x)=(1/√2π)exp( x /2) showing the relation between the significance Z and the p-value. −

When using the statistic tµ, a data set may result in a low p-value in two distinct ways: the estimated signal strengthµ ˆ may be found greater or less than the hypothesized value µ. As a result, the set of µ values that are rejected because their p-values are found below a specified threshold α may lie to either side of those values not rejected, i.e., one may obtain a two-sided confidence interval for µ.

5 We may contrast this to the statistic t0, i.e., Eq. (8), used to test µ = 0. In this case one may reject the µ = 0 hypothesis for either an upward or downward fluctuation of the data. This is appropriate if the presence of a new phenomenon could lead to an increase or decrease in the number of events found. In an experiment looking for neutrino oscillations, for example, the signal hypothesis may predict a greater or lower event rate than the no- oscillation hypothesis.

When using q0, however, we consider the data to show lack of agreement with the background-only hypothesis only ifµ> ˆ 0. That is, a value ofµ ˆ much below zero may indeed constitute evidence against the background-only model, but this type of discrepancy does not show that the data contain signal events, but rather points to some other systematic error. For the present discussion, however, we assume that the systematic uncertainties are dealt with by the nuisance parameters θ. If the data fluctuate such that one finds fewer events than even predicted by background processes alone, thenµ< ˆ 0 and one has q0 = 0. As the event yield increases above the expected background, i.e., for increasingµ ˆ, one finds increasingly large values of q0, corre- sponding to an increasing level of incompatibility between the data and the µ = 0 hypothesis. To quantify the level of disagreement between the data and the hypothesis of µ =0using the observed value of q0 we compute the p-value in the same manner as done with tµ, namely,

∞ pµ = f(qµ µ) dqµ , (15) ∞ q | p0 = f(q0 0) dq0 . ￿ µ,obs (13) q | ￿ 0,obs which can be expressed as a significance using Eq. (1). Here f(q µ) is the pdf of q assuming p =µ ∞ f(q µ) dq , µ (15) µ | µ| µ the hypothesis µ. In Sec. 3.6 we provide useful approximations￿q forµ,obs this and other related Here f(q0 0) denotes the pdf of the statistic q0 under assumption of the background-only | pdfs. which can be expressed as a significance using Eq. (1). Here f(q µ) is the pdf of q assuming µ| µ (µ = 0) hypothesis. An approximation for this and other related pdfs are giventhe hypothesis in Sec.µ. In 3.5. Sec. 3.6 we provide useful approximations for this and other related pdfs. 2.5 Alternative test statistic q˜ for upper limits µ Center for Cosmology and 2.4 Test statisticModifiedq for upper test limits statistic for 1-sided2.5 Alternativeupper test limits statistic q˜µ for upperParticle limits Physics µ For the case where one considers models for which µ 0, the variable λ˜(µ) can be used ≥ instead of λ(µ) in Eq. (14) to obtainFor the the case corresponding where one considers test models statistic, for which whichµ we0, denote the variableq ˜µ.λ˜(µ) can be used For 1-sided upper-limit the threshold on the test statistic is different ≥ For purposes of establishing an upper limit onThat the is, strength parameter µinstead, we of considerλ(µ) in Eq. two (14) to obtain the corresponding test statistic, which we denoteq ˜µ. closely related test statistics.‣ and First, with we physical may define boundaries, it is againThat more is, complicated ˆ L(µ,θˆ(µ)) ˆ 2ln µ<ˆ 0 L(µ,θˆ(µ)) ˆ 2ln ˆ µ<ˆ 0 2lnλ˜(µ)ˆµ µ −2lnλ˜(µL)ˆ(0,µθ(0))µ − L(0,θˆ(0))   ˆ 2lnλ(µ)ˆµ µ, − ≤ q˜ = − ˆ ≤ = L(µ,θˆ(µ)) (16) q˜µ = =µ  L(µ,θ(µ))  2ln 0 (16)µˆ µ q = − ≤  0ˆ(14)2ln µ>µ 0 µˆ µ L(ˆµ,θˆ) µ 0ˆµ>µ  − L(ˆµ,θˆ) ≤ ≤− ≤ ≤ ￿ 0ˆµ>µ,   0ˆµ>µ.  0ˆµ>µ.     We give an approximation for the pdf f(˜qµ µ￿) in Sec. 3.7.  | where λ(µ) is the profile likelihood ratio as definedWe give in an Eq. approximation (7). The reason for the for pdf settingf(˜qµ µ￿) inqµ Sec.=0 3.7. In numerical| examples we have found that the difference between the tests based on qµ (Eq. (14)) andq ˜ usually to be negligible, but use of q leads to important simplifications. forµ>µ ˆ is that when setting an upper limit, oneIn numerical would not examples regard we have data found with thatµ>µ ˆ µ the diasfference between the testsµ based on qµ Furthermore, in the context of theOne-sided approximation used in Sec. 3, the two statistics are equiv- representing less compatibility with µOne-sidedthan the(Eq. data (14)) obtained, andq ˜µ usually and therefore to be negligible, this is but not use taken of qµ leads to important simplifications. alent. That is, assuming the approximations below, qµ can be expressed as a monotonic Furthermore, in the context of the approximation used in Sec.constrained 3, the two statistics are equiv- as part of the rejection region of the test.unconstrained From the definition of the test statisticfunction one ofq ˜µ seesand thus that they lead to the same results. alent. That is, assuming the approximations below, qµ can be expressed as a monotonic (µ) higher values of qµ represent greater incompatibility between the data and(µ) the hypothesized ! function ofq ˜µ and thus they lead! to the same results. value of µ. 3 Approximate sampling distributions One should note that q is not simply a special case of q with µ = 0,In but order ratherto find the hasp-value a of a hypothesis using Eqs. (13) or (15) we require the sampling -2 ln

0 µ -2 ln 3 Approximate samplingdistribution distributions for the test statistic being used. In the case of discovery we are testing the different definition (see Eqs. (12) and (14)). That is, q0 is zero if the data fluctuatebackground-only downward hypothesis (µ = 0) and therefore we need f(q 0), where q is defined by 0| 0 (ˆµ<0), but qµ is zero if the data fluctuate upwardIn order (ˆµ>µ to find). With the p-value that caveat of a hypothesisEq. in (12). mind, When using testing we Eqs. will a nonzero (13) or value (15) of µ wefor require purposes the of finding sampling an upper limit we need the distribution f(q µ)whereq is defined by Eq. (14), or alternatively we require the pdf distribution for the test statistic being used. Inµ| the caseµ of discovery we are testing the often refer in the following to qµ with the idea that this means either q0 orofq theµ as corresponding appropriate statisticq ˜µ as defined by Eq. (16). In this notation the subscript of q background-only hypothesis (µ =refers 0) and to the therefore hypothesis we being need tested,f(q and0 0), the where second argumentq0 is defined in f(q byµ) gives the value of to the context. | µ| Eq. (12). When testing a nonzeroµ valueassumed of inµ thefor distribution purposes of of the finding data. an upper limit we need

As with the case of discovery, one quantifiesthe the distribution level of agreementf(qµ µ)where betweenqµ isWe defined also the need bydata the Eq. distribution and (14), orf alternatively(qµ µ￿)withµ = weµ￿ to require find what the significance pdf to expect and | | ￿ hypothesized µ with p-value. For, e.g., an observedof the value correspondingq , statisticone hasq ˜µ ashow defined this is distributed by Eq. (16).if the data In correspondthis notation to a strength the subscript parameter of diffqerent from the one µ,obs being tested. For example, it is useful to characterize the sensitivity of a planned experiment refers to the hypothesis being tested, and the second argument in f(qµ µ) gives the value of by quoting the median significance, assuming dataµ|95 distributed according to a specified signal µ assumed in the distributionµ of themodel, data. with which one would expect to exclude the background-onlyµ hypothesis. For this one 0 would need f(q0 µ0￿), usually with µ￿ = 1. From this one can find the median q0, and thus the We also need the distribution f(qµ µ￿)with|µ = µ￿ to find what significance to expect and median| discovery significance.￿ When considering upper limits, one would usually quote the Kyle Cranmer (NYU) how this isCERN distributed School if HEP, the data Romania,value correspond of µSept.for which 2011 to a the strength median p parameter-value is equal di tofferent 0.05, as from this gives the152 the one median upper limit 7being tested. For example, it is usefulon µ at to 95% characterize confidence level. the sensitivityIn this case one of would a planned need f( experimentq 0) (or alternatively f(˜q 0)). µ| µ| by quoting the median significance, assumingIn Sec. 3.1 we data present distributed an approximation according for the to profile a specified likelihood signal ratio, valid in the large model, with which one would expectsample to exclude limit. This the allows background-only one to obtain approximations hypothesis. for For all of this the requiredone distributions, would need f(q µ ), usually with whichµ = are1. From given in this Sections one can3.3 through find the 3.6 median The approximationsq , and thus become the exact in the large 0| ￿ ￿ 0 median discovery significance. When considering upper limits, one would usually quote the value of µ for which the median p-value is equal to 0.05, as this gives the8 median upper limit on µ at 95% confidence level. In this case one would need f(q 0) (or alternatively f(˜q 0)). µ| µ| In Sec. 3.1 we present an approximation for the profile likelihood ratio, valid in the large sample limit. This allows one to obtain approximations for all of the required distributions, which are given in Sections 3.3 through 3.6 The approximations become exact in the large

8 Center for Cosmology and The Non-Central Chi-Square Particle Physics Wald’s theorem allows one to find the distribution of -2log'(µ) when µ is not true -- the result is a non-central chi-square distribution

Let Xi be k independent, normally distributed random variables with means µi and . Then the

is distributed according to the noncentral chi- square distribution. It has two parameters: k which specifies the number of degrees of freedom (i.e. the number of Xi), and ! which is related to the mean of the random variables Xi by:

! is sometime called the noncentrality parameter. Note that some references define ! in other ways, such as half of the above sum, or its square root.

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 153 data can be represented as one or more histograms. Using the method in an unbinned analysis is a straightforward extension. Suppose for each event in the signal sample one measures a variable x and uses these values to construct a histogram n =(n1,...,nN ). The expectation value of ni can be written

E[ni]=µsi + bi , (2) where the mean number of entries in the ith bin from signal and background are

The results of Wilks and Wald generalize to more than one parameter of interest. If s s f x θ dx , the parameters of interesti = tot can bes( explicitly; s) identified with asubsetoftheparameters(3) θr = where the final equality above exploitsbin thei fact that the estimators for the parameters are (θ ,...,θ ), then the distribution! of 2lnλ(θ )followsanoncentralchi-squaredistribution equal to their hypothesized1 r values when the likelihood− r is evaluated with the Asimov data set. for r-degrees of freedomb b with noncentralityf x θ dx . parameter i = tot b( data; b) can be represented as one or more histograms.(4) Using the method in an unbinned analysis Astandardwaytofindσ is by estimating!bin i the matrix of second derivatives of the log- isr a straightforward extension. 1 likelihoodHere functionthe parameter (cf.µ Eq.determines (18)) the to strength obtain of the theSuppose signal inverse process, forV˜ cov each1ariance with eventµ in=0corresponding the, matrix signal sampleV ,invertingto one measures a variable x and uses these Λr = (θi θi!) ij− (θj θj! ) − (21) µ values to− construct a histogram− n =(n ,...,n ). The expectation value of n can be written find V to,andthenextractingtheelement the background-only hypothesis and V=1beingthenominalsignalhypothesis.The00 correspondingi,j = 1 to the variance1 ofNµ ˆ.Thesecond i f x θ f x θ ! functions sL( ; s)and b( ; b)aretheprobabilitydensityfunctions(pdfs)ofthevariable derivativex of ln is V˜ 1 θ θ E n µs b , for signalwhere and backgroundij− is the events, inverse and of thes and submatrixb represent one obtains parameters from that restricting characterize[ i]= thei + fulli covariance (2) the shapes ofmatrix pdfs. to The the quantities parameterss ofand interest.b are theThe total full meancovariance numbers matrix of signal is given and from inverting tot totwhere the mean number of entries in the ith bin from signal and background are backgroundEq. events, (18), and and the we integrals show an in e (3)fficient and (4) way represen to calculatettheprobabilitiesforaneventto it in Sec. 3.2. be found in bin i.Belowwewilluse2 Nθ =(θ , θ ,b )todenoteallofthenuisanceparameters.2 ∂ ln L nsi b tot ∂ νi ∂νi ∂νi ni s s s f x θ dx , The signal normalization tot=is not, however, an1 adjustable parameter but ratheri2= tot is fixeds( t;o s) (3) the value predicted3.2∂θ The byj∂θ thek Asimov nominal data signal" ν set model.i − and the∂θj∂θ variancek − ∂θj of∂θµˆk νi % !bin i !i = 1 # $ b b f x θ dx . In additionSome to the of the measured formulae histogram given requiren one often the standard makes further deviation subsidiaryσ ofi µ= ˆ measurements,whichisassumedtofollowtot b( ; b) (4) M bin i that help constrain the nuisance parameters.m For example,µ∂2u one may∂u select∂u m a control! sample aGaussiandistributionwithameanofi data!.Belowweshowtwowaysofestimating cani be representedi i as onei or. more histograms.σ Using,bothof the method in an unbinned analysis where one expects mainly background+ events and1 from them constructµ a histogram of some (27) µ which are closely related to au special,Here artificial theis parameter a straightforward data setdetermines th extension.at weu the2 call strength the “Asimov of the signal data process, set”. with =0corresponding " i − ∂θj∂θk − ∂θj ∂θk i % µ chosen kinematic variable. This theni = 1 gives# a setto of$ valuesthe background-onlym =(m1,...,m hypothesisM )forthenumber and =1beingthenominalsignalhypothesis.The ! fSupposex θ for eachf x eventθ in the signal sample one measures a variable x and uses these of entries in eachWe of define the M thebins. Asimov The expectation data setfunctions such value that ofs(m when; cans)and one be useswrittenb( ; itb)aretheprobabilitydensityfunctions(pdfs)ofthevariab to evaluate the estimators for le x values toi construct a histogram n =(θ n ,...,nθ ). The expectation value of n can be written all parameters, one obtains the truefor parameter signal and values. background Cons events,ider the and likelihoods and1 b Nrepresent function parameters for that characterizei From (27) one sees that the second derivative of ln L is linear in the datas valuesb n and m . the generic analysis given by Eq. (6).the To shapes simplify of pdfs. the The notati quantitieson in thistot and sectiontot are we thei define total meani numbers of signal and E m u θ , E n µs b , Thus its expectation value is found simply[ i]= bybackgroundi( evaluating) events, wit andhtheexpectationvaluesofthe the integrals in (3) and[ (5) (4)i]= represeni + ttheprobabilitiesforaneventtoi (2) be found in bin i.Belowwewilluseθ =(θs, θb,btot )todenoteallofthenuisanceparameters. data, which is theu same as the Asimov data.The One signalwhere can normalization therefore the meanθ numbers obtainis of not, entries however,the in inverse the anith adjustable bin covariance from parameter signal and but background rather is fixed are to where the i are calculable quantities depending onν thei = parametersµ!si + bi . .Oneoftenconstructstot Center for (22) Cosmology and matrixthis from measurementThe so as main to provide results information onthe the value backgro predictedund by normalization the nominal signal parameter model. Particle Physics b and also possibly on the signalµ and backgroundIn shape addition paramet to the measureders. histogram n one often makes further subsidiary measurements tot Further let θ0 = represent the strength parameter, so that here θsi cans standf forx anyθ dx of , the The Model is just a binnedthat help version constrain theof nuisancethe parameters.i = tot For example,s( ; s) one may select a control sample (3) parameters. The ML estimators for the parameters can be foundbysettingthederivativesbin i The likelihood function2 isL the product2 of PoissonL probabiliN ties for all bins:M u u ! V 1of ln LEmarkedwith∂ respectln Poisson to all of∂ we theln parametershaveAwhere considered one expects∂ν equali ∂ν to mainlyi zero:1 background∂ i events∂ i and1 . from them construct a histogram of some − = = chosen= kinematic variable.+ This then givesb ab set of valuesf x θm =(dx(28) .m1,...,mM )forthenumber jk i = totu b( ; b) (4) − ∂θj∂θk N − ∂θjn∂θj k ∂θM j m∂θk k νiM ∂θj ∂θk i m " % (µs + b ) of entriesi = 1 in eachu of the bins.i = 1 The expectation!bin valuei of i can be written L µ, θ j j e (µsj + bj ) k e uk . ( )= L N −n ! M −m ! u (6) ∂ ln n ! i ∂νi m ! i µ ∂ i µ j = 1 j Herek = the1 parameterk determines the. strength of the signal process, with =0corresponding " = 1 "+ 1 L E=0[m ]=u (θ) , (23) (5) In practice one could, for example,∂θ evaluatej theνi − theto∂θ the derivatj background-onlyivesui of− ln hypothesis∂θAj numerically,i and µi =1beingthenominalsignalhypothesis.The use this µ !i = 1 " # !if= 1x"θ f# x θ to find theTo inverse test a hypothesized covariance value matrix, of we and consider then the invert profilefunctionsu and likelihoods ext( ; racts)and ratio theb( ; varianceb)aretheprobabilitydensityfunctions(pdfs)ofthevariab ofµ ˆ.Onecanθ le The “Asimov Data” is anwhere artificial thex i are calculabledataset quantities dependingθ on the parametersθ .Oneoftenconstructs This condition holds if the Asimov data, ni,forA and signalm andi,A ,areequaltotheirexpectationvalues: background events, and s and b represent parameters that characterize see directly from Eq. (28) that this variancethis depends measurement on so the as to parameter provide information valuess on assumed theb background for normalization parameter where the “observations”b arethe shapesset equal of pdfs. to The quantities tot and tot are the total mean numbers of signal and L µ,totθˆand also possibly on the signal and background shape parameters. the Asimov data set, in particular onµ the assumed( ) .background strength events, pa andrameter the integralsµ",whichentersvia in (3) and (4) representtheprobabilitiesforaneventto the expected valuesλ( )= givenThe the likelihood parameters functioni is the product ofθ Poisson(7)θ , θ probabili,b ties for all bins: L(ˆµ, θˆ) be found in bin .Belowwewilluse =( s b tot )todenoteallofthenuisanceparameters. Eq. (22). n = E[n The]=where signalν = theµ normalizations final(θ equality)+b ( aboveθs ) ,is exploits not, however, the fact that an the adjustable estimators(24) parameter for the parameters but rather are is fixed to of the model i,A i equali to their! i hypothesizedi t valuesot when the likelihood is evaluated with the Asimov data set. ˆ the value predicted by theN nominal(µs + b signal)nj model. M umk θˆ θ L j jµ (µsj + bj ) k uk AnotherHere methodin the numerator for estimating denotes theσ valuem(denoted of thatEσmA maximizesinu thisAstandardwaytofindθ . sectionL(µ,forθ)= the to specifiedσ distinguishis by estimating,i.e.,e− the it matrix from of second thee− derivatives. of the log- (6) i,A = [ i]= i( ) n ! m(25)! V 1 Inlikelihood additionθ function to the (cf. measuredj Eq.= 1 (18)) histogram toj obtainµ then inverseone often covk =ariance1 makesk matrix further− subsidiary,invertingto measurements it is the conditional maximum-likelihood (ML) estimatorL of V (and thus is" a function ofV ). " µ approach above based on the second derivatives ofthat ln find help)istofindfindthevaluethatisneces-,andthenextractingtheelement constrain the nuisance parameters.00 corresponding For example, to the one may of ˆ.Thesecond select a control sample We proved that fits to theTo Asimov test aderivative hypothesized data of ln L iscan value of µ we consider the profile likelihood ratio sary to recover theHere known the parameter properties values of representλA (µ). those Becausewhere implied one expects the by the Asimov mainly assumed background data distribution set events corresponding and of fromthe data. them construct a histogram of some be used to get the− non-centrality4 chosen kinematic parameter variable. This then gives a set of values m =(m1,...,mM )forthenumber to a strength µ" Ingives practice,µ ˆ = theseµ",fromEq.(17)onefinds are the values that would be estimated fromtheMonteCarlomodelusing of entries in each of the2 ML bins.N Then expectationL µ, θˆ 2 value ofnm can be written ∂ ln µ i ( ∂) ν.i ∂νi ∂νi i i averylargedatasample.needed for Wald’s theorem = λ( )= 1 2 (7) ∂θj∂θk " νi L− µ, ∂θθˆ j∂θk − ∂θj ∂θk νi % !i = 1 # (ˆ$ ) We can use the Asimov data setµ to evaluateµ 2 the “Asimov likelihood”E[mL]=A andu (θ) the, cor- (5) ( ") M m i 2u i u u m µ θˆ . i θ ∂ i ∂ i ∂ i Li . µ responding profile likelihood2lnλA ( ratio) HereλA .Theuseofnon-integervaluesforthedataisnota−in the numerator= Λ denotes the+ valueu of 1 that maximizes(29)u2 for the specified(27) ,i.e., 2 u " i − ∂θj∂θk − ∂θj ∂θk i % θ problem as the factorial− terms in the≈it is Poisson thewhereσ conditional likelihood the i are maximum-likelihood calculable represent quantities constants!i = 1 (ML)# depending estimator that$ cancel on of theθ when(and parameters thus is a.Oneoftenconstructs function of µ). this measurement so as to provide information on the background normalization parameter From (27) one sees that the second derivative of ln L is linear in the data values n and m . forming the likelihood ratio, and thus canb beand dropped. also possibly One on fin theds signal and background shape parameters. i i That is, from the AsimovKyle Cranmer data (NYU) set one obtainsCERN an School estimatetot Thus HEP, its Romania, expectation of th Sept.enoncentralityparameter value 2011 is found simply4 by evaluating withtheexpectationvaluesoftheΛ154 data, which is the same as the Asimov data. One can therefore obtain the inverse covariance that characterizes the distribution f(qµ µ"). Equivalently,The likelihood one can function use is Eq. the product (29) toof Poisson obtain probabili the ties for all bins: | L µ,matrixθˆ fromL µ, θˆ 2 µ A (µ ) A ( ) , variance σ which characterizes the distributionλ ( )= of ˆ,namely,= N (26)M mk A 2 L (µs2 +L b )nNj M uu u L µ, θˆ L (1µ , θ) ∂ ln ∂ lnj A j ∂ν(µsi ∂νj +i b1j ) ∂ ki ∂ i 1uk A (ˆ ) VA− =! LE(µ, θ)= = = e− + e− . . (28) (6) jk − ∂θ ∂θ − ∂θ ∂θn ∂θ ∂θ ν ∂θm ∂θ u " j k %j = 1 j jk! i = 1 j k i i = 1 j k! k i µ µ 2 " ! k!"= 1 ( ") In practice one could, for example, evaluate the the derivatives of ln L numerically, use this 2 To10, test a hypothesized value of µ we consider the profile likelihoodA ratio σA = q− to find the inverse covariance matrix, and then invert and extract(30) the variance ofµ ˆ.Onecan µ, A see directly from Eq. (28) that this variance depends on the parameter values assumed for the Asimov data set, in particular on the assumed strengthˆ parameter µ ,whichentersvia L(µ, θˆ) " Eq. (22). λ(µ)= . (7) where qµ, A = 2lnλA (µ). For the important case where one wants to find theL(ˆµ, θˆ median) − Another method for estimating σ (denoted σA in this section to distinguish it from the exclusion significance for the hypothesis µ assumingapproach that above there based on is the no second signal, derivatives then of ln L)istofindfindthevaluethatisneces- one has ˆ µ Heresaryθˆ in to recoverthe numerator the known propertiesdenotes the of λ valueA ( ). Because of θ that the Asimov maximizes data setL correspondingfor the specified µ,i.e., µ µ µ µ − " =0andtherefore it isto the a strength conditional" gives maximum-likelihood ˆ = ",fromEq.(17)onefinds (ML) estimator of θ (and thus is a function of µ).

(µ µ )2 2lnλ (µ) − " = Λ . (29) µ2 − A ≈ σ42 2 , σA = That is, from the Asimov data set one obtains an estimate of th(31)enoncentralityparameterΛ q that characterizes the distribution f(q µ ). Equivalently, one can use Eq. (29) to obtain the µ, A µ| " variance σ2 which characterizes the distribution ofµ ˆ,namely,

µ µ 2 2 ( ") , 11 σA = − (30) qµ, A

where q = 2lnλ (µ). For the important case where one wants to find the median µ, A − A exclusion significance for the hypothesis µ assuming that there is no signal, then one has µ" =0andtherefore

µ2 2 , σA = (31) qµ, A

11 Center for Cosmology and How well does it work? Particle Physics "#$%&'()*+#'%&,%'#-'),./0%#%12'-#*/3+)&'

D,./0%#%12''!"E#BFCG''8##=')+*&)=.'-#*'-)1*+.',/)++',)/0+&,5

"&=1)$H#BFCI'-*#/'D,1/#J'=)%)',&%K'8##=')8*&&/&$%'61%9'"(5

45'(#6)$' 7,1$8'%9&':*#-1+&';1<&+19##='1$'>&)*29&,'-#*'?&6':9.,12,'@'A)$--'!BCB !!

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 155 Center for Cosmology and How well does it work Particle Physics

#$%&'()*+,$(&'-&($.(*-/01&$&23(.$+04,*(

D'+'(&*='( E(C6

F-/01&$&23(.$+04,*(2-( 9$$>(*11+$G20*&2$%(&$(H

,'I',(J!" E(!HK(*,+'*>/(.$+ " L(!"6

56()$7*%( 8-2%9(&:'(;+$.2,'(<2=',2:$$>(2%(?'*+3:'-(.$+(@'7(;:/-23-(A(B*%..(!"C" !" Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 156 Center for Cosmology and Some non-trivial tests: boundaries Particle Physics #$%&'()*+,$(&'-&($.(*-/01&$&23(.$+04,*'( H ?*0'(0'--*9'(.$+(&'-&(E*-'>($%(!6 H ! *%>(! 92F'(-202,*+(&'-&-(&$( &:'('G&'%&(&:*&(*-/01&$&23 .$+04,*'(*+'(F*,2>6

56()$7*%( 8-2%9(&:'(;+$.2,'(<2=',2:$$>(2%(?'*+3:'-(.$+(@'7(;:/-23-(A(B*%..(!CDC !" Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 157 The p-value of the hypothesized µ is

p =1 F (q µ)=1 Φ √q (59) µ − µ| − µ ! " and therefore the corresponding significance is

1 Z = Φ− (1 p )=√q . (60) µ − µ µ

As with the statistic tµ above, if the p-value is found below a specified threshold α (often one takes α =0.05), then the value of µ is said to be excluded at a confidence level (CL) of 1 α.Theupperlimitonµ is the largest µ with p α.Herethiscanbeobtainedsimply − µ ≤ by setting pµ = α and solving for µ.UsingEqs.(54)and(59)onefinds

1 µ =ˆµ + σΦ− (1 α) . (61) up − For example, α =0.05 gives Φ 1(1 α)=1.64. Also as noted above, σ depends in general on − − the hypothesized µ.Thusinpracticeonemayfindtheupperlimitnumericallyasthe value of µ for which pµ = α.

3.7 Distribution of q˜µ (upper limits)

Using the alternative statisticq ˜µ defined by Eq. (16) and assuming the Wald approximation we find

µ2 2µµˆ 2 2 µ<ˆ 0 , σ − σ q  (µ µˆ )2 Center for ˜µ = −2 0 µˆ µ, (62)  σ ≤ ≤ Cosmology and  Some non-trivial tests: boundaries 0ˆµ>µ. Particle Physics   The pdf f(˜q µ )isfoundtobe  #$%&'()*+,$(&'-&($.(*-/01&$&23(.$+04,*'(µ| " µ" µ f(˜qµ µ")=Φ − δ(˜qµ) | ' σ ( HIGGS –STATISTICALH COMBINATION! 2 OF SEVERAL IMPORTANT STANDARD MODEL HIGGS ... 1 1 1 1 q µ µ < q µ2/ 2 , ?*0'(0'--*9'(.$+(&'-&(E*-'>($%(! 6 2 exp 2 ˜µ −σ 0 ˜µ σ  √2π √q˜µ − − ≤ +  ) ! " + (63) * 2 ! 2 2  1 1 (q˜µ (µ 2µµ )/σ ) 2 2  exp − − 2 q˜ >µ /σ . H √2π(2µ/σ) − 2 (2µ/σ) µ ! *%>(! 92F'(-202,*+(&'-&-(&$(  , -  µ µ &:'('G&'%&(&:*&(*-/01&$&23 The special case = " is therefore

1) 1 1 1 q˜µ/2 2 2 1) e− 0 < q˜µ µ /σ , ! 2 √2π q˜µ ! (a) √ ≤ (b) .$+04,*'(*+'(F*,2>6 ATLASf q µ 1 q H # " " ATLAS H # " " µ (˜ )= δ(˜ )+  (64) µ µ µ 2 2 2 1 | 2  1 1 (q˜µ + µ /σ ) q >µ2/ 2 . 1  exp 2 ˜µ-1 σ -1 √2π(2µ/σ) − 2 (2Lµ/σ )= 2 fb L = 10 fb =1, , - =1,

µ  µ |  | 1 The corresponding cumulative distribution is 1 f(q f(q -1 -1 We now can describe 10 16 10 effect of the boundary on

-2 the distribution of the 10 10-2 test statistic.

-3 10 10-3 0 2 4 6 8 10 12 0 2 4 6 8 10 12 q q 1 1

56()$7*%( 8-2%9(&:'(;+$.2,'(<2=',2:$$>(2%(?'*+3:'-(.$+(@'7(;:/-23-(A(B*%..(!CDC !" Figure 7: The distribution of the test statistic q1 for µˆ 1 under the s+ b hypothesis (for H !!), for mH = 120 Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 1 157≤ 1 2 → GeV with an integrated luminosity of (a) 2 fb− and (b) 10 fb− . A "1 distribution is superimposed.

3.2 H W +W → − The H W +W − search is divided into two topologies, production of a Higgs with no jets (H + 0 j) and → with two additional jets (H + 2 j), using in both cases the decay mode H WW e!µ!. The present → → study does not yet consider the final states e!e! or µ!µ!, nor those with hadronic W decays. Future inclusion of these channels is expected to improve the search sensitivity particularly for the high Higgs mass region. The search is described in detail in Ref. [5].

3.2.1 H + 0 j The analysis of the H + 0 j channel uses a two dimensional maximum-likelihood fit of the transverse mass and the transverse momentum of the WW system in two bins of the dilepton opening angle in the transverse plane. The fit includes control samples to measure the backgrounds from tt and Z "". → The QCD WW background requires particular attention. Its distributions of Higgs-candidate trans- verse mass and pT are described with functions containing several adjustable (nuisance) parameters, and several others whose values are determined from a full Monte Carlo simulation and thereafter treated as fixed. The distribution of the test statistic q0 under the background-only (µ = 0) hypothesis is shown in −1 Fig. 8(a) for mH = 150 GeV for an integrated luminosity of 10 fb . The same fixed QCD WW shape 1 2 parameters are used both to generate the data and for calculating the likelihood ratio. A 2 #1 distribution is superimposed, showing the level of agreement of the asymptotic approximation. For this channel, further investigation of the systematic uncertainties was carried out. For the fixed shape parameters related to pT and transverse mass distributions for the QCD WW background, the val- ues used to generate the data were varied relative to what was used when determining the likelihood ratio. This was done in a manner that minimized the sensitivity of the resulting q0 distribution to variations in 2 other fixed parameters such as the QCD Q scale. The resulting distributions of q0 are thus no longer 1 2 expected to follow the 2 #1 form, as can be seen in Fig. 8(b). Because the chi-square approximation is not valid in this case, the p-values are calculated using the q0 distribution obtained directly from the Monte Carlo. An exponential is fitted to the tail region in order to extrapolate to large q0 values, and the median value of q0 under the hypothesis of signal plus background is determined using the same variation of the background parameters. It was found that the median p-value of the background-only hypothesis, with the median computed under assumption of the s+ b hypothesis, is very similar to the original case where the QCD shape parameters are not varied and 1 2 the 2 #1 distribution is used.

16 1495 299 Center for Cosmology and The problem with p-values Particle Physics The decision to reject the null hypothesis is based on the probability for data you didn’t get to agree less well with the hypothesis... ‣ doesn’t sound very convincing when you put it that way. Other criticisms:

● test statistic is “arbitrary” (not really, it is designed to be powerful against an alternative)

● what is the ensemble? Related to conditioning

obs b-only p-value b-only s+b P( N | s+b )

N events more discrepant Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 158 Center for Cosmology and The Likelihood Principle Particle Physics Likelihood Principle

• As noted above, in both Bayesian methods and likelihood-ratio based methods, the probability (density) for obtaining the data at hand is used (via the likelihood function), but probabilities for obtaining other data are not used! • In contrast, in typical frequentist calculations (e.g., a p-value which is the probability of obtaining a value as extreme or more extreme than that observed), one uses probabilities of data not seen. • This difference is captured by the Likelihood Principle*: If two experiments yield likelihood functions which are proportional, then Your inferences from the two experiments should be identical. • L.P. is built in to Bayesian inference (except e.g., when Jeffreys prior leads to violation). • L.P. is violated by p-values and confidence intervals. • Although practical experience indicates that the L.P. may be too restrictive, it is useful to keep in mind. When frequentist results “make no sense” or “are unphysical”, in my experience the underlying reason can be traced to a bad violation of the L.P.

*There are various versions of the L.P., strong and weak forms, etc. Bob Cousins, CMS, 2008 46 Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 159 Center for Cosmology and Goal of Likelihood-based Methods Particle Physics

Likelihood-based methods settle between two conflicting desires: ‣ We want to obey the likelihood principle because it implies a lot of nice things and sounds pretty attractive ‣ We want nice frequentist properties (and the only way we know to incorporate those properties “by construction” will violate the likelihood principle)

The asymptotic results give us a a way to approximately cover while simultaneously obeying f(x θ) | the likelihood principle and NOT using a prior θ

θ2

θ1

θ0 x

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 160 Center for Cosmology and Particle Physics

Bayesian methods

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 161 Center for Cosmology and Bob’s Example Particle Physics A b-taggingThe Falgorithmactory gives a curve like this 10

Background rejection versus Signal efficiency 1

0.95 0.9 MVA Method: 0.85 Fisher 0.8 MLP 0.75 LikelihoodD Background rejection PDERS 0.7 Cuts 0.65 RuleFit HMatrix 0.6 BDT 0.55 Likelihood 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Signal efficiency

One wantsFigure 5: to decide where to cut and to optimize analysis ‣ For some point on the curve you have:

● P(btag| b-jet), i.e., efficiency for tagging b’s

● P(btag| not a b-jet), i.e., efficiency for background

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 162

Code Example 6: Center for Cosmology and Bob’s example of Bayes’ theorem Particle Physics Now that you know: ‣ P(btag| b-jet), i.e., efficiency for tagging b’s ‣ P(btag| not a b-jet), i.e., efficiency for background

Question: Given a selection of jets with btags, what fraction of them are b-jets? ‣ I.e., what is P(b-jet | btag) ?

Answer: Cannot be determined from the given information! ‣ Need to know P(b-jet): fraction of all jets that are b-jets. ‣ Then Bayes’ Theorem inverts the conditionality:

● P(b-jet | btag) ∝P(btag|b-jet) P(b-jet) Note, this use of Bayes’ theorem is fine for Frequentist

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 163 Center for Cosmology and An different example of Bayes’ theorem Particle Physics An analysis is developed to search for the Higgs boson ‣ background expectation is 0.1 events

● you know P(N | no Higgs) ‣ signal expectation is 10 events

● you know P(N | Higgs )

Question: one observes 8 events, what is P(Higgs | N=8) ?

Answer: Cannot be determined from the given information! ‣ Need in addition: P(Higgs)

● no ensemble! no frequentist notion of P(Higgs)

● prior based on degree-of-belief would work, but it is subjective. This is why some people object to for particle physics

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 164 Center for Cosmology and Markov Chain Monte Carlo Particle Physics Markov Chain Monte Carlo (MCMC) is a nice technique which will produce a sampling of a parameter space which is proportional to a posterior ‣ it works well in high dimensional problems ‣ Metropolis-Hastings Algorithm: generates a sequence of points ￿α (t) { } ● Given the likelihood function L ( ￿α ) & prior P ( ￿α ), the posterior is proportional to L(￿α ) P (￿α ) · ● propose a point ￿α ￿ to be added to the chain according to a proposal density Q ( ￿α ￿ ￿α ) that depends only on current point ￿α | ● if posterior is higher at ￿α ￿ than at ￿α , then add new point to chain ● else: add ￿α ￿ to the chain with probability L(￿α ￿) P (￿α ￿) Q(￿α ￿α ￿) ρ = · | L(￿α ) P (￿α ) · Q(￿α ￿α ) · ￿| ● (appending original point ￿α with complementary probability) ‣ RooStats works with any L ( ￿α ) , P ( ￿α ) ‣ can use any RooFit PDF as proposal function Q (￿α ￿ ￿α ) | ‣ Helper for forming custom multivariate Gaussian, Bank of Clues, etc. ‣ New Sequential Proposal function similar to BAT Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 165 Center for Cosmology and Examples from Higgs Combination Particle Physics RooStats MCMCCalculator tool used for the ATLAS and CMS Higgs combinations. Combinations include ~25-50 channels and >100 parameters

nuisance parameters vs. Param of interest

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 166 Center for Cosmology and The Jeffreys Prior Particle Physics Physicist Sir Harold Jefreys had the clever idea that we can “objectively” create a flat prior uniform in a metric determined by I(θ)

Adds “minimal information” in a precise sense, and results in:

It has the key feature that it is invariant under reparameterization of the parameter vector . In particular, for an alternate parameterization we can derive

Unfortunately, the Jefreys prior in multiple dimensions causes some problems, and in certain circumstances gives undesirable answers.

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 167 Center for Cosmology and Reference Priors Particle Physics

Refrerence priors are another type of “objective” priors, that try to save Jeffreys’ basic idea.

See Luc Demortier’s PhyStat 2005 proceedings http://physics.rockefeller.edu/luc/proceedings/phystat2005_refana.ps Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 168 where the final equality above exploits the fact that the estimators for the parameters are equal to their hypothesized values when the likelihood is evaluated with the Asimov data set. Astandardwaytofindσ is by estimating the matrix of second derivatives of the log- 1 likelihood function (cf. Eq. (18)) to obtain the inverse covariance matrix V − ,invertingto find V ,andthenextractingtheelementV00 corresponding to the variance ofµ ˆ.Thesecond derivative of ln L is

2 N 2 ∂ ln L ni ∂ νi ∂νi ∂νi ni = 1 2 ∂θj∂θk νi − ∂θj∂θk − ∂θj ∂θk ν Center for i=1 "# $ i % Cosmology and Jeffreys’s Prior ! Particle Physics M 2 Jeffreys’s Prior is an “objective”mi prior∂ basedui ∂ uoni ∂u formali mi rules + 1 2 . (27) " ui − ∂θj∂θk − ∂θj ∂θk ui % (it is related to the Fisher!i=1 Information# $ and the Cramér-Rao bound] ∂2 From (27)π one(θ￿) sees thatdet the secondθ￿ . derivative of ln( L(θis)) linear= inE the data valuesln f(Xn;i θand) θ m.i. I i,j − ∂θ ∂θ Thus its expectation∝ valueI is found simply by evaluating withtheexpectationvaluesofthe￿ i j ￿ ￿ ￿ ￿ ￿ ￿ Eilam,data, which Glen, is the Ofer, same asand the AsimovI showed data. in One arXiv:1007.1727 can therefore obtain the that inverse the covariance Asimov￿ matrix from ￿ data provides a fast, convenient way to calculate the Fisher Information 2 2 N M 1 ∂ ln L ∂ ln LA ∂νi ∂νi 1 ∂ui ∂ui 1 Vjk− = E = = + . (28) − "∂θj∂θk % − ∂θj∂θk ∂θj ∂θk νi ∂θj ∂θk ui !i=1 !i=1

In practice one could, for example, evaluate the the derivatives of ln LA numerically, use this Useto find this the as inverse basis covariance to calculate matrix, and then invert and extract the variance ofµ ˆ.Onecan Jeffreys’ssee directly prior from Eq.for (28)an arbitrary that this variance PDF! depends on the parameter values assumed for the Asimov data set, in particular on the assumed strength parameterValidateµ",whichentersvia on a Poisson RooWorkspaceEq. (22). w("w"); w.factory("Uniform::u(x[0,1])");Another method for estimating σ (denoted σ in this section to distinguish it from the w.factory("mu[100,1,200]"); A Analytic w.factory("ExtendPdf::p(u,mu)");approach above based on the second derivatives of ln L)istofindfindthevaluethatisneces-RooStats numerical sary to recover the known properties of λA(µ). Because the Asimov data set corresponding w.defineSet("poi","mu"); − to a strength µ" givesµ ˆ = µ",fromEq.(17)onefinds w.defineSet("obs","x"); // w.defineSet("obs2","n"); (µ µ )2 2lnλ (µ) − " = Λ . (29) RooJeffreysPrior pi("jeffreys","jeffreys",*w.pdf("p"),*w.set("poi"),*w.set("obs"));− A ≈ σ2 KyleThat Cranmer is, from (NYU) the Asimov data setCERN one School obtains HEP, an Romania, estimate Sept. of th 2011enoncentralityparameterΛ 169 that characterizes the distribution f(q µ ). Equivalently, one can use Eq. (29) to obtain the µ| " variance σ2 which characterizes the distribution ofµ ˆ,namely,

2 2 (µ µ") σA = − , (30) qµ,A

where q = 2lnλ (µ). For the important case where one wants to find the median µ,A − A exclusion significance for the hypothesis µ assuming that there is no signal, then one has µ" =0andtherefore

2 2 µ σA = , (31) qµ,A

11 Center for Cosmology and Jeffreys’s Prior Particle Physics Validate Jeffreys’s Prior on a Gaussian µ, !, and (µ,!)

RooWorkspace w("w"); w.factory("Gaussian::g(x[0,-20,20],mu[0,-5,5],sigma[1,0,10])"); w.factory("n[10,.1,200]"); w.factory("ExtendPdf::p(g,n)"); w.var("n")->setConstant();

w.var("sigma")->setConstant(); w.defineSet("poi","mu"); w.defineSet("obs","x"); RooJeffreysPrior pi("jeffreys","jeffreys",*w.pdf("p"),*w.set("poi"),*w.set("obs"));

Analytic RooStats numerical

Analytic RooStats numerical

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 170 Center for Cosmology and The Bayesian Solution Particle Physics Bayesian solution generically have a prior for the parameters of interest as well as nuisance parameters ‣ 2010 recommendations largely echoes the PDG’s stance.

Recommendation: When performing a Bayesian analysis one should separate the objective likelihood function from the prior distributions to the extent possible.

Recommendation: When performing a Bayesian analysis one should investigate the sensitivity of the result to the choice of priors.

Warning: Flat priors in high dimensions can lead to unexpected and/or misleading results.

Recommendation: When performing a Bayesian analysis for a single parameter of interest, one should attempt to include Jeffreys's prior in the sensitivity analysis.

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 171 a frequentist context (eg. Eq. 3). Priors for the parameters of interest can have a larger influence on the final inference, thus, in order to compare with frequentist methods it is important to separate these two components as much as possible. In order to understand the sensitivity to the choice of priors, it is recommended to repeat the analysis with several choices of priors. As Michael Goldstein said “Sensitivity Analysis is at the heart of scientific Bayesianism.” It is common practice in HEP to use flat priors. This follows partially from the intuitive notion that flat priors are non-informative. One should absolutely not be fooled by this intuitive picture, as flat priors are not invariant to reparametrizing the model, and thus are informative. In many dimensions, uniform priors are especially dangerous, as volume effects push probability away from the origin. The conceptual goal of choosing a prior in an objective way that adds as little information to the resulting inference as possible has been developed significantly by the statistical com- munity. This was first done by Jeffreys (a physicist and statistician), and is a recommended prior to try in one-dimensional problems (note, it can be improper). In high-dimensional problems Jeffrey’s rule has problems. The state-of-the art “objective” priors (eg. priors cho- sen by formal rules) are the reference priors of Bernardo and collaborators. While this is an Center for area for our field to investigate, no general tools are available. Cosmology and Words of wisdom on Bayesian methods Particle Physics It is also worth noting that a Bayesian method has frequentist properties (and vice versa). So even though a method is Bayesian, one can still quantify its coverage or calibrate it in a frequentist way. To support the points raised above, here are some quotes from professional statisticians (taken from selected PhyStat talks and selections from Bob Cousins lectures):

“Perhaps the most important general lesson is that the facile use of what appear to be • uninformative priors is a dangerous practice in high dimensions.” – Brad Effron

“meaningful prior specification of beliefs in probabilistic form over very large possibility • spaces is very difficult and may lead to a lot of arbitrariness in the specification.” – Michael Goldstein

“Sensitivity Analysis is at the heart of scientific Bayesianism.” – Michael Goldstein • “Non-subjective Bayesian analysis is just a part – an important part, I believe of a • healthy sensitivity analysis to the prior choice...” J.M. Bernardo

“Objective Bayesian analysis is the best frequentist tool around” – Jim Berger • Recommendation: When performing a Bayesian analysis one should separate the objective likelihood function from the prior distributions to the extent possible.

Recommendation: When performing a Bayesian analysis one should investigate the sensi- tivity of the result to the choice of priors.

Warning: Flat priors in high dimensions can lead to unexpected and/or misleading results. Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 172 Recommendation: When performing a Bayesian analysis for a single parameter of interest, one should attempt to include Jeffreys’s prior in the sensitivity analysis. In addition to Bayesian credible intervals, one can use Bayes factors as an alternative to p-values for hypothesis tests. Bayes factors have not been used extensively in HEP, but it is an area worth further investigation.

5 Center for Cosmology and Coverage & Likelihood principle Particle Physics Methods based on the Neyman-Construction always cover.... by Jimconstruction. Berger: ‣ this approach violates the likelihood principle Bayesian methods obey likelihood principle, but do not necessarily cover ‣ that doesn’t mean Bayesians shouldn’t care about coverage Coverage can be thought of as a calibration of our statistical apparatus. [explain under-/over-coverage]

-Jim Berger Bob Cousins,Bayesian CosmoStats 2009 and Frequentist results answer diferent questions31 ‣ major diferences between them may indicate severe coverage problems and/or violations of the likelihood principle

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 173 Center for Cosmology and Joke Particle Physics

“Bayesians address the question everyone is interested in, by using assumptions no-one believes”

“Frequentists use impeccable logic to deal with an issue of no interest to anyone”

-L. Lyons

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 174 Center for Cosmology and Particle Physics

!e End

!ank Y"!

Kyle Cranmer (NYU) CERN School HEP, Romania, Sept. 2011 175