<<

and Jet Discriminaon

Mahew Schwartz Harvard University

Based on work with J. Galliccho arXiv:1211.7038. 1106.3076, 1104.1175 and P. Komiske and E. Metodiev arXiv:1612.01551

See also ATLAS arXiv:1405.6583, ATLAS-CONF-2016-034 Larkoski et al. arXiv:1405.6583 Schwartzman et al. arXiv:1407.5675, arXiv:1511.05190 1 Why do we care?

1. BSM searches: New physics mostly quark jets Backgrounds mostly gluon jets _ q q _ q q

~ χ˜ o q g˜ ~ 2 q o χ˜ 1

o ~ χ˜ g˜ q 1 o ~ χ˜ 2 q

_ q q q _ q

Figure 1. decay as an of of a quark-heavy signal, in this case with 8 quarkjetsandno gluon jets produced. Multi-jet2. SM searches events in standard model backgrounds are extremely unlikely to have so many quark jets. • Gluonic backgrounds to e.g. hadronic top decays

1Introduction 3. Improve Monte Carlos • Gluon jet modeling limits accuracy of current simulaons Being able to distinguish quark-initiated from gluon-initiated jets reliably at the LHC could be fantastically useful, since signatures of beyond-the-standard-model physics are often quark heavy. For example, a4. Test precision QCD typical gluino-pair production topology is pictured in Figure 1.Pro- duced in pairs, each gluino’s5. For the challenge: can we do it? cascade decay can produce four and missing transverse 2 momentum due to the escape of the lightest supersymmetric partner. Backgrounds to this process have events with many jets produced from QCD. These jets are predominately glu- onic. Additionally, many R-parity violating SUSY models produce quark jets without the missing transverse momentum. To constrain these models, being able to filter out background QCD events containing gluon jets would be helpful. Leptophobic Z′ or W ′ particles provide other obvious examples where quark/gluon discrimination would be useful. Gluon-heavy backgrounds are especially problematic for signals without leptons, gauge bosons, B-jets, tops, or missing energy. Quark/gluon tagging might beoneofthefewways to improve these searches. Another application is to reduce reduce combinatorial ambiguity within a single event. If jets in a given event could be identified as quark or gluon, their place in a proposed decay topology could be constrained, or they could be classified as initial- state radiation. Examining the quark/gluon tagging scores of jets produced by a new particle might be the only way to measure QCD quantum numbers directly.Alternatively,some signals consist of gluon jets, like coloron models [1]orburied-Higgs,whereh 2a 4g → → and a is CP odd scalar [2]. The same observables and techniques apply to gluon tagging, though here we will treat the quark jets as the signal and the gluon jets as background for concreteness.

–2– Quark/Glue basics

Probability of quark radiang: Probability of quark radiang:

↵ ↵s P (q qg)= s C ( ) P (g gg)= C ( ) ! 2⇡ F ··· ! 2⇡ A ··· 4 C = =1.3 C =3 F 3 A • around twice as likely to radiate than quarks • Gluon jets are faer • Gluon jets are more massive • Gluon jets have more parcles • …

3 Example distribuons Jet mass Charged parcle count Jet broadening ak05_mass components_jet0_ak05_charged_count Q 200 GeV components_jet0_ak05_charged_broadening Q 200 GeV Q 200 GeV 6 G 200 GeV 10 G 200 GeV 8 G 200 GeV

5 7 8 6 4 6 5 3 4 4 2 3

2 2 1 1

0 0 0 0 10 20 30 40 50 60 70 0 5 10 15 20 25 30 35 40 45 0 0.05 0.1 0.15 0.2 0.25 0.3

rd Jet angularity PT fracon of 3 hardest subjet Moment of Hu angularity_ak05_p100_Pt_4v_y_Rmax_overPt subjet_3rdPt_over_jetPt_jet0_ak05_sub_ak010 hu_1_ak05_Pt_4v_y Q 200 GeV Q 200 GeV Q 200 GeV

G 200 GeV G 200 GeV 12 G 200 GeV 8 12

7 10 10 6 8 8 5

4 6 6

3 4 4 2 2 2 1

0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0 0.05 0.1 0.15 0.2 0.25 0.3

4 We looked at 10,000 variables

Integrated/differenal Integrated Jet ShapeIteravely opmized out to r =0.1 Jet angularies “Jet Shape” fa radial profile a!"1.9 5 Optimal Kernel !log r" Optimal Kernel !linear r" 1.5 8.9 Two-Dimensional Geometric Moments 4 1.0 1.0 800 GeV The radial moments above ignored how the pT was distributed around the jet axis. Motivated3 1 0.5 400 GeV 100 GeV by the moment-of-inertia and covariance tensors, a second order 2D geometric moment tensor 800 GeV 0.5 2 100 GeV can be formed as shown in Figure 19.Combinationsofitseigenvaluesandeigenvectors(like 400 GeV r 1 0.02 0.05 0.1 0.2 0.4 0.7 1 r Planar Flow) have been used used to distinguish boosted objects. a!0 0.2 0.4 0.6 0.8 1.0 a!#3 0 None of these variables turn out not be particularly useful for quark/gluon discrimination,0ΠΘ 0.2! 0.40.5 0.6 0.8 1 so no distributions are shown here. Whether a quark emits a gluonΠ or a gluon splits, theΠ the !0.5 4 2 2 R 2-body kinematics are similar. Since it’s this leading emission that dominates the subsequent !1.0 shower,Figure it is understandable 12.TheIntegratedJetShapeFigure that 14. these Profiles shapesfa(θ˜)fordi mightff noterentΨ(r di) choicesff isersignificantlybetweenquarks the of fraction the angularity of thea parameterpT of a spaced jet of at cone 0.1 intervals size R falling within a smaller(in rainbow) cone of and size linearr, radial-moment as illustrated “girth” in (inthe black). far left These panel.profile shapesΨ(r = haveR nothingjet) = to 1 do by with definition. and gluons. Figure 16. Profiles forSubjet the optimal counts kernels found for various jet sizes. Kernels for the higher-pT jets the shapes of the distributions resulting from integrating these moments over jets and histogramming In the center is a plot of the integrated jet shape averaged over all observedgive a jets higher of weighta particular to and properes pT near type the jet axis. the results. i (here ourProperes of quark and gluon dijet sample).pT ∆ Onηi∆ theηi ∆ rightηi∆φ thei distribution of r =0.1 jet shapes is shown. Covariance Tensor: C = jet p ∆φi∆ηi ∆φi∆φi The meanCovariance tensor of these distributionsi∈ givesjet T Ψ"(0Subjet.1) for Count quarks# and gluons. The distribution1st Subjet’s is clearlypT notFraction a !Grejection AngularitiesGrejection for different jet sizes Angularities for different jet sizes Angularities for different jet sizes 0.50 100 to have100 f(0) = 0. This means that the energy deposits near the crowded and noisy jet simple Gaussian centered around the average value, indicating that much information0.98 is discarded in 0.85 0.45 Combination of Eigenvalues center count least. Multiplying by a constant (even a negative one) also does not affect considering only the integrated jet90 shape. The rise at low0.80 r is due to jets where90 the0.96 parton underwent 0.40 Eigenvalues: a>b 0.75 discrimination power, so we normalized our trial profiles so their maximum value was +1. a semi-hard splitting leading to little pT deposited along the jet axis.2 2 0.94 0.35 Quadratic Moment:0.70 g = √a + b 80 We evaluated80 the ROC curve at three different quark efficiencies,20%,50%,and80%. 0.30 Determinant:0.65 det = a b 0.92 a · The best kernels we found had rejection scores that were not significantly higher than 70 a #3 #2 #1 1 2 70 a #3 #2 #1 1 2 Ratio: ρ = b/a #3 #2 #1 0 1 2 those for girth (equation 8.3)orthesquare-rootprofile(equation8.5). For this reason, we Eccentricity: ϵ = √a2 b2 80% quark60 50% quark 20%60 quark 4ab− won’t go into much more detail. Some general trends did appear. By construction, all kernels more precisely as Planar Flow: pf = 2 Figure 15. Gluon rejection power for angularities as a function( ofa+ angularityb) parameter a. Each line Rjet started out at zero at the center of the jetR andjet rose to +1 at somedistanceaway.Inthebest 0.2 0.4 Orientation:0.6 0.8θ 1. 0.2 0.4 0.6 0.8 1. represents growing jet size from R=0.2 in red up to R=1.0 inr purple.p (r′) Herekernels, the scores this for happened all pT sare around r =0.4forlow-pT jets, 0.3 for 100 GeV jets and 0.24 for 400 averaged. The bestIntegrated angularitiesSubjet Jet perform Shape:pT Standard slightlyΨ better(r Deviation)= than massTes, butdr worse′ . than track2st and Subjet’s subjetpT Fraction(8.1) Grejection jet andG 800rejection GeV jets. Beyond this, it mattered less what happened,butthebestkernelsdidfall counts. hp://jets.physics.harvard.edu/qvg100 0 p 100 5 Figure 19. The Covariance Tensor and its eigenvalues and eigenvectors.! T toward the edge of the jet. Examples of such kernels are shown in Figure 16.

90 90 8.7 N-subjettiness the jet mass, but this is not the most useful for our purposes. Agivenangularityhastwo An important distinction must80 be made between this definition, which80 is different function parameters (Rjet and a)inadditiontoanydiscretechoiceslikenormalization(noN-subjettinessne, jet [38 mass,]isafamilyofjetshapesthatattempttocharacterizethedegree to which a for each jet, and what is commonly plotted˜ as ‘jet shape,’ which is averaged over all jets seen jet pT ,jetE)orangleused(70 θ as defined, or geometric θ.) Gluonjet rejections has70 exactly for diNfferentsubjets. N is one of the input parameters, and is commonly taken to be 1, by a detector (with some cuts.) In Figure 12,thisaveragedintegratedjetshapeistheleft choices of a are shown in Figure 15. 2or3.N-subjettiness finds exactly N axes within the jet and associates each particle or p 60 60 T plot, whereas the distribution of integrated jet shapes out to a singledeposit radius to the nearest of r =0 axis..1is These are the N subjets. The N-subjettiness score τ is sum of 8.6 Optimal Kernel for Radial Moment N Rjet Rjet the right plot. The distribution0.2 is clearly0.4 not agaussiancenteredaroundtheaveragevalue.0.6 0.8 1. pT -weighted0.2 radial0.4 moments0.6 for0.8 each subjet.1. In this moment, each bit of pT is multiplied by Given Ψ(0.1)Rather for a than particular sticking to jet powers that of r you,sinesandcosines(likeangularity),oranotherorthonorm want to classify, it’s moreits distance useful to to theknow subjetal the axis full∆R raised to a power β,whichmustbepositive.Specifically, basis, we looked for the kernel f(r)thatgivesthebestdiscriminationpowerbetweenquarks distribution for quarks and gluonsFigure than 10. For just subjets, the smaller two average is better. Gluonvalues.this rejection Historic is at 50% measurements quark acceptance is plotted as a and gluons for each pT .Becausethegoalistofindthebestfunction,theoptimization N function of initial jet size Rjet. These scores are averaged over all jet pT bins from 50 GeV to 16001 GeV. β and calculationsproblem are is for technically the average infinite rather dimensional. than But the through full dis retribution.asonable smoothness The same criteria, is true it forτ jetN, β = pT,k (∆RJ,k) , (8.8) The color corresponds to the subjet algorithm, with anti-kT in red being slightly better thand0 CA in masses: oftencan average be reduced masses to adjusting are calculated a few control-points and measured of a spline fo orcoerdiffffierentcients ofp ansratherthanmass orthonormal J=0 k∈subjetJ green, which is slightly better than kT in blue. As for subjetT size, the darkest color corresponds to! the ! basis. Since adding a constant doesn’t change the discrimination power, we chose our kernels distributions. smallest and best subjet size of Rsub =0.1. Lightestwhere is the largestd0 is and a normalization worst subjet size involving of Rsub = R thejet. jet size R0 to keep τN, β between zero and one: These trends hold even for subjet variables not plotted. jet β d0 = R0 pT,k . (8.9) Measurements at CDF agreed well [30]withPythiaTuneAandHerwigouttopT = k∈jet 380 GeV. At higher pT ,shapesgotnarrower,whichisconsistentwiththemixofquark and ! –23– There are three parameters: N,theexponentβ,andthemethodofchoosingaxes.Asimple gluon jets evolving from 27% quark at 50 GeV to 80% quark at 350 GeV. Early ATLAS way of choosing N axes is to undo a kT or Cambridge-Aachen clustering exactly N steps. A data also agrees moderately–27– well [25]withsimulations.Whenusedevent-by-event,oftena particular annulus was chosen to be integrated over, for example 0.2

–18–

–20– We looked at 10,000 variables

The best two variables in Pythia are:

Charged parcle count 1 • Beer spaal and energy resoluon works beer • e.g. parcles > calorimeter clusters > subjets

and

1 2 Linear radial moment (girth) g = pi r • Similar to jet broadening jet T | i| pT i jet X2

• Many variables have similar performance

6 If the signal is not pure quark and the background is not pure gluon, a cut with these quark and gluon efficiencies will translate into signal and background efficiencies in a way that depends on: the fraction of signal made of quarks sq,signalmadeofgluonssg,background made of quarks bq,andbackgroundmadeofgluonsbg.Inthiscase,

ϵs = sqϵq + sgϵg and ϵb = bqϵq + bgϵg . (12.1)

Now suppose you start with S signal events and B background events. A cut with signal Quark and gluon jet substructure efficiency ϵs and background efficiency ϵb changes the statistical significance in a simple way as before: S Cut Sϵs ϵs Significance Improvement σ = = σ (12.2) √B → √Bϵb √ϵb

2.4 If the signal started with some significance σ,thecutwillimproveitbyafactorϵs/√ϵb, Top 5 combined with BDT which is called the “Significance Improvement” in the plots below. It depends not only on 2.2 the performance of the tagger, but on the quark/gluon makeup of your signal and background best group of 5 2 and where one chooses to operate. A ROC curve for quark/gluon discrimination (ϵg vs ϵq)can charged mult & girth Girth be easily transformed into a Significance Improvement Curve (ϵs/√ϵb vs ϵs)usingequations 1.8 charged mult * girth 12.1 and 12.2.ThefirstplotinFigure27,severalsuchcurvesareshownforsignalsthatcharged mult R=0.5 1.6 are 100% quark, but backgrounds that are mixed.subjet The mult curve Rsub=0.1 labeled 0% has a background

Significance Improvement that is purely gluon jets and corresponds to thegirth curves R=0.5 shown previously. Even when the Parcle count 1.4 background has only 10% quark jets, the maximumoptimal achievable kernel at 80% significance improvement has 1.2 fallen quite a bit. The second plot in this figure shows1st subjet those R=0.5 maxima as a continuous function avg kT of Rsub=0.1 1 of the background’s quark fraction. A typical low-p QCD background has 15% quarks and mass/PtT R=0.3 is indicated. Clearly the tagger will be most useful when the signal and background are both 0.8 decluster kT Rsub=0.1 quite pure. jet shape Ψ(0.1) 0.6 The same analysis can be performed for the published|pull| R=0.3 B-tagger ROC curves. Generic low- planar flow R=0.3 0.4 pT QCD backgrounds have around 2% B-jets. For this value, the significance improvement 0peaks 0.1 at0.2 around 0.3 60% 0.4B-acceptance. 0.5 0.6 This 0.7 is 0.8 the typical 0.9 operating 1 point of these taggers. Quark Jet Acceptance

100 GeV jets, Pythia 7 Figure 23. Significance Improvement Curves for pT = 100 jets for selected variables. These curves show the significance improvement εS/√εB as a function of εS.

–32–

–37– 9

〉 30 〉 0.3 trk

n ATLAS Simulation Pythia Dijets ATLAS Simulation Pythia Dijets 〈 25 Pythia γ+jet 0.25 Pythia γ+jet anti-kt R=0.4, |η| < 0.8 anti-kt R=0.4, |η| < 0.8 Extracted Extracted Extracted from Pythia MC11 Extracted from Pythia MC11 20 Closed symbols: Quarks 0.2 Closed symbols: Quarks s = 7 TeV Open symbols: Gluons s = 7 TeV Open symbols: Gluons Track Width 〈 15 0.15

10 0.1

5 0.05

1.2 1.2 jet jet ATLAS 7 TeV 1.0 p [GeV] 1.0 p [GeV] T T • ATLAS developed procedure to disentangle quark and gluon jets 0.8 0.8 50 100 150 200 250 300 350 50 100 150 200 250 300 350

• Used relavely pure samples (dijetsExtracted MC/MC for gluon, γ + jet for quark) jet Extracted MC/MC jet p [GeV] p [GeV] T T

(a) 9 (b) Simulaon Data 〉 〉 30 0.3〉 30 〉 0.3 trk trk

n Pythia Dijets Pythia Dijets

ATLAS Simulation n ATLAS Simulation 〈 ATLAS Pythia Dijets ATLAS Pythia Dijets 25 Pythia γ+jet 0.25〈 Pythia γ+jet anti-kt R=0.4, |η| < 0.8 25 anti-kt R=0.4, |η| < 0.8 Pythia γ+jet 0.25 Pythia γ+jet anti-kt R=0.4, |η| < 0.8 anti-kt R=0.4, |η| < 0.8 Extracted ExtractedExtracted Extracted Extracted from Pythia MC11 ExtractedExtracted from Pythia from 2011 MC11 Data Extracted from 2011 Data 20 Closed symbols: Quarks 0.2 20 -1 ClosedClosed symbols: symbols: Quarks Quarks 0.2 -1 Closed symbols: Quarks s = 7 TeV Open symbols: Gluons L dts == 74.7 TeV fb , s = 7 TeVOpen symbols: Gluons L dt = 4.7 fb , s = 7 TeV

Track Width ∫ Open symbols: Gluons ∫ Open symbols: Gluons Track Width 〈 〈 15 gluon 0.15 15 0.15 10 0.1 quark 10 0.1 5 0.05 5 0.05

1.2 1.21.2 1.2 jet jet 1.0 p [GeV] 1.0 p p[GeV]jet [GeV] pjet [GeV] T 1.0 T T 1.0 T 0.8 0.80.8 0.8 50 100 150 200 250 300 350 5050 100 100 150 150 200 200 250 250 300 300 350 350 50 100 150 200 250 300 350

Extracted MC/MC jet Extracted MC/MC jet jet jet p [GeV] Extracted Data/MC p p[GeV] [GeV] Extracted Data/MC p [GeV] T T T T (a) (b) (c) (d)

Extracted gluon jet properes closer to quark than in Fig. 2 Average (a,c) ntrk and (b,d) track width forpythia quark- (solid symbols) and gluon-jets (open symbols) as a function of

〉 30 〉 0.3 8 reconstructed jet pT for isolated jets with |η| < 0.8. Results are shown for distributions obtained using the in-situ extraction trk

n ATLAS Pythia Dijets method in Pythia ATLAS6simulation(blackcircles,(a,b))ordata(blackcircles,Pythia Dijets (c,d)), as well as for labelled jets in the dijet 〈 25 Pythia γ+jet 0.25 Pythia γ+jet anti-kt R=0.4, |η| < 0.8 sample (triangles)anti-kt andR=0.4, in |η| the< 0.8γ+jet sample (squares). The error bars represent only statistical uncertainties. Isolated jets are Extracted Extracted Extracted from 2011 Data reconstructedExtracted using thefrom 2011 anti- Datakt jet algorithm with radius parameter R =0.4. The bottom panels show the ratio of the results 20 L dt = 4.7 fb-1, s = 7 TeV Closed symbols: Quarks obtained0.2 with L the dt = in-situ4.7 fb-1, s extraction= 7 TeV Closed method symbols: to Quarks the results inthedijetandγ+jet MC samples. ∫ Open symbols: Gluons ∫ Open symbols: Gluons Track Width 〈 15 0.15

10 5.3.2 Heavy-flavour0.1 input uncertainties the results obtained after the extraction of the pure quark- and gluon-jet properties are added in quadra- 5 The fractions0.05 of b-jets and c-jets are varied by 20% ture to obtain the total systematic uncertainty from ± in the dijet sample, following Ref. [55], and by 50% this effect. ± 1.2 in the 1.2γ+jet sample to estimate a conservative uncer- Uncertainties on the properties of b-jets are deter- pjet [GeV] tainty. As the fractions of b-jets and cp-jetsjet [GeV] are small, mined using a tt¯sample, described in Sec. 3. The purity 1.0 T 1.0 T 0.8 these uncertainties0.8 remain sub-leading. The two input of this sample is generally better than 95%. An enve- 50 100 150 200 250 300 350 fractions50 are varied 100 independently. 150 200 250 The 300 differences 350 in lope 10% uncertainty is included on the b-jet properties jet jet Extracted Data/MC p [GeV] Extracted Data/MC p [GeV] T T (c) (d)

Fig. 2 Average (a,c) ntrk and (b,d) track width for quark- (solid symbols) and gluon-jets (open symbols) as a function of reconstructed jet pT for isolated jets with |η| < 0.8. Results are shown for distributions obtained using the in-situ extraction method in Pythia 6simulation(blackcircles,(a,b))ordata(blackcircles,(c,d)), as well as for labelled jets in the dijet sample (triangles) and in the γ+jet sample (squares). The error bars represent only statistical uncertainties. Isolated jets are reconstructed using the anti-kt jet algorithm with radius parameter R =0.4. The bottom panels show the ratio of the results obtained with the in-situ extraction method to the results inthedijetandγ+jet MC samples.

5.3.2 Heavy-flavour input uncertainties the results obtained after the extraction of the pure quark- and gluon-jet properties are added in quadra- The fractions of b-jets and c-jets are varied by 20% ture to obtain the total systematic uncertainty from ± in the dijet sample, following Ref. [55], and by 50% this effect. ± in the γ+jet sample to estimate a conservative uncer- Uncertainties on the properties of b-jets are deter- tainty. As the fractions of b-jets and c-jets are small, mined using a tt¯sample, described in Sec. 3. The purity these uncertainties remain sub-leading. The two input of this sample is generally better than 95%. An enve- fractions are varied independently. The differences in lope 10% uncertainty is included on the b-jet properties 0.18 0.18 > > trk

0.16 calo 0.16

Track width Jet PT (GeV) Jet PT (GeV)

0.18 0.18 > > trk

0.16 calo 0.16

0.06 The ATLAS Collaboration 0.06 0.04 ATLAS Preliminary 0.04 ATLAS Preliminary s = 8 TeV 20.3 fb s = 8 TeV 20.3 fb 0.02 -1 0.02 -1 1.2 < |η| < 2.1 1.2 < |η| < 2.1 1.2 100 200 300 400 500 1.2 100 200 300 400 500 Abstract Jet P (GeV) Jet P (GeV) 1 T 1 T This note presents a study of the discrimination between jets arising from light flavor quarks 0.8 and gluons using data from ps = 8TeVproton-proton collisions0.8 taken with the ATLAS Extracted Extracted Simulation 100 200 detector300 at the LHC. Templates400 of light-quark500 and gluon jet propertiesSimulation are derived100 and 200 300 400 500 compared with predictionsJet from P Monte (GeV) Carlo generators and high purity quark and gluon Jet P (GeV) jet samples. The power of a varietyT of discriminants derived from data is studied and T systematic uncertainties on their performance are assessed. Particular emphasis is placed on Data appears to be between the dependence of the discriminationPythia performance and Herwig for dierent event samples. Rejections of Figure 5: Means of extractedATLAS-CONF-2016-034 22/07/2016 templates40 to 50% of jets initiated for w fromtrk gluons(left) are achieved and forw 70%calo light(right) quark9 jet acceptance. comparing data (solid line), P (dotted line) and Herwig++ (dashed line). The top plots show the distribution for ⌘ < 0.8, the bottom plots are for | | 1.2 < ⌘ < 2.1. The bottom panel of each plot shows the ratio of the P and Herwig++ distributions to the | | extracted templates. The last pT bin in all plots includes overflow events.

© 2016 CERN for the benefit of the ATLAS Collaboration. Reproduction of this article or parts of it is allowed as specified in the CC-BY-4.0 license. 4.4 Validation

Using the +2jet and trijet validation samples defined in Section 3 it is possible to check the extracted templates against purified quark- and gluon-jet data samples. Figure 7 show a comparison of the means of the template distributions for quarks and gluons as compared with the two samples. The variables ntrk and wcalo are displayed as examples. Generally the extracted templates and validation samples agree within 15%, with the extracted gluon means being typically 10–15 % higher in the validation sample. The quark means are well reproduced, except at the lowest pT bins. Note that no attempt is made to correct the validation samples to 100% light-quark or gluon jet purity. However the purity of these samples is above 90%. therefore dierences between the the validation and extracted templates can be attributed to other sources, such as sample dependence, as discussed in Section 5.

4.5 Discrimination Performance

In order to determine which variables are most powerful for quark-gluon discrimination, a likelihood is created to rank the variables based on the fraction of gluons they reject (gluon rejection) for fixed quark

13 Monte Carlos can be improved…

Les Houches study (2016) Quark, -level Gluon, hadron-level 6 6 Pythia 8.205 Pythia 8.205 Herwig 2.7.1 5 Herwig 2.7.1 Sherpa 2.1.1 5 Sherpa 2.1.1 Vincia 1.201 Vincia 1.201 4 Deductor 1.0.2 1.0.2 Ariadne 5.0.β 4 Deductor ) Ariadne 5.0.β 1 0.5 ) λ

( 3 1 0.5

q Q=200 GeV p λ

R=0.6 ( 3

g Q=200 GeV p 2 R=0.6 2 1

1 0 0 0.2 0.4 0.6 0.8 1 1 λ0.5 [LHA] 0 0 0.2 0.4 0.6 0.8 1 (a) 1 (b) Monte Carlo’s all seem to agree on quarks λ0.5 [LHA] (a) Improved shower MCs in between (b) Herwig and Pythia • Promising sign for future progress 10

(c)

Fig. IV.48: Hadron-level distributions of the LHA for (a) the e+e uu¯ (“quark jet”) sample, ≠ æ (b) the e+e gg (“gluon jet”) sample, and (c) the classifier separation integrand in Eq. (IV.32). ≠ æ Six parton shower generators—Pythia(c) 8.205, Herwig++ 2.7.1, Sherpa 2.1.1, Vincia 1.201, Deductor 1.0.2, and Ariadne 5.0.——are run at their baseline settings with center-of-mass Fig. IV.48:energy Hadron-levelQ = 200 distributionsGeV and jet radius of theR LHA=0.6 for. (a) the e+e uu¯ (“quark jet”) sample, ≠ æ (b) the e+e gg (“gluon jet”) sample, and (c) the classifier separation integrand in Eq. (IV.32). ≠ æ Six parton shower generators—Pythia 8.205, Herwig++ 2.7.1, Sherpa 2.1.1, Vincia 1.201, Deductor 1.0.2, and Ariadne 5.0.——are run at their baseline settings with center-of-mass energy Q = 200 GeV and jet radius R =0.6.

141

141 Deep learning Tradional approach Machine learning approach

Think about physics Think about observaons

Guess some movated observables

Run simulaons

Combine best variables • Boosted Decision Trees Computer learns discriminaon • Neural Networks

Final mulvariate discriminant

Test on data 11 Neural networks

Tradional (shallow) neural networks Useful for mulvariate analysis Deep networks

Hidden Layer Handful of Inputs Output Jet mass Q Q Parcle count G G N-subjeness

(sigmoid) • Many inputs • Many hidden layers Acvaon funcon inspired by biology

12 Recent advances allowing Deep Learning

• In algorithms: – New acvaon funcons to avoid issues such as saturaon – New model regularizaons • Dropout: Randomly selected fracon p of units are ignored during each weight-update. – Network architecture adapted to applicaon • Drascally fewer elements to opmize

• In compung: – Faster compung capabilies • Graphics Processing Units (GPUs) – Easier Usability • Keras Deep Learning Python Library

13 Acvaon Funcons

Tradional New!

• Sigmoid • Tanh • Recfied Linear Unit (ReLU)

• Sigmoid and Tanh can saturate, whereby the gradient becomes vanishingly small for inputs far from zero, making training difficult.

• ReLUs avoid this saturaon problem and have an easily computable gradient.

14 How to link up units?

• Dense (Fully Connected) Layer – Each unit is linked to every unit in the previous layer.

• Convoluonal Layer – Each unit is linked to an n × n patch of the previous layer.

– Units are downsampled to n/2 × n/2 patches with a max-pooling layer.

– Can handle mulple channels. E.g. RGB images.

15 Applicaons to Image Recognion

16 Jet Images Cogan et al. arXiv:1407.5675

• Treat energy deposits as image Applicaon to boosted W tagging

preprocess

2.5 0.6 500 W Jets Light Jets 2.0 0.4 0.2 400

1.5 0.0

1 300 Q 0.2 1.0

0.4 Occupancy 200 0.5 0.6 100 0.8 0.0 1.0 0.0 0.5 1.0 1.5 2.0 2.5 0 Cell 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 Coecient Q2 Fisher-Jet Output Fisher discriminant (a) Fisher-Jet (b)17 Fisher-Jet Discriminant Output

Figure 2: A Fisher’s linear discriminant presented as an image (left) and the distributions of the discriminant output when applied to W-jets and Light-jets (right), when the FLD is trained on jets with p [250, 300] GeV, mass M [65, 95] GeV, and separation between T 2 2 subjets of R [0.6, 0.8]. 2

The background rejection vs. signal eciency curves for the FLD, computed using the 1-D likelihood ratios of the output distribution of the FLD for W-jets and QCD jets, can be seen in Figure 3a, along with the rejection vs. eciency curves observed when using N-subjettiness (⌧2/⌧1)[7, 8] computed analogously with the 1-D likelihood ratios. For the rejection vs. eciency curve in Figure 3a Fisher-jets are trained on jets satisfying p [250, 300] in 6 bins of R , and a combined 1D likelihood ratio distribution is T 2 jj computed by taking the likelihood ratio for each jet computed with respect to appropriate Rjj bin and merging these likelihood ratio values into a single distribution. The N- subjettiness distributions are not binned in Rjj as this did not show any improvements in performance. Figure 3b shows the eciency of W jets at a fixed QCD jet rejection of 10 as a function of jet pT for the FLD (combining the 6 bins of Rjj for each jet pT bin) and for N-subjettiness. It can be seen that FLD outperforms N-subjettiness for the full range of jet pT examined. It should be noted that the output of FLD and N-subjettiness are correlated, as shown in Figures 4a and 4b for W and QCD jets respectively, with a correlation coecient of approximately 0.7 for both W and QCD jets. Thus, the Fisher-jet approach is able to combine in a linear way the information comprising the jet e↵ectively, and capture much of the information of N-subjettiness and more. On the other hand, mass, which relies on quadratic relationships between the inputs, is a simple quantity which FLD does not reproduce, as shown in Figures 4c and 4d for W and QCD jets respectively. Since the Fisher-jet output is only slightly correlated with mass, with a correlation coecient of approximately -0.25 for both W and QCD jets indicating a small degree of anti-correlation, the performance of the classifier does not change dramatically whether it is applied to a small window around the W mass, or to a sample without jet mass cuts.

–9– Deep Convolutional Architectures for Jet-Images at the Large Hadron Collider Luke de Oliveiraa, Michael Aaron Kaganb, Lester Mackeyc, Benjamin Nachmanb, Ariel Schwartzmanb aStanford University, Institute for Computational and Mathematical Engineering (ICME), bSLAC National Accelerator Laboratory, cStanford University, Department of Statistics

Introduction Deep Convolutional Networks Physics Performance Improvements The Large Hadron Collider (LHC) at CERN is the largest and most powerful particle accelerator in Deep Learning — convolutional networks in particular — currently represent the state of the art in Our analysis shows that Deep Convolutional Networks significantly improve the classification of the world, collecting 3,200 TB of proton-proton collision data every year. A true instance of Big most image recognition tasks. We apply a deep convolutional architecture to Jet Images, and new physics processes compared to state-of-the-art methods based on physics features, Data, scientists use machine learning for rare-event detection, and hope to catch glimpses of new perform model selection. Below, we visualize a simple architecture used to great success. enhancing the discovery potential of the LHC. More importantly, the improved performance and uncharted physics at unprecedented collision energies. suggests that the deep convolutional network is capturing features and representations beyond We found that architectures with large filters captured the physics response with a higher level of physics-motivated variables. Our work focuses on the idea of the ATLAS detector as a camera, with events captured as accuracy. The learned filters from the convolutional layers exhibit a two prong and location based images in 3D space. Drawing on the success of Convolutional Neural Networks240 in < Computerp /GeV < 260 GeV, 0.19 < τ < 0.21,structure 79 < mass/GeV that sheds< 81 light240 < on p /GeV phenomenological < 260 GeV, 0.19 < τ < structures0.21, 79 < mass/GeV within < 81 jets. T 21 T 21 Vision, we study the potential of deep leaning for interpreting LHC events in new ways. Pythia 8, s = 13 TeV s = 13 TeV, Pythia 8 20 Local ReLU150 Dropout ReLU Dropout Response Fully mass mass+τ21 Normalization Connected ReLU Unit mass+∆R τ21 15 τ +∆R The ATLAS detector ∆R 21 MaxOut MaxOut The ATLAS detector is one of the two general-purpose experiments at the LHC. The 100 million 100 Convnet channel detector captures snapshots of particle collisions occurring 40 million times per second. Convnet Convnet-norm 1/(Background Efficiency) 10 1/(Background Efficiency) We focus our attention to the Calorimeter, which we treat as a digital camera in cylindrical space. Random Random Below, we see a snapshot of a 13 TeV proton-proton collision. Boosted W’s and jet images Boosted50 Boson Type Tagging 5 GenericConvolution overviewMax-Pool slide ConvolutionJet ETmiss Max-Pool Flatten 5 EvenJet Image morede Olivera non-linearity: et al. arXiv:1511.05190 Going Deep -32- Baldi et al. arXiv:1603.09349 0 0 0.2 0.4 0.6 0.8 Boosted0.2 0.4 Boson0.6 Type0.8 Tagging the underlying theoretical questions may naturally be cu- Convolved 4 Signal Efficiency Signal Efficiency 10 rious as to whether the deep network has learned a novel ConvolutionsBenjamin NachmanFeature andLayers Ariel Schartzman Jet ETmiss No pile-up DNN(image) strategy for classificationStandard which physically could informmotivated their stud- Receiver Operating Characteristic discriminants — mass (top) (a) SLAC,(b) Stanford University BDT(expert) ies, or rediscovered and further optimized the existing 3 Deep NN and n-subjettiness (bottom) BDT 10 β=2 features. D2 +mass March 26, 2014 β=1 Benjamin Nachman and Ariel Schartzman +mass While one cannot probe the motivation of the ML al- Figure 6: Left: ROC curves for individual physics-motivated features as well as three deep neural τ21 Understandinggorithm, Improvement it is possible to compare distributionss of events network discriminants. Right: the DNNs are compared with pairwise combinations of the physics- 2 Jet mass SLAC, Stanford University 10 Since the selection of physics-driven variablescategorized is driven as by signal-like physical understanding, by the di↵erent we algorithms want to be in motivated features. order to understand how the classification is being accom- Background rejection sure that the representations we learn are more than simple recombinations of basic physical Max-Pooling March 26, 2014 variables. We introduce a new method to testplished. this — To we compare derive distributions sample weights between to apply di↵erent such algo- that W’→ WZ event n-subjeness 10 B. Nachman (SLAC) Boosted Boson Type Tagging March 26, 2014 1 / 21 rithms, we study simulated events with equivalent back- 240 < p /GeV < 260 GeV, 0.19 < τ < 0.21, 79 < mass/GeV < 81 240 < p /GeV < 260 GeV, 0.19 < τ < 0.21, 79 < mass/GeV < 81 T 21 T 21 ground rejection, see Figs. 5 and 6 for a comparison of the RepeatRepeat s = 13 TeV, Pythia 8 s = 13 TeV, Pythia 8 selected regions in the expert features for the two classi- LHC Events as Images 1 0 0.2 0.4 0.6 0.8 1 fiers. The BDT preferentially selects events with values We transform the ATLAS coordinate system (η, φ) to a rectangular grid that allows for150 an image- Figure150 5: The convolution neural network concept as applied to jet-images. meaning that physical variables have no discrimination power. Then, we apply our learned mass+Convnet Apply deep learning techniquesmass+MaxOut on jet images! [3] of the features close to the characteristic signal values +Convnet B. Nachman (SLAC) Boosted Boson Type+MaxOut Tagging March 26, 2014 1 / 21 Signal efficiency based grid arrangement. During a collision, energy from particles are deposited in pixels in (η, φ) τ21 τ21 discriminant, and check for improvement in our figure of merit — the ROC curve. and away from background-dominated values. The DNN, space. We take these energy levels, and use them as the pixel intensities in a greyscale analogue. ∆R+Convnet4.1 Architecturalconvolutional Selection nets are∆ R+MaxOuta standard image MaxOut MaxOut which has a modestly higher eciency for the equivalent These images — called Jet Images — were first introduced by our group [JHEP 02 (2015) 118], For the MaxOutVisualizing architecture, we utilizemass two FC layers Learning with MaxOut activation (the first with 256 100 Convnet processing100 technique; alsoConvnet consider maxout 4 rejection,Notice selects that events removing near out the the same individual signal values, effects but of enabling the connection between LHC physics event reconstruction and computer vision.. We Below,Convnet-norm weunits, have the secondthe learned with 128 units,convolutional both of which filters have 5(left) piecewiseConvnet-norm and components the difference in the MaxOut-operation), in between the average 10 1/(Background Efficiency) followed by two FC1/(Background Efficiency) layers with ReLU activations (the first with 64 units, the second with 25 units), Pile-up <µ >=50 in somethe cases physics-related can be seen to variables retains aleads slightly to highera likelihood frac- transform each image in (η, φ), rotate around the jet-axis, and normalize each image, as is often signalRandom and background image after applying the learnedRandom convolutional filters (right).• Deep NN does beer than single variables This novel DNN(image) followed by a FC sigmoid layer for classification. We found that the He-uniform initialization [35] tion ofperformance jets away from equivalent the signal-dominated to a random region. guess, The but done in Computer Vision, to account for non-discriminative difference in pixel intensities. difference-visualization technique helps understand what the network learns. BDT(expert) for the initial MaxOut layer weights was needed in order to train the network, which we suspect• Deep NN does about as well as BDT is 3 50 50 =2 likelythe explanation Deep Convolutional is that the DNN Network has discovered retains some the n-subjeness 10 Dβ +mass due to the sparsity of the jet-image input. In cases where other initialization schemes were used, of 6 good discriminants the 2 same signal-richdiscriminative region power. identified This indicates by the expert that the features, deep In our experiments, we build discriminants on top of Jet Images to distinguish between a networks often converged to very sub optimal solutions. This network is trained (and evaluated) on β=1+mass τ21 but hasnetwork in addition learns found beyond avenues theory-driven to optimize variables the perfor- — hypothetical new physics event, W’→ WZ, and a standard model background, QCD. un-normalizedDeep NN jet-images using the transverse energy for the pixel intensities 102 Jet mass mancewe and hypothesize carve into the these background-dominated may have to do region. with 0 For the deep convolution0 networks, we use a convolutional architecture consisting of three sequen- 0.2 0.4 0.6 tial [Conv0.8 + Max-Pool0.2 + Dropout]0.4units, followed0.6 by a local response0.8 normalization (LRN) layer [8], Note thatdensity, DNNs shape, can also spread, be trained and other to be spatially independent driven of Signal Efficiency Signal Efficiency Background rejection mass, by providing a range of mass in training, or train- followed by two fully connected, dense layers. We note that the convolutional layers used are so called 18 features. “full” convolutions – i.e., zero padding is added the the input pre-convolution. Our architecture can 10 ing a network explicitly parameterized [44, 45] in mass. (a) be succinctly written as: (b) Difference in average [Dropout Conv ReLU MaxPool] 3 LRN2D [Dropout FC ReLU] Dropout Sigmoid. image between signal ! ! ! ⇤ ! ! ! ! ! ! DISCUSSION Convolutions (4.1) 1 and background Figure 7: ROC curves that combined the DNN outputs with physics motivated features for the 0 0.2 0.4 0.6 0.8 1 The convolution layers each utilize 32 featureto Jet Images maps, or filters, with filter sizes of 11 11, 3 3, Concluding Remarks Convnet (left) and MaxOut (right) architectures. ⇥ ⇥ and 3 3 respectively. All convolution layers are regularized with the L2 weight matrix norm. A Signal efficiency The signal from massive W qq jets is typically ob- ⇥ We show that modern Deep Convolutional Architectures can significantly enhance the discovery down-sampling of (2, 2), (3, 3), and (3, 3) is performed by the three max pooling layers, respectively. scured by a background from the! copiously produced low- A dropout [8] of 20% is used before the first FC layer, and a dropout 10% is used before the output potential of the LHC for new particles and phenomena. We hope to both inspire future research into Computer Vision-inspired techniquesmass for jets particle due to discovery, quarks or and gluons. continue Highly down ecient this classifi- path layer. The FC hidden layer consists of 64 units. FIG. 4: Signal eciency versus background rejection (inverse cation is critical, and even a small relative improvement After early experiments with the standard 3 3 filter size, we discovered significantly worse towards increased discovery potential for new physics. ⇥ of eciency) for deep networks trained on the images and performance over a more basic MaxOut [7] feedforward network. After further investigation into larger in the classification accuracy can lead to a significant boosted decision trees trained on the expert features, both boost in the power of the collected data to make statis- convolutional filter size, we discovered that larger-than-normal filters work well on our application. with (bottom) and without pile-up (top). Typical choices of Though not common in the Deep Learning community, we hypothesize that this larger filter size is signal eciency in real applications are in the 0.5-0.7 range. tically significant discoveries. Operating the collider is helpful when– 13 dealing – with sparse structures in the input images. In Table 1, we compare di↵erent Also shown are the performance of jet mass individually as very expensive, so particle physicists need tools that al- filter sizes, finding the optimal filter size of 11 11, when considering the Area Under the ROC Curve ⇥ well as two expert variables in conjunction with a mass win- low them to make the most of a fixed-size dataset. How- (AUC) metric, based on the ROC curve outlined in Sections 3 and 5. dow. ever, improving classifier performance becomes increas- ingly dicult as the accuracy of the classifier increases. Physicists have spent significant time and e↵ort de- INTERPRETATION signing features for jet-tagging classification tasks. These –8– designed features are theoretically well motivated, but as Current typical use in experimental analysis is the their derivation is based on a somewhat idealized descrip- combination of the jet mass feature with ⌧21 or one of tion of the task (without detector or pileup e↵ects), they the energy correlation variables. Our results show that cannot capture the totality of the information contained even a straightforward BDT-combination of all six of the in the jet image. We report the first studies of the ap- high-level variables provides a large boost in comparison. plication of deep learning tools to the jet substructure In probing the power of deep learning, we then use as our problem to include simulation of detector and pileup ef- benchmark this combination of the variables provided by fects. the BDT. Our experiments support two conclusions. First, that The deep network has clearly managed to match or machine learning methods, particularly deep learning, slightly exceed the performance of a combination of the can automatically extract the knowledge necessary for state-of-the-art expert variables. Physicists working on classification, in principle eliminating the exclusive re- Deep learning for Q vs G

NN inputs

preprocess

• Center • Crop • Normalize • Zero • Standardize

We use 3 “colors”

• Find an-kT R=0.4 jets • Extract square grid around jet center • Red = pT of charged parcles • Pixelate into ∆η × ∆φ = 0.024 × 0.024 cells • Green = pT of neutral parcles • Produces 33x33 image • Blue = charged parcle mulplicity

19 ⌘

beam Deep NN architecture pre-process

convolutional layer dense layer

quark jet

gluon jet max-pooling

3 ⇥ | {z } Figure• Convoluon layers apply 8x8 pixel filters to images 2: An illustration of the deep convolutional neural network architecture. The first layer is• the 4x4 filters used for for 2 input jet image, followednd by and 3 threerd conv. layers convolutional layers, a dense layer and an output• layer. We use 64 independent filters in each layer • Max-pooling reduces layer size by 4 charged,• Final layer is densely connected to all final filters neutral and negatively charged particles. To be concrete, in this study we take three input channels:

red = transverse momenta of charged particles

green = the transverse momenta of neutral particles 20

blue = charged particle multiplicity

Each of these observables is evaluated on each image pixel. All channels of the image undergo (k) the standard pre-processing: the images are normalized such that ij Iik = 1, where k indexes over channel; the zero centering and standardization are done for each pixel in each P

–9– Results

1000 GeV Jets from Pythia

Deep Color NN

Deep grayscale NN ✏ Q BDT of top 5 variables p✏G

Single variables

Works really well – especially considering we don’t put in any physics!

Figure 5: (top) ROC and (bottom) SIC curves of the FLD and the deep convolutional 21 network trained on (left) 200 GeV and (right) 1000 GeV Pythia jet images with and without color compared to baseline jet observables and a BDT of the five jet observables. Figure 5: (top) ROC and (bottom) SIC curves of the FLD and the deep convolutional network trained on (left) 200 GeV and (right) 1000 GeV Pythia jet images with and without color compared to baseline jet observables and a BDT of the five jet observables. in signal over background discrimination power in a collider physics application, and also exhibits a nontrivial maximum (at some "q) whichin gives signal an over unbiased background measure discrimination of the relative power in a collider physics application, and also performance of di↵erent discriminants [6]. exhibits a nontrivial maximum (at some "q) which gives an unbiased measure of the relative performance of di↵erent discriminants [6]. The ROC and SIC curves of the jet variables and the deep convolutional network on 200 GeV and 1000 GeV Pythia jets are shown in FigureThe ROC5. and The SIC quark curves jet of classificiationthe jet variables and the deep convolutional network on 200 GeV and 1000 GeV Pythia jets are shown in Figure 5. The quark jet classificiation eciency at 50% quark jet classification eciencye forciency each at of 50% the quark jet variables jet classification and the eciency CNN for each of the jet variables and the CNN are listed in Table 1. To combine the jet variablesare into listed more in Table sophisticated1. To combine discriminants, the jet variables a into more sophisticated discriminants, a boosted decision tree (BDT) is implemented withboosted scikit-learn. decision tree The (BDT) convolutional is implemented network with scikit-learn. The convolutional network outperforms the traditional variables and matchesoutperforms or exceeds the the traditional performance variables of the and BDT matches of or exceeds the performance of the BDT of all of the jet variables. The performance of the networksall of the trained jet variables. on images The performance with and of without the networks trained on images with and without color is shown in Figure 6. color is shown in Figure 6.

– 13 – – 13 – Results

200 GeV Jets from Pythia

Deep Color NN ✏Q Deep grayscale NN p✏ G BDT of top 5 variables

Single variables

Works really well – especially considering we don’t put in any physics! Figure 5: (top) ROC and (bottom) SIC curves of the FLD and the deep convolutional 22 network trained on (left) 200 GeV and (right) 1000 GeV Pythia jet images with and without color compared to baseline jet observables and a BDT of the five jet observables. Figure 5: (top) ROC and (bottom) SIC curves of the FLD and the deep convolutional network trained on (left) 200 GeV and (right) 1000 GeV Pythia jet images with and without color compared to baseline jet observables and a BDT of the five jet observables.

in signal over background discrimination power in a collider physics application, and also exhibits a nontrivial maximum (at some "q) which gives an unbiased measure of the relative in signal over background discrimination power in a collider physics application, and also performance of di↵erent discriminants [6]. exhibits a nontrivial maximum (at some "q) which gives an unbiased measure of the relative performance of di↵erent discriminants [6]. The ROC and SIC curves of the jet variables and the deep convolutional network on 200 GeVThe and ROC 1000 and GeV SIC curves Pythia of the jets jet are variables shown and in the Figure deep convolutional5. The quark network jet on classificiation eciency200 GeV at and 50% 1000 quark GeV jet Pythia classification jets are shown eciency in Figure for5. each The of quark the jet jet classificiationvariables and the CNN aree listedciency in at Table 50% quark1. To jet combine classification the e jetciency variables for each into of the more jet variables sophisticated and the CNN discriminants, a are listed in Table 1. To combine the jet variables into more sophisticated discriminants, a boosted decision tree (BDT) is implemented with scikit-learn. The convolutional network boosted decision tree (BDT) is implemented with scikit-learn. The convolutional network outperformsoutperforms the the traditional traditional variables variables and and matches matches or exceeds or exceeds the performance the performance of the BDT of of the BDT of all ofall the of the jet jet variables. variables. The The performance performance of the of networks the networks trained trained on images on with images and without with and without colorcolor is shown is shown in in Figure Figure 66..

– 13 – – 13 – Comparing Pythia and Herwig

• Discriminaon worse in Herwig • Gluon and quark jets are more similar • Consistent with previous studies

Network performance independent of MC used to train

• Indicates robustness • May work on data

23 Figure 8: ROC curves for the Pythia- and Herwig-trained CNNs applied to 200 GeV samples generated with both of the generators. Remarkably, the network performance seems robust to which samples are used for training. distributions and disagree primarily on the gluon distributions. The baseline jet variables and the convolutional network all indeed have worse performance on Herwig jets than on Pythia jets. A comparison of the discrimination power of the observables between the two generators is included in Table 1. It is interesting to consider the four possibilities of applying the convolutional networks trained on Pythia jets or Herwig jets to test samples of Pythia jets or Herwig jets. Figure 8 and Figure 9 show the resulting ROC curves and distributions of convolutional network outputs on the colored jet images. We find that the network is surprisingly insensitive to the generator: the convolutional network trained on Pythia jets and tested on Herwig jets has comparable performance to the convolutional network trained directly on Herwig jets and tested on Herwig jets. This insensitivity is a positive sign for being able to train the network on MC-generated jets and apply it to data robustly.

6 Conclusions

The ability to distinguish quark-initiated jets from gluon-initiated jets would be of tremendous practical application at colliders like the LHC. For example, many signals of beyond the standard model physics contain mostly quark jets, while their backgrounds are gluon-jet dominated. Quark/gluon jet discrimination is also extremely challenging: correlations in their radiation patterns and non-pertubative e↵ects like are hard to disentangle. Thus this task is ideally suited for artificial intelligence.

– 17 –

Figure 6: SIC curve of deep convolutionalIs it learning physics? network performance on Pythia jets with color (solid) and without color (dotted). The introduction of color becomes more helpful at higher energies, with the largestAdd in observables improvement on the 1000 GeV jets. • CPM = charged parcle mulplicity

• N95= a useful discriminant (minimum number of pixels with 95% of jet pT) [Pumplin 1991]

Except at very high pT, no benefit from adding observables

May indicate that NN has “learned” physics

Figure 7: SIC curve of deep convolutional network performance on Pythia jets without color 24 with additional inputs of CPM (solid), N95 (dashed), and zero as a control (dotted). The spread in the SIC curves for models trained on 100 GeV and 200 GeV jets is within the typical variation, with no clear improvement from the additional variables. For models trained on 500 GeV and 1000 GeV jets, a modest improvement was seen from the introduction of CPM, though not as large as the improvement from the introduction of color.

– 16 – Conclusions

• Quark and gluon jets can be disnguished by radiaon paerns • Pythia and Herwig have significant differences, parcularly for gluons • Improved parton showers (e.g. vincia) look promising

• Tradional variables • Two types: shape (mass, girth, n-jeness) and count (# parcles, # subjets) • Marginal gains from exploing correlaons of >2 variables using BDTs

• Deep learning approach • Use image-recognion technology to avoid thinking • Does beer than tradional approach! • Relies heavily on simulaons, but • Performance independent of Pythia or Herwig training

Domo Arigato!

25 BACKUP

26 ��� ������� ��� ��� ����-��-�� ��� ��� ������� ��� κ ����-��-�� κ κ �(��λβ) κ �(��λβ) �(��λβ) �(��λβ) ����� ��� ����� ����� ��� ����� ��� ����� ��� ����� ��� ����� ��� ����� ��� �� > ��� ��� ��� �� > ��� ��� �� > ��� ��� �� = ��� �� > ��� ��� �� = ��� �� = ��� �� = ���

κ ��� ��� κ ��� ��� κ ��� ��� κ ��� ���

��� ��� ��� ��� �� �� �� ��

��� � ��� � ��� Setting the quark� fraction��� f equal to 1/2 and� CA/CF =9/4 for QCD, the mutual ��� ������ ������ ������ ��� ��� ��� ������ ������ ������ ������ ��� informationβ β for quark/gluon discriminationβ β is Analyc approach to correlaons (a) (b) (a) (b) I(T ; A)f=1/2 0.103Larkoski. et al. arXiv:1408.3122 (3.9) ��� ��� ������ ������� � ��� ��� ������' ++������++ κ κ κ �(��λβ) κ �(��λβ) �(��λβ) �(��λβ) This will be the baseline����� ��� ����� value to which all observables����� ��� ����� will be compared. Note that I(T,A) ��� ����� ��� ����� ��� ����� ��� ����� ��� �� > ��� ��� ��� �� > ��� ��� �� > ��� ��� �� > ��� ��� �� = ��� �� = ��� is quite far from 1�� = (i.e.��� a full truth bit), demonstrating�� = ��� Monte-Carlo simulaons the inherent challenge of quark/gluon κ ��� ��� κ ��� ��� κ ��� tagging. ��� κ ��� ���

��� ��� ��� ��� �� �� �� �� ��� ��� ��� 4 Generalized� Angularities� ��� � � ��� ������ ������ ������ ��� ��� ��� ������ ������ ������ ������ ��� β β β β Our analytic studies of quark/gluon separation will focus on the generalized angularities  (c) (c) (d) (d) defined in Eq. (1.2), repeated for convenience: ��� ������� ��� ��� ����-��-�� Analyc approach to generalized angularies  �(��λκ ) FigureFigure 12: The 12: quark/gluon The quark/gluon truth truthoverlap�(�� overlapλκ for) an for individual an individual generalized generalized angularity angularity. Top: . Top: β β   ��� ����� ��� ����� the LLthe and��� LL NLL and analytic NLL analytic calculations. calculations. Note����� ��� ����� that Note these that calculations these calculations are= singular are singularz at✓ =. at 0 or = 0 or (4.1) �� > ��� ��� �� > ��� ��� i i � = ��� � = ��� � = 0. = Bottom: 0. Bottom: the Pythia the Pythia 8 and 8Herwig++and� Herwig++partonparton showers showers (identical (identicali jet to Figs. to4a Figs.and4a and κ ��� ��� 4b). The4bκ).��� solid The boxes solid boxes correspond correspond to dots to indicated��� dots indicated in Fig. in1, Fig. the1 dashed, the dashed boxX2 corresponds box corresponds to to the IRCthe safe IRC angularities safe angularitiese , ande the, and grey the dashed grey dashed curve incurve the in LL/NLL the LL/NLL plots marks plots marks the range the range Here zi is the energy fraction and ✓i = Ri/R0 the angular fraction with respect to the jet ��� of validityof validity��� of our of calculations our calculations (i.e. the (i.e. edge the of edge the of blue the region blue region in Fig. in7). Fig. 7). • Challenging �� radius R , such that 0 �� z , ✓ 1. We measure• the Not impossible angles R with respect to the recoil-free 0  i i  i ��� � winner-take-all��� axis [21–� 23] and we use a jet algorithm• Complementary to MCs that centers the jet on the winner-take- ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� 27 β all axis, suchβ that ✓ 1 is strictly enforced. For the IRC safe angularities e ,itisknown i  (a) that a recoil-free(b) axis improves quark/gluon discrimination power [9]. For the generalized angularities , a recoil-free axis is crucial for the calculations with ,sinceitensures ��� ������ � ��� ������++ . κ  κ �(��λβ) that measures the�(�� radiationλ–β) 25 –– 25 – pattern around the initiating hard quark or gluon and not ��� ����� ��� ����� ��� ����� ��� ����� �� > ��� ��� �� > ��� ��� �� = ��� the displacement (i.e.�� recoil)= ��� of the hard parton away from the jet axis.

κ ��� ��� κ ��� These variables are��� e↵ective quark/gluon discriminants because they probe the angular and energetic structure of jets, both of which are sensitive to the di↵ering color factors between ��� ��� �� quarks and gluons, among�� other e↵ects. Large emphasizes wide-angle radiation whereas

��� � ��� � ��� ��� ��� ��� ��� small��� ���emphasizes��� ��� ��� collinear radiation. Large  emphasizes harder , whereas small  β emphasizesβ softer hadrons. For reference, we highlight the  = 1 and = 0 cases: (c) (d) 1 e = zi✓i , (4.2)  ⌘ Figure 12: The quark/gluon truth overlap for an individual generalized angularity . Top: i jet the LL and NLL analytic calculations. Note that these calculations are singular at  = 0 or X2   = 0. Bottom: the Pythia 8 and Herwig++ parton showers (identical to Figs. 4a and 0 = zi . (4.3) 4b). The solid boxes correspond to dots indicated in Fig. 1, the dashed box corresponds to i jet X2 the IRC safe angularities e, and the grey dashed curve in the LL/NLL plots marks the range of validity of our calculations (i.e. the edge of the blue region1 in Fig. 7). While 0 = 1 is a trivial observable, we can expand around  =1tofind

 lim 0 =1+ ( 1)zi ln zi, (4.4)  1 ! i jet X2 1  so when we present studies for 0, we really mean lim 1 0 ,whichise↵ectively the same as ! the observable i jet zi ln zi. – 25 – To get a feel for2 the performance of the various , we can use parton shower simulations P to estimate their quark/gluon truth overlap. We generate an equal admixture of quark and

– 10 – Data: where are the quark and gluon jets?

Quark200 GeVPurity Gluon for Different Purity pT 50 γj 3 10410 50 γjj 2jet100 γj 100 γjj 200 γjj 103 b+jet 200 γj 1 2 10 400 γj 3jet softest400 γjj 3jet mid

Cross Section in pb b+2jet hardest 101 800 γj b+2jet800 γ jjsoftest Cross Section in pb 10-3

1 3jet hardest 1600 γj 1600 γjj 10-110-6 70%50% 60% 80% 70% 80% 90% 90% 100% 100% GluonQuark Purity Purity

Figure 11:Photon + jet samples Cross section as a functionFigure of 12: quarkCross purity. section as• The a functionb + 2 jet: high purity low s left ofpanel gluon purity shows for the the diff purityerent samples for thewith a 200 GeV cut on all non-b jets. The different points correspond to different cuts placed on a Boosted Decision Tree different samples• Jet closer to photon likely quark with a 200 GeV cuttrained on all to non- optimizeb jets. the gluon The purity. diff• erent TheOne jet is b other is gluon leftmost points dots correspond of each sample to are di thefferent uncut purities. There cuts placed on a Boosted Decision Treeare output, 3 curves for trained the 3-jet samples, to• optimizeDijet: high cross secon, low purity and two t forhe the quarkb+2jet purity. samples, corresponding The leftmost to which of the jets dots of each sample are the uncut purities,(from and hardest each to softest) successive is being do considered.t corresponds Note the to three cutting 3-jet samples the number start with identical cross sections, but higherth purities are achievable for the softer jets. of events in half. By the final dot, which keeps 1/128 of the signal, cutting harder no longer increases28 the purity. The right panel shows the purities for the γ+1jet (red) and γ+2jet (blue) samples for various pT ’s, where the cuts are with BDTs trained on 6 and 9 kinematic variables, respectively. The samples with jets and b’s, where we can use b-tagging information to help purify the sample. black curves correspond to purities obtained after cutting on the single variable ηγηj1 + ∆Rγj2 .The This will in fact be relevant, but we will find that the 3- and 4-jet samples actually work quite blue curve takes the jet closest to thewell, photon and avoid as a having starting to deal point, with whereasb-tagging. the black curve takes the softer of the two jets as its starting point. This is the reason for the lower initial purity but the same To begin, we start with a multivariate BDT analysis using as inputs the (pT , η, φ)ofall cross section. (It was easier to find a singlefinal state variable particles. using The the results softe forr the jet di ratherfferent 200 than GeV the sam jetples closer are shown to in Figure 12. the photon.) We can see that while the b+2jets has good efficiency, it also has a cross section orders of magnitude smaller than the 2-jet sample. The 3-jet sample is somewhere in between, with efficiencies about 80% for a cross section of 100 pb. We will consider these three samples in the following, as there may be situations when each one is advantageous. anticipated, the γ+1jet cannot be purifiedFirst, consider much the —b+2jet putting sample. harsher Looking cuts back at hits Figure a wal 3, weland see that there is a eventually just kills the cross section.contribution On the from right, both ‘GG’we foc (withusggb onfinal just states) the andγ+1jet ‘QG’ (with andqgbγ+2jetfinal states). The ggb section obviously has perfect gluon efficiency regardless of cuts. The main parton level process samples for all pT .TheredcurvesaretheBDToutputusing6inputsforγ+1jet, the contribution in the qgb channel is ub → ubg,whichlookslikefinalstategluonradiationfrom blue curves BDT with 9 inputs fort-channelγ+2jets,ub → andub.Sinceweputahardercutonthe the black curves for ouru singleand g than variable the b,thekinematics

ηγ ηj1 + ∆Rγj2 .Itisnicethatthesinglevariabledoesaswellasthecomprewill mostly have the u going back-to-back with the gb,andsothehensiveg analysiswill be somewhat softer. using the 9 BDT inputs. This explains why the starting efficiencies for the softer jet at pT =200 GeV are around 73%, versus 63% for the harder jet, as shown in see Figure 12. We conclude that the best way to get a clean quark sample at low pT is to use γ+1jet, for simplicity, or γ+2jets at moderate to large pT ,cuttingonthesinglevariableηγηj1 + ∆Rγj2 . Depending on how much cross section you are willing to sacrifice,–11– for 200 GeV jets, you can get 95% quark purity at 2 pb or 99% purity at 500 nb.

3.2 Gluon jet purification

Next, we turn to the more difficult case of gluon jet purification. It is more difficult because there is no starting sample with purity above 80%, and becausetherearenosimplephysically motivated handles for purification. Indeed, for the quark, weusedthefactthatthereisa collinear qγ singularity but no gγ singularity to inspire a ∆Rjγ cut. But for a gluon we cannot use the gq singularity since we are trying to avoid q jets all together. The exception is

–10– Mulvariate approach

• We can think about and visualize single variables

• Two variables are harder

• Muldimensional distribuons are not well-suited for visualizaon.

• There are things that computers are just beer at.

• Mulvariate approaches let you figure out how well you could possibly do

EFFICIENCY

Save you the trouble or looking

for good variables (project killer) FRAMING

See if simple variables POWER can do as well (establishes the goal) Somemes they are really necessary (e.g. b tagging) 29 Mulvariate methods

Lots of methods • Boosted Decision Trees • Arficial Neural Networks • Fischer Discriminants • Rectangular cut opmizaon Useful in many • Projecve Likelihood Esmator areas of science • H-matrix discriminant • Predicve learning/Rule ensemble • Support Vector Machines • K-nearest neighbor • …

For parcle physics, Boosted Decision Trees are best suited for combining variables

Easy to understand Train fast Nearly opmal efficiencies

30 Combining Variables: Girth and Charged Count Quark Gluon .4 .4 40

50

.3 .3 30 40

30 .2 .2 20

20 Girth Girth .1 .1 10 Are they correlated? 10 0 0 0 0 Combining Variables: Girth and Charged0 Count 10 20 30 0 10 20 30 Combining Variables: Girth and Charged CountCharged Count Charged Count QuarkQuark GluonGluon Likelihood: q/(q + g) .4 .4 40 .4 1 .4 .4 40 .9 50 50 .8 .3 .3 30 .3 .3 .3 30 40 .7 40 .6

3030 20 .2 .5 .2.2 .2 .2 20 .4 20 20 Girth Girth Girth

Girth Girth .3 .1 .1 10 10 .1 .1 .1 .2 1010 .1

0 0 00 0 0 0 0 0 0 0 10 20 30 00 10 10 20 20 30 30 00 10 10 20 20 30 30 ChargedCharged Count Count ChargedCharged Count Count Charged Count Jason Gallicchio (Harvard/Davis) Gluon Tagging and Quark & Gluon Samples28 November 2011 23 / 48

Cut here

• Not completely • Can get more discriminaon from 2D cuts

Jason Gallicchio (Harvard/Davis) Gluon Tagging and Quark & Gluon Samples28 November 2011 23 / 48 Jason Gallicchio (Harvard/Davis) Gluon Tagging and Quark & Gluon Samples28 November 2011 23 / 48 31 Pythia vs Herwig

Charged Track Count (ntrk) Linear Radial Moment (jet width) 16 8 14 Q 50 Herwig++ 7 Q 50 Herwig++ 12 Q 50 Pythia8 Q 50 Pythia8 6

10 G 50 Herwig++ G 50 Herwig++ 5 8 G 50 Pythia8 G 50 Pythia8 4 6 3

4 2

2 1

0 0 0 5 10 15 20 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

mass/pT 1-subjettiness, optimized axes β =1/4

7 6

6 Q 50 Herwig++ Q 50 Herwig++ 5 Q 50 Pythia8 Q 50 Pythia8 5 G 50 Herwig++ 4 G 50 Herwig++ 4 G 50 Pythia8 G 50 Pythia8 3 3

2 2

1 1

0 0 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

• Pythia and Herwig qualitavely similar • Discriminaon power with Figure 28. DistributionsHerwig of charged ++ universally worse track count and linear radial moment (here calculated using only the charged tracks within the jet) for 50 GeV jets. Quark samples are blue and Gluon samples red. Pythia 8.165 is the lighter shade and Herwig++ 2.5.2 is in the darker shade. 32

–40– Quark and gluon tagging: results Gluon Efficiency % at 50 GeV 200 GeV 50% Quark Acceptance Particles Tracks Particles Tracks P8 H++ P8 H++ P8 H++ P8 H++ 2-Point Moment β=1/5 8.7∗ 17.8∗ 13.7∗ 22.8∗ 8.3 15.9 13.2 19.6 1-Subjettiness β=1/2 9.3 18.5 14.2 22.9 7.6 16.2 12.3 19.4∗ 2-Subjettiness β=1/2 9.2 18.6 13.9 23.6 6.8 15.7∗ 9.8 18.7 Single ∗ ∗ 3-Subjettiness β=1 9.1 19.3 14.6 24.4 5.9 16.7 8.6 19.5 variables Radial Moment β=1 (Girth) 10.3 20.5 16.1 24.9 11.2 18.9 15.3 21.9 Angularity a =+1 10.3 20.0 15.8 24.5 12.0 19.3 14.0 21.6 Det of Covariance Matrix 11.2 21.2 18.1 27.0 9.4 20.9 13.5 24.6 2 jet Track Spread: /pT 16.5 25.3 16.5 25.3 9.3 20.1 9.3 20.1 Track Count ! 17.7 26.4 17.7 26.4 8.9 21.0 8.9 21.0 Decluster with kT , ∆R 15.8 24.5 20.1 28.4 13.9 20.1 16.9 23.4 Jet m/pT for R=0.3 subjet 13.1 25.9 16.3 27.7 11.9 24.2 14.8 26.2 Planar Flow 28.7 34.4 28.7 34.4 39.6 42.9 39.6 42.9 Pull Magnitude 37.0 39.0 32.9 35.6 30.6 30.2 29.6 30.6 Track Count & Girth 9.9 20.1 13.4 23.2 7.1 17.3 7.7∗ 18.7 ∗ ∗ ∗ Pairs of R=0.3 m/pT &R=0.72-Pointβ=1/5 7.9 17.7 12.2 22.1 5.7 14.4 8.5 17.9 1-Subj β=1/2 & R=0.7 2-Point β=1/5 8.5 17.3∗ 12.9 22.1 6.0 14.6 8.6 17.7∗ variables ∗ Girth & R=0.7 2-Point β=1/10 12.6 21.9 12.6 21.9 9.2 18.0 9.2 18.0 1-Subj β=1/2 & 3-Subj β=1 8.9 18.0 14.0 23.2 5.6∗ 15.0 8.4 18.4 3,4,5 Best Group of 3 7.5 17.0 11.0 20.9 4.7 14.0 6.9 16.6 variables Best Group of 4 7.1 16.7 10.6 20.5 4.5 13.7 6.2 16.3 Best Group of 5 6.9 16.4 10.4 20.0 4.3 13.3 6.1 15.9 33 Table 1. Comparison of gluon efficiencies at the 50% quark acceptance working point. All of the single variables use R=0.5 jets, wheras combinations sometimes include R=0.7 jets. Gluon efficiencies, rather than gluon rejections (one minus efficiencies), are shown because a fractional improvement here is the same fractional improvement in S/B. Divided by two, it is also the fractional improvement in S/√B.Thesescoreshave 0.5% statistical errors, but they are correlated — the differences between ± variables has smaller spread, as does the improvement when combining variables. Because of the large number of variables and parameters, and the larger number of possible combinations of these, there is definitely a look-elsewhere-type effect when choosing the top pair. Many pairs statistically tied for the top spot in each category, so five pairs were chosen as representative. Their scores are marked with asterisks, as are the best individual variables in each category. Thebestgroupsof3,4,and5startto show diminishing returns.

–42–