MIT–CTP 5223

Mapping Machine-Learned Physics into a Human-Readable Space

Taylor Faucett,1 Jesse Thaler,2, 3 and Daniel Whiteson1 1Department of Physics and Astronomy, UC Irvine, Irvine, CA 92627 2Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139 3The NSF AI Institute for Artificial Intelligence and Fundamental Interactions (Dated: April 21, 2021) We present a technique for translating a black-box machine-learned classifier operating on a high- dimensional input space into a small set of human-interpretable observables that can be combined to make the same classification decisions. We iteratively select these observables from a large space of high-level discriminants by finding those with the highest decision similarity relative to the black box, quantified via a metric we introduce that evaluates the relative ordering of pairs of inputs. Successive iterations focus only on the subset of input pairs that are misordered by the current set of observables. This method enables simplification of the machine-learning strategy, interpretation of the results in terms of well-understood physical concepts, validation of the physical model, and the potential for new insights into the nature of the problem itself. As a demonstration, we apply our approach to the benchmark task of jet classification in collider physics, where a convolutional neural network acting on calorimeter jet images outperforms a set of six well-known jet substructure observables. Our method maps the convolutional neural network into a set of observables called energy flow polynomials, and it closes the performance gap by identifying a class of observables with an interesting physical interpretation that has been previously overlooked in the jet substructure literature.

CONTENTS powerful solutions to important and difficult problems in high-energy physics [1, 2]. Examples of tasks that I. Introduction 1 have benefitted from NNs include triggering [3], event re- construction [4, 5], object identification [6–8], and event II. Translating from Machine to Human 3 selection [9–11]. In all of these contexts, though, the A. Average Decision Ordering 3 physicist wonders: what has the machine learned? Satis- B. Black-Box Guided Search Strategy 4 faction with improved performance is tempered by frus- tration with the “black-box” nature of NN strategies. III. A Case Study in Jet Substructure 5 A. Boosted Boson Classification 6 The advent of deep learning has made this question B. Energy Flow Polynomials 7 more urgent, as the data dimensionality of the tasks has increased dramatically and ML approaches have outper- IV. Supplementing Existing Observables 8 formed human-engineered strategies for problems that, A. Black-Box Guiding 9 until recently, were deemed too difficult for ML. Exam- B. Physics Interpretation 9 ples include event classification [12–17], jet substructure studies [18–22], jet flavor classification [23–27], detector V. Iteratively Mapping from Minimal Features 11 unfolding [28–30], and uncertainty estimation [31–35]. A. Black-Box Guiding 11 B. Comparison to Brute Force Search 13 In each case, a deep NN has successfully utilized more C. Comparison to Truth-Label Guiding 13 information from the high-dimensional low-level input D. Physics Interpretation 14 data than was captured by a smaller number of physics- motivated high-level (HL) observables. When there is a VI. Discussion 15 performance gap between machine-learned and human- arXiv:2010.11998v2 [hep-ph] 20 Apr 2021 engineered strategies, physicists want to understand how Acknowledgments 16 the NN is using the low-level information to improve its performance. This desire also applies to situations where A. Network Architectures and Hyperparameters 16 performance of the ML solution matches (but does not 1. Baseline Convolutional Neural Network 17 exceed) the human-engineered approach. Has the ma- 2. Baseline Dense Neural Network 17 3. K-fold Validation 17 chine learned the same strategy that humans devised, or has it found an alternative solution with equal efficacy? References 17 In this paper, we present a technique for translating a black-box ML strategy based on low-level inputs into a human-readable space of HL observables. Instead of try- I. INTRODUCTION ing to directly interpret an NN classifier, our approach is to use the NN as a guide for the construction of a simpler It is widely appreciated that neural networks (NNs) classifier which makes the same decisions but relies on and related machine-learning (ML) tools can provide a small number of physics-inspired human-interpretable 2 inputs, selected iteratively from a large space of observ- helps gauge the effective upper limit on possible ML per- ables. As a demonstration of our method, we present formance and determine asymptotically optimal decision a case study in jet classification [20], using a convolu- boundaries. The new second step is translating as much tional neural network (CNN) to guide the selection of of the ML strategy as possible to a well-understood set of a small set of HL observables called energy flow poly- HL observables. This allows for physical interpretation nomials (EFPs) [36]. We find that the final set of HL of the information being used, validation of the model- observables provides the same classification performance ing, definition of reasonable systematic uncertainties, as as the CNN acting on the low-level inputs, but in a more well as computational benefits due to dimensionality re- compact, interpretable, and physically meaningful for- duction. mat. Our study suggests that physicists should consider an overlooked set of HL observables that are relevant to An extended outline of this paper is as follows. In jet substructure classification. Sec. II, we present a general approach to map an ML This desire to gain insight into the nature of ML strate- model’s learned solution into a human-readable space. gies is important on several levels. First, it is critical that This mapping requires that we construct a large space information used by the NNs be validated as real and of candidate HL observables and use a reliable similar- physical, rather than an artifact of the training samples ity metric for comparing these observables to the learned or procedure. Even in cases where the ML training does solution. As our similarity metric, we introduce average not rely on simulated samples [37–39], it is important to decision ordering (ADO), which is related to Kendall’s understand what information is being used. Translat- rank correlation coefficient [46] and quantifies to what ex- ing a black-box ML strategy into a simpler network that tent two classifiers make the same (even if incorrect) deci- relies on a smaller number of physically meaningful ob- sions. We then present a mapping strategy that leverages servables allows for effective validation. Second, having the ADO. Our black-box guided strategy iteratively finds an interpretable strategy based on HL inputs allows for the maximum ADO between the HL observables and a more reliable estimates of systematic uncertainties. If the fixed black-box ML algorithm. HL space is small enough, the HL inputs themselves can be individually studied and calibrated, which is a more straightforward task than handling low-level inputs di- In Sec. III, we review the collider task of discriminat- rectly. Third, if one can replace a sophisticated ML strat- ing between jets originating from boosted W bosons and egy with a simpler network based on preprocessed inputs, those originating from light quarks and gluons. This this can enable faster interference and lower memory re- is a well-studied problem in the field of jet substruc- quirements at run-time [40, 41]. Finally, if previously ture [21, 47–58], where both HL [59–64] and NN strate- overlooked information in the low-level data is physical, gies [18–20] have proven effective. Our starting point will identifying it can provide new insights into the nature of be the analysis of Ref. [20], which found a small but per- the problem. Indeed, in our jet classification case study, sistent improvement in classification performance with a we show how the six HL observables studied in Ref. [20] deeply-connected CNN when compared to a boosted de- can be augmented by a seventh HL observable that had cision tree (BDT) of HL observables. To augment the not been previously considered in the literature. By scru- set of HL observables for jet tagging, we search the space tinizing the structure of the EFPs, we can provide a phys- of EFPs [36], which form an (over)complete basis of col- ical interpretation for this new observable. lider observables that are infrared and collinear (IRC) There have been previous proposed strategies to draw safe. We also introduce EFP variants inspired by IRC- connections between a learned NN strategy and existing unsafe observables that have been successful in other jet HL observables. This can done by comparing the perfor- tagging studies [65–68]. mance with and without the HL observables [42] or pro- jecting the decision surfaces along those observables [12]. The results of our case study are presented in Secs. IV An alternative strategy is to expand the NN function in and V. We start with the six HL observables from a basis of the input features [43–45]. These strategies are Ref. [20], and use the black-box guided strategy to iden- valuable, but are primarily limited to studying the struc- tify a seventh HL observable that closes the performance ture of the NN in terms of already-identified HL observ- gap with the CNN. We then apply the black-box guided ables. The goal of the present work is to identify new strategy starting from just the mass and transverse mo- HL observables relevant for ML tasks, starting from a mentum of the jet, comparing the results to a brute force large space of HL observables that is as broad and com- strategy of directly searching the space of EFPs and a prehensive as possible (yet still interpretable) and sys- guided search based on ground truth labels. The black- tematically mapping a black-box ML strategy into that box guided strategy significantly outperforms the label space. guided search, reaching comparable performance to the Our mapping strategy suggests a new approach to the brute force strategy with considerably reduced compu- application of deep learning to high-energy physics data. tational costs. At the end of each of these sections, we In this approach, training a powerful deep neural network provide a physical interpretation of the translated ML (DNN) on low-level inputs is just the first step, which strategy, and we draw broader conclusions in Sec. VI. 3

II. TRANSLATING FROM MACHINE TO these decision functions are applied, which in this case is HUMAN binary classification. Because classification performance is invariant under any non-linear monotonic transforma- In our mapping approach, we seek to identify a small tion of f or g, our similarity metric cannot be affected set of physically-motivated HL observables that, when by such a transformation. This rules out naive metrics combined into a joint classifier, make the same classi- like functional overlap or linear correlation. fication decisions as a deep network operating on the Furthermore, it is not sufficient to simply compare low-level features. Crucially, our set of HL features is the overall performance of two classifiers over a given designed to, when combined, maximize the classifica- dataset, since that does not provide information about tion performance by following the learned strategy of the how the low-level inputs are being used. As discussed in black-box NN, which we argue below is more efficient Ref. [67], two decision functions might use information than training directly on ground truth information. If from different regions of the low-level input space and this translation is successful, then we will have expressed make conflicting classification decisions case by case, yet the ML strategy more simply and transparently in terms still achieve similar overall performance. The key to our of a few HL observables. guided strategies is that we aim to match not just the The first step in our approach is to identify a com- classification performance of the black-box NN, but also prehensive set of HL observables that are potentially rel- the specific classification decisions. evant for solving the ML task at hand. This in turn We assume that the decision functions f(x) and g(x) requires (human) knowledge about the physical system are real valued and that the final binary classification is being studied and about the kinds of HL observables determined by a threshold on the decision function out- that have interpretable meaning. We know of no way put. Objects on one side of the threshold are labeled “sig- to automate this step, though automation may not be nal”, while objects on the other side are labeled “back- desirable, as the choice and structure of the HL observ- ground”. Depending on the particular application, this able space defines the physical interpretation. For our threshold can be tuned to different points on the receiver jet substructure classification case study, the EFPs have operating characteristic (ROC) curve to optimize the sig- already been identified as a suitable basis of HL observ- nal acceptance versus background rejection. To quantify ables [36], as discussed further in Sec. III B.1 Other ML overall classification performance, we use the AUC, which tasks in high-energy physics might require the develop- is equivalent to the probability that a randomly selected ment of alternative bases of HL observables. signal/background pair is correctly ordered by a decision In the rest of this section, we describe the aspects of our function f(x): approach that are generic to any ML task, focusing on Z the case of binary classification. To evaluate the relative 0 0 0  AUC[f] = dx dx psig(x) pbkg(x )Θ f(x)−f(x ) . (1) performance of our simple HL network and a black-box NN, we need a metric to evaluate whether two classifiers make the same classification decisions. There are many Here, Θ is the Heaviside theta function (i.e. Θ(x < 0) = 0 such metrics one could use, but we introduce ADO, in and Θ(x ≥ 0) = 1) and psig and pbkg are the ground truth part because it shares the conceptual simplicity of the signal and background probability distributions. A per- area under the curve (AUC) metric often used to bench- fect decision function has AUC = 1 and random guessing 1 mark classifiers against ground truth. Armed with an ex- yields AUC = 2 . plorable set of HL observables and a metric for assessing To compare the classification behavior of two decision learning similarity, we then present a guiding strategy to functions, we consider the decision surface defined by map black-box NNs into a physically meaningful space. a threshold on the function output. For each function, the set of such thresholds defines a set of surfaces in x space. If two decision functions have the same decision A. Average Decision Ordering surfaces, then they are effectively using the same low- level information for classification. Note that the abso- lute output values of the decision functions are not rel- Our guided strategy requires a similarity metric that evant for determining whether the decision surfaces are compares the output of two decision functions f(x) and similar. The relative locations of the decision surfaces g(x). Here, x represents the full high-dimensional low- are determined by the relative ordering of the two de- level data, which are inputs to both the black-box NN cision functions when evaluated at pairs of points in the and the physically-motivated HL observables. Of course, input space. We can encapsulate this information via the the definition of similarity must reflect the task for which decision ordering (DO) for a pair of points x and x0:   DO[f, g](x, x0) = Θ f(x) − f(x0)g(x) − g(x0) , (2) 1 Beyond jet substructure, the EFPs are also formally a complete basis for event-wide IRC-safe observables, but their utility in that where 1 corresponds to f and g having the same ordering context has not yet been established. and 0 corresponds to inverted ordering. 4

If two decision functions have DO = 1 for all pairs of B. Black-Box Guided Search Strategy x and x0, then they are monotonically related to each other, have identical decision surfaces, and are therefore The idea behind the black-box guided strategy is identical for the purposes of classification. To build a shown in Fig. 1, where the goal is to find HL observables summary statistic, we average over all possible values of that maximize the ADO relative to an already trained 0 x and x , weighted by the signal and background distri- ML tool. We denote the reference black-box network as butions, yielding the ADO: “BBN” (with apologies to cosmologists), and it will typi- cally be some kind of deep network acting on the low-level inputs. Starting from a large set S of human-defined HL Z observables motivated by physics considerations, our aim 0 0 0 ADO[f, g] = dx dx psig(x) pbkg(x ) DO[f, g](x, x ). is to build a high-level network (HLN) with the same de- cision surfaces as the BBN. The initial step (n = 0) in the (3) black-box guided approach is to identify the observable This evaluates to 1 if the decision functions make the HL that has the largest ADO with the BBN: same relative classification decision for every pair, to 0 if 1 the functions make the opposite classification for every 1 HL1 = argmax ADO[BBN, HL]Xall . (4) pair, and to 2 if there is no consistency in their orderings. HL∈S Since a decision function can be trivially inverted, the cases of ADO = 0 and ADO = 1 have the same meaning, Here, the Xall subscript indicates that we are using the so we map ADO → 1 − ADO whenever it is less than full set of signal/background training pairs (x, x0) when 1 computing the ADO. The observable HL is therefore 2 . The ADO has a similar philosophy to Kendall’s rank 1 correlation coefficient [46], with the key difference that the physics-motivated observable in the set S that best we are comparing inputs drawn from separate signal and approximates the decision surfaces of the black box. background distributions. In the next step (n = 1), we focus our search for HL ob- servables in regions of the feature space where the black- To gain intuition for the ADO, note that it has a very box network disagrees with our current set of HL observ- similar structure to the AUC in Eq. (1). The AUC is ables, by isolating the subset of signal/background pairs the probability that a single decision function orders ob- X1 that are ordered differently by the BBN and HL1: jects correctly relative to ground truth. The ADO is n o the probability that two decision functions order objects 0 0 X1 = (x, x ) DO[BBN, HL1](x, x ) = 0 . (5) in the same way, even if incorrectly. In the case that f(x) = p (x)/p (x) is the likelihood ratio, then f(x) sig bkg We then identify the observable HL2 that has the largest is an optimal classifier by the Neyman–Pearson lemma, so ADO with the BBN when restricted to the X1 subset: an ADO = 1 implies that g(x) defines the same optimal decision boundaries as f(x). In most ML applications, HL2 = argmax ADO[BBN, HL]X1 . (6) one is trying to maximize the AUC or other similar met- HL∈S ric of absolute classification performance. Our guided strategies, by contrast, aim to maximize the ADO rela- For each subsequent step n > 1, we combine the HL tive to an already trained ML tool. observables already identified in the previous steps into a joint network: There are other similarity metrics that one could use, HLN = NN[HL ,..., HL ], (7) but they are not as easy to interpret in terms of classi- n 1 n fication decisions. One way to capture information sim- where NN indicates a neural network trained on the full ilarity is to use mutual information, or more appropri- signal/background training set with just the n HL ob- ate to binary classification, mutual information with the servables as inputs. From this joint HLN, the misordered truth [67]. For our guided strategies, though, we are subset X is defined via less interested in whether two decision functions have n the same quantity of information available to a classifica- n o X = (x, x0) DO[BBN, HLN ](x, x0) = 0 . (8) tion task, and more interested in quantifying the degree n n to which two decision functions make classification deci- sions in the same way. Even if f(x) and g(x) contain a Because a new HLN is trained in each iteration, Xn may high level of mutual information, they do not necessarily not be a strict subset of Xn−1. The next observable define the same decision boundaries. This is especially HLn+1 is determined via: important to keep in mind given the flexibility of deep   HLn+1 = argmax ADO BBN, HL X . (9) networks, which allow the same discrimination power to HL∈S n be encoded in many informationally equivalent ways. Be- yond the ADO, there are other summary statistics one Note that the same black-box network is used in each could use based on the raw DO information, and we leave iteration, but the changing subset Xn means that differ- a study of those alternatives to future work. ent decision surface information is tested at each step. 5

Signal/Background Pairs Black-Box Guided Search BBN BBN No Same Maximize Decision Decision HL

… … Ordering? … … Ordering

… HL

HLN HL Yes HLN’

FIG. 1. Schematic of the black-box guided search in Sec. II B. In each iteration of this strategy, the relative decision ordering of signal/background pairs between the fixed black-box network (BBN, black triangle) and a trainable network of HL observables (HLN, white triangle) is used to identify the subset (red box) in which pairs are differently ordered. From a large space of HL observables (circles), the one with the largest ADO in the misordered space (blue circle) is selected for the next iteration. The schematic above corresponds to the n = 4 iteration. Note that the BBN is not retrained in each iteration, but the network of HL observables is.

These steps are repeated until ADO[BBN, HLNn+1] gets rectly ordered even by the theoretically optimal classifier. as close to 1 as desired. Instead of getting smaller and smaller, the subsets Xn Isolating the differently-classified pairs in Eq. (8) is will stall at the set of signal/background pairs that can similar in spirit to the boosting step of BDTs [69, 70]. never be ordered correctly. This in turn means that the This approach focuses attention only on the subspace classification performance of HLNn will stall well below of pairs where the BBN disagrees with the current set the theoretical maximum in the label guided approach. of HL observables, allowing us to identify new HL ob- That is why we advocate for the selection of HL observ- servables that make signal-background ordering decisions ables to be guided by the ADO, since then the classifi- most similar to the BBN in that subspace. It is worth cation performance of the HLNn will eventually match emphasizing that the ADO, or some other metric for net- that of the BBN, as desired. work decision similarity, is essential for this approach to As with any “greedy algorithm”, our black-box guided work. strategy cannot identify situations where two HL observ- Later in Sec. V C, we will compare this black-box ables could be combined simultaneously to match the guided approach to a label guided approach. Instead BBN decision surfaces. This means that we might miss of using the ADO, the label guided approach uses the sets of observables that are individually poor classifiers AUC with respect to ground truth information. It is but perform well jointly. If the goal were to just to max- straightforward to understand why the ADO is superior imize performance, this would be an undesirable feature. to the AUC for guiding purposes. To the extent that the In the context of mapping a black-box ML strategy to BBN is well trained, it represents a good approximation a physically-interpretable space, though, we are indeed to the Neyman–Pearson optimal classifier. Achieving the looking for individual observables with high information correct DO relative to the optimal classifier for every sig- content relevant for classification, so this greedy strategy nal/background training pair is the best one could ever is the one most likely to yield physical insight. hope to do. Therefore, if the black-box guiding strategy is working correctly, then the subsets Xn will get smaller and smaller until almost all signal/background pairs have III. A CASE STUDY IN JET SUBSTRUCTURE been correctly ordered relative to the BBN. By contrast, the AUC captures DO relative to truth We now apply the technique introduced in Sec. II to labels. Unless the BBN is able to achieve AUC = 1, there a specific case study involving jet classification at the will inevitably be signal/background pairs that are incor- LHC. In this section, we review boosted W boson clas- 6 sification using jet substructure and highlight the ele- Observable AUC ADO[CNN, Obs.] ments of Ref. [20] that will be used in our case study. We Mjet 0.898 ± 0.004 0.807 then introduce the EFPs [36] as our set of HL physics- β=1 C2 0.660 ± 0.006 0.584 motivated observables. Details about our NN architec- β=2 C2 0.604 ± 0.007 0.548 tures and training parameters are provided in App. A. β=1 D2 0.790 ± 0.005 0.743 β=2 D2 0.807 ± 0.005 0.762 β=1 τ2 0.662 ± 0.006 0.600 A. Boosted Boson Classification 6HL 0.9504 ± 0.0002 0.971 CNN 0.9531 ± 0.0002 1.000 Massive objects produced at the LHC often have 488HL 0.9535 ± 0.0002 0.978 7HL 0.9528 ± 0.0003 0.971 enough transverse momentum that their decay products black-box become collimated. For an object with a hadronic decay TABLE I. Classification performance of the six HL observ- mode, such as a W boson decaying to a quark-antiquark 0 ables studied in Ref. [20], as well as a 6HL joint classifier. pair (W → qq¯ ), the resulting jet in the detector consists The six HL observables face a small but significant perfor- of two clusters of energy, one from each of the fragment- mance gap compared to the benchmark CNN. As discussed ing quarks. The substructure of these jets is distinct from later in Sec. IV A, this performance gap is bridged by a sev- those that arise from the fragmentation of a single hard enth feature discovered using our black-box guided strategy. quark or gluon. Identification of jets with nontrivial sub- The 488HL case involving a large set of EFPs is discussed at structure has become an essential tool for probing the the end of Sec. V B. The uncertainty on the AUC is computed nature of collisions at the LHC [21, 47–58] from 1 standard deviation of 10-fold cross-validation. The de- There are many different ways to represent the infor- cision similarity (ADO) to the benchmark CNN is also shown. mation in a jet. At the most fundamental level, a jet is Details of the NN architectures are provided in App. A. variable-length collection of four-vectors with associated particle properties, motivating set-based ML tools [71– 76]. Another popular approach is to describe a jet as a (multiple proton-proton collisions per beam crossing). grid of calorimeter cells with energy depositions, giving Jets are clustered using the anti-kt algorithm [93] with rise to a “jet image” [19, 77]. In any of these low-level radius parameter R = 1.2, using FastJet 3.1.2 [94]. representations, the jet data is high dimensional. This The dataset contains 5 × 106 events, split equally be- motivates the development of HL observables that intelli- tween signal and background. Following the approach in gently summarize the low-level information to reduce the Ref. [20], each jet is pixelated into a 32 × 32 grid in the effective dimensionality of the task. Physicists have engi- rapidity-azimuth plane, and a jet image is formed from neered numerous HL observables tasks that incorporate the transverse momentum (pT) deposits in each cell. The domain knowledge about jet formation (see Refs. [51, 59– jet image is then trimmed [95], where subjets of radius 64, 78–85] for an incomplete list). Typical usage is to Rsub = 0.2 are discarded if their pT is less than 3% of apply cuts on one or more of these HL observables, or to the original jet. The final jet selection takes jets with trim combine several of them using a shallow ML classifier. trimmed momentum pT ∈ [300, 400] GeV within the In the context of jet classification, ML tools based on rapidity range |η| < 5.0. While important jet informa- low-level inputs have outperformed traditional strategies tion is lost by pixelation and trimming, we include these based on HL observables [86]. Of course, the HL ob- steps in our analysis in order to perform an apples-to- servables themselves are just functions of the low-level apples comparison to Ref. [20]. inputs, so it should be possible to find a large enough set The trimmed jet’s constituents are used to compute six of physics-motivated HL observables that can match the HL jet substructure observables: the trimmed jet mass performance of these ML classifiers [87–89]. This is in- β=1 (Mjet), four ratios of energy correlation functions (C2 , deed the intuition behind the guided strategy in Sec. II, β=2 β=1 β=2 C2 , D2 , D2 ) [62, 64], and the N-subjettiness ratio where the goal is to leverage a black-box ML method to β=1 identify the most effective HL observables. (τ21 ) [60, 61]. These observables are well-established in the context of boosted W classification, including studies Our case study is based on the same datasets as √ at ATLAS [96, 97] and CMS [98]. The W boson classi- Ref. [20]. These datasets correspond to s = 14 TeV fication performance of these six HL observables is sum- proton-proton collision, where hard scattering and res- marized in Table I. The trimmed jet mass is the most onance decay were generated using MadGraph 5 powerful single observable, since the 80.4 GeV mass peak v2.2.3 [90], showering and hadronization were generated is a characteristic feature of boosted W bosons. with Pythia v6.426 [91], and the response of the de- tectors was simulated with Delphes v3.2.0 [92]. The We can use the ADO from Eq. (3) to gain additional in- boosted W signal process is diboson production (pp → sight into these six HL observables. In Fig. 2, we assess W +W −), which yields two fat jets each with 2-prong sub- the pairwise ADO between each of the HL observables structure. The background process is QCD dijet produc- considered. The observable pairs that make the most β=1 β=2 tion (pp → qq, qg, gg), which typically yields 1-prong jets. similar decisions (i.e. ADO → 1) are C2 with C2 β=1 β=2 These samples do not include contamination from pileup and D2 with D2 . This is expected since these ob- 7

1.0

Mjet 0.898

0.9 = 1 0.935 0.660 C2

= 2 0.931 0.755 0.604 0.8 C2 AUC = 1 0.932 0.905 0.878 0.790 D2 0.7

= 2 0.928 0.919 0.922 0.842 0.807 D2 0.6

= 1 0.924 0.901 0.814 0.820 0.835 0.662 21

0.5 = 1 = 2 = 1 = 2 = 1 Mjet C2 C2 D2 D2 21

FIG. 2. Similarity of the classification decisions between the FIG. 3. Classification performance of the six HL observables six HL observables, as quantified by the ADO. A value of in Table I on the diagonal entries, along with AUC values ADO = 1 indicates identical decision ordering for all sig- for pairwise classification between coupled HL inputs on the 1 off-diagonal entries. nal/background pairs, while ADO = 2 corresponds to no similarity. In this way, the ADO has a similar interpreta- tion to the AUC, but with respect to classification decisions instead of ground truth. est though persistent, making it an excellent benchmark problem for studying the interpretation of ML strate- gies. We repeat this exercise in Table I, now using the servables have relatively similar structures except for the NN parameters in App. A for the CNN and for the 6HL choice of β coefficient, which controls the weighting of an- combination: gular information within the jets. These pairs also have  β=1 β=2 β=1 β=2 β=1 similar AUC values, as seen in Table I, since pairs that 6HL ≡ NN Mjet,C2 ,C2 ,D2 ,D2 , τ2 , (10) make common classification decisions should exhibit sim- ilar classification power. Comparing the AUC and ADO where we find a 0.0027 ± 0.0003 gap in AUC, as seen values provides a more detailed picture about the degree in Table I. Using our guided strategy, we seek to under- of correlation in classification. stand the nature of this performance gap, and whether the CNN has found a strategy similar to the existing HL The observable pairs that make the least similar deci- 1 β=1 β observables or something distinct. Does the gap indi- sions (i.e. ADO → 2 ) are Mjet with τ21 and C2 with cate a mild optimization of the same basic HL ideas, or β β=1 D2 . For Mjet versus τ21 , this is expected since N- does it hint at the existence of a new HL observable not subjettiness probes the degree of prong-like collimation, appearing previously in the jet substructure literature? whereas mass is sensitive to the energies of the prongs β β and their relative angles. For C2 versus D2 , this is ex- pected since they have different scalings under boosts B. Energy Flow Polynomials along the jet direction [64]. Pairs that make dissimilar decisions can often be combined into more powerful joint In order to map the CNN from Ref. [20] to a human- classifiers. This is shown in Fig. 3, where we consider the readable space, we first define a suitable set of physics- pairwise classifiers NN[HLi, HLj], where the details of the motivated HL observables for use in the guided strategies. NN parameters are presented in App. A 2. More compre- This requires domain knowledge about the underlying hensive studies of these six jet substructure observables physics as well as intuition for the kinds of information can be found in Ref. [56]. that might be missing from existing HL observables. This While these six engineered HL features are powerful jet also requires identifying HL observables that are likely to substructure discriminants, they do not capture the full work well as classifiers individually, since the black-box information relevant for W boson tagging. Viewing the guided strategy in Sec. II B is based on finding a single calorimeter cells as pixels of a two-dimensional image, we observable that maximizes the ADO in each step. can try to enhance the discrimination power using com- Our set of HL observables is based on the EFPs [36]. puter vision techniques [18–20, 22, 27, 32, 77, 99, 100]. The EFPs are a large (formally infinite) set of Indeed, Ref. [20] demonstrated that a deeply connected parametrized engineered functions, inspired by previous CNN using the low-level jet image inputs yielded better work on energy correlation functions [62, 64, 101–104]. In classification performance than the six HL observables the jet image representation, they are defined in terms of combined with a BDT. The performance gain was mod- the momentum fraction za of calorimeter cell a, as well 8 as the pairwise angular distance θab between cells a and It is important to emphasize that, although the EFP b. The EFPs are built in increasing levels of complexity, space constitutes a formally complete basis for (IRC-safe) from simple sums over single cells to many higher-order jet classification, we are primarily concerned with the combinations of momentum and pair-wise angles. An pragmatics of isolating individual observables that can EFP can be represented as a multigraph where: map out the CNN behavior. The ideal case is that the CNN strategy maps to a single EFP, indicating that it N X can be expressed compactly in terms that can be easily each node =⇒ za, (11) interpreted by humans. Failing that, though, it is still of a=1 considerable value if a similar mapping can be made us- k each k-fold edge =⇒ (θab) . (12) ing a small collection of observables [87–89]. This would still provide a significantly more physically meaningful in- As one example, we have: terpretation of the data and a marked reduction in data complexity and dimensionality. N N N N X X X X 3 2 Beyond this specific benchmark example, if one is un- = zazbzczdθabθbcθacθad. (13) able to map the CNN strategy into a small number of a=1 b=1 c=1 d=1 EFPs, this could mean one of two things. First, it could From these graph representations, we can express both mean that the CNN strategy simply does not operate connected and disconnected graphs. For the purposes of in this HL space, requiring us to revisit the assumption our study, we only consider connected graphs. that the HL space was sufficiently complete to capture The observable corresponding to each graph can be the essential information for jet classification. Second, it modified with parameters (κ, β). These parameters de- could mean that the CNN strategy is still encodable in termine the specific meaning of z and θ , where terms of these HL observables, but in a more complex a ab combination. As an example of this second possibility,  κ consider the C2 [62] and D2 [64] observables discussed in (κ) pTa za = P , (14) Sec. III A. These can be written as EFPs with κ = 1: pTb b  2 (β) 2 2 β/2 (β) θab = ∆ηab + ∆φab . (15) C2 = , (16) Here, p is the transverse momentum of cell a, and η  3 Ta ab (β) (φab) is the pseudorapidity (azimuth) difference between D2 = , (17) cells a and b. The original IRC-safe EFPs require κ = 1, however we include examples with κ 6= 1 to explore a where the graphs corresponds to: broader space of HL observables, motivated by Refs. [65– N N N 2 X X X (β) (β) 68]. Note that κ > 0 generically corresponds to IR-safe = z z z θ θ θ(β), (18) but C-unsafe observables. We intentionally include zero a b c ab bc ca a=1 b=1 c=1 and negative values of κ to explore both IR-unsafe and N N 3 C-unsafe observables as well. For our study, we use X X (β) = z z θ . (19) the EnergyFlow python package [105] to translate jet- a b ab a=1 b=1 image pT, η, and φ information to EFPs with varying graphs and choices of κ and β. The guided strategies, however, would not necessarily be For our guided search, we consider all combinations of able to identify these ratio combinations unless they were 1 1 defined ahead of time.4 Therefore, whether or not the (κ, β) where κ = [−1, 0, 2 , 1, 2] and β = [ 2 , 1, 2]. Each of the 15 combinations of (κ, β) are applied to the complete guided mapping is effective, one learns something about set of connected graphs with degree (i.e. number of edges) the nature of the physics problem either way. d ≤ 7 along with all connected graphs with degree d ≤ 8 and chromatic number c = 4 (to be defined in Sec. IV B below), which comes to 509 in total. This yields a space IV. SUPPLEMENTING EXISTING of 7,635 HL observables to search from. Note that β in OBSERVABLES Eq. (15) can sometimes be traded for k in Eq. (12); we remove degenerate graphs from our space, reducing our In this section, we demonstrate the success of the map- pool of unique observables to 7,545. ping strategy from Sec. II in searching for an additional HL observable in the context of boosted W classification.

2 We thank Patrick Komiske and Eric Metodiev for discussions 4 It might be interesting to combine the guided strategy with some related to the normalization of the κ 6= 1 EFPs. kind of symbolic regression to find interesting combinations [106]. 3 For κ < 0, empty cells are omitted from the sum in Eq. (11). If this symbolic regression allowed for index contraction, then This is not as discontinuous as one might naively think due to one could use the energy flow moments [107] to more efficiently the pixelation and trimming steps. search the space of β = 2 EFPs. 9

1 From Table I, we saw that the difference in classifi- Eq. (22), and they often feature κ = 2 and β = 2 . Re- cation power between a CNN acting on jet images and call that κ = 2 corresponds to IRC-unsafe EFPs, which an NN combination of six HL observables is relatively suggests that IRC-unsafe information is valuable (though small, but genuine and statistically significant. As such, perhaps not uniquely so) for mapping the CNN strategy. 1 it is interesting to ask whether this existing set of six HL Similarly, the appearance of β = 2 suggests the impor- observables could be supplemented by a new single HL tance of probing small-angle behavior. Other IRC-unsafe observable which has not yet been considered by human EFPs with κ = 0 and κ = −1 also perform well, espe- physicists. We employ our black-box guiding strategy cially the constituent multiplicity appearing third on this to find such a seventh HL observable, which closes the list. The best performing IRC-safe κ = 1 observable ap- performance gap previously identified in Ref. [20]. pears 5531th on this list and is not able to close the performance gap with the CNN. Specifically, the EFPs in Eqs. (18) and (19) with κ = 1 have a relatively small A. Black-Box Guiding ADO in the X6 subspace, never getting above 0.5279. This is as one might expect, since these observables al- The first step of the black-box guided strategy from ready effectively appear in the C2 and D2 combinations. Sec. II B is to identify the subset of signal/background Further discussions of the physics implications are pro- pairs that are differently ordered by the CNN and the vided below. 6HL combination: For completeness, we show distributions for the top n o three EFPs from Table II in Fig. 4, both in the full space X = (x, x0) DOCNN, 6HL(x, x0) = 0 . (20) 6 as well as in X6. The first two observables show good sep- aration between signal and background in the full space, 12 Though we have 6.25 × 10 signal/background pairs, we as expected given that their AUC is around 0.8. The 7 construct X6 from a randomly selected subset of 5 × 10 third observable, constituent multiplicity, is a relatively pairs. We then search for the EFP that has the great- poor discriminant by itself. When restricted to the X6 est decision similarity to the CNN in the X6 subspace, subspace, there is only modest residual separation power which ideally captures all of the remaining discrepancies shown by these three observables, but enough to close between the CNN and 6HL approaches: the performance gap with the CNN. black-box HL7 = argmax ADO[CNN, HL]X6 . (21) HL∈EFPs B. Physics Interpretation The results are shown in the first row of Table II. The EFP with the largest ADO with the CNN in the X6 sub- space is: Our first physics conclusion is that the κ-augmented EFP space is sufficiently comprehensive to close the per-  1  N N N N formance gap between the 6HL and the CNN. Had we κ=2,β= 2 X X X X 2 2 2 2p = zazb zc zd θabθbcθacθad. restricted our attention to just the IRC-safe EFPs, this a=1 b=1 c=1 d=1 would not have been the case, since the top ranked κ = 1 (22) EFP in both strategies can only achieve AUC = 0.9507 By itself, Eq. (22) only has an AUC of 0.8031, but when when combined with the six previous HL observables. used as the seventh feature of an NN that also uses the Thus, IRC-unsafe information seems essential for closing previously identified 6HL observables, the performance gap.5   1  Fascinatingly, κ = 2 appears prominently in the top β=1 κ=2,β= 2 7HLblack-box ≡ NN Mjet, . . . , τ2 , , ten EFPs, though in a different form than previously con- sidered in the literature. The key feature of the κ = 2 (23) EFPs is that they weight higher energy particles more it closes the performance gap with the CNN by achieving than lower energy particles. Looking at the top κ = 2 ob- AUC = 0.9528 ± 0.0003, as previewed in Table I. Inter- servables in Table II, they all have the common feature of estingly, this happens even though the ADO between the corresponding to chromatic number c = 3 graphs. Chro- CNN and 7HL is only 0.971, implying that they black-box matic number is the minimum number of colors needed still make inconsistent decisions around 3% of the time. to decorate the nodes of a graph such that no edge con- So while the black box has guided the selection of a new nects same-color nodes. If an EFP has chromatic number HL observable that closes the AUC performance gap, the c, then it is only non-zero if the jet has at least c distinct remaining ADO gap implies that there is additional in- formation not being captured. The remaining rows of Table II show portions of the ranked list of 7,545 EFPs, ordered by their ADO values. 5 As discussed in Ref. [108], CNNs are formally IRC safe. To map Note that the statistical uncertainties on the ADO are this IRC-safe behavior to the EFPs, however, would require very large enough that the precise ranking is not so mean- high-point correlators, with in principle as many nodes as pixels ingful, though the overall trends are. One striking fea- in the original jet image. With IRC-unsafe EFPs, we can instead ture is that many observables have a similar ADO to match the CNN decision boundaries with low-point correlators. 10

Rank EFP κ β Chrom # ADO[EFP, CNN]X6 AUC[EFP] ADO[6HL + EFP, CNN]Xall AUC[6HL + EFP]

1 1 2 2 3 0.6207 0.8031 0.9714 0.9528 ± 0.0003

1 2 2 2 3 0.6205 0.8203 0.9714 0.9524

3 0 – 1 0.6205 0.6737 0.9715 0.9525

1 4 2 2 3 0.6199 0.8301 0.9715 0.9527

1 5 2 2 3 0.6197 0.8290 0.9714 0.9527

1 6 2 2 3 0.6196 0.8251 0.9715 0.9522

1 7 0 2 2 0.6187 0.7511 0.9715 0.9526

1 8 2 2 3 0.6184 0.8257 0.9712 0.9527

1 9 2 2 3 0.6182 0.8090 0.9714 0.9527

1 10 2 2 3 0.6180 0.8314 0.9714 0.9526

60 0 1 2 0.6163 0.7194 0.9715 0.9525

1 341 −1 2 4 0.6142 0.6286 0.9714 0.9509

589 0 2 2 0.6109 0.7579 0.9714 0.9523

3106 −1 – 1 0.5891 0.5882 0.9714 0.9510

1 1 3519 2 2 2 0.5664 0.7698 0.9715 0.9524

1 3521 2 – 1 0.5663 0.7093 0.9714 0.9522

5531 1 2 1 0.5290 0.7454 0.9714 0.9507

1 5554 1 2 2 0.5279 0.8210 0.9713 0.9505

5610 2 – 1 0.5245 0.7117 0.9714 0.9507

5657 1 1 3 0.5224 0.8257 0.9712 0.9506

5793 1 1 2 0.5191 0.8640 0.9714 0.9505

6052 1 2 3 0.5153 0.8500 0.9716 0.9504

7438 1 2 2 0.5011 0.8835 0.9716 0.9506

TABLE II. A selection of EFPs, sorted by their similarity with the CNN, evaluated using ADO in the differently-ordered subspace X6. This corresponds to one step in the black-box guiding technique depicted in Fig. 1. After the top 10, EFPs are shown if they correspond to a dot graph, appear in the C2/D2 observables from Eqs. (16) and (17), or have the highest ADO among graphs with a given value of κ, β, or chromatic number.

D D particle directions, making it an effective probe for devi- literature is pT [65, 66]. In the EFP language, pT is a ations from (c − 1)-prong substructure. The κ = 2 and c = 1 graph with no edges: c = 3 EFPs found by our guided strategy therefore probe N IRC-unsafe deviations from 2-prong substructure (as one X (κ=2) = z2. (24) might expect for boosted W tagging), with a particular a emphasis on the higher energy particles inside the jet. a=1 D By contrast, the only κ = 2 observable that has re- Here, we see that pT is only ranked 5610th by ADO. ceived any significant attention in the jet substructure Apparently, generic IRC-unsafe information is not, by it- 11

Background in space EFPi Background in subspace Xi other hand, the background quark and gluon jets are 1- 1.0 Signal in space EFPi Signal in subspace Xi prong objects, and constituent multiplicity is well-known 0.8 to be a powerful quark/gluon discriminant [68] (though 0.6 sensitive to detector effects [109]). The next κ = 0 ob- ( =2, =0.5) Density 1 0.4 servable on the list has c = 2 and β = 2 , which is an IRC-unsafe probe of 1-prong substructure with an em- 0.2 phasis on collinear physics, which should also be an ef- 0.0 10 8 6 4 10 8 6 4 fective quark/gluon discriminant. This suggests that one log [EFP Observable] log [EFP Observable] 10 10 can improve the classification performance either with a 0.8 Background in space EFP Background in subspace X refined probe of the W boson signal or a refined probe i i Signal in space EFPi Signal in subspace Xi of the quark/gluon background, which happens to have 0.6 a similar effect on the decision boundaries.

0.4 In summary, by translating an ML strategy into a ( =2, =0.5) Density human-readable space, we have identified an important 0.2 class of jet substructure observables that have been miss- ing from previous boosted W boson studies. This mo- 0.0 14 12 10 8 6 14 12 10 8 6 tivates further studies of IRC-unsafe observables, espe- log10 [EFP Observable] log10 [EFP Observable] cially high degree EFPs with κ = 2. In Sec. VI, we dis-

Background in space EFPi Background in subspace Xi cuss the implications of this result on future work with 0.10 Signal in space EFPi Signal in subspace Xi jet substructure observables. 0.08

0.06 ( =0) Density 0.04 V. ITERATIVELY MAPPING FROM MINIMAL FEATURES 0.02

0.00 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 EFP Observable EFP Observable In the previous section, our aim was to supplement an existing set of HL features with one new feature to bridge the gap with the CNN performance. This jet FIG. 4. The top three EFPs from the black-box guided search for a seventh HL observable; see Table II. Shown are the EFP substructure case study is unusual, however, in that it distributions for signal and background events, both in the benefits from a highly mature literature of theoretically full set of events Xall (left column) and in X6 (right column), motivated features. Other applications of our black-box i.e. the differently ordered space between the 6HL and the guided strategy may have to begin from a more minimal CNN. The top two observables, while not identical, have very starting point and build an HL classification strategy es- similar functional forms, up to an overall rescaling. The third sentially from scratch. observable is the jet constituent multiplicity. In this section, we start from just the most basic jet properties—transverse momentum pT and jet mass M —and iteratively identify a small set of EFPs rele- self, useful for boosted W boson classification, but must jet vant for boosted W boson classification. Using the black- be paired with the correct angular dependence to high- box guided strategy, we are able to match the CNN per- light the physics of interest. It is interesting that β = 1 2 formance using around 7 EFPs. This is a similar di- is preferred as the angular exponent, since this choice mensionality to the 7HL combination (which did not in- appeared previously in the context of the Les Houches clude p ), though the physics features being probed will angularity for quark/gluon discrimination [68]. T turn out to be interestingly different. We then show that There are also κ = 0 observables in the top ten EFPs, this black-box strategy is more computationally efficient including the well known constituent multiplicity: than a brute-force search and more effective than a label- N guided search. X (κ=0) = 1. (25) a=1 The fact that a κ = 0, c = 1 observable yields nearly the A. Black-Box Guiding same performance as a class of κ = 2, c = 3 observables is a surprising result of our study. Our tentative inter- Here, we apply the same black-box approach as pretation is that this represents two complementary ap- Sec. IV A, starting just from the smaller set of observ- proaches to solving this jet classification task. On the one ables, pT and Mjet. The motivation for this starting point hand, boosted W bosons are 2-prong objects, so one ex- is as follows. The W boson mass at 80.4 GeV is one of pects c = 3 observables to be most relevant. Indeed, the the most important (and obvious) features of boosted W numerators of Eqs. (16) and (17) are c = 3 graphs that bosons. Because of the choice of za variable in Eq. (14), probe 2-prong substructure, which was part of the orig- though, the EFPs are dimensionless. Therefore, we need inal motivation for the C2 and D2 observables. On the at least one dimensionful HL observable to capture the 12

0.95 0.95

0.94 0.94

Black-box Guided Black-box Guided AUC AUC 0.93 0.93 Brute Force Brute Force Truth Guided Truth Guided 0.92 CNN 0.92 CNN 6 HL 6 HL 0.91 0.91

1.00 1.00

0.98 0.98

0.96 0.96

0.94 0.94 ADO[CNN, Obs.] 0.92 ADO[CNN, Obs.] 0.92

0.90 0.90 0 1 2 3 4 5 6 7 1000 2000 3000 4000 5000 EFPs Computing Time (Min.)

FIG. 5. Performance of the black-box guided search strategy FIG. 6. The same as Fig. 5, but now plotted in terms of the to map the CNN solution into human-interpretable observ- cumulative computing time. ables. Here, we start from just the basic jet features pT and Mjet and iteratively add one EFP at a time. The performance is shown in terms of AUC (top) and ADO (bottom) as a func- information: tion of the scan number. The performance of a brute-force   scan of the EFP space (Sec. V B) and a truth-label guided HLN0 ≡ NN pT,Mjet . (27) search (Sec. V C) are also shown. For reference, the per- This yields an AUC of 0.9119, which is substantially be- formance of the CNN and of the existing 6HL features are indicated by horizontal lines. low the CNN performance for boosted W boson tagging. We then restrict our attention to the subset of events that are differently ordered by these minimal features relative to the CNN: W boson mass peak, and either p or M would suffice T jet n o for this purpose. 0   0 X0 = (x, x ) DO CNN, HLN0 (x, x ) = 0 . (28) We begin from both pT and Mjet for two reasons. The first is that they are ubiquitous jet observables appearing The ADO between the CNN and HLN0 is 0.9150, so X0 in myriad collider studies. The second is to streamline contains 8.5% of the original Xall sample, though we only 7 the selection of EFPs. Naively, Mjet could be derived consider a random subset of 5×10 pairs in X0 to reduce from pT using the EFP in Eq. (15) with κ = 1 and β = 2: the computational burden. Our aim is to find a set of EFPs that can order these signal and background events the same as the CNN decision boundaries. 2 To identify the n-th EFP, we use the black-box guided (κ=1,β=2) Mjet ≈ 2 . (26) strategy of Sec. II B, adapted to the current notation: pT

EFPn = argmax ADO[CNN, EFP]Xn−1 . (29) EFP∈S With the choice of θa variable in Eq. (15), though, Eq. (26) is only approximately true, so multiple EFPs We construct a new joint classifier that includes this are needed to map out the mass information if pT is the EFP: only dimensionful scale. We checked that the black-box HLNn ≡ NN[pT,Mjet, EFP1,..., EFPn]. (30) strategy is still effective starting from just pT or from just Mjet, but the chosen EFPs tend to be more mass-like in This allows us to identify the remaining differently- their structure. By contrast, starting from both pT and ordered subset of events: Mjet yields more variation in the types of EFPs selected. n o X = (x, x0) DO[CNN, HLN ](x, x0) = 0 , (31) We start by training an NN on just the pT and Mjet n n 13 where in each iteration we only keep a random subset The primary computational cost of the brute force ap- of 5 × 107 pairs. The main computational cost in this proach comes from training the joint classifier appear- procedure is in training the joint classifier in Eq. (30). ing in Eq. (34), which combines each EFP with the cur- The AUC and ADO values for this black-box guided rent set of observables. This has to be done for each procedure are shown in Fig. 5 versus EFP scan iteration, EFP under consideration, and it is too computationally and in Fig. 6 versus cumulative computing time. More expensive to examine all 7,545 EFP graphs over multi- details about the selected EFPs are given in Table III. By ple iterations. Therefore, we only consider a subset of the 5th iteration, the AUC performance matches that of graphs at each iteration, which means there is no guar- the 6HL combination. By the 7th iteration, the AUC antee that the brute force method will perform better performance matches that of the CNN, with an ADO than the black-box guided strategy. For our purposes, of 0.974, indicating closer agreement with the CNN deci- our subset consists of the 54 connected graphs of degree 1 1 sions than we found with the 7HLblack-box strategy. Since d ≤ 5 and (κ, β) choices of κ = [ 2 , 1, 2] and β = [ 2 , 1, 2]. we started from a minimal set of jet features, it is not This reduces our original search space down to a more surprising that the EFPs identified here are qualitatively manageable 486 choices. different from the ones in Sec. IV. The physics interpreta- The results from this brute force procedure are shown tion of these various EFPs will be presented in Sec. V D. in Fig. 5 in terms of the ADO and AUC values after each iteration. In the first few iterations, the AUC and ADO values are higher than for black-box guiding, achieving B. Comparison to Brute Force Search a comparable performance to the original 6HL result af- ter the inclusion of a third EFP. The brute force process continues until it matches the CNN performance with An alternative approach to maximizing ADO is to per- 6 EFPs (8 HL inputs total). As one would expect, the form a brute-force search through the space of EFPs to brute force approach performs well as it is effectively try- find the set that maximally matches the decisions of the ing every possible combination of inputs and selecting the CNN. This is much more computationally expensive than best. This computational cost, however, must be weighed the black-box guided strategy, but it has the potential against the marginal decrease in the number of EFPs re- to converge to a smaller number of EFPs if there are quired to match the CNN as well as the need to restrict important correlations between the observables. In an the input space prior to exploring the performance. As absolute brute force search, one would construct all pos- shown in Fig. 6, the brute force approach does not com- sible sets of EFPs, and evaluate the ADO of each relative plete even a single iteration before the guided approaches to the CNN; given the number of graphs and combina- have converged to a complete solution. tions, this would be completely intractable. Instead, we Finally, for completeness, we consider the alternative attempt an iterative greedy algorithm, which incremen- brute-force approach of a network trained with all avail- tally builds the EFP set. This is still computationally able EFPs. Using the baseline DNN architecture in expensive, but (borderline) tractable. App. A 1, a network was trained with M , p , and the We again start from the jet p and M information, jet T T jet reduced set of 486 EFPs as input features. The perfor- but immediately train a joint classifier using each of the mance of “488HL” is shown in Table I, with marginally EFPs as an input: better performance than the CNN, indicating that the h i EFPs are effectively a complete basis for this task. NN pT,Mjet, EFP . (32)

We then select the EFP that yields the largest ADO with C. Comparison to Truth-Label Guiding the CNN, evaluated on the full training set.In the first iteration, we select the EFP via: In the black-box guided strategy, the CNN and the ADO similarity metrics are auxiliary tools to help iden- h  i EFP1 = argmax ADO CNN, NN pT,Mjet, EFP . tify a set of EFPs that maximizes the classification per- EFP∈S Xall formance. Assuming the EFP space is sufficiently com- (33) plete and labeled samples exist, one could dispense with We repeat this procedure in each subsequent iteration, the CNN entirely and simply search the space of EFPs choosing the EFP that yields the largest improvement in for the most powerful set that maximizes AUC, in an ADO when combined with the previous observables: approach similar to that of Ref. [110]. As a computa- tionally tractable alternative to the brute force search, h  i EFPn = argmax ADO CNN, NN ..., EFPn−1, EFP . we test what happens when the selection of the EFP is EFP∈S Xall guided by the truth labels, instead of by the ADO. (34) Analogously to decision ordering in Eq. (2), we de- The key difference from the black-box guided strategy is fine truth ordering (TO) for a pair of signal/background that the joint classifier is trained before evaluating the points x and x0 and a decision function f: ADO, and the ADO is evaluated on the full training set, instead of just the differently-ordered subset. TO[f](x, x0) = Θf(x) − f(x0), (35) 14

Iteration (n) EFP κ β Chrom # ADO[EFP, CNN]Xn−1 AUC[EFP] ADO[HLNn, CNN]Xall AUC[HLNn]

0 Mjet + pT ––– –– 0.9259 0.9119

1 1 2 2 2 0.8144 0.8190 0.9570 0.9382

2 0 2 2 0.6377 0.8106 0.9673 0.9458

3 0 – 1 0.5460 0.6737 0.9692 0.9476

1 4 1 2 2 0.5274 0.8464 0.9712 0.9487

5 −1 – 1 0.5450 0.5882 0.9714 0.9504

1 6 1 2 4 0.5382 0.7678 0.9734 0.9523

1 7 −1 2 2 0.5561 0.5957 0.9741 0.9528

TABLE III. The EFPs selected during each iteration of the black-box guiding strategy beginning from HLN0, which uses just pT and Mjet. For each iteration, the selected EFP is the one with the largest ADO with the CNN in the differently-ordered subspace Xn−1. where 1 corresponds to f correctly ordering the points D. Physics Interpretation and 0 corresponds to inverted ordering. Starting again from the jet pT and Mjet information, we identify the By translating the CNN into a space of physically- subset of event pairs that are incorrectly ordered: motivated observables, we can gain physical insight into the observables used in the classification decision. In n o 0   0 particular, the first few observables in Table III give us Y0 = (x, x ) TO HLN0 (x, x ) = 0 . (36) a glimpse at a possible alternative history for the field of jet substructure, if combinations like C and D had In each iteration, we find the EFP that has the highest 2 2 not been previously identified. Distributions of the EFPs AUC in the incorrectly-ordered subspace, found in the first four iterations are shown in Fig. 7. After pT and Mjet, the first EFP selected by the black- EFP = argmax AUC[EFP] , (37) n Yn−1 box guided strategy is: EFP∈S

 1  κ=2,β= construct a new joint classifier HLNn ≡ HLN0 + nEFP, 2 . (39) and identify the next incorrectly-ordered subset of events: The fact that a κ = 2 observable shows up early in the n o iterative procedure bolsters the evidence from Sec. IV A Y = (x, x0) TO[HLN ](x, x0) = 0 . (38) n n that these kinds of observables are important for mapping the CNN strategy. This is a chromatic number c = 2 Note that this procedure is completely independent of graph, so just like jet mass, it probes deviations from 1- the CNN. prong substructure. However, it uses a 5-point correlator The results from this truth-label guided procedure are (unlike mass which is a 2-point correlator) and it uses the 1 shown in in Fig. 5 in terms of the AUC and ADO. In β = 2 angular exponent (unlike mass which uses β = 2). the first iteration, the classification performance is bet- Putting these together, Eq. (39) is an IRC-unsafe probe ter than in the black-box guided search, which makes of hard, small-angle radiation. sense since the label guided method is trying to optimize The second EFP is also IRC unsafe and also corre- AUC directly. After 7 iterations, though, the classifica- sponds to a c = 2 graph: tion performance never rises above AUC = 0.951. As mentioned in Sec. II B, isolating the incorrectly-ordered (κ=0,β=2). (40) pairs turns out to be counter productive, since some of these pairs could never be ordered correctly even by the optimal classifier. This emphasizes the value of using Here, though, we have κ = 0 and β = 2, which is a probe the ADO relative to an already-trained network, to make of soft, wide-angle radiation. It is interesting that the sure attention is focused on event pairs that have a chance black-box guided strategy selects these two complemen- to be correctly ordered. tary c = 2 observables in the first two iterations, indicat- 15

0.7 Background in space EFPi Background in subspace Xi search until the fourth iteration: Signal in space EFPi Signal in subspace Xi 0.6 κ=1,β= 1 0.5 ( 2 ). (41) 0.4 ( =2, =0.5) Density 0.3

0.2 Moreover, it is a c = 2 graph, so still a probe of 1-prong

0.1 substructure. Only in interaction six do we finally see a

0.0 higher chromatic number graph, but the guided search 12 10 8 6 4 12 10 8 6 4 log10 [EFP Observable] log10 [EFP Observable] skips over the C2/D2-like graphs with c = 3 and goes straight to c = 4. The black-box guided strategy has Background in space EFPi Background in subspace Xi Signal in space EFPi Signal in subspace Xi identified a very different strategy for boosted W boson 0.4 tagging that nevertheless matches the 6HL combination 0.3 with a comparable number of observables.

( =0, =2) Density 0.2 One interpretation of this result is that it simply re- flects the “entropy” of our HL space. There are 4 times as 0.1 many IRC-unsafe observables in our HL collection than

0.0 8 6 4 2 0 2 4 8 6 4 2 0 2 4 IRC-safe ones, so just by random chance, one expects to log10 [EFP Observable] log10 [EFP Observable] see more unsafe observables in the scan. Indeed, there are IRC-safe observables that are highly ranked in the Background in space EFPi Background in subspace Xi 0.10 Signal in space EFPi Signal in subspace Xi first three iterations, just not at the top of the list. An-

0.08 other interpretation is that the black-box guided strategy is teaching us that IRC-unsafe information is more rele- 0.06 ( =0) Density vant for boosted W tagging than one might naively think. 0.04 A related observation was made in Ref. [85], which in- 0.02 troduced a color ring observable to identify color-singlet

0.00 configurations. Intriguingly, when restricted to three par- 0 5 10 15 20 25 30 35 0 4 8 12 16 20 24 28 32 EFP Observable EFP Observable ticles, the angular structure of Eq. (39) defines similar decision boundaries to the color ring.6 Either way, by Background in space EFPi Background in subspace Xi 1.0 Signal in space EFPi Signal in subspace Xi searching through a large space of HL observables in a

0.8 systematic way, the black-box guided strategy has given us a new perspective on an old problem in a human- 0.6 ( =1, =0.5) Density readable format. 0.4

0.2

0.0 VI. DISCUSSION 5 4 3 2 1 5 4 3 2 1 log10 [EFP Observable] log10 [EFP Observable] The ever increasing complexity of new ML strategies FIG. 7. The first four EFP graphs selected by the black- has produced better classification performance for vari- box guided strategy beginning from the minimal set of HL ous physics problems. At the same time, the increasing observables, pT and Mjet; see Table III. Shown are the EFP opaqueness of these methods has widened the gap be- distributions for signal and background events, both in the tween our understanding of a problem and our apprecia- full set of events Xall (left column) and in Xn (right column), tion of the ML solution. In this paper, we have proposed i.e. the differently ordered space between the HLNn and the a new technique for mapping an ML solution into a space CNN after n iterations. of human-interpretable observables. Our guided strate- gies mitigate some of the well-founded concerns about black-box approaches, while still allowing us to capital- ing the importance of 1-prong substructure probes even ize on the black-box performance to efficiently guide the if the goal is to identify 2-prong boosted W bosons. selections of HL observables. The end result is a set of The third EFP is constituent multiplicity, as seen be- HL observables that have a more direct physical inter- fore in Eq. (25), which reinforces the idea that controlling pretation and allow for a more transparent treatment of the composition of the quark/gluon background is impor- systematic uncertainties. tant for W tagging. These three observables, together In our jet substructure case study, we have shown that with pT and Mjet, yield an AUC of 0.9476. This is not the black-box guided strategy could be used to isolate in- as good as the 6HL combination, but still quite encour- formation that is not captured by previous HL represen- aging given that we did not give the black-box guided tations. Remarkably, only a single observable was needed strategy any information about the ratio structures used to construct C2 and D2. The main surprise from this study is that IRC-safe information was not selected by the black-box guided 6 We thank Andrew Larkoski for discussions of this point. 16 to close the performance gap identified in Ref. [20], nearly More broadly, our view is that the ultimate goal of duplicating the CNN strategy with a low-dimensional in- ML research in high-energy physics should not be to de- put representation. Beginning from a minimal set of ba- velop artificial-intelligence physicists which (or should we sic jet observables (pT and Mjet), we successfully con- say who?) can blindly process raw data and make state- densed the CNN behavior to a small set of EFP observ- ments about the Universe without being able to commu- ables which reproduce its performance and very nearly nicate the intermediate steps. The power of modern ML match its decisions. It would be interesting to study the can certainly be used to identify gaps in our knowledge utility of the EFPs in more complicated contexts, such where existing human-engineered approaches are insuffi- as event-wide classification tasks. cient. At the end of the day, though, we should insist that Interestingly, the structure of the selected EFPs differ data analysis strategies used to make statements about in qualitative ways from the C2 and D2 jet substructure physics should be understandable to human physicists. observables custom designed for boosted W boson clas- sification. While these previous observables are based on fully connected graphs, the guided strategy picked ACKNOWLEDGMENTS out multi-node EFP graphs with relatively low chromatic number. While these previous observables use the IRC- We thank Pierre Baldi, Timothy Cohen, Michael Fen- safe choice of κ = 1, the guided strategy emphasized the ton, Patrick Komiske, Andrew Larkoski, Eric Metodiev, importance of unsafe κ = 2 observables, particularly ones Benjamin Nachman, Tilman Plehn, Peter Sadowski, and with non-trivial angular dependence. This motivates fur- Edison Weik for helpful conversations and feedback. JT ther physics studies of these exotic EFPs. It is worth em- was supported by the U.S. Department of Energy (DOE), phasizing that we were only able to identify these new ob- Office of High Energy Physics under Grant No. DE- servables because we considered a sufficiently large space SC0012567. DW and TF are supported by the U.S. De- of HL observables. There may be other hidden orga- partment of Energy (DOE), Office of Science under Grant nizing principles to exploit for jet substructure studies, No. DE-SC0009920. TF was supported by the National which motivates the construction of alternative sets of Science Foundation under Award No. 1633631. observables based on different physical principles than the EFPs. In particular, we did not capitalize on the power counting and scaling properties of ratio/product Appendix A: Network Architectures and observables [64, 104, 110–112], which may reveal more Hyperparameters efficient HL observables for jet classification. It may also be beneficial to leverage first-principles knowledge about For consistency, all neural networks, regardless of input signal/background likelihood ratios [85, 113–118] to iden- data type, use a set of common settings and training tify promising HL observables. procedures: These results have important implications for what we • N = 5 × 106 training samples, broken down as: should regard as “best practices” in the application of ML methods to high-energy physics problems. Primarily, – 70% training, we should be more wary of utilizing opaque ML strate- – 15% validation, gies which obscure how a problem is solved in exchange for relatively small classification improvements. In gen- – 15% testing. eral, one should evaluate whether “additional” informa- Training samples are pre-processed via sci-kit’s tion captured by DNNs represents genuine patterns or is • StandardScaler. the byproduct of something unintentionally pressed into the data during simulation and then re-discovered by the • Adam optimizer used with default tensorflow network. settings: The informational gap in our benchmark problem could be closed using a single HL observable, suggesting – learning_rate = 0.001, that the CNN strategy was not relying on subtle correla- – beta1 = 0.9, tions among the low-level features, but rather exploiting – beta2 = 0.999, information encodable into a κ = 2 EFP. Thus, instead of a purely performance-oriented approach, we suggest a – epsilon = 1e-07, strategy of using deep networks to establish performance – amsgrad = False. benchmarks, but always seek to translate ML strategies into a more tractable space of well-motivated physical ob- • Output layer uses a sigmoid activation. servables. If this proves to be impossible or impractical, • Uncertainties are calculated via a 10-fold cross- it might be that the ML approach really is identifying validation (see App. A 3). genuinely new information, or more likely, that the space of physical observables needs to be augmented or opti- • Early stopping on validation_loss with a pa- mized. tience of 30 epochs. 17

• Model checkpoints saved (for best results only) on 2. Baseline Dense Neural Network minimum validation loss. The following training details are specific to all dense neural networks acting on jet substructure observables, 1. Baseline Convolutional Neural Network including EFPs:

The following training details are specific to all convo- • Hidden layers consist of 3 hidden dense layers with lutional neural networks trained on jet-images: the parameters:

• Input data undergoes a log transformation prior to – 300 nodes, StandardScaler pre-processing. – activation = ’relu’, • Hidden layers consist of 5 hidden convolutional lay- – kernel_constraint = max_norm(3). ers with the parameters: • 2 dropout layers (i.e. 1 between each dense layer) – 500 nodes, with a rate of 0.5. – kernel_size = (4,4), – strides = (1,1), 3. K-fold Validation – padding = ’valid’, To derive uncertainties on the trained model prediction – kernel_initializer = ’glorot_normal’, accuracy, we use the bootstrap cross-validation package – activation = ’relu’, in sci-kit to equally divide the test set 10 times and – kernel_constraint = max_norm(3). measure the performance across 10 bootstrapping itera- tions. Averages and standard deviations are then taken • 3 dropout layers (i.e. 1 between each convolutional from these 10 iterations to define the central value and layer) with a rate of 0.2. uncertainties of the AUC.

[1] Dan Guest, Kyle Cranmer, and Daniel White- [10] DZero Collaboration, “Search for single top quark pro- son, “Deep Learning and its Application to LHC duction at dø using neural networks,” Physics Letters Physics,” Ann. Rev. Nucl. Part. Sci. 68, 161–181 (2018), B 517, 282 – 294 (2001). arXiv:1806.11484 [hep-ex]. [11] CDF Collaboration, Phys. Rev. Lett. 104, 141801 [2] Alexander Radovic, Mike Williams, David Rousseau, (2010), arXiv:0911.3935 [hep-ex]. Michael Kagan, Daniele Bonacorsi, Alexander Himmel, [12] Pierre Baldi, Peter Sadowski, and Daniel Whiteson, Adam Aurisano, Kazuhiro Terao, and Taritree Wongji- “Searching for Exotic Particles in High-Energy Physics rad, “Machine learning at the energy and intensity fron- with Deep Learning,” Nature Commun. 5, 4308 (2014), tiers of ,” Nature 560, 41–48 (2018). arXiv:1402.4735 [hep-ph]. [3] J. K. Kohne et al., “Realization of a second level neural [13] Pierre Baldi, Peter Sadowski, and Daniel White- network trigger for the H1 experiment at HERA,” Nucl. son, “Enhanced Higgs Boson to τ +τ − Search with Instrum. Meth. A389, 128–133 (1997). Deep Learning,” Phys. Rev. Lett. 114, 111801 (2015), [4] Georges Aad et al. (ATLAS), “A neural network clus- arXiv:1410.3469 [hep-ph]. tering algorithm for the ATLAS silicon pixel detector,” [14] Roberto Santos, M. Nguyen, Jordan Webster, Soo Ryu, JINST 9, P09009 (2014), arXiv:1406.7690 [hep-ex]. Jahred Adelman, Sergei Chekanov, and Jie Zhou, “Ma- [5] Carsten Peterson, “Track Finding With Neural Net- chine learning techniques in searches for tth¯ in the works,” Nucl. Instrum. Meth. A279, 537 (1989). h → b¯b decay channel,” JINST 12, P04014 (2017), [6] Halina Abramowicz, Allen Caldwell, and Ralph Sinkus, arXiv:1610.03088 [hep-ex]. “Neural network based electron identification in the [15] A. Aurisano, A. Radovic, D. Rocco, A. Himmel, M.D. ZEUS calorimeter,” Nucl. Instrum. Meth. A365, 508– Messier, E. Niner, G. Pawloski, F. Psihas, A. Sousa, 517 (1995), arXiv:hep-ex/9505004 [hep-ex]. and P. Vahle, “A Convolutional Neural Network Neu- [7] Vardan Khachatryan et al. (CMS), “Observation of the trino Event Classifier,” JINST 11, P09001 (2016), Diphoton Decay of the Higgs Boson and Measurement arXiv:1604.01444 [hep-ex]. of Its Properties,” Eur. Phys. J. C 74, 3076 (2014), [16] Timothy Cohen, Marat Freytsis, and Bryan Ostdiek, arXiv:1407.0558 [hep-ex]. “(Machine) Learning to Do More with Less,” JHEP 02, [8] V. M. Abazov et al. (D0), “First measurement of σ 034 (2018), arXiv:1706.09451 [hep-ph]. √ (pp¯ → Z) . Br (Z → ττ ) at s = 1.96- TeV,” [17] M. Andrews, M. Paulini, S. Gleyzer, and B. Poc- Phys. Rev. D71, 072004 (2005), [Erratum: Phys. zos, “End-to-End Event Classification of High-Energy Rev.D77,039901(2008)], arXiv:hep-ex/0412020 [hep-ex]. Physics Data,” J. Phys. Conf. Ser. 1085, 042022 (2018). [9] DELPHI Collaboration, Physics Letters B 295, 383 – [18] Leandro G. Almeida, Mihailo Backovi´c,Mathieu Cliche, 395 (1992). Seung J. Lee, and Maxim Perelstein, “Playing Tag with 18

ANN: Boosted Top Identification with Pattern Recogni- son, “Deep-Learning Jets with Uncertainties and More,” tion,” JHEP 07, 086 (2015), arXiv:1501.05968 [hep-ph]. SciPost Phys. 8, 006 (2020), arXiv:1904.10004 [hep-ph]. [19] Luke de Oliveira, Michael Kagan, Lester Mackey, [34] Benjamin Nachman, “A guide for deploying Deep Benjamin Nachman, and Ariel Schwartzman, “Jet- Learning in LHC searches: How to achieve optimal- images — deep learning edition,” JHEP 07, 069 (2016), ity and account for uncertainty,” SciPost Phys. 8, 090 arXiv:1511.05190 [hep-ph]. (2020), arXiv:1909.03081 [hep-ph]. [20] Pierre Baldi, Kevin Bauer, Clara Eng, Peter Sadowski, [35] Gregor Kasieczka, Michel Luchmann, Florian Otter- and Daniel Whiteson, “Jet Substructure Classification pohl, and Tilman Plehn, “Per-Object Systematics using in High-Energy Physics with Deep Neural Networks,” Deep-Learned Calibration,” (2020), arXiv:2003.11099 Phys. Rev. D93, 094034 (2016), arXiv:1603.09349 [hep- [hep-ph]. ex]. [36] Patrick T. Komiske, Eric M. Metodiev, and Jesse [21] Andrew J. Larkoski, Ian Moult, and Benjamin Nach- Thaler, “Energy flow polynomials: A complete lin- man, “Jet Substructure at the Large Hadron Col- ear basis for jet substructure,” JHEP 04, 013 (2018), lider: A Review of Recent Advances in Theory and arXiv:1712.07124 [hep-ph]. Machine Learning,” Phys. Rept. 841, 1–63 (2020), [37] Eric M. Metodiev, Benjamin Nachman, and Jesse arXiv:1709.04464 [hep-ph]. Thaler, “Classification without labels: Learning from [22] Gregor Kasieczka, Tilman Plehn, Michael Russell, and mixed samples in high energy physics,” JHEP 10, 174 Torben Schell, “Deep-learning Top Taggers or The End (2017), arXiv:1708.02949 [hep-ph]. of QCD?” JHEP 05, 006 (2017), arXiv:1701.08784 [hep- [38] Anders Andreassen, Ilya Feige, Christopher Frye, and ph]. Matthew D. Schwartz, “JUNIPR: a Framework for Un- [23] Daniel Guest, Julian Collado, Pierre Baldi, Shih-Chieh supervised Machine Learning in Particle Physics,” Eur. Hsu, Gregor Urban, and Daniel Whiteson, “Jet Fla- Phys. J. C 79, 102 (2019), arXiv:1804.09720 [hep-ph]. vor Classification in High-Energy Physics with Deep [39] Jack H. Collins, Kiel Howe, and Benjamin Nach- Neural Networks,” Phys. Rev. D94, 112002 (2016), man, “Anomaly Detection for Resonant New Physics arXiv:1607.08633 [hep-ex]. with Machine Learning,” Phys. Rev. Lett. 121, 241803 [24] Identification of Jets Containing b-Hadrons with Re- (2018), arXiv:1805.02664 [hep-ph]. current Neural Networks at the ATLAS Experiment, [40] Cristian Buciluundefined, Rich Caruana, and Alexan- Tech. Rep. ATL-PHYS-PUB-2017-003 (CERN, Geneva, dru Niculescu-Mizil, “Model compression,” in Proceed- 2017). ings of the 12th ACM SIGKDD International Confer- [25] CMS Collaboration, Heavy flavor identification at CMS ence on Knowledge Discovery and Data Mining, KDD with deep neural networks, Tech. Rep. CMS-DP-2017- ’06 (Association for Computing Machinery, New York, 005 (2017). NY, USA, 2006) p. 535–541. [26] A.M. Sirunyan et al. (CMS), “Identification of heavy- [41] Javier Duarte et al., “Fast inference of deep neural flavour jets with the CMS detector in pp collisions at networks in FPGAs for particle physics,” JINST 13, 13 TeV,” JINST 13, P05011 (2018), arXiv:1712.07158 P07027 (2018), arXiv:1804.06913 [physics.ins-det]. [physics.ins-det]. [42] Spencer Chang, Timothy Cohen, and Bryan Ostdiek, [27] Patrick T. Komiske, Eric M. Metodiev, and “What is the Machine Learning?” Phys. Rev. D 97, Matthew D. Schwartz, “Deep learning in color: towards 056009 (2018), arXiv:1709.10106 [hep-ph]. automated quark/gluon jet discrimination,” JHEP 01, [43] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, 110 (2017), arXiv:1612.01551 [hep-ph]. “Deep Variational Information Bottleneck,” ArXiv e- [28] Anders Andreassen, Patrick T. Komiske, Eric M. prints (2016), arXiv:1612.00410 [cs.LG]. Metodiev, Benjamin Nachman, and Jesse Thaler, [44] Stefan Wunsch, Raphael Friese, Roger Wolf, and “OmniFold: A Method to Simultaneously Unfold All G¨unter Quast, “Identifying the relevant dependencies Observables,” Phys. Rev. Lett. 124, 182001 (2020), of the neural network response on characteristics of arXiv:1911.09107 [hep-ph]. the input space,” Comput. Softw. Big Sci. 2, 5 (2018), [29] Kaustuv Datta, Deepak Kar, and Debarati Roy, “Un- arXiv:1803.08782 [physics.data-an]. folding with Generative Adversarial Networks,” (2018), [45] Thomas Roxlo and Matthew Reece, “Opening the black arXiv:1806.00433 [physics.data-an]. box of neural nets: case studies in stop/top discrimina- [30] Marco Bellagente, Anja Butter, Gregor Kasieczka, tion,” (2018), arXiv:1804.09278 [hep-ph]. Tilman Plehn, Armand Rousselot, Ramon Winter- [46] M. G. KENDALL, “A new measure of rank correlation,” halder, Lynton Ardizzone, and Ullrich K¨othe,“Invert- Biometrika 30, 81–93 (1938). ible Networks or Partons to Detector and Back Again,” [47] M.H. Seymour, “Tagging a heavy Higgs boson,” (2020), arXiv:2006.06685 [hep-ph]. in ECFA (LHC) Workshop: [31] Christoph Englert, Peter Galler, Philip Harris, and Physics and Instrumentation (1991) pp. 557–569. Michael Spannowsky, “Machine Learning Uncertainties [48] Michael H. Seymour, “Searches for new particles using with Adversarial Neural Networks,” Eur. Phys. J. C 79, cone and cluster jet algorithms: A Comparative study,” 4 (2019), arXiv:1807.08763 [hep-ph]. Z. Phys. C 62, 127–138 (1994). [32] James Barnard, Edmund Noel Dawe, Matthew J. [49] J.M. Butterworth, B.E. Cox, and Jeffrey R. Forshaw, Dolan, and Nina Rajcic, “Parton Shower Uncertain- “WW scattering at the CERN LHC,” Phys. Rev. D 65, ties in Jet Substructure Analyses with Deep Neu- 096014 (2002), arXiv:hep-ph/0201098. ral Networks,” Phys. Rev. D 95, 014018 (2017), [50] J.M. Butterworth, John R. Ellis, and A.R. Raklev, “Re- arXiv:1609.00607 [hep-ph]. constructing sparticle mass spectra using hadronic de- [33] Sven Bollweg, Manuel Haußmann, Gregor Kasieczka, cays,” JHEP 05, 033 (2007), arXiv:hep-ph/0702150. Michel Luchmann, Tilman Plehn, and Jennifer Thomp- 19

[51] Jonathan M. Butterworth, Adam R. Davison, Mathieu arXiv:1408.3122 [hep-ph]. Rubin, and Gavin P. Salam, “Jet substructure as a new [68] Philippe Gras, Stefan H¨oche, Deepak Kar, Andrew Higgs search channel at the LHC,” Phys.Rev.Lett. 100, Larkoski, Leif L¨onnblad, Simon Pl¨atzer, Andrzej 242001 (2008), arXiv:0802.2470 [hep-ph]. Si´odmok, Peter Skands, Gregory Soyez, and Jesse [52] A. Abdesselam et al., “Boosted objects: A Probe of be- Thaler, “Systematics of quark/gluon tagging,” JHEP yond the physics,” Boost 2010 Oxford, 07, 091 (2017), arXiv:1704.03878 [hep-ph]. United Kingdom, June 22-25, 2010, Eur. Phys. J. C71, [69] Robert E. Schapire, “The strength of weak learnability,” 1661 (2011), arXiv:1012.5412 [hep-ph]. Mach. Learn. 5, 197–227 (1990). [53] A. Altheimer et al., “Jet Substructure at the Tevatron [70] Y. Freund, “Boosting a weak learning algorithm by ma- and LHC: New results, new tools, new benchmarks,” jority,” Information and Computation 121, 256 – 285 BOOST 2011 Princeton , NJ, USA, 22–26 May 2011, (1995). J. Phys. G39, 063001 (2012), arXiv:1201.0008 [hep-ph]. [71] Patrick T. Komiske, Eric M. Metodiev, and Jesse [54] Jessie Shelton, “Jet Substructure,” in Theoretical Ad- Thaler, “Energy Flow Networks: Deep Sets for Particle vanced Study Institute in Elementary Particle Physics: Jets,” JHEP 01, 121 (2019), arXiv:1810.05165 [hep-ph]. Searching for New Physics at Small and Large Scales [72] Huilin Qu and Loukas Gouskos, “ParticleNet: Jet Tag- (2013) pp. 303–340, arXiv:1302.0260 [hep-ph]. ging via Particle Clouds,” Phys. Rev. D 101, 056019 [55] A. Altheimer et al., “Boosted objects and jet substruc- (2020), arXiv:1902.08570 [hep-ph]. ture at the LHC. Report of BOOST2012, held at IFIC [73] Eric A. Moreno, Olmo Cerri, Javier M. Duarte, Har- Valencia, 23rd-27th of July 2012,” BOOST 2012 Valen- vey B. Newman, Thong Q. Nguyen, Avikar Periwal, cia, Spain, July 23-27, 2012, Eur. Phys. J. C74, 2792 Maurizio Pierini, Aidana Serikova, Maria Spiropulu, (2014), arXiv:1311.2708 [hep-ex]. and Jean-Roch Vlimant, “JEDI-net: a jet identification [56] D. Adams et al., “Towards an Understanding of the Cor- algorithm based on interaction networks,” Eur. Phys. J. relations in Jet Substructure,” Eur. Phys. J. C 75, 409 C 80, 58 (2020), arXiv:1908.05318 [hep-ex]. (2015), arXiv:1504.00679 [hep-ph]. [74] Vinicius Mikuni and Florencia Canelli, “ABCNet: An [57] Roman Kogler et al., “Jet Substructure at the Large attention-based method for particle tagging,” Eur. Hadron Collider: Experimental Review,” Rev. Mod. Phys. J. Plus 135, 463 (2020), arXiv:2001.05311 Phys. 91, 045003 (2019), arXiv:1803.06991 [hep-ex]. [physics.data-an]. [58] Simone Marzani, Gregory Soyez, and Michael Span- [75] Alexander Bogatskiy, Brandon Anderson, Jan T. Of- nowsky, Looking inside jets: an introduction to jet sub- fermann, Marwah Roussi, David W. Miller, and Risi structure and boosted-object phenomenology, Vol. 958 Kondor, “Lorentz Group Equivariant Neural Network (Springer, 2019) arXiv:1901.10342 [hep-ph]. for Particle Physics,” (2020), arXiv:2006.04780 [hep- [59] Yanou Cui, Zhenyu Han, and Matthew D. Schwartz, ph]. “W-jet Tagging: Optimizing the Identification of [76] Jonathan Shlomi, Sanmay Ganguly, Eilam Gross, Kyle Boosted Hadronically-Decaying W Bosons,” Phys. Rev. Cranmer, Yaron Lipman, Hadar Serviansky, Hag- D 83, 074023 (2011), arXiv:1012.2077 [hep-ph]. gai Maron, and Nimrod Segol, “Secondary Ver- [60] Jesse Thaler and Ken Van Tilburg, “Identifying Boosted tex Finding in Jets with Neural Networks,” (2020), Objects with N-subjettiness,” JHEP 03, 015 (2011), arXiv:2008.02831 [hep-ex]. arXiv:1011.2268 [hep-ph]. [77] Josh Cogan, Michael Kagan, Emanuel Strauss, and [61] Jesse Thaler and Ken Van Tilburg, “Maximiz- Ariel Schwarztman, “Jet-Images: Computer Vision In- ing Boosted Top Identification by Minimizing N- spired Techniques for Jet Tagging,” JHEP 02, 118 subjettiness,” JHEP 02, 093 (2012), arXiv:1108.2701 (2015), arXiv:1407.5675 [hep-ph]. [hep-ph]. [78] David E. Kaplan, Keith Rehermann, Matthew D. [62] Andrew J. Larkoski, Gavin P. Salam, and Jesse Thaler, Schwartz, and Brock Tweedie, “Top Tagging: A “Energy Correlation Functions for Jet Substructure,” Method for Identifying Boosted Hadronically Decay- JHEP 06, 108 (2013), arXiv:1305.0007 [hep-ph]. ing Top Quarks,” Phys.Rev.Lett. 101, 142001 (2008), [63] Georges Aad et al. (ATLAS), “Measurement of the arXiv:0806.0848 [hep-ph]. cross-section of high transverse momentum vector [79] Leandro G. Almeida, Seung J. Lee, Gilad Perez, bosons reconstructed as single jets√ and studies of jet George F. Sterman, Ilmo Sung, and Joseph Virzi, “Sub- substructure in pp collisions at s = 7 TeV with the structure of high-p T Jets at the LHC,” Phys. Rev. D ATLAS detector,” New J. Phys. 16, 113013 (2014), 79, 074017 (2009), arXiv:0807.0234 [hep-ph]. arXiv:1407.0800 [hep-ex]. [80] Stephen D. Ellis, Christopher K. Vermilion, and [64] Andrew J. Larkoski, Ian Moult, and Duff Neill, “Power Jonathan R. Walsh, “Recombination Algorithms and Counting to Better Jet Observables,” JHEP 12, 009 Jet Substructure: Pruning as a Tool for Heavy Par- (2014), arXiv:1409.6298 [hep-ph]. ticle Searches,” Phys. Rev. D 81, 094023 (2010), [65] Francesco Pandolfi, Search for the Standard Model Higgs arXiv:0912.0033 [hep-ph]. Boson in the H → ZZ → l+l−qq Decay Channel at [81] Tilman Plehn, Michael Spannowsky, Michihisa CMS, Ph.D. thesis, Zurich, ETH, New York (2012). Takeuchi, and Dirk Zerwas, “Stop Reconstruc- [66] Serguei Chatrchyan et al. (CMS), “Search for a Higgs tion with Tagged Tops,” JHEP 1010, 078 (2010), − boson in the decay√ channel H to ZZ(*) to q qbar ` l+ arXiv:1006.2833 [hep-ph]. in pp collisions at s = 7 TeV,” JHEP 04, 036 (2012), [82] Stephen D. Ellis, Andrew Hornig, Tuhin S. Roy, arXiv:1202.1416 [hep-ex]. David Krohn, and Matthew D. Schwartz, “Qjets: A [67] Andrew J. Larkoski, Jesse Thaler, and Wouter J. Non-Deterministic Approach to Tree-Based Jet Sub- Waalewijn, “Gaining (Mutual) Information about structure,” Phys. Rev. Lett. 108, 182003 (2012), Quark/Gluon Discrimination,” JHEP 11, 129 (2014), arXiv:1201.1914 [hep-ph]. 20

[83] Mrinal Dasgupta, Alessandro Fregoso, Simone Marzani, JHEP 10, 121 (2018), arXiv:1803.00107 [hep-ph]. and Gavin P. Salam, “Towards an understanding of jet [101] Andrea Banfi, Gavin P. Salam, and Giulia Zan- substructure,” JHEP 09, 029 (2013), arXiv:1307.0007 derighi, “Principles of general final-state resummation [hep-ph]. and automated implementation,” JHEP 03, 073 (2005), [84] Andrew J. Larkoski, Simone Marzani, Gregory Soyez, arXiv:hep-ph/0407286. and Jesse Thaler, “Soft Drop,” JHEP 1405, 146 (2014), [102] Guy Gur-Ari, Michele Papucci, and Gilad Perez, “Clas- arXiv:1402.2657 [hep-ph]. sification of Energy Flow Observables in Narrow Jets,” [85] Andy Buckley, Giuseppe Callea, Andrew J. Larkoski, (2011), arXiv:1101.2905 [hep-ph]. and Simone Marzani, “An Optimal Observable for Color [103] Martin Jankowiak and Andrew J. Larkoski, “Jet Sub- Singlet Identification,” SciPost Phys. 9, 026 (2020), structure Without Trees,” JHEP 06, 057 (2011), arXiv:2006.10480 [hep-ph]. arXiv:1104.1646 [hep-ph]. [86] Anja Butter et al., “The Machine Learning Land- [104] Ian Moult, Lina Necib, and Jesse Thaler, “New An- scape of Top Taggers,” SciPost Phys. 7, 014 (2019), gles on Energy Correlation Functions,” JHEP 12, 153 arXiv:1902.09914 [hep-ph]. (2016), arXiv:1609.07483 [hep-ph]. [87] Kaustuv Datta and Andrew Larkoski, “How Much [105] https://pypi.org/project/EnergyFlow/. Information is in a Jet?” JHEP 06, 073 (2017), [106] Silviu-Marian Udrescu and Max Tegmark, “AI Feyn- arXiv:1704.08249 [hep-ph]. man: a Physics-Inspired Method for Symbolic Regres- [88] Liam Moore, Karl Nordstr¨om,Sreedevi Varma, and sion,” (2019), arXiv:1905.11481 [physics.comp-ph]. Malcolm Fairbairn, “Reports of My Demise Are Greatly [107] Patrick T. Komiske, Eric M. Metodiev, and Exaggerated: N-subjettiness Taggers Take On Jet Im- Jesse Thaler, “Cutting Multiparticle Correlators ages,” SciPost Phys. 7, 036 (2019), arXiv:1807.04769 Down to Size,” Phys. Rev. D 101, 036019 (2020), [hep-ph]. arXiv:1911.04491 [hep-ph]. [89] J.A. Aguilar-Saavedra and B. Zald´ıvar, “Jet tag- [108] Suyong Choi, Seung J. Lee, and Maxim Perelstein, “In- ging made easy,” Eur. Phys. J. C 80, 530 (2020), frared Safety of a Neural-Net Top Tagging Algorithm,” arXiv:2002.12320 [hep-ph]. JHEP 02, 132 (2019), arXiv:1806.01263 [hep-ph]. [90] Johan Alwall, Michel Herquet, Fabio Maltoni, Olivier [109] Gregor Kasieczka, Nicholas Kiefer, Tilman Plehn, and Mattelaer, and Tim Stelzer, “MadGraph 5 : Going Be- Jennifer M. Thompson, “Quark-Gluon Tagging: Ma- yond,” JHEP 1106, 128 (2011), arXiv:1106.0522 [hep- chine Learning vs Detector,” SciPost Phys. 6, 069 ph]. (2019), arXiv:1812.09223 [hep-ph]. [91] Torbjorn Sjostrand, Stephen Mrenna, and Peter Z. [110] Kaustuv Datta and Andrew J. Larkoski, “Novel Jet Skands, “PYTHIA 6.4 Physics and Manual,” JHEP Observables from Machine Learning,” JHEP 03, 086 0605, 026 (2006), arXiv:hep-ph/0603175 [hep-ph]. (2018), arXiv:1710.01305 [hep-ph]. [92] J. de Favereau et al. (DELPHES 3), “DELPHES [111] Kaustuv Datta, Andrew Larkoski, and Benjamin Nach- 3, A modular framework for fast simulation of a man, “Automating the Construction of Jet Observables generic collider experiment,” JHEP 1402, 057 (2014), with Machine Learning,” Phys. Rev. D 100, 095016 arXiv:1307.6346 [hep-ex]. (2019), arXiv:1902.07180 [hep-ph]. [93] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez, [112] Albert M Sirunyan et al. (CMS), “Identification of “The Anti-k(t) jet clustering algorithm,” JHEP 04, 063 heavy, energetic, hadronically decaying particles us- (2008), arXiv:0802.1189 [hep-ph]. ing machine-learning techniques,” JINST 15, P06005 [94] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez, (2020), arXiv:2004.08262 [hep-ex]. “FastJet User Manual,” Eur.Phys.J. C72, 1896 (2012), [113] Davison E. Soper and Michael Spannowsky, “Finding arXiv:1111.6097 [hep-ph]. physics signals with shower deconstruction,” Phys. Rev. [95] David Krohn, Jesse Thaler, and Lian-Tao Wang, D 84, 074002 (2011), arXiv:1102.3480 [hep-ph]. “Jet Trimming,” JHEP 02, 084 (2010), arXiv:0912.1342 [114] Davison E. Soper and Michael Spannowsky, “Finding [hep-ph]. top quarks with shower deconstruction,” Phys. Rev. D [96] Georges Aad et al. (ATLAS), “Identification of boosted, 87, 054012 (2013), arXiv:1211.3140 [hep-ph]. hadronically decaying W√ bosons and comparisons with [115] Davison E. Soper and Michael Spannowsky, “Finding ATLAS data taken at s = 8 TeV,” Eur. Phys. J. C physics signals with event deconstruction,” Phys. Rev. 76, 154 (2016), arXiv:1510.05821 [hep-ex]. D 89, 094005 (2014), arXiv:1402.1189 [hep-ph]. [97] Morad Aaboud et al. (ATLAS), “Performance of top- [116] Danilo Ferreira de Lima, Petar Petrov, Davison Soper, quark and W -boson tagging with ATLAS in Run and Michael Spannowsky, “Quark-Gluon tagging with 2 of the LHC,” Eur. Phys. J. C 79, 375 (2019), Shower Deconstruction: Unearthing and arXiv:1808.07858 [hep-ex]. Higgs couplings,” Phys. Rev. D 95, 034001 (2017), [98] Vardan Khachatryan et al. (CMS), “Identification tech- arXiv:1607.06031 [hep-ph]. niques for highly boosted W bosons that decay into [117] Andrew J. Larkoski and Eric M. Metodiev, “A Theory of hadrons,” JHEP 12, 017 (2014), arXiv:1410.4227 [hep- Quark vs. Gluon Discrimination,” JHEP 10, 014 (2019), ex]. arXiv:1906.01639 [hep-ph]. [99] Quark versus Gluon Jet Tagging Using Jet Images [118] Gregor Kasieczka, Simone Marzani, Gregory Soyez, and with the ATLAS Detector, Tech. Rep. ATL-PHYS-PUB- Giovanni Stagnitto, “Towards Machine Learning An- 2017-017 (2017). alytics for Jet Substructure,” JHEP 09, 195 (2020), [100] Sebastian Macaluso and David Shih, “Pulling Out All arXiv:2007.04319 [hep-ph]. the Tops with Computer Vision and Deep Learning,”