<<

Permutationless Many-Jet Event Reconstruction with Symmetry Preserving Attention Networks

Michael James Fenton ∗†,1 Alexander Shmakov ∗,2 Ta-Wei Ho,3 Shih-Chieh Hsu,4 Daniel Whiteson,1 and Pierre Baldi2

1Department of Physics and Astronomy, University of California Irvine 2Department of Computer Science, University of California Irvine 3Department of Physics, National Tsing Hua University, Taiwan 4Department of Physics and Astronomy, University of Washington

Top , produced in large numbers at the Large Collider, have a complex detector signature and require special reconstruction techniques. The most common decay mode, the “all-jet” channel, results in a 6-jet final-state which is particularly difficult to reconstruct in pp collisions due to the large number of permutations possible. We present a novel approach to this class of problem, based on neural networks using a generalized attention mechanism, that we call Symmetry Preserving Attention Networks (Spa-Net). We train one such network to identify the decay products of each top unambiguously and without combinatorial explosion as an example of the power of this technique. This approach significantly outperforms existing state-of-the-art methods, correctly assigning all jets in 80.7% of 6-jet, 66.8% of 7-jet, and 52.3% of ≥ 8-jet events respectively.

Our code is available at https://github.com/Alexanders101/SPA Net Code Release

INTRODUCTION these events is correctly determining which jets originate from each of the parent top quarks. At the Large Hadron Collider (LHC), protons are col- lided at the highest energy ever produced in the labora- The , as the most massive fundamental par- tory. Most of these collisions produce a high multiplicity ticle in the Standard Model, is the only quark to decay of jets; collimated sprays of particles that originate from before hadronisation. This presents a unique opportu- the strongly-coupled quarks and inside the pro- nity to study an isolated quark - if its decay products can ton. Final-states containing only jets occur through a be correctly identified. Top quarks decay via t W b, → number of physical processes, and the copious produc- and events are categorized by the decay modes of the tion of these “all-jet” topologies presents opportunities W bosons: dileptonic (9%), single-lepton (45%) or all- for precision measurements [1,2], and searches for rare jets (46%) [8]. To date, the most precise measurements Standard Model [3] or new physics [4,5] processes, but of top quark properties are typically performed in the also raises particular challenges. Specifically, it is typi- single-lepton or dilepton channels[9]. The all-jet chan- cally difficult to connect an observed jet with its quark nel, held back by the ambiguous event reconstruction and origin, and the factorial dependence on the number of arXiv:2010.09206v5 [hep-ex] 22 Mar 2021 large backgrounds, is comparatively under-explored. jets leads to the so-called “combinatorial explosion”. For example, top quark pair production has a 6 jet final- ≥ In this letter, we propose a novel architecture for as- state in the so-called “resolved” regime in which each signment of particle origin to jets, Symmetry Preserv- of the three decay products produced from each top are ing Attention NETworks (Spa-Net). Applying attention reconstructed as a single jet. In some events this can networks that naturally reflect the permutation symme- be mitigated using “boosted” reconstruction [6], though try of the task, Spa-Net significantly outperforms exist- this is limited to a small subset of events [7]. Thus, the ing state-of-the-art techniques while avoiding combinato- biggest obstacle to extracting physical information from rial explosion. In the following, we define the nature of the jet assignment task, describe invariance and attention mechanisms in neural networks, describe the dataset and * These authors contributed equally. training, and demonstrate the performance of our tech- † [email protected] nique relative to the state-of-the-art. 2

b masses2:

q 2 2 2 t (mbqq mb0q0q0 ) (mqq mW ) (mq0q0 mW ) χ2 = − + − + − W q σ2 σ2 σ2 ∆mbqq W W q0 (1) W 0 t0 In our dataset (described below), we find mW = q0 81.3 GeV, σW = 12.3 GeV and σ∆mbjj = 26.3 GeV. b0 χ2 is evaluated for every permutation, and the minimum FIG. 1: Production diagram of a top quark pair, tt0, value is chosen. This method typically uses b-tagging and their decays into qqb and q0q0b0, respectively. to consider only permutations with b-jets in the place of b-quarks. This reduces the permutations but prevents the correct solution being found in the presence of mis- tagged jets. We apply this requirement in our studies for JET ASSIGNMENT consistency with recent experimental results - this imple- mentation has been the preferred reconstruction for AT- The jet assignment task is the identification of the orig- LAS [1, 10], while CMS have used a similar method [11]. inal particle which leads to a reconstructed jet. In a colli- sion which produces N jets, there are N! possible assign- Another approach is the use of Boosted Decision Trees ments. Fortunately, symmetries can reduce this number. or Neural Networks as permutation classifiers, though these have mostly been employed in the leptonic channels Top quarks decay via the chain t W b qqb, sup- where combinatoric explosion is reduced [12, 13], or to → → pressing charge labels. A pair of top quarks, tt0, therefore reconstruct tops individually [14]. CMS has also used a 2 produce six quarks, qqbq0q0b0. This process is shown in hybrid χ and BDT method in the all-jet channel [15]. Figure1. The task is to correctly identify six observed A more advanced minimisation technique is imple- jets with six labels: b, b0, 2 q, and 2 q0. The symmetries × × mented in KLFitter [16]. Transfer functions are used between the two tops and between the decay products of to represent the detector response in a likelihood func- the W -bosons reduce this to 6!/(2 2 2) = 90 permuta- × × tion per permutation, which is minimised to find the as- tions. Further complicating things, 50% of tt¯ events at 2 ∼ signments. This operates similarly to the χ technique, the LHC are expected to contain at least one additional though has greater CPU requirements due to the more jet which is not the result of top quark decay, leading complex model. KLFitter has to date been almost exclu- to 7!/(2 2 2) = 630 or 8!/(2 2 2 2) = 2520 sively used in the single lepton channel, and with the χ2 × ×1 × × × permutations in 7- or 8-jet events, respectively. Higher already CPU limited in many cases, it is not discussed jet multiplicity events are also likely; nonetheless, the here. current state-of-the-art techniques depend on enumerat- ing and evaluating each permutation to identify the best candidate. The many incorrect assignments obscure the INVARIANCE AND ATTENTION IN DEEP true assignment, diluting the scientific power of the data, NEURAL NETWORKS and represents a significant computational penalty when all permutations are evaluated for every event. Some- Equivariance and invariance properties can play an im- times, each permutation must be evaluated per system- portant role in the design of both feed-forward and re- atic uncertainty per event, which is often intractable in cursive neural networks across different forms of learn- the datasets typical in high-energy-physics (HEP). ing [17–22]. For instance, classical convolution neu- ral networks can produce object recognition outputs The most common technique is a χ2-minimisation that are invariant with respect to translations in their method, which scores a permutation based on the consis- two-dimensional inputs, and this invariance property tency of the reconstructed W -boson masses with known has been generalized to apply to other manifolds and values and similarity of the two reconstructed top quark

2 2 Explicitly using mtop in the χ equation improves reconstruction efficiency by a few percent at the expense of dramatically sculpt- ing the top mass distribution. We chose to make comparisons to 1 The additional factor of 2 is the invariance of the extraneous jets. the definition used in recent experimental results. 3 groups [23, 24]. In the problem considered here, the network output should be invariant under permutations of the input jet order. Such permutation invariance has been explored in set-based [25, 26] and graph net- works [27]. The output should further identify two dis- tinct interchangeable triplets, qqb and q0q0b0, each includ- ing an interchangeable pair, qq or q0q0. This permuta- tion invariance on the output is a unique property of our dataset which our architecture must account for.

Attention mechanisms allow the network to selectively propagate information (gating) - the activities of a set of neurons are multiplied, component-wise, by the activities of another set of neurons. These gating mechanisms allow FIG. 2: High-level structure of Spa-Net. neural networks to dynamically modulate neuron activ- ity as a function of the other neurons or inputs. Recur- sive and attention-based networks, which allow the net- b-jets. work to infer relationships between different elements in The supervised learning technique employed here re- a sequence, have achieved state-of-the-art performance in quires a training sample in which the correct assignments natural language processing in machine translation [28], are identified. We define the correct jet assignments by language understanding [29], and text generation [30]. matching them to the simulated truth quarks within an p Attention architectures are permutation invariant be- angular distance of ∆R = ∆η2 + ∆φ2 < 0.4. Requir- cause rearranging the order of the elements in the input ing all six quarks to be unambiguously matched leaves sequence induces the same rearrangement in the atten- 2,967,955 total events for training, and 119,283 for per-

s formance evaluation. This matchingr requirement has an

tion weights. This feature can be leveraged to endow g e r

n n r i m e

efficiency of 24% on 6-jeto events, 32% on 7-jet events, r i the network with permutation symmetry [25, 26]. We o d d t r r o s o d f n e e n

and 40% on 8-jet events. Wec also evaluate the per- s e e d

leverage the permutation invariance present in attention- d e t n n t o b

≥ o T a c E c formance in 365,939 eventsA in which only one of the top r

based methods to efficiently model the symmetries of the m n n T E E E

quarks has all decay products matched, with filter effi- t top quark pair system. We generalize the ideas present in r r n e e

ciencies of 56%, 52%, and 48% in 6-jet, 7-jet, and 8-jet e dot-product attention to allow for a three-way symmetry- m m

r d r

r ≥ o o n f e preserving attention mechanism which can perform jet events respectively. f r

e n s s r m e o p n n r i 0 0 0 o d t a e assignments into qqb and q q b triplets. a o s r o f r n d n c T s T e e n t n n I t T

a E t A SPA-NET ARCHITECTUREr e T DATASETS J We perform the jet classification with Spa-Net: an A sample of 50M simulated pp tt¯ events was gener- attention-based neural network which encodes the sym- → ated at √s = 13 TeV using MadGraph aMC@NLO [31] metries of the problem. Inputs to the network are an (v2.7.2), interfaced to Pythia8 [32] (v8.2) for shower- unsorted list of jets, each represented by their 4-vector ing and hadronisation. Detector response was simu- (pT, η, φ, M) as well as a boolean b-tag. M and pT are lated using Delphes [33] (v3.4.2) with the ATLAS pa- logarithmically scaled, and then each component is inde- rameterization. Events are generated at leading order pendently normalized to have zero mean and unit vari- in quantum-chromodynamics (QCD), with the top mass ance. The network, shown in Figure2, consists of six mtop = 173 GeV, and the W -boson forced to decay components: a jet-independent embedding which con- hadronically. Jets, reconstructed using the anti-kT al- verts each jet into a D-dimensional latent space repre- gorithm [34] as implemented in FastJet [35] (v3.2.1) with sentation; a stack of transformer encoders which learn radius R = 0.4, are required to have transverse momen- contextual relationships; two additional transformer en- tum pT 25 GeV and absolute pseudo-rapidity η < 2.5, coders on each branch to extract top-quark information; ≥ | | and are tagged as originating from b-quarks with pT- and two tensor-attention layers to produce the top-quark dependent efficiency and mistag rates. 5,926,407 events distributions. The transformer encoders employ a vari- meet the preselection requirements of 6 jets and 2 ant of attention known as multi-head self-attention [28], ≥ ≥ 4 though they may use any permutation-invariant architec- limitation of the χ2 method, which grows as (N 6). In O ture in general. our dataset, the largest jet multiplicity in a single event is N = 18.

Symmetry Preserving Tensor Attention Training Jet assignment in the context of tt¯events presents sev- eral unique problems for typical classification networks, We train these distributions via cross-entropy between as a variety of symmetries complicate the output gener- the output probabilities and the true target distribution ation and training. Primarily, the physics quantities are on the all-jet tt¯ problem, naming the resulting network invariant under permutation of the W decay products qq. Spatt¯er (Spa-Net for tt¯Reconstruction). This formula- The network must not differentiate between predictions tion contains another symmetry which can be exploited: which prefer one ordering to the other, and match each qq the top quark pairs are invariant with respect to the la- pair to the appropriate b. This naturally creates a triplet bels tt0 t0t. We create a symmetric loss function based ↔ relation qqb, with qq obeying permutation symmetry. In on cross-entropy, H(X,Y ) = P x log(y), (x,y)∈(X,Y ) − order to encode this into the network, we develop a gen- which allows either of the networks two output distri- eralization of the attention mechanism which can learn butions, L and R, to match either one of the targets P P n-way relationships with included symmetries. We call 1 and 2. The target distributions have two symmet- T T T this technique tensor attention. ric non-zero entries, one for each permutation of the qq pair. The loss is expressed as Each tensor attention layer contains a set of weights L D×D×D θ R . This tensor is not inherently symmetric: L R L R ∈ = min ( 1( , 1, , 2), 1( , 2, , 1)) (5) in order to produce an invariant attention weighting, we L L P T P T L P T P T first transform it into an auxiliary weights tensor which where 1(P1,T1,P2,T2) = H(T1,P1) + H(T2,P2). The L conforms to the classification permutation group. We resulting distributions may classify the same jet to be produce S RD×D×D and use it to perform weighted ∈ part of both triplets. To enforce unique predictions, dot-product attention on a list of embedded jets X we select the assignment of the higher probability N×D ∈ P R , where N is the number of jets. Working in flat first, and re-evaluate the other to select the best non- P Euclidean space, we express the attention mechanism in contradictory classification. Einstein notation, with We also note that Spatt¯er does not enforce that b- 1 Sijk = θijk + θjik (2) tagged jets are selected in the position of the b-quarks. 2 This allows the network to correctly predict events in ijk i j k nml O = XnXmXl S (3) which there are mis-tagged jets, while still utilising b- tagging information. In the χ2, allowing this means b- The summation in Equation2 guarantees that the first tagging information is completely lost, and greatly in- two dimensions of S will be symmetric, ensuring that creases the number of permutations. Oijk = Ojik, and enforcing qq invariance. Afterwards, we perform a 3-dimensional softmax on O to generate Spatt¯er contains 2.1M parameters in each tensor at- the joint triplet probability distribution tention layer and 600k parameters in the central trans- former encoder stack. Spatt¯er was trained using the exp Oijk (i, j, k) = (4) AdamW optimizer [36] and Nvidia Titan-V GPUs, con- P exp Oijk P i,j,k verging after approximately 4 hours. We produce individual distributions for each of the two top-quarks, L and R, and we produce a single triplet P P PERFORMANCE from each by selecting the peak of these distributions. An important note is that the weights tensor rank de- To assess performance we define two metrics, which pends only on the hyperparameter D, and thus, it is pos- can only be evaluated on identifiable top quarks where sible to include any number of jets in each event. Further- the correct assignment has been identified as described more, the network evaluation scales only as (N 3) with previously. The first metric is top, the fraction of iden- O respect to the number of jets because we only need to pro- tifiable top quarks which have all three jets correctly as- duce a triplet distribution . This removes a crippling signed. This is reported in two event subsets, those where P 5

top only one top quark is identifiable (1 ), and those where more peaked distribution, with comparable shapes in the top both top quarks are identifiable (2 ). We further define incorrect and unmatched events. event, the fraction of events with two identifiable top quarks in which both top quarks have all jets correctly 25K Correct assigned. TableI shows these metrics for both the χ2 20K Incorrect and Spatt¯er methods. Unmatched 15K 2 event ¯ The χ has an  of 37.7%, while Spatter achieves Count 10K an event of 63.7%. The χ2 suffers from a large reduc- tion in performance as jet multiplicity increases, peaking 5K at 61.8% for events with exactly 6 jets and dropping to 0 40 60 80 100 120 140 160 23.2% in events with at least 8 jets. This drop is less W Candidate Mass [GeV] pronounced for Spatt¯er, at 80.7% in 6-jet events and (a) χ2 52.3% in 8-jet events. We observe a similar trend in ≥ top 25K ¯ Correct the per-top efficiencies; for 2 , Spatter achieves 73.5% inclusively, compared to just 47.0% for the χ2. These 20K Incorrect Unmatched numbers drop to 66.2% and 35.5% respectively in 8-jet ≥ 15K

events. The performance is lower in events in which only Count 10K one top is identifiable, though Spatt¯er still strongly out- 2 top 5K performs the χ , with an 1 of 55.2% and 23.0% respec- tively. We also note that in our evaluation dataset, 8.1% 0 40 60 80 100 120 140 160 of events in which both tops are identifiable have at least W Candidate Mass [GeV] one b-quark matched to non-b-tagged jets. These quarks, (b) Spatt¯er which are impossible for the χ2 to correctly reconstruct, are reconstructed by Spatt¯er with an efficiency of 29.4%. FIG. 3: Stacked distributions of reconstructed mW using(a) the χ2, and (b) Spatt¯er. TABLE I: Performance of the χ2 and Spatt¯er assignments assessed by per-event efficiency event and top per-top efficiencies  , inclusively and by jet 25K Correct multiplicity Njets. 20K Incorrect Unmatched 2 χ Method Spatt¯er 15K N event top top event top top jets 2 1 2 1 Count 10K 6 61.8% 65.0% 24.2% 80.7% 84.1% 56.7% 7 40.8% 50.4% 24.6% 66.8% 75.7% 56.2% 5K ≥8 23.2% 35.5% 20.2% 52.3% 66.2% 52.9% 0 50 100 150 200 250 300 350 Inclusive 37.7% 47.0% 23.0% 63.7% 73.5% 55.2% Top Candidate Mass [GeV] (a) χ2

25K We inspect the reconstructed W mass using the as- Correct signments generated by both methods in Figure3, bro- 20K Incorrect ken down into three categories: “correct”, “incorrect”, Unmatched 15K and “unmatched”, corresponding to the cases in which: Count all three top decay products are correctly assigned, all 10K three top decay products are present in the event but 5K at least one is incorrectly assigned, and one or more of 0 the top decay products is not identifiable, respectively. 50 100 150 200 250 300 350 Top Candidate Mass [GeV] 2 The χ has a narrower peak around mW than Spatt¯er, ¯ though much of this shape comes from the incorrect and (b) Spatter unmatched events. This is explained by the presence FIG. 4: Stacked distributions of reconstructed mtop of mW in Equation1. Figure4 shows the mtop distri- using (a) the χ2, and (b) Spatt¯er. butions for the same three categories. Spatt¯er has the 6

In 11.2% of fully matched events, the network pre- additional ways, such as setting a minimum reconstruc- dicted the same jet to be part of both top quarks. Spatt¯ tion quality requirement that will act to suppress back- er correctly predicts 25.0% of these events after the re- grounds, analogously to how the χ2 is used in [1]. Ad- selection described above, while the χ2 achieved an event ditional input information, such as jet substructure [37] of only 12.7%, indicating that these were generally dif- or (pseudo-)continuous b-tagging [38], may also improve ficult events to classify correctly. We also note that the performance. average softmax value of the predicted jets in these events is only 36%, compared to 74% for all fully reconstructable This letter contributes to a family of work which help events. endow machine learning methods with problem specific invariances. We have presented an efficient polynomial- A final important performance metric is the computa- time approach for tackling classification tasks where the 2 tion time per event. The time required to evaluate the χ targets must obey a set of permutation symmetries. Such scales approximately as P (N, 6) = (N 6) with the num- O symmetries underlie the mathematical foundation of the ber of jets in the event, and this often leads to analyses Standard Model, but they may be found in many other setting a maximum number of jets to consider, degrad- common classification tasks, such as graph matching [39] ing the performance for purely CPU-time reasons. On and hierarchical clustering algorithms. Well trained deep the author’s laptop, a 2019 Dell XPS13 with an Intel- neural networks can replace permutation based algo- ¯ Core i7-1065G7 1.30GHz CPU, Spatter took an average rithms, and avoid combinatorial explosion, by effectively of 4.4 ms to evaluate per event, with no dependence on estimating symmetry-aware pair-wise similarities. Sym- 3 2 jet multiplicity . In contrast, the χ took an average of metries can also be used to create smaller models that 20 ms in 6-jet events, 48 ms in 7-jet events, and 369 ms reduce the amount of data necessary for training [40]. in 8-jet events. ≥ Understanding and exploiting the invariances present in any modeling task is vital for effective learning. CONCLUSIONS

Spa-Net’s have inherent permutation symmetries ACKNOWLEDGEMENTS which make them very well suited to the task of jet assignment, where the permutations otherwise lead to MF would like to thank Nicole Hartman for useful an increase in computation time and dilutes the scien- early discussions, and Megan Remillard for language edit- tific value of the data. Our network Spatt¯er demon- ing assistance. DW and MF are supported by the U.S. strates superior performance on this task. The adoption Department of Energy (DOE), Office of Science under of our technique by the experimental collaborations AT- Grant No. DE-SC0009920. S.-C. Hsu is supported by LAS and CMS will lead to significantly improved preci- the U.S. Department of Energy, Office of Science, Office sion of analyses in the all-jet tt¯ final-state by improving of Early Career Research Program under Award number the fraction of events that are well reconstructed from DE-SC0015971. The work of AS and PB in part sup- 37.7% using existing methods to 64.1% using our new ported by grants NSF NRT 1633631 and ARO 76649- technique. This paradigm shift will allow greatly en- CS to PB. The work of T.-W.H. was supported by the hanced sensitivity in high jet multiplicity events making Taiwan MoST with the grant number MOST-107-2112- many new analyses viable in this final-state. M-007-029-MY3. This letter describes just one of many possible appli- cations of Spa-Net’s to event reconstruction in HEP. Future work may include extending these techniques to alternative all-jet final-states, to other tt¯ decay modes, [1] Georges Aad et al. Measurements of top-quark pair single- and double-differential cross-sections in the all- or many other classes of problem. Though not stud- √ ¯ hadronic channel in pp collisions at s = 13 TeV using ied here, the output of Spatter may also be used in the ATLAS detector. JHEP, 01:033, 2021. [2] Albert M Sirunyan et al. Measurement of differential t¯t production cross sections using top quarks at large √ 3 We pad all events to the maximum of 18 jets, making Spatt¯er’s transverse momenta in pp collisions at s = 13 TeV. 8 evaluation time constant in our testing dataset. The speed could 2020. be further improved through intelligent batching of events by jet [3] Georges Aad et al. Search for the Standard Model Higgs multiplicity. boson decaying into bb produced in association with top 7

√ quarks decaying hadronically in pp collisions at s = 8 [19] Alexander Bogatskiy, Brandon Anderson, Jan T. Offer- TeV with the ATLAS detector. JHEP, 05:160, 2016. mann, Marwah Roussi, David W. Miller, and Risi Kon- [4] ATLAS Collaboration. Search for resonances decaying dor. Lorentz group equivariant neural network for parti- into top-quark pairs using fully hadronic decays in pp cle physics, 2020. √ collisions with ATLAS at s = 7 TeV. JHEP, 01:116, [20] P. Baldi. The inner and outer approaches for 2013. the design of recursive neural networks architec- [5] CMS Collaboration. Search for vector-like T quarks de- tures. Data Mining and Knowledge Discovery, DOI: caying to top quarks and Higgs bosons in the all-hadronic 10.1007/s10618-017-0531-0:1–13, 2017. Available at: channel using jet substructure. JHEP, 06:080, 2015. http://link.springer.com/article/10.1007/s10618-017- [6] A. Abdesselam et al. Boosted Objects: A Probe of Be- 0531-0. yond the Standard Model Physics. Eur. Phys. J. C, [21] P. Baldi. Deep Learning in Science: Theory, Algorithms, 71:1661, 2011. and Applications. Cambridge University Press, Cam- [7] Georges Aad et al. Measurements of top-quark pair differ- bridge, UK, 2020. In press. ential and double-differential cross-sections in the `+jets [22] Forest Agostinelli, Stephen McAleer, Alexander √ channel with pp collisions at s = 13 TeV using the Shmakov, and Pierre Baldi. Solving the rubik’s ATLAS detector. Eur. Phys. J. C, 79(12):1028, 2019. cube with deep reinforcement learning and search. [Erratum: Eur.Phys.J.C 80, 1092]. Nature Machine Intelligence, 1(8):356–363, July 2019. [8] M. Tanabashi et al. Review of . Phys. [23] Taco Cohen, Maurice Weiler, Berkay Kicanaoglu, and Rev. D, 98(3):030001, 2018. Max Welling. Gauge equivariant convolutional networks [9] ATLAS, CDF, CMS, and D0. First combination of Teva- and the icosahedral CNN. volume 97 of Proceedings tron and LHC measurements of the top-quark mass. 3 of Machine Learning Research, pages 1321–1330, Long 2014. Beach, California, USA, 09–15 Jun 2019. PMLR. [10] ATLAS Collaboration. Top-quark mass measurement in [24] Taco Cohen and Max Welling. Group equivariant con- √ the all-hadronic tt decay channel at s = 8 TeV with volutional networks. volume 48 of Proceedings of Ma- the ATLAS detector. JHEP, 09:118, 2017. chine Learning Research, pages 2990–2999, New York, [11] CMS Collaboration. Measurement of the top quark mass New York, USA, 20–22 Jun 2016. PMLR. √ in the all-jets final state at s = 13 TeV and combination [25] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, with the lepton+jets channel. Eur. Phys. J. C, 79(4):313, Seungjin Choi, and Yee Whye Teh. Set transformer: 2019. A framework for attention-based permutation-invariant [12] J. Erdmann, T. Kallage, K. Kr¨oninger,and O. Nacken- neural networks. volume 97 of Proceedings of Machine horst. From the bottom to the top—reconstruction of tt¯ Learning Research, pages 3744–3753, Long Beach, Cali- events with deep learning. Journal of Instrumentation, fornia, USA, 09–15 Jun 2019. PMLR. 14(11):P11015–P11015, Nov 2019. [26] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, [13] ATLAS Collaboration. Search for the standard model Barnabas Poczos, Russ R Salakhutdinov, and Alexan- Higgs boson produced in association with top quarks and der J Smola. Deep sets. In I. Guyon, U. V. Luxburg, √ decaying into a b¯b pair in pp collisions at s = 13 TeV S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and with the ATLAS detector. Phys. Rev. D, 97(7):072016, R. Garnett, editors, Advances in Neural Information Pro- 2018. cessing Systems 30, pages 3391–3401. Curran Associates, [14] ATLAS Collaboration. CP Properties of Higgs Boson In- Inc., 2017. teractions with Top Quarks in the ttH¯ and tH Processes [27] Jonathan Shlomi, Peter Battaglia, and Jean-Roch Vli- Using H → γγ with the ATLAS Detector. Phys. Rev. mant. Graph neural networks in particle physics. Ma- Lett., 125(6):061802, 2020. chine Learning: Science and Technology, 2(2):021001, [15] CMS Collaboration. Measurement of the t¯tbb¯ production Jan 2021. cross section in the all-jet final state in pp collisions at [28] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob √ s = 13 TeV. Phys. Lett. B, 803:135285, 2020. Uszkoreit, Llion Jones, Aidan N Gomez,Lukasz Kaiser, [16] Johannes Erdmann, Stefan Guindon, Kevin Kroeninger, and Illia Polosukhin. Attention is all you need. In Boris Lemmer, Olaf Nackenhorst, Arnulf Quadt, and I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer- Philipp Stolte. A likelihood-based reconstruction algo- gus, S. Vishwanathan, and R. Garnett, editors, Ad- rithm for top-quark pairs and the KLFitter framework. vances in Neural Information Processing Systems 30, Nucl. Instrum. Meth. A, 748:18–25, 2014. pages 5998–6008. Curran Associates, Inc., 2017. [17] D. Pfau, J.S. Spencer, A.G. de G. Matthews, and [29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and W.M.C. Foulkes. Ab-initio solution of the many-electron Kristina Toutanova. BERT: Pre-training of deep bidi- schr¨odinger equation with deep neural networks. Phys. rectional transformers for language understanding. In Rev. Research, 2:033429, 2020. Proceedings of the 2019 Conference of the North Amer- [18] Jason Sang Hun Lee, Inkyu Park, Ian James Watson, and ican Chapter of the Association for Computational Lin- Seungjin Yang. Zero-permutation jet-parton assignment guistics: Human Language Technologies, Volume 1 (Long using a self-attention network, 2020. and Short Papers), pages 4171–4186, Minneapolis, Min- nesota, June 2019. Association for Computational Lin- 8

guistics. [30] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsu- pervised multitask learners. 2019. [31] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, and M. Zaro. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP, 07:079, 2014. [32] Torbj¨ornSj¨ostrand,Stefan Ask, Jesper R. Christiansen, Richard Corke, Nishita Desai, Philip Ilten, Stephen Mrenna, Stefan Prestel, Christine O. Rasmussen, and Pe- ter Z. Skands. An introduction to PYTHIA 8.2. Comput. Phys. Commun., 191:159–177, 2015. [33] J. de Favereau, C. Delaere, P. Demin, A. Giammanco, V. Lemaˆıtre,A. Mertens, and M. Selvaggi. DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP, 02:057, 2014. [34] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez. The anti-kt jet clustering algorithm. JHEP, 04:063, 2008. [35] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez. FastJet User Manual. Eur. Phys. J. C, 72:1896, 2012. [36] Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learn- ing Representations, 2019. [37] Andrew J. Larkoski, Ian Moult, and Benjamin Nachman. Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning. Phys. Rept., 841:1–63, 2020. [38] ATLAS Collaboration. ATLAS b-jet identification per- formance and efficiency measurement with tt¯events in pp √ collisions at s = 13 TeV. Eur. Phys. J. C, 79(11):970, 2019. [39] Tib´erio S. Caetano, Julian J. McAuley, Li Cheng, Quoc V. Le, and Alexander J. Smola. Learning graph matching. CoRR, abs/0806.2890, 2008. [40] Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research, 21(90), 2020. 9

APPENDIX : SUPPLEMENTARY PLOTS 1.0 AUC: 0.851

0.04 0.8 Ground Truth ¯ 0.03 Spatter χ2 0.6

0.02 Density 0.4 0.01 True Positive Rate 0.2 40 60 80 100 120 140 160 W Candidate Mass [GeV] 0.0 (a) m W 0.0 0.2 0.4 0.6 0.8 1.0 0.020 False Positive Rate Ground Truth Spatt¯er 0.015 (a) 2 Top Events χ2 1.0 0.010 AUC: 0.793 Density

0.005 0.8

50 100 150 200 250 300 350 0.6 Top Candidate Mass [GeV]

(b) mtop 0.4 Invariant Mass Distribution For Complete tt¯ Events. True Positive Rate 0.003 Ground Truth 0.2 Spatt¯er 2 0.002 χ 0.0

Density 0.0 0.2 0.4 0.6 0.8 1.0 0.001 False Positive Rate (b) 1 Top Events 0.000 200 400 600 800 1000 1200 Invariant mass of six jets. (GeV) FIG. 6: Receiver operating characteristic (ROC) curve (c) mtt¯ of the predicted top-quark triplet softmax value for Spatt¯er on events with two and one reconstructable FIG. 5: Comparison of (a) m , and (b) m , for the W top tops. Targets are defined as 1 if the predicted triplet truth matched jets (solid red), the χ2 assignments was correct and 0 and otherwise. Included is the area (dotted green), and the Spatt¯er assignments (dashed under each curve. blue). A Gaussian KDE has been applied. 10

Prediction Confidence Distribution Correct events 6 Incorrect events

4 Density 2

0 0.0 0.2 0.4 0.6 0.8 1.0 Prediction Confidence Distribution 2 top events Tensor Tensor 4 1 top events Attention Attention 0 top events

Density 2

0 Transformer Transformer 0.0 0.2 0.4 0.6 0.8 1.0

Encoder EncoderFIG. 7: The distribution of the predicted triplet’s softmax value in Spatt¯er’s output distributions, grouped by either correctness or event type. Transformer Encoder

Transformer Encoder

FIG. 9: High Level Structure of Spa-Net. Left Branch b Distribution Right Branch b Distribution 0.6 0.4

Jet Independent Embeddings 0.4 0.2

Probability Probability 0.2

0.0 0.0 Left Branch qq Distribution Heatmap Right Branch qq Distribution Heatmap 0 0 1 1 2 2 Jet Jet 3 3 4 4 5 5 0 1 2 3 4 5 0 1 2 3 4 5 Jet Jet

FIG. 8: A visualization of an example output for a single event. The top plots display the projected b distribution, and the lower plots show the join qq distributions. 11

APPENDIX : SPAtt¯ER MATHEMATICAL Transformer Encoder FORMULATION The encoder stack consists of a sequence of transformer In this section, we provide a more detailed description encoder blocks as described by Vaswani et al. [28]. A sin- of Spatt¯er’s architecture which is necessary to recre- gle encoder block contains a multi-head attention layer ate the model. Figure9 provides a high-level graphical Attention(Q, K, V )[28]; two LayerNorm operations [? overview of the network. Here we describe the mathe- ]; two feed-forward layers L1 and L2; and a P ReLU non- matics necessary to implement each of the sections in the linearity. In this paper, we provide a general overview of figure. Spatt¯er has a very recursive architecture, so we the encoder structure, but we omit a detailed description define several terms to simplify this description. Spatt¯er of the multi-head attention and layer normalization op- consists of several large stacks, which are compositions of erations. A single transformer encoder block Ti can be several identically-structured blocks, each of which con- expressed as Ti = B A, where tains one or more layers, and each of which contains one ◦ or more parameters.

B(x) = LayerNorm(Attention(x, x, x) + x) (7)

Independent Embedding A(x) = LayerNorm(L2(P ReLU(L1(x))) + x) (8)

The embedding stack consists of several embedding The complete transformer encoder stack, T , is simply blocks, each of which progressively increases the latent a composition of k encoder blocks, one after the other. dimensionality of the input jets up to the final dimension- We use k = 6 in tt¯ . ality D. The embedding blocks are feed-forward, fully- Spa er connected neural networks which are applied indepen- dently to each jet via weight sharing. Each block con- sists of a fully-connected matrix multiplication layer L T = Tk Tk−1 T2 T1 (9) with parameters W RDo×Di and b RDo ; a P ReLU ◦ ◦ · · · ◦ ◦ ∈ ∈ [? ] non-linearity with a parameter a RDo ; and a ∈ one-dimensional BatchNorm [? ] layer with param- D Branch Encoders eters µ, σ R o . A single embedding block with an ∈ output dimensionality Do can be described as EDo = BatchNorm P ReLU L where After passing through the shared, central embedding ◦ ◦ and encoder stacks, Spatt¯er’s signal path splits into two branches, one for each of the target top quarks. Figure9 demonstrates this branch splitting. We name these two x E[x] paths the left and right branch, although this distinction BatchNorm(x) = p− σ + µ V ar[x] ∗ is arbitrary. Each of these branches contains an indepen- P ReLU(x) = max (x, 0) + a min (x, 0) dent embedding and encoder stack. The branch embed- ∗ L(x) = W x + b ding stacks share a near-identical structure to the initial independent embedding stack, except they preserve the

latent dimensionality: Di = Do. The embedding blocks L R We stack several of these embedding block, with each Ei and Ei are applied independently to each jet. The block doubling their latent space dimensionality, starting branch encoder stacks also share an identical structure L from the original 5 jet features. Spatt¯er has a target to the central encoder stack. The left branch, T , and R latent space dimensionality of D = 128, so the Do values right branch, T , can be described as compositions of j follow the sequence: 8 16 32 64 128. The embedding blocks and l transformer encoder block. We → → → → full embedding stack can be described as the following use j = l = 4 in Spatt¯er. composition:

T L = T L T L EL EL (10) l ◦ · · · ◦ 1 ◦ j ◦ · · · ◦ 1 R R R R R E = E128 E64 E32 E16 E8 (6) T = T T E E (11) ◦ ◦ ◦ ◦ l ◦ · · · ◦ 1 ◦ j ◦ · · · ◦ 1 12

D×D Tensor Attention θ1, θ2, θ3 R . Then we compute three intermedi- ∈ ate vectors from X by simply performing regular matrix We provide a detailed description of the tensor atten- multiplication tion output layers in the main text with Equations 2, 3, and 4. Here, we replicate the description in a concise A = θ1X manner. B = θ2X Each tensor attention layers contains a single parame- C = θ3X ter θ RD×D×D. We produce an intermediate symmet- ∈ These can be easily implemented with fully-connected ric tensor S RD×D×D who’s indices obey the symme- ∈ neural network layers and computed in parallel. Finally, try group of the top quark triplet. This is combined with we can estimate our original output by performing a triv- the list of input vectors X RN×D into an output ten- ∈ ial three-form with these tensors. sor O RN×N×N . Finally, this output tensor is passed ∈ through a softmax non-linearity in order to produce valid ijk i j k1nml O = AnBmCl three-way joint distributions L and R. The complete P P equations for this layer are: where 1nml = 1 for all possible indices. The space com- plexity of this decomposition when including the interme- ijk 1 ijk jik S = θ + θ (12) diate tensors reduces to (N 3 +ND+D2) (D2), and 2 O ≈ O ijk i j k nml improvement over the naive approach which has space O = XnXmXl S (13) complexity of (N 3 + ND2 + D3) (D3). Addition- exp Oijk O ≈ O (i, j, k) = (14) ally, when using opt einsum to evaluate the three-form, P P exp Oijk i,j,k the run-time is reduced from (N 3D3) to (N 3D2). O O We can replicate the symmetry constrain we impose in Efficient Tensor Attention Equation 12 by simply requiring the first two decomposed matrices to be equal to each other. Equation 13 can be expression as a cubic tensor form, which naively requires (N 3D3) operations. When ac- θ = θ O 1 2 tually computing this expression, we have to construct several very large intermediate tensors. Since we use This decomposition is not one-to-one, so not every θ optimized GPU linear algebra libraries in order to per- can be represented in this form. However, in our experi- form tensor operations, we have to split the evaluation ments, we found no drop in predictive performance when into several, more fundamental, operations. We use using this decomposition. opt einsum[ ? ] in order to generate an optimal set of op- erations for Equation 13. The summation can be expres- sion using the following intermediate operations, each of Batching which is a generalized matrix-multiplication (GEMM). Since most deep neural network implementations use Amli = SnmlXi n efficient batch matrix multiplication routines in order lij mli j B = A Xm speed up computation, it is beneficial to feed batches ijk lij k O = B Xl of several events into the network simultaneously. How- ever, events can contain a varying number of jets: a With the intermediate tensors Amli and Blij have dimen- single event can have a length of anywhere from 6 to sionalities (D D N) and (D N N) respectively. × × × × 20 momentum vectors. In order to process all of these Since D > N in our situation, this operation requires a events together in batches, we pad all events to the max- large amount of memory to store all of these intermediate imum size of 20 jets by appending 0-vectors, creating components. a batched input tensor X RB×20×5 where B is the ∈ Instead of storing the complete cubic weights ten- batch size. We also construct a secondary masking input sor θ and explicitly finding the summation, we limit M 0, 1 B×20. This vector indicates if the given jet in ∈ { } the possible θ weights that we can learn in order to a given event is a real jet or a padding jet. This vector is greatly reduce the intermediate tensors. Instead of learn- employed during multi-head attention in order to prevent ing θ RD×D×D, we decompose θ into three matrices the attention weights from including the masked jets [28]. ∈ 13

TABLE II: A summary of hyperparameter values for described as: tt¯ , decided via hyperparameter optimization. Spa er M biM bjM bk exp Obijk (b, i, j, k) = P bi bj bk bijk (15) Name Symbol Value P i,j,k M M M exp O Latent Dimensionality D 128 Encoder Count k 6 Branch Embedding Count j 5 Hyperparameters Branch Encoder Count l 3 Batch Size B 4096 We select optimal hyperparameters for Spatt¯er by Learning Rate α 1.5 × 10−3 running a parallel Gaussian process hyperparameter search using the Sherpa hyperparameter optimization library [? ]. We evaluate 500 different sets of hyper- paremeters using a subset of the original training dataset, training on only the first 50% of events. We also sample the last 5% of the training dataset to act as the valida- tion dataset for the hyperparameter search. We evaluate top each model by computing the event purity, 2 , on this validation dataset. The first 100 sets of hyperparameters are randomly sampled from a uniform distribution. Af- The masking vector is also employed during softmax cal- terwards, we use a Gaussian process optimizer in order culation to prevent the masked vectors from skewing the to suggest future hyperparamers which may improve the output distributions. When applied to a batch output purity. A complete listing of the final hyperparameters tensor O RB×20×20×20, the softmax calculation can be for this network can be found in TableII. ∈