MIT–CTP 4979

On the Topic of Jets: Disentangling and at Colliders

Eric M. Metodiev∗ and Jesse Thaler† Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

We introduce jet topics: a framework to identify underlying classes of jets from collider data. Because of a close mathematical relationship between distributions of observables in jets and emergent themes in sets of documents, we can apply recent techniques in “topic modeling” to extract jet topics from data with minimal or no input from simulation or theory. As a proof of concept with parton shower samples, we apply jet topics to determine separate and jet distributions for constituent multiplicity. We also determine separate quark and gluon rapidity spectra from a mixed Z-plus-jet sample. While jet topics are defined directly from hadron-level multi-differential cross sections, one can also predict jet topics from first-principles theoretical calculations, with potential implications for how to define quark and gluon jets beyond leading-logarithmic accuracy. These investigations suggest that jet topics will be useful for extracting underlying jet distributions and fractions in a wide range of contexts at the .

When quarks and gluons are produced in high-energy as jet mass, the distribution pMa (x) in mixed sample Ma particle collisions, their fragmentation and hadroniza- is a mixture of the K underlying jet distributions pk(x): tion via quantum chromodynamics (QCD) results in K collimated sprays of particles called jets. To extract X (a) p (x) = f p (x), (1) separate information about quark and gluon jets, though, Ma k k one typically needs to know the relative fractions of k=1 quark and gluon jets in the data sample of interest, where f (a) is the fraction of jet type k in sample a, with estimated by convolving matrix element calculations with k PK f (a) = 1 for all a and R dx p (x) = 1 for all k. non-perturbative parton distribution functions (PDFs). k=1 k k Recent progress in jet substructure—the detailed study For the specific case of quark (q) and gluon (g) jet of particle patterns and correlations within jets [1– mixtures, we have: 10]—has offered new avenues to tag and isolate quark (a) (a) pM (x) = f pq(x) + (1 f ) pg(x). (2) and gluon jets [11–26], with recent applications at the a q − q Large Hadron Collider (LHC) [27–35]. Still, there are Of course, there are well-known caveats to this picture considerable theoretical uncertainties in the modeling of jet generation, which go under the name of “sample of quark and gluon jets, as well as more fundamental dependence”. For instance, “quark” jets from the Z+jet concerns about how to define quark and gluon jets from process are not exactly identical to “quark” jets from first principles in QCD [36–39]. In particular, quark the dijet process due to soft color correlations with the and gluon partons carry color charge, while jets are entire event [37], though these correlations are power composed of color-singlet hadrons, so there is presently suppressed in the small-jet-radius limit [41–43]. Also, no unambiguous definition of “quark” and “gluon” jet at more universal quark/gluon definitions can be obtained the hadron level. using jet grooming methods [44–52]. Here, we assume In this letter, we introduce a data-driven technique that sample-dependent effects can either be quantified or to extract underlying distributions for different jet types mitigated, taking Eq. (2) as the starting assumption for from mixed samples, using quark and gluon jets as an our analysis. example. We call our method “jet topics” because Mixed quark/gluon samples were previously stud- arXiv:1802.00008v2 [hep-ph] 7 Jun 2018 of a mathematical connection to topic modeling, an ied in the context of Classification Without Labels unsupervised learning paradigm for discovering emergent (CWoLa) [53] (see also [54–58]). Via Eq. (2), one can themes in a corpus of documents [40]. Jet topics are prove that the optimal binary mixed-sample classifier,

defined directly from measured multi-differential cross pM1 (x)/pM2 (x), is a monotonic rescaling of the optimal sections, requiring no inputs from simulation or theory. quark/gluon classifier, pq(x)/pg(x). This means that a In this way, jet topics offer a practical way to define classifier trained to optimally distinguish M1 (e.g. Z+jet) jet classes, allowing us to label “quark” and “gluon” jet from M2 (e.g. dijets) is optimal for distinguishing quark distributions at the hadron level without reference to the from gluon jets without requiring jet labels or aggregate underlying partons. class proportions. The CWoLa framework, though, does At colliders like the LHC, it is nearly impossible to not directly yield information about the individual quark kinematically isolate pure samples of different jets (i.e. and gluon distributions pq(x) and pg(x). quark jets, gluon jets, boosted W jets, etc.). Instead, col- With jet topics—and with topic modeling more lider data consist of statistical mixtures Ma of K different generally—one can obtain the full distributions pk(x) and (a) types of jets. For any jet substructure observable x, such fractions fk solely from the mixed-sample distributions 2

Topic Model Jet Distributions Jet Topics Mixed Jet Sample N Word Histogram bin ... Vocabulary Jet observable(s) Mixed Jet Sample 1 Anchor word Pure phase-space region (anchor bin) Topic Type of jet (jet topic) Quark Jet Document Histogram of jet observable(s) Corpus Collection of histograms

Jet Fractions Mixed Data Histogram TABLE I. The correspondence between topic models and jet Gluon Jet distributions. Note that topic modeling treats each document as an unstructured bag of words, in the same way that a FIG. 1. The generation of mixed samples of quark and gluon collection of jets has no intrinsic ordering. jets, highlighting the correspondence with topic models. Each jet is either a quark or gluon jet, sampled according to the underlying quark fraction. The observable is then sampled in Eq. (1), subject to requirements which will be spelled according to a universal distribution for that jet type. Each out below. As originally posed, topic modeling aims to mixed-sample observable distribution is then a mixture of the expose emergent themes in a collection of text documents two universal distributions, giving rise to a “jet document”. (a corpus) [40]. A topic is a distribution over words in the vocabulary. Documents are taken to be unstructured bags of words. Each document arises from an unknown and analogously for pT2 (x). The jet topics are unique and mixture of topics: a topic is sampled according to the universal, in that they are independent of the mixtures mixture proportions and then a word is chosen according used to construct them. to that topic’s distribution over the vocabulary. As long The goal is for the topic distributions pT1 (x) and as each topic has words unique to it, known as anchor pT2 (x) to match the underlying quark and gluon jet words [59, 60], topic-modeling algorithms can learn the distributions pq(x) and pg(x). There are three required underlying topics and proportions from the corpus alone. conditions for this to occur. Two of them (shared with Intriguingly, the generative process for producing CWoLa) are sample independence and different purities, counts of words in a document is mathematically identi- i.e. that the jet samples are obtained from Eq. (2) (a) cal to producing jet observable distributions via Eq. (1), with different values of fq . The third condition is as summarized in Table I. For the case of quark/gluon the presence of anchor bins, which can be stated more jet mixtures, we have suggestively depicted the process formally as: of writing “jet documents” in Fig. 1. Anchor words are analogous to having phase-space regions where each of Mutual irreducibility: Each underlying distribution the underlying distributions is pure, and the presence of pk(x) is not a mixture of the remaining underlying these anchor bins is necessary for jet topics to yield the distributions plus another distribution [55]. underlying “quark” and “gluon” distributions. Due to its theoretical transparency and asymptotic Note that this is a much weaker requirement than the guarantees, we use the Demix method [60] to extract distributions being fully separated. In the quark/gluon jet topics, though other algorithms yield comparable context, a necessary and sufficient condition for mutual results. The key idea is to undo the mixing of the irreducibility is that the reducibility factors κ(q g) = κ(g q) = 0 for feature representation x. We later explore| two fundamental distributions in Eq. (1) by maximally | subtracting the two mixtures from one another, such that the implications of this condition for QCD. With these the zeros of the subtracted distributions correspond to three conditions satisfied, the mixture proportions are the anchor bins. Adopting the notation of Ref. [60], let uniquely determined via the reducibility factors. Taking (1) (2) κ(M1 M2) be the largest subtraction amount κ such that fq > fq , inserting Eq. (2) into Eq. (3) yields: | pM (x) κ pM (x) 0, namely: 1 − 2 ≥ (1) (2) 1 fq fq κ(M1 M2) = − , κ(M2 M1) = . (5) pMi (x) (2) (1) κ(Mi Mj) = min . (3) | 1 fq | fq x | pMj (x) − Even without mutual irreducibility, the extracted jet We refer to κ as the reducibility factor (equivalently, the topics will still relate to the underlying quark and gluon minimum of the mixed-sample likelihood ratio). The distributions. Specifically, jet topics yield the “gluon- jet topics T and T are then the normalized maximal 1 2 subtracted quark distribution”: subtractions of M2 from M1, p (x) κ(M M ) p (x) pq(x) κ(q g) pg(x) M1 1 2 M2 pq|g(x) = − | , (6) pT1 (x) = − | , (4) 1 κ(q g) 1 κ(M1 M2) − | − | 3

0.40 0.05 Jet Topics Z + quark Topics Fracs. w. Mult. Pythia 8.226, √s=13 TeV Pythia 8.226, √s=13 TeV 0.35 Z + gluon R = 0.4, pT [250,275] GeV R = 0.4, pT [250,275] GeV ∈ Jet Topic 1 ∈ 0.04 0.30 Z + jet Jet Topic 2 dijets 0.25 y 0.03 σ d Z + quark d 0.20 1 tot

Z + gluon σ 0.02 Jet Topic 1 0.15 Jet Topic 2 Probability Density 0.10 0.01 0.05

0.00 0.00 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0 20 40 60 80 100 − − − − Constituent Multiplicity Rapidity y

FIG. 2. The jet topics method applied to constituent FIG. 3. Cross sections for jet topics (orange and green) using multiplicity, starting with Z+jet (pink) and dijet (purple) topic fractions extracted from the Z+jet sample across 10 distributions from Pythia 8.226. There is good agreement rapidity bins. The extracted topic cross sections closely track between the two extracted jet topics (orange and green) and the underlying Z+quark and Z+gluon cross sections (red and pure Z+quark and Z+gluon distributions (red and blue). blue).

with more than 30 events. We determine the κ values and the “quark-subtracted gluon distribution”, defined of Eq. (3) by selecting the most constraining (anchor) analogously. By universality, the topics calculated from bin: that with the lowest upper uncertainty bar on pure samples via Eq. (6) and from mixtures via Eq. (4) the ratio. Remarkably, the two extracted jet topics are identical. These may be useful in their own right, overlap very well with the underlying quark and gluon particularly if the quark/gluon fractions are uncertain distributions, providing practical evidence that Eq. (4) but κ(q g) and κ(g q) can be determined analytically or | | works as desired, at least for constituent multiplicity. from simulation (see Fig. 4 below). We verified that similar results could be obtained from We now turn to a practical demonstration of the jet samples with different pT cuts and from mixtures of topics method for realistic quark and gluon samples. dijets at different rapidities. This approach is similar Following Ref. [37], we consider two mixed jet processes to the template extraction procedure in Ref. [24], with at the LHC: the quark-enriched Z+jet process and the the important distinction that the quark/gluon fractions gluon-enriched dijets process. See Ref. [61] for alternative need not be specified a priori. selections for quark- or gluon-enriched samples. The In Fig. 3, we use the extracted jet topics to construct parton shower Pythia 8.226 [62, 63] is used to generate separate jet rapidity spectra for quark and gluon jets 500k jets at √s = 13 TeV including hadronization in the Z+jet samples. Binning the Z+jet sample into and multiple parton interactions (i.e. underlying event). 10 rapidity bins in y < 2, we find the mixture of the Detector-stable, non-neutrino particles are clustered into two topics extracted| above| that most closely matches the anti-kt jets [64] with radius R = 0.4 using FastJet constituent multiplicity histogram in each rapidity bin, 3.3.0 [65]. The hardest jet(s) in each event (one jet minimizing the squared error to find the best mixture. for Z+jet and up to two jets for dijets) are selected if This is an example of the general problem of extracting (a) they have transverse momentum pT [250, 275] GeV sample fractions f from various mixed samples. As ∈ k and rapidity y 2. These cuts resulted in the desired, the extracted topic cross sections in Fig. 3 track | | ≤ Z+jet process having (Pythia-labeled) quark fraction the true quark and gluon rapidity cross sections. (1) (2) fq = 0.88 and the dijet process having fq = 0.37. Thus, just from a collection of mixed-sample his- We use the constituent multiplicity within a jet as the tograms, one can make progress toward extracting both feature representation x, since it is known to be a good the underlying distributions pk(x) and the fraction of quark/gluon discriminant [18]. (a) each jet topic fk . Crucially, Figs. 2 and 3 are just In Fig. 2, we present the result of extracting two novel projections of the hadron-level multi-differential 3 jet topics from these samples. Shown are the con- jet cross section d σ/dpT dy dnconst on two independent stituent multiplicity distributions from the original samples, making jet topics implementable on existing Z+jet and dijet samples, from Pythia-labeled Z+quark LHC jet measurements (e.g. [33]). The agreement and Z+gluon samples, and from the jet topics T1 and T2 between the operationally-defined jet topics and the using Eq. (4). Uncertainties are estimated by assuming theoretically-ambiguous quark and gluon distributions √N bin count uncertainties and only considering bins may even suggest using mutual irreducibility of the final- ± 4 state distributions to define “quark” and “gluon” jets. 0.08 Jet Topics From the perspective of first-principles QCD, the Pythia 8.226, √s=13 TeV R = 0.4, p [250,275] GeV implications of mutual irreducibility are simple yet pro- 0.07 T ∈ found. For the reducibility factors κ(q g) and κ(g q) Z + jet 0.06 to be zero, there must be phase-space| regions almost| dijets entirely dominated by quark or gluon jets. In the 0.05 Z + quark Z + gluon leading-logarithmic (LL) limit, mutual irreducibility can 0.04 Jet Topic 1 be achieved with any jet substructure observable that 0.03 Jet Topic 2 counts the number of parton emissions, such as “soft drop Probability Density “quark”, κ-corrected multiplicity” [26]. At LL order, quark and gluon jets have 0.02 the same emission profile, differing only by a color factor 0.01 in their emission density, CF = 4/3 for quarks and CA = 0.00 0 10 20 30 40 50 60 70 80 3 for gluons. Ignoring the ΛQCD regulator, counting these (infinitely many) emissions results in arbitrarily well- Jet Mass (GeV) separated quark and gluon Poissonian distributions [26], and therefore mutual irreducibility. Beyond LL order, FIG. 4. The jet topics method applied to jet mass up to though, naive quark/gluon definitions may not lead 35 GeV (up to the black, dashed line). The gray curve is the corrected quark topic using to determine κ(q|g), to mutual irreducibility, since running-coupling, higher- Pythia extrapolated beyond 35 GeV by letting jet topic 1 go negative. order, and non-perturbative effects generically contam- There is good agreement between the κ-corrected quark topic inate the anchor bins. That said, as long as these (gray) and the pure Z+quark distribution (red). effects maintain sample independence (perhaps achieved via grooming), then one can still use Eq. (6) to define subtracted “quark” and “gluon” labels. topics to groomed jet mass measurements [69, 70], where Interestingly, many jet substructure observables do not grooming is an essential ingredient that allows κ(q g) to lead to quark/gluon mutual irreducibility, even at LL be calculated to high precision [49–52]. | accuracy. Consider for instance the jet mass m (or any jet There are many potential uses for the jet topics angularity [66–68]). Jet mass exhibits Casimir scaling at framework at the LHC. Focusing just on quark and gluon LL order, meaning that the cumulative density functions jets, one often wants to separately measure quark and CA/CF Σi(m) are related to each other by Σg = Σq [19, 20]. gluon distributions from mixed data samples, without The probability distributions are then given by pi = relying on theory or simulation for fraction estimates. dΣi/dm. Substituting this into Eq. (3) immediately To determine PDFs, it would be beneficial to isolate yields, for all observables with Casimir scaling: different partonic subprocesses, and this could be feasible as long as jet topics is applied both to data and to fixed- CA −1 CA CF order QCD calculations. Similar subprocess isolation κ(g q) = min Σq = 0, (7) | CF might be useful in mono-jet searches for dark matter by 1− CA aiding in signal/background discrimination or in setting CF CF CF κ(q g) = min Σq = , (8) improved limits on specific new physics models [71, 72]. | CA CA For extracting the strong coupling constant αs from since CA/CF = 9/4 > 1 and Σ takes all values between 0 (groomed) jet shape distributions, it would be beneficial and 1. Because of Eq. (8), jet mass alone is not sufficient to determine the quark and gluon jet fractions using data- to extract the quark distribution at LL order without driven methods, since there are uncertainties associated additional information. with whether αs comes multiplied by CF or CA [73]. On the other hand, if the reducibility factors are The extracted topic fractions could be also be used known, then the subtracted distributions in Eq. (6) to augment training with CWoLa, since the classifier can be inverted. This is shown in Fig. 4 for the jet operating points could then be determined entirely from mass, where the “quark” topic has been corrected using data. In heavy ion collisions, quarks and gluons are the value κ(q g) = 0.40 at 35 GeV determined from expected to be modified differently in medium due to the Pythia Z| + q/g distributions, which is known to their different color charges, and jet topics may allow for differ from the LL expectation [37]. This analysis is fully data-driven studies of separate quark and gluon jet performed up to 35 GeV to avoid sample-dependent modifications. effects in the high-mass tails of the distributions. The In conclusion, phrasing jet mixtures as a topic model- qualitative behavior of the topics agrees with the LL ing problem makes available a variety of new and more predictions of Eqs. (7) and (8): no correction is needed sophisticated statistical and mathematical tools for jet to obtain the “gluon” topic, and the “quark” topic is physics (see e.g. [59, 60, 74–88]), including recent efforts a non-trivial mixture of the jet topics. Given the good to determine the appropriate number of topics to use agreement seen here, it would be interesting to apply jet from data [89–91]. We emphasize that jet topics can be 5 applied to any set of multi-differential cross sections— 1661 (2011), arXiv:1012.5412 [hep-ph]. in experiment or in theory—as long as the criteria of [7] A. Altheimer et al., “Jet Substructure at the Tevatron sample independence, different purities, and mutual ir- and LHC: New results, new tools, new benchmarks,” reducibility are met. Furthermore, mutual irreducibility BOOST 2011 Princeton , NJ, USA, 22–26 May 2011, J. Phys. G39, 063001 (2012), arXiv:1201.0008 [hep-ph]. need not be assumed if the subtracted distributions in [8] A. Altheimer et al., “Boosted objects and jet sub- Eq. (6) are sufficient for the intended application, or if the structure at the LHC. Report of BOOST2012, held at reducibility factors are known from theory or simulation. IFIC Valencia, 23rd-27th of July 2012,” BOOST 2012 Of course, experimental studies are needed to understand Valencia, Spain, July 23-27, 2012, Eur. Phys. J. C74, the systematic and statistical uncertainties associated 2792 (2014), arXiv:1311.2708 [hep-ex]. with jet topics for LHC measurements and searches, and [9] D. Adams et al., “Towards an Understanding of the theoretical studies are needed to determine the interplay Correlations in Jet Substructure,” Eur. Phys. J. C75, 409 (2015), arXiv:1504.00679 [hep-ph]. of jet topics with precision calculations. It would also [10] Andrew J. Larkoski, Ian Moult, and Benjamin Nachman, be interesting to design jet substructure observables “Jet Substructure at the Large Hadron Collider: A specifically targeted for mutual irreducibility. More Review of Recent Advances in Theory and Machine generally, topic models may find applications in collider Learning,” (2017), arXiv:1709.04464 [hep-ph]. physics beyond jets and in other disciplines beyond [11] Hans Peter Nilles and K. H. Streng, “Quark - Gluon collider physics, since extracting signal and background Separation in Three Jet Events,” Phys. Rev. D23, 1944 distributions from mixtures is a ubiquitous challenge (1981). [12] Lorella M. Jones, “Tests for Determining the Parton faced when analyzing and interpreting rich data sets. Ancestor of a Hadron Jet,” Phys. Rev. D39, 2550 (1989). The authors would like to thank Mario Campanelli, [13] Z. Fodor, “How to See the Differences Between Quark Timothy Cohen, Philip Harris, Patrick Komiske, Andrew and Gluon Jets,” Phys. Rev. D41, 1726 (1990). Larkoski, Ian Moult, Benjamin Nachman, Gavin Salam, [14] Lorella Jones, “Towards a Systematic Jet Classification,” Clayton Scott, and Wouter Waalewijn for illuminating Phys. Rev. D42, 811–814 (1990). discussions. The authors are grateful to Patrick Komiske [15] Leif L¨onnblad, Carsten Peterson, and Thorsteinn Rogn- valdsson, “Using neural networks to identify jets,” Nucl. for generating the jet samples. This work was supported Phys. B349, 675–702 (1991). by the Office of Nuclear Physics of the U.S. Department [16] Jon Pumplin, “How to tell quark jets from gluon jets,” of Energy (DOE) under grant DE-SC0011090 and the Phys. Rev. D44, 2025–2032 (1991). DOE Office of High Energy Physics under grant DE- [17] Jason Gallicchio and Matthew D. Schwartz, “Quark and SC0012567. Computations in this paper were run on Gluon Tagging at the LHC,” Phys. Rev. Lett. 107, the Odyssey cluster supported by the FAS Division of 172001 (2011), arXiv:1106.3076 [hep-ph]. Science, Research Computing Group at Harvard Univer- [18] Jason Gallicchio and Matthew D. Schwartz, “Quark and Gluon Jet Substructure,” JHEP 04, 090 (2013), sity. Cloud computing resources were provided through arXiv:1211.7038 [hep-ph]. a Microsoft Azure for Research award. [19] Andrew J. Larkoski, Gavin P. Salam, and Jesse Thaler, “Energy Correlation Functions for Jet Substructure,” JHEP 06, 108 (2013), arXiv:1305.0007 [hep-ph]. [20] Andrew J. Larkoski, Jesse Thaler, and Wouter J. Waalewijn, “Gaining (Mutual) Information about ∗ [email protected] Quark/Gluon Discrimination,” JHEP 11, 129 (2014), † [email protected] arXiv:1408.3122 [hep-ph]. [1] M. H. Seymour, “Tagging a heavy ,” in ECFA [21] Biplob Bhattacherjee, Satyanarayan Mukhopadhyay, Mi- Large Hadron Collider Workshop, Aachen, Germany, 4-9 hoko M. Nojiri, Yasuhito Sakaki, and Bryan R. Web- Oct 1990: Proceedings.2. (1991) pp. 557–569. ber, “Associated jet and subjet rates in light-quark [2] Michael H. Seymour, “Searches for new particles using and gluon jet discrimination,” JHEP 04, 131 (2015), cone and cluster jet algorithms: A Comparative study,” arXiv:1501.04794 [hep-ph]. Z. Phys. C62, 127–138 (1994). [22] Danilo Ferreira de Lima, Petar Petrov, Davison Soper, [3] J. M. Butterworth, B. E. Cox, and Jeffrey R. Forshaw, and Michael Spannowsky, “Quark-Gluon tagging with “WW scattering at the CERN LHC,” Phys. Rev. D65, Shower Deconstruction: Unearthing dark matter and 096014 (2002), arXiv:hep-ph/0201098 [hep-ph]. Higgs couplings,” Phys. Rev. D95, 034001 (2017), [4] J. M. Butterworth, John R. Ellis, and A. R. Raklev, arXiv:1607.06031 [hep-ph]. “Reconstructing sparticle mass spectra using hadronic [23] Patrick T. Komiske, Eric M. Metodiev, and Matthew D. decays,” JHEP 05, 033 (2007), arXiv:hep-ph/0702150 Schwartz, “Deep learning in color: towards automated [HEP-PH]. quark/gluon jet discrimination,” JHEP 01, 110 (2017), [5] Jonathan M. Butterworth, Adam R. Davison, Mathieu arXiv:1612.01551 [hep-ph]. Rubin, and Gavin P. Salam, “Jet substructure as a new [24] The ATLAS collaboration (ATLAS), “Discrimination√ of Higgs search channel at the LHC,” Phys. Rev. Lett. 100, Light Quark and Gluon Jets in pp collisions at s = 8 242001 (2008), arXiv:0802.2470 [hep-ph]. TeV with the ATLAS Detector,” (2016). [6] A. Abdesselam et al., “Boosted objects: A Probe of [25] Joe Davighi and Philip Harris, “Fractal based observables beyond the Standard Model physics,” Boost 2010 Oxford, to probe jet substructure of quarks and gluons,” Eur. United Kingdom, June 22-25, 2010, Eur. Phys. J. C71, Phys. J. C78, 334 (2018), arXiv:1703.00914 [hep-ph]. 6

[26] Christopher Frye, Andrew J. Larkoski, Jesse Thaler, for Jet Processes,” JHEP 11, 019 (2016), [Erratum: and Kevin Zhou, “Casimir Meets Poisson: Improved JHEP05,154(2017)], arXiv:1605.02737 [hep-ph]. Quark/Gluon Discrimination with Counting Observ- [43] Daniel W. Kolodrubetz, Piotr Pietrulewicz, Iain W. ables,” JHEP 09, 083 (2017), arXiv:1704.06266 [hep-ph]. Stewart, Frank J. Tackmann, and Wouter J. Waalewijn, [27] CMS Collaboration (CMS), Performance of quark/gluon “Factorization for Jet Radius Logarithms in Jet discrimination in 8 TeV pp data, Tech. Rep. CMS-PAS- Mass Spectra at the LHC,” JHEP 12, 054 (2016), JME-13-002 (2013). arXiv:1605.08038 [hep-ph]. [28] Georges Aad et al. (ATLAS), “Light-quark√ and gluon [44] Stephen D. Ellis, Christopher K. Vermilion, and jet discrimination in pp collisions at s = 7 TeV with Jonathan R. Walsh, “Techniques for improved heavy the ATLAS detector,” Eur. Phys. J. C74, 3023 (2014), particle searches with jet substructure,” Phys. Rev. D80, arXiv:1405.6583 [hep-ex]. 051501 (2009), arXiv:0903.5081 [hep-ph]. [29] Georges Aad et al. (ATLAS), “Jet energy measurement [45] Stephen D. Ellis, Christopher K. Vermilion, and and√ its systematic uncertainty in -proton collisions Jonathan R. Walsh, “Recombination Algorithms and at s = 7 TeV with the ATLAS detector,” Eur. Phys. J. Jet Substructure: Pruning as a Tool for Heavy C75, 17 (2015), arXiv:1406.0076 [hep-ex]. Particle Searches,” Phys. Rev. D81, 094023 (2010), [30] Vardan Khachatryan et al. (CMS), “Measurement of arXiv:0912.0033 [hep-ph]. electroweak production of two jets in association√ with [46] David Krohn, Jesse Thaler, and Lian-Tao Wang, “Jet a Z boson in proton-proton collisions at s = 8 TeV,” Trimming,” JHEP 02, 084 (2010), arXiv:0912.1342 [hep- Eur. Phys. J. C75, 66 (2015), arXiv:1410.3153 [hep-ex]. ph]. [31] Georges Aad et al. (ATLAS), “Search for high-mass [47] Mrinal Dasgupta, Alessandro Fregoso, Simone Marzani, diboson resonances with√ boson-tagged jets in proton- and Gavin P. Salam, “Towards an understanding of jet proton collisions at s = 8 TeV with the ATLAS substructure,” JHEP 09, 029 (2013), arXiv:1307.0007 detector,” JHEP 12, 055 (2015), arXiv:1506.00962 [hep- [hep-ph]. ex]. [48] Andrew J. Larkoski, Simone Marzani, Gregory Soyez, [32] Vardan Khachatryan et al. (CMS), “Search for the and Jesse Thaler, “Soft Drop,” JHEP 05, 146 (2014), standard model Higgs boson produced through vector arXiv:1402.2657 [hep-ph]. boson fusion and decaying to bb,” Phys. Rev. D92, [49] Christopher Frye, Andrew J. Larkoski, Matthew D. 032008 (2015), arXiv:1506.01010 [hep-ex]. Schwartz, and Kai Yan, “Precision physics with pile-up [33] Georges Aad et al. (ATLAS), “Measurement√ of the insensitive observables,” (2016), arXiv:1603.06375 [hep- charged-particle multiplicity inside jets from s = 8 TeV ph]. pp collisions with the ATLAS detector,” Eur. Phys. J. [50] Christopher Frye, Andrew J. Larkoski, Matthew D. C76, 322 (2016), arXiv:1602.00988 [hep-ex]. Schwartz, and Kai Yan, “Factorization for groomed [34] CMS Collaboration (CMS), Performance of quark/gluon jet substructure beyond the next-to-leading logarithm,” discrimination in 13 TeV data, Tech. Rep. CMS-DP- JHEP 07, 064 (2016), arXiv:1603.09338 [hep-ph]. 2016-070 (2016). [51] Simone Marzani, Lais Schunk, and Gregory Soyez, “A [35] M. Aaboud et al. (ATLAS), “Jet energy scale mea- study of jet mass distributions with grooming,” JHEP surements and their√ systematic uncertainties in proton- 07, 132 (2017), arXiv:1704.02210 [hep-ph]. proton collisions at s = 13 TeV with the ATLAS detec- [52] Simone Marzani, Lais Schunk, and Gregory Soyez, “The tor,” Phys. Rev. D96, 072002 (2017), arXiv:1703.09665 jet mass distribution after Soft Drop,” Eur. Phys. J. C78, [hep-ex]. 96 (2018), arXiv:1712.05105 [hep-ph]. [36] Andrea Banfi, Gavin P. Salam, and Giulia Zanderighi, [53] Eric M. Metodiev, Benjamin Nachman, and Jesse “Infrared safe definition of jet flavor,” Eur. Phys. J. C47, Thaler, “Classification without labels: Learning from 113–124 (2006), arXiv:hep-ph/0601139 [hep-ph]. mixed samples in high energy physics,” JHEP 10, 174 [37] Philippe Gras, Stefan H¨oche, Deepak Kar, Andrew (2017), arXiv:1708.02949 [hep-ph]. Larkoski, Leif L¨onnblad, Simon Pl¨atzer, Andrzej [54] Kyle Cranmer, Juan Pavez, and Gilles Louppe, “Ap- Si´odmok, Peter Skands, Gregory Soyez, and Jesse proximating Likelihood Ratios with Calibrated Discrim- Thaler, “Systematics of quark/gluon tagging,” JHEP 07, inative Classifiers,” (2015), arXiv:1506.02169 [stat.AP]. 091 (2017), arXiv:1704.03878 [hep-ph]. [55] Gilles Blanchard, Marek Flaska, Gregory Handy, Sara [38] Daniel Reichelt, Peter Richardson, and Andrzej Pozzi, Clayton Scott, et al., “Classification with asym- Siodmok, “Improving the Simulation of Quark and Gluon metric label noise: Consistency and maximal denoising,” Jets with Herwig 7,” Eur. Phys. J. C77, 876 (2017), Electronic Journal of Statistics 10, 2780–2824 (2016). arXiv:1708.01491 [hep-ph]. [56] Lucio Mwinmaarong Dery, Benjamin Nachman, [39] Jonathan Mo, Frank J. Tackmann, and Wouter J. Francesco Rubbo, and Ariel Schwartzman, “Weakly Waalewijn, “A case study of quark-gluon discrimination Supervised Classification in High Energy Physics,” at NNLL’ in comparison to parton showers,” Eur. Phys. JHEP 05, 145 (2017), arXiv:1702.00414 [hep-ph]. J. C77, 770 (2017), arXiv:1708.00867 [hep-ph]. [57] Timothy Cohen, Marat Freytsis, and Bryan Ostdiek, [40] David M Blei, “Probabilistic topic models,” Communi- “(Machine) Learning to Do More with Less,” JHEP 02, cations of the ACM 55, 77–84 (2012). 034 (2018), arXiv:1706.09451 [hep-ph]. [41] Andrea Banfi, Mrinal Dasgupta, Kamel Khelifa-Kerfa, [58] Patrick T. Komiske, Eric M. Metodiev, Benjamin Nach- and Simone Marzani, “Non-global logarithms and jet man, and Matthew D. Schwartz, “Learning to Classify algorithms in high-pT jet shapes,” JHEP 08, 064 (2010), from Impure Samples,” (2018), arXiv:1801.10158 [hep- arXiv:1004.3483 [hep-ph]. ph]. [42] Thomas Becher, Matthias Neubert, Lorena Rothen, [59] Sanjeev Arora, Rong Ge, and Ankur Moitra, “Learn- and Ding Yu Shao, “Factorization and Resummation ing topic models–going beyond svd,” in Foundations 7

of Computer Science (FOCS), 2012 IEEE 53rd Annual negative matrix factorization give a correct decompo- Symposium on (IEEE, 2012) pp. 1–10. sition into parts?” in Advances in neural information [60] Julian Katz-Samuels, Gilles Blanchard, and Clayton processing systems (2004) pp. 1141–1148. Scott, “Decontamination of Mutual Contamination Mod- [77] Ganesh R Naik and Dinesh K Kumar, “An overview of els,” (2017), arXiv:1710.01167 [stat.ML]. independent component analysis and its applications,” [61] Jason Gallicchio and Matthew D. Schwartz, “Pure Sam- Informatica 35 (2011). ples of Quark and Gluon Jets at the LHC,” JHEP 10, [78] Gilles Blanchard and Clayton Scott, “Decontamination 103 (2011), arXiv:1104.1175 [hep-ph]. of Mutually Contaminated Models,” in Proceedings of the [62] Torbjorn Sjostrand, Stephen Mrenna, and Peter Z. Seventeenth International Conference on Artificial Intel- Skands, “PYTHIA 6.4 Physics and Manual,” JHEP 05, ligence and Statistics, Proceedings of Machine Learning 026 (2006), arXiv:hep-ph/0603175 [hep-ph]. Research, Vol. 33, edited by Samuel Kaski and Jukka [63] Torbj¨orn Sj¨ostrand, Stefan Ask, Jesper R. Chris- Corander (PMLR, Reykjavik, Iceland, 2014) pp. 1–9. tiansen, Richard Corke, Nishita Desai, Philip Ilten, [79] Shantanu Jain, Martha White, Michael W Trosset, Stephen Mrenna, Stefan Prestel, Christine O. Ras- and Predrag Radivojac, “Nonparametric semi-supervised mussen, and Peter Z. Skands, “An Introduction to learning of class proportions,” (2016), arXiv:1601.01944 PYTHIA 8.2,” Comput. Phys. Commun. 191, 159–177 [stat.ML]. (2015), arXiv:1410.3012 [hep-ph]. [80] Julian Katz-Samuels and Clayton Scott, “A mutual [64] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez, contamination analysis of mixed membership and partial “The Anti-k(t) jet clustering algorithm,” JHEP 04, 063 label models,” (2016), arXiv:1602.06235 [stat.ML]. (2008), arXiv:0802.1189 [hep-ph]. [81] Pierre Comon and Christian Jutten, Handbook of Blind [65] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez, Source Separation: Independent component analysis and “FastJet User Manual,” Eur. Phys. J. C72, 1896 (2012), applications (Academic press, 2010). arXiv:1111.6097 [hep-ph]. [82] Vaninirappuputhenpurayil Gopalan Reju, Soo Ngee Koh, [66] Carola F. Berger, Tibor Kucs, and George F. Sterman, and Yann Soon, “An algorithm for mixing matrix esti- “Event shape / energy flow correlations,” Phys. Rev. mation in instantaneous blind source separation,” Signal D68, 014012 (2003), arXiv:hep-ph/0303051 [hep-ph]. Processing 89, 1762–1773 (2009). [67] Leandro G. Almeida, Seung J. Lee, Gilad Perez, [83] Tyler Sanderson and Clayton Scott, “Class Proportion George F. Sterman, Ilmo Sung, and Joseph Virzi, Estimation with Application to Multiclass Anomaly Re- “Substructure of high-pT Jets at the LHC,” Phys. Rev. jection,” in Proceedings of the Seventeenth International D79, 074017 (2009), arXiv:0807.0234 [hep-ph]. Conference on Artificial Intelligence and Statistics, Pro- [68] Stephen D. Ellis, Christopher K. Vermilion, Jonathan R. ceedings of Machine Learning Research, Vol. 33, edited Walsh, Andrew Hornig, and Christopher Lee, “Jet by Samuel Kaski and Jukka Corander (PMLR, Reyk- Shapes and Jet Algorithms in SCET,” JHEP 11, 101 javik, Iceland, 2014) pp. 850–858. (2010), arXiv:1001.0014 [hep-ph]. [84] Clayton Scott, “A Rate of Convergence for Mixture [69] Morad Aaboud et al. (ATLAS), “A measurement√ of the Proportion Estimation, with Application to Learning soft-drop jet mass in pp collisions at s = 13 TeV with from Noisy Labels,” in Proceedings of the Eighteenth the ATLAS detector,” (2017), arXiv:1711.08341 [hep- International Conference on Artificial Intelligence and ex]. Statistics, Proceedings of Machine Learning Research, [70] Measurement of the differential jet production cross sec- Vol. 38, edited by Guy Lebanon and S. V. N. Vish- tion with respect to jet mass and transverse√ momentum wanathan (PMLR, San Diego, California, USA, 2015) in dijet events from pp collisions at s = 13 TeV , Tech. pp. 838–846. Rep. CMS-PAS-SMP-16-010 (CERN, Geneva, 2017). [85] Harish Ramaswamy, Clayton Scott, and Ambuj Tewari, [71] Morad Aaboud et al. (ATLAS), “Search for dark matter “Mixture proportion estimation via kernel embeddings of and other new phenomena in events with an energetic jet distributions,” in Proceedings of The 33rd International and large missing transverse momentum using the AT- Conference on Machine Learning, Proceedings of Ma- LAS detector,” JHEP 01, 126 (2018), arXiv:1711.03301 chine Learning Research, Vol. 48, edited by Maria Florina [hep-ex]. Balcan and Kilian Q. Weinberger (PMLR, New York, [72] Albert M Sirunyan et al. (CMS), “Search for new physics New York, USA, 2016) pp. 2052–2060. in final states with an energetic jet or a hadronically [86] Weicong Ding, Prakash Ishwar, and Venkatesh decaying W√ or Z boson and transverse momentum Saligrama, “Necessary and sufficient conditions and a imbalance at s = 13 TeV,” (2017), arXiv:1712.02345 provably efficient algorithm for separable topic discov- [hep-ex]. ery,” (2015), arXiv:1508.05565 [cs.LG]. [73] J. R. Andersen et al., “Les Houches 2017: Physics at TeV [87] Weicong Ding, Learning mixed membership models with a Colliders Standard Model Working Group Report,” in separable latent structure: theory, provably efficient algo- 10th Les Houches Workshop on Physics at TeV Colliders rithms, and applications, Ph.D. thesis, Boston University (PhysTeV 2017) Les Houches, France, June 5-23, 2017 (2015). (2018) arXiv:1803.07977 [hep-ph]. [88] Sanjeev Arora, Rong Ge, Yonatan Halpern, David [74] David M Blei, Andrew Y Ng, and Michael I Jordan, Mimno, Ankur Moitra, David Sontag, Yichen Wu, and “Latent dirichlet allocation,” Journal of machine Learn- Michael Zhu, “A practical algorithm for topic modeling ing research 3, 993–1022 (2003). with provable guarantees,” in Proceedings of the 30th [75] Daniel D Lee and H Sebastian Seung, “Learning the parts International Conference on Machine Learning, Proceed- of objects by non-negative matrix factorization,” Nature ings of Machine Learning Research, Vol. 28, edited by 401, 788–791 (1999). Sanjoy Dasgupta and David McAllester (PMLR, Atlanta, [76] David Donoho and Victoria Stodden, “When does non- Georgia, USA, 2013) pp. 280–288. 8

[89] Derek Greene, Derek O’Callaghan, and P´adraigCun- Fuzzy Systems and Knowledge Discovery (FSKD), 2014 ningham, “How many topics? stability analysis for 11th International Conference on (IEEE, 2014) pp. 756– topic models,” in Joint European Conference on Ma- 760. chine Learning and Knowledge Discovery in Databases [91] Weizhong Zhao, James J Chen, Roger Perkins, Zhichao (Springer, 2014) pp. 498–513. Liu, Weigong Ge, Yijun Ding, and Wen Zou, “A heuristic [90] Biao Wang, Yang Liu, Zelong Liu, Maozhen Li, and approach to determine an appropriate number of topics Man Qi, “Topic selection in latent dirichlet allocation,” in in topic modeling,” BMC bioinformatics 16, S8 (2015).