Exploring the Space of Jets with CMS Open Data

Exploring the space of jets with CMS open data The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Komiske, Patrick T., et al. "Exploring the space of jets with CMS open data." Physical Review D, 101, 3 (February 2020): 034009. As Published http://dx.doi.org/10.1103/PhysRevD.101.034009 Publisher American Physical Society (APS) Version Final published version Citable link https://hdl.handle.net/1721.1/125462 Terms of Use Creative Commons Attribution 3.0 unported license Detailed Terms http://creativecommons.org/licenses/by/3.0 PHYSICAL REVIEW D 101, 034009 (2020) Exploring the space of jets with CMS open data † ‡ ∥ Patrick T. Komiske ,1,2,* Radha Mastandrea,1, Eric M. Metodiev ,1,2, Preksha Naik ,1,§ and Jesse Thaler 1,2, 1Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 2Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA (Received 7 September 2019; accepted 21 January 2020; published 11 February 2020) We explore the metric space of jets using public collider data from the CMS experiment. Starting from pffiffiffi 2.3 fb−1 of proton-proton collisions at s ¼ 7 TeV collected at the Large Hadron Collider in 2011, we isolate a sample of 1,690,984 central jets with transverse momentum above 375 GeV. To validate the performance of the CMS detector in reconstructing the energy flow of jets, we compare the CMS Open Data to corresponding simulated data samples for a variety of jet kinematic and substructure observables. Even without detector unfolding, we find very good agreement for track-based observables after using charged hadron subtraction to mitigate the impact of pileup. We perform a range of novel analyses, using the “energy mover’s distance” (EMD) to measure the pairwise difference between jet energy flows. The EMD allows us to quantify the impact of detector effects, visualize the metric space of jets, extract correlation dimensions, and identify the most and least typical jet configurations. To facilitate future jet studies with CMS Open Data, we make our datasets and analysis code available, amounting to around two gigabytes of distilled data and one hundred gigabytes of simulation files. DOI: 10.1103/PhysRevD.101.034009 I. INTRODUCTION analyses [37,38] were performed using the CMS 2010 Open Data [39], corresponding to 31.8 pb−1 of 7 TeV data Ever since the first evidence for jet structure [1], the from Run 2010B at the Large Hadron Collider (LHC). fragmentation of short-distance quarks and gluons into Among other aspects of jets, these studies explored the long-distance hadrons has been a rich area for experimental groomed momentum fraction z [40], which has sub- and theoretical investigations into quantum chromodynam- g ics (QCD). A variety of observables have been proposed sequently been measured in proton-proton and heavy-ion over the decades to probe the jet formation process [2–8], collisions by CMS [41], ALICE [42], and STAR [43]. The especially with recent advances in the field of jet sub- CMS Open Data release from LHC Run 2011A includes – – detector-simulated Monte Carlo (MC) samples, facilitating structure [9 20]. The stress-energy flow [21 23] is a – particularly powerful probe of jets, since it in principle machine learning studies [44 46], an underlying event contains all the information about a jet that is infrared and study [47], as well as a novel search for dimuon resonances collinear (IRC) safe [24–26]. A variety of observables have [48]. CMS has also released data from Runs 2012B and been built around the energy flow concept [27–31], 2012C, which have been used to search for nonstandard including recent work on machine learning for jet sub- sources of parity violation in jets [49] and extract standard structure [32–34]. model cross sections [50]. Beyond CMS, archival ALEPH data [51] have been used by Ref. [52] to search for new The unprecedented release of public collider data by the – CMS experiment [35] starting in November 2014 [36] has physics and by Refs. [53 55] to perform QCD studies. enabled new exploratory studies of jets. The first such jet While analyses using public collider data cannot match the sophistication or scope of official measurements by the experimental collaborations, they can enable proof-of- *[email protected] † concept collider investigations and help stress-test archival [email protected] ‡ data strategies. [email protected] §[email protected] In this paper, we perform the first exploratory study of ∥ [email protected] the “space” of jets using the CMS 2011 Open Data. This data and MC release corresponds to 2.3 fb−1 of proton- Published by the American Physical Society under the terms of protonffiffiffi collisions collected at a center-of-mass energy of the Creative Commons Attribution 4.0 International license. p 7 Further distribution of this work must maintain attribution to s ¼ TeV. The key idea, as proposed in Ref. [56],isto the author(s) and the published article’s title, journal citation, compute the pairwise distance between jet energy flows, and DOI. Funded by SCOAP3. and then use this information to construct a metric space. 2470-0010=2020=101(3)=034009(36) 034009-1 Published by the American Physical Society PATRICK T. KOMISKE et al. PHYS. REV. D 101, 034009 (2020) This enables a variety of distance-based jet analyses, describe the baseline jet selection criteria used for our including quantitative characterizations and qualitative substructure and EMD studies. visualizations. Because this is an exploratory study, we do not unfold for detector effects nor estimate systematic A. Jet primary dataset uncertainties, but the general agreement between the CMS The CMS Open Data is available on the CERN Open Open Data and simulated MC samples provides evidence Data Portal [36], which currently hosts data collected by for the experimental robustness of these methods. CMS in 2010 [95], 2011 [96], and 2012 [97], as well as “ ’ ” The metric we use is the energy mover s distance specialized samples for machine learning studies [98].It ’ (EMD) [56], inspired by the famous earth mover s distance also contains limited datasets from ALICE [99], ATLAS – [57 61] sharing the same acronym. The EMD has units of [100], and LHCb [101], as well as data from the OPERA “ ” energy (i.e., GeV) and quantifies the amount of work in neutrino experiment [102]. Accompanying the CMS 2011 energy times angle to make one jet radiation pattern look Open Data is a virtual machine which runs version 5.3.32 of like another, including the cost of creating energy for jets the CMS software (CMSSW) framework. This open data with different pT. While we focus on the EMD between initiative complements efforts like HEPDATA [103], RIVET pairs of jets in this study, the same concept could be applied [104], and REANA [105] to preserve the results and work- to pairs of events as a whole. Crucially, the CMS Open Data flows of official collider analyses (see further discussion in contains full information about reconstructed particle flow Ref. [106]). – candidates (PFCs) [62 64], which provide a robust proxy The CMS Open Data is grouped into primary datasets for the energy flow of a jet. It also contains information that contain a subset of the triggers used for event selection about primary vertices, allowing us to mitigate pileup [107]. There are 19 primary datasets included in the (multiple proton-proton collisions per beam crossing) 2011 release, along with corresponding MC samples (see through charged hadron subtraction (CHS) [65]. Because Sec. II D below). All of the primary datasets are provided of the improved resolution and pileup insensitivity of by CMS in their analysis object data (AOD) format, which charged particles (i.e., tracks), we use a track-based variant provides high-level reconstructed objects used for the bulk of EMD for these exploratory studies. of official CMS analyses in Run 1. A subsample of some We base our study on the CMS 2011 Jet primary primary datasets (e.g., Jet [108] and MinimumBias dataset [66] and focus on the HLT_Jet300 single-jet [109]) are also provided in the RAW format, containing the trigger, which we show is fully efficient to reconstruct jets full readout of the CMS detector. with transverse momentum (pT) above 375 GeV. We also Our analysis is based on the Jet primary dataset [66], – use dijet MC samples [67 81], generated with PYTHIA 6 which includes a variety of single jet and dijet triggers. This [82] and simulated using GEANT 4 [83], to understand the primary dataset contains 30,726,331 events spread across performance of the CMS detector in reconstructing the jet 1,223 AOD files, totaling 4.7 TB. The 2011A data-taking energy flow. In order to facilitate future jet studies on the period is subdivided into 318 runs, and the runs are CMS Open Data, we make our MIT Open Data (MOD) subdivided into 109,428 luminosity blocks (LBs) [110]. software framework available [84,85], along with the A luminosity block is the smallest unit of data-taking – distilled data [86] and MC [87 94] files needed to recreate for which there is calibrated luminosity information, the majority of our studies. and during one block, the triggers are guaranteed to have The remainder of this paper is organized as follows. We consistent requirements and prescale factors (see Sec. II C begin in Sec. II by describing the CMS Open Data and the below). Of the events in the Jet primary dataset, MOD software framework used for our analysis. In Sec. III, 26,275,768 are contained in “valid” LBs which are certified we validate the Jet primary dataset by comparing the basic by CMS for use in physics analyses [111].

Exploring the Space of Jets with CMS Open Data

Licensing Open Government Data Jyh-An Lee

Open IP Workgroup Report

Recommendations for Open Data Portals: from Setup to Sustainability

Empowering Social Media: Citizens-Source E-Government and Peer-To-Peer Networks" (2012)

Analytical Report N20

Replace This with the Actual Title Using All Caps

The Power of Open Source Innovation Thingscon 2017

Briefing Paper on Open Data and Privacy

Why We Need Open Knowledge Societies - by Rufus Pollock

Transparency and Open Data Principles: Why They Are Important and How They Increase Public Participation and Tackle Corruption

Open Research Data and Open Peer Review: Perceptions of a Medical and Health Sciences Community in Greece

Open Data Study New Technologies