<<

Beyond the Identification of Events in Simulated - Collisions

Joshua LaBounty* In collaboration with Dr. Abhay Deshpande and Dr. Nils Feege May 2017

Abstract In the Standard Model (SM) of physics, each is associated with a , as stated by Emmy Noether’s eponymous theorem. Experiments to date have not shown any violations of the observed conservation of flavor (the number of each generation of lepton) for the charged, massive (electron, , and ). However, within the SM there exists no known symmetry which would lead to this observed conservation. In fact, Lepton Flavor Violation (LFV) events have been shown to occur in solar ( oscillations, which are possible due to their low masses). Discounting Lepton Flavor Conservation, the SM cross sections of LFV events for the charged leptons are predicted to lie outside the reach of any current or planned experiment. However, many models of Beyond the Standard Model (BSM) physics predict the existence of which would mediate flavor violations and increase the cross sections for LFV to within the realm of planned experiments. Thus, the study of leptoquarks provides an avenue with which to probe BSM physics. Studies done at HERA have put constraints on the mass and coupling strength of such leptoquarks. Here we study the e− → τ − conversion process using the LQGENEP event generator. We seek to distinguish the characteristic 3-π decay of the τ − from SM Deep Inelastic Scattering (DIS) decay products using Geant4 simulations of the current Electron Ion (EIC) detector design (based around sPHENIX) in the center of mass energy range of 32 − 181 GeV. To do this, we employ standard anti- kt jet finding algorithms to locate jets in concert with machine learning modules to classify them. In doing so, we analyze the effectiveness of the proposed EIC detector for probing beyond the limits set by previous experiments.

*[email protected]

1 Contents

1 Introduction 5 1.1 The Standard Model ...... 5 1.1.1 Deep Inelastic Scattering (DIS) ...... 6 1.2 Leptoquarks: Looking Beyond the Standard Model ...... 7 1.3 The Case for the EIC ...... 9 1.4 The Case for Leptoquark Studies at EIC ...... 11

2 Generator Level Studies 12 2.1 LQGENEP ...... 12 2.2 Leptoquark vs. DIS: A Characteristic Signature ...... 13

3 Detector Level Studies 15 3.1 Geant4 Simulations ...... 15 3.2 Jet Reconstruction ...... 18 3.3 Jet Identification ...... 21 3.3.1 Initial Cuts ...... 21 3.3.2 Application of Machine Learning ...... 23 3.3.3 Energy Dependence ...... 26

4 Results and Discussions 28

5 Future Studies 28

A Monte Carlo Event Generation 30

B Codebase 31

C Additional Figures 33

2 List of Figures

1 The Standard Model of [20]...... 5 2 Two DIS events, one in which the proton breaks up and one in which it remains whole ...... 6 3 Feynman diagrams depicting the e → τ scattering process, which is mediated by a leptoquark. Time flows from left to right. α, β = 1, 2, 3 representing the three generations of ...... 8 4 Profile of the EIC-sPHENIX detector in η (Equation 1) and φ...... 9 5 η vs. φ for τ − from a 20x250 GeV collision...... 12 6 η vs. transverse momentum for τ − from a 20x250 GeV collision...... 12 7 Transverse momentum vs. total momentum for τ − from a 20x250 GeV collision. 12 8 η vs. φ for τ − for various collision energies (1, 000, 000 events)...... 13 9 ∆η vs. ∆φ for τ − jets (20x250 GeV collision)...... 14 10 ∆η vs. ∆φ for DIS jets (20x250 GeV collision)...... 14 11 Annotated diagram of the EIC detector and its major subsystems [18]. . . . 15 12 Jets in the barrel calorimeters for a single event. Each point on this plot represents a tower which was identified as part of a jet. The calorimeters are arranged by an assigned ID number; where 1 is the CEMC, 2 is the Inner HCAL, and 3 is the Outer HCal...... 19 13 ∆η vs. ∆φ for τ − jets (20x250 GeV collision)...... 20 14 ∆η vs. ∆φ for DIS jets (20x250 GeV collision)...... 20 15 Comparison of DIS and τ jets from a single event (20x250 GeV)...... 21 16 Plot showing a linear cut between global η and mjet. In this figure, Class A refers to DIS jets and Class B refers to τ jets...... 23 17 Accuracy of different machine learning algorithms with the smaller, jet-only data set...... 24 18 Accuracy of different machine learning algorithms with the larger, jet+global data set...... 24 19 Monte Carlo calculation of the value of π [21]. As we increase the number of points, the accuracy of our calculation increases...... 31 20 Scatter plots of all of the parameters in the smaller .csv file, for a 20x250 GeV collision...... 33 21 Plot showing the correlation of the different variables used in this analysis. The and y axes are the same values. A darker box indicates a more correlated set of values...... 34

List of Tables

1 Confusion Matrix for a linear division of 20x250 τ vs. DIS jets. This confusion matrix was created using the division in Figure 16...... 23

3 2 Confusion Matrix for an AdaBoost Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet only variables. Total accuracy: 86.7%...... 25 3 Confusion Matrix for an AdaBoost Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet and global variables. Total accuracy: 90.5%...... 25 4 Confusion Matrix for a Logistic Regression Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet and global variables and τ:DIS weighting of 1:1. Total accuracy: 84.4%...... 26 5 Confusion Matrix for a Logistic Regression Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet and global variables and τ:DIS weighting of 10:1. Total accuracy: 75.6%...... 26 6 Confusion matrices for the AdaBoost algorithm for a variety of different elec- tron and proton beam energies...... 27

4 1 Introduction

1.1 The Standard Model

Figure 1: The Standard Model of particle physics [20].

The Standard Model (SM) of particle physics is one of the greatest achievements of 20th century physics. It describes all of the known of and anti-matter and the means through which they interact. The particles of the standard model fall into two categories: and . Fermions have a of 1/2 and include both quarks and leptons. Quarks carry , which means that they can interact with the strong nuclear force. They also participate in electromagnetic and weak interactions. Leptons, on the other hand, interact only through the weak and (if electrically charged) electromagnetic forces. There are three generations (columns) of both quarks and leptons, but the reason for this parallel is as yet unknown. Bosons have a spin of 0 (Higgs) or 1 (gauge bosons). These spin-1 particles are mediators of the fundamental forces. These particles make up the fundamental building blocks of the visible universe as we know it. A complete description of the Standard Model, and its underlying physics, can be found in many introductory physics textbooks such as [16].

5 1.1.1 Deep Inelastic Scattering (DIS)

(a) Proton breaks up [1] (b) Proton remains whole [17]

Figure 2: Two DIS events, one in which the proton breaks up and one in which it remains whole .

Deep Inelastic Scattering (DIS) is a standard model process by which a lepton interacts with one of the quarks inside a . For the purposes of this paper, we will limit ourselves to the case of an electron interacting with the constituent quarks of a proton. In collider experiments, DIS is useful for using the electron as a probe to explore the internal structure of the proton (see Section 1.3). At low energies (< 1 GeV), interactions between electrons and can be considered roughly elastic, and the electron interacts with the proton as if it were a point particle. However, as the energy of the interaction is increased, and the de Broglie wavelength of the electron becomes small compared to the size of the proton, we begin to observe the substructure within the proton. At these high energies, the electron will exchange a virtual with one of the quarks inside the proton and cause it to recoil. The energy of the electron is altered (hence inelastic) and it is scattered. The scattering of the electron depends on which of the quarks participated in the reaction, and therefore measurements of the scattering angle can provide insight into the substructure of the proton. Experiments of this nature provided the first direct evidence for the existence of the 3 structure of the proton and [23]. Today, the energies of these experiments are sufficient to probe structure which exists beyond the three valance quarks, such as the sea

6 quarks and . The energies of the interaction are also sufficient to cause the proton to break into multiple pieces of hadronic debris (Figure 2a). Some DIS experiments (exclusive and semi-inclusive DIS) endeavor to measure these particles as well, while others (inclusive DIS) measure only the final state lepton. A more mathematical description of these processes can be found in [1, Pg. 18-20, 32] In this study, we expect the majority of the background events to be DIS events as described in Figure 2a. In these events, the quark is ejected from the proton and forms a hadronic jet (see Section 2.2).

1.2 Leptoquarks: Looking Beyond the Standard Model

Within the standard model, there are a number of observed conservation laws. Noether’s Theorem tells us that each one of these conservation laws corresponds to a symmetry. For instance, conservation of is associated with gauge invariance and conservation of energy is associated with time invariance [25]. However, there is no known symmetry which corresponds to the observed conservation of lepton number1 (also called lepton flavor). Until recently, it was unknown whether this symmetry was unknown or simply nonexis- tant. However, the discovery of neutrino oscillations provided definite proof that conserva- tion of was only an approximate symmetry [4]. These observations showed

that the electron neutrinos (νe) produced in nuclear reactions in the sun would be (approx-

imately) equally distributed between the three possible flavors (νe, νµ, and ντ ) by the time they reached the earth. Therefore, they must have been able to change freely between the three flavors. However, the neutrinos were able to exhibit this behavior only because their

−37 masses are very small (Σmν−i < 0.5 eV= 8.91 ∗ 10 kg  melectron = 511 keV). For the more massive leptons, the cross sections for these events (and thus their probabilities) are

1Conservation of lepton number is the principle that the number of each generation of leptons be un- changed in a reaction.

7 still incredibly small (BR(µ → e γ) < 10−54) [10, 19]. However, many models of Beyond the Standard Model (BSM) physics predict rates of Lepton Flavor Violation (LFV) events to be much higher than the standard model would give. Observations of LFV at higher rates than expected would be good evidence for these BSM models, and conversely observations of it at expected rates would put hard limits on such models. Thus, LFV events have become an exciting avenue to probe BSM physics. Here, we study a model of LFV mediated by a particle known as a Leptoquark (LQ), a particle which carries both lepton number and color charge. In the general case, there could exist a leptoquark for each of the allowed permutations of spin (0 or 1), number

(F = 3B +L = 0, ±2), SU(2)L singlets/doublets/triplets, and couplings with L or R handed fermions. In the Buchm¨uller-R¨uckl-Wyler (BRW) model, utilized in this study, there are 14 such leptoquarks [10, 11]. There are 7 leptoquarks with F = 0 and 7 with F = 2, each set with one leptoquark for all of the possible couplings between quark generations (1st ↔ 1st, 2nd ↔ 3rd, etc.). More detail, including the Lagrangians for these interactions, can be found in [10]. Figure 3 shows parton level Feynmen diagrams of how these leptoquarks mediate a e → τ LFV event.

e− τ − e− τ − e− qα e− qα LQ LQ LQ LQ qβ τ − qβ τ − qα qβ q¯α q¯β s-channel u-channel s-channel u-channel |F | = 0 |F | = 0 |F | = 2 |F | = 2

Figure 3: Feynman diagrams depicting the e → τ scattering process, which is mediated by a leptoquark. Time flows from left to right. α, β = 1, 2, 3 representing the three generations of quarks.

8 1.3 The Case for the EIC

Figure 4: Profile of the EIC-sPHENIX detector in η (Equation 1) and φ.

The EIC will probe the frontiers of Quanntum Chromodynamics (QCD). Experiments done at HERA, the only previously realized electron-proton collider, have shown that the the proton is not composed solely of the traditional 3 sea quarks [1]. Instead, there exists a rich substructure of sea quarks and gluons. However, the HERA experiment (and others since) were not able to explain some of the anomalous properties of proton. For instance, the spin of the 3 valence quarks can only account for ≈ 30% of the total spin of the proton. By utilizing longitudinally polarized beams of and , the EIC will be able to measure and quantify how the spins of gluons and sea quarks contribute to the total spin [1]. Particle collisions are messy. This is especially so at such as RHIC or the LHC, where two (or nuclei) are collided against one another. These ‘particles’ are actually complex, many-body systems and collisions between them result in a spray of hadronic debris in both directions along the beam axis. At high energies, this amounts to hundreds of particles to track and identify. In an Electron Ion Collider, however, we can cut this noise in half by replacing one of the beams with an electron beam. The electron, being an , can not break apart. Thus, we can use the electron as a resilient probe to investigate the inner structure of these composite particles.

9 The EIC Science case, laid out in [10], inludes a number of so-called “golden measure- ments”. These are measurements which the EIC is uniquely qualified to undertake, including that of:

ˆ The distribution of polarized gluons within nucleons

ˆ Generalized parton distributions of sea quarks and gluons

ˆ High precision measurements of the weak mixing angle (θW )

ˆ e → τ conversion mediated by a BRW leptoquark among many others. In this study, we seek to expand upon the last point and examine how effective the current EIC design would be at distinguishing the e → τ conversion process from background DIS processes. One potential plan for an EIC detector, based on the upcoming sPHENIX experiment, is described in [1] and will be referred to here as ‘EIC-sPHENIX’. This detector would be built around the the sPHENIX detector at RHIC (Brookhaven National Laboratory) and would utilize the sPHENIX barrel calorimeters and solenoidal magnet. In addition to these systems, endcap calorimeters are implemented in both the electron-going and hadron-going directions. The electron-going (left side in Figure 4) endcap calorimeter is of particular importance for measurements of DIS because that is the region where the final state lepton (in the vast majority of cases) will be after the interaction. The hadronic-going endcap (right side in Figure 4), on the other hand, will absorb and measure the majority of the hadronic debris. These three regions, working in concert, will ensure that the EIC-sPHENIX detector is able to detect particles produced in its interaction point with a range of η from −4 to 4, where

 θ  η = − ln tan . (1) 2

10 More detail about our specific implementation of the detector in Geant4 can be found in Section 3.1.

1.4 The Case for Leptoquark Studies at EIC

In order to study e → τ LFV events, an initial state electron is essential. Thus, the EIC is uniquely suited to undertake these studies as compared to the LHC or RHIC. Studies done at HERA have also put hard limits on both the mass and coupling strength of BRW

2 leptoquarks [15]. In particular, the ratio λ1αλ3β/MLQ (where λij is the coupling strength

th th of the leptoquark between the i generation lepton and the j generation quark and MLQ is the mass of the leptoquark) was studied. In the high LQ mass approximation, this ratio √ along with the center of mass energy ( s) of the collision determine the cross section of the interaction:

" #2 Z  X s λ1αλ3β σ = xq¯ (x, xs) f(y)dxdy + xq (x, −u) g(y)dxdy F =0 32π M 2 α β α,β LQ (2) " #2 Z  X s λ1αλ3β σ = xq (x, xs) f(y)dxdy + xq¯ (x, −u) g(y)dxdy F =±2 32π M 2 α β α,β LQ

It is estimated in [15] that the EIC will be able to achieve luminosities of O(1000) fb−1 “within a reasonable timescale”. Thus, it would be sensitive to LQ events with a cross section of O(0.001) fb over that same period of time. Initial designs for the EIC called for √ s ≈ 90 GeV, which would be able to improve upon the HERA limits for mass and coupling strengths of leptoquarks by up to a factor of 200 [10]. For the leptoquark which mediates the e → τ conversion through interactions with a first generation quark in the initial and

2 final state (governed by λ11λ31/MLQ) we would expect to improve the limit by only a factor √ of 20. However, more recent designs call for a s of up to 181 GeV (30x2752 collision). This

2The first number refers here to the electron momentum and the second number refers to the proton momentum. For instance: 30x275 GeV → a 30 GeV e− colliding with a 275 GeV p+.

11 will push our improvement of the HERA limit to a factor of 80. Even if LFV events are not observed at these energies, we will still be able to exclude a number of BSM theories.

2 Generator Level Studies

2.1 LQGENEP

Figure 5: η vs. φ for τ − from Figure 6: η vs. transverse Figure 7: Transverse momen- a 20x250 GeV collision. momentum for τ − from a tum vs. total momentum for 20x250 GeV collision. τ − from a 20x250 GeV colli- sion.

For these studies, we use a Monte Carlo3 event generator called LQGENEP. This gener- ator was developed for the HERA experiment to study leptoquark events in e − p scattering experiments [8]. The simulation was build on top of Pythia6, a widely used event genera- tor, with the BRW leptoquark processes added in. This study makes uses a modified version of LQGENEP which produces an output file compatible with our analysis chain. For this study, we focus entirely on the leptoquark which couples the first generation leptoquarks

(λ11 = λ31 = 0.3, MLQ = 1936.5 GeV [10]) and the e → τ conversion process. This study uses > 1, 000, 000 LQ events generated with LQGENEP over the full range of energies for which the EIC will operate. The plots in Figures 5 through 7 show the distribution of τ − √ from these events at a typical EIC energy (20x250 GeV, s = 141 GeV).

3A general description of Monte Carlo simulations can be found in Appendix A.

12 These plots show that the distribution of τ − produces by a LQ event is uniform in φ, as we would expect. However, there is a clear anisotropic distribution of the particles in η, with the majority falling within a range of 0 < η < 1. As discussed in Section 1.3, the EIC-sPHENIX detector has calorimeter subsystems cov- ering −4 < η < 4. Within that range, multiple tracking and detector subsystems are implemented. Looking at Figure 5 above, we can see that the vast majority (> 99.9%) of τ particles fall within that range of η. In Figure 8 we see that, for the entire range of e:p momentum ratios within the design constraints of the EIC, the τ particles produced from LQ events fall into this active range.

(a) 30 GeV x 50 GeV (b) 20 GeV x 250 GeV (c) 5 GeV x 275 GeV

Figure 8: η vs. φ for τ − for various collision energies (1, 000, 000 events).

Out of all of these events, we find that < 1% fall outside the range of η covered by EIC-sPHENIX. Thus, we conclude that the distribution of τ particles is compatible with our detector geometry.

2.2 Leptoquark vs. DIS: A Characteristic Signature

Jets of particles are produced when a quark is ejected from a color neutral system. These quarks, unable to exist in a color charged state, spontaneously create quark-antiquark pairs as they travel. All of these particles are all boosted4 in the same direction in the original 4A particle is said to be ‘boosted’ in a direction if it is accelerated in that direction at a significant speed. This causes the daughter particles from this particle to appear more collimated in the direction of

13 quark, and form a characteristic signature which can be recognized by a particle physics experiment. Pseudo-jets can also be formed when a boosted particle undergoes a decay into one or more hadrons (i.e. τ → 3-π). Because the end products are the same, the signatures of these jets and pseudo-jets can be extremely similar. This means that we can use similar methods to identify them. Here, we see to identify whether DIS and τ jets can be distinguished at the generator level. In order to do this, we first identify the particle at the head of the jet in our LQGENEP output file. Then, we iterate through the remaining particles to find its daughter particles (and their daughters, and so on). Using this information, we construct plots showing the difference in the individual particles from the progenitor particle in η and φ (i.e. ητ − ηi and

φτ − φi).

Figure 9: ∆η vs. ∆φ for τ − jets (20x250 GeV Figure 10: ∆η vs. ∆φ for DIS jets (20x250 collision). GeV collision).

The difference between these two signals can clearly be seen. The τ jet contains a more tightly packed central region whereas the DIS jet is much more spread out with a lower energy ‘core’. Looking at Figures 9 and 10, we see that the in the τ jets in φ is 0.171 radians whereas the standard deviation of the DIS jets is 0.413 radians. the boosting due to relativistic effects.

14 This difference in the width is the characteristic signal that we will be searching for as we transition into detector level studies.

3 Detector Level Studies

Figure 11: Annotated diagram of the EIC detector and its major subsystems [18].

The results from the generator level studies are promising, but do not represent the conditions of a real experiment. In a real collider experiment, we cannot simply identify and count individual particles and their daughters. Instead we must observe how these particles deposit their energy in our detector subsystems. To simulate the conditions of a real experiment, we utilize a software package called Geant4.

3.1 Geant4 Simulations

Geant4 is a Monte Carlo physics simulation software package which simulates the passage of particles through matter [3]. Utilizing this software package, the sPHENIX collaboration and

15 others have implemented a robust simulation of the detector geometry of the proposed EIC- sPHENIX detectors. An annotated diagram of important subsystems currently implemented in the EIC-sPHENIX simulation can be seen in Figure 11. This simulation includes both the barrel and endcap calorimeter subsystems, which means that the range of η in which particles can be detected with covers the full range proposed (−4 to 4). The barrel calorimeter subsystems cover a range of η from −1.1 to +1.1 and the endcap calorimeters bridge the gap to higher η. The barrel region contains three separate calorimeter subsystems: one electromagnetic calorimeter (EMCAL) and two hadronic calorimeters (HCALs). The innermost calorimeter, the EMCAL, consists of a so-called “optical accordian” of tungsten and scintillating fibers [5, 24]. Alternating layers of thin scintillating fibers and tungsten are glued together in a wave pattern. This pattern provides for a uniform response across a wide range of angles and positions. This design was chosen because of its potential for high segementation (0.024 x 0.024 in η and φ [5]), ability to operate in a region of high magnetic field, and potential √ for very good linearity and energy resolution (≤ 12%/ E). The inner and outer HCAL’s are constructed of plates of steel surrounding scintillating fibers. These steel-scintillator sandwiches are orientated radially inward with a small (5◦) tilt to guarantee no particle will be able to encounter no steel on a radial trajectory. The inner and outer HCAL’s are tilted in opposite directions, to again ensure that particles will deposit at least some of their energy in steel. More detail, including diagrams of these subsystems, can be found in [24]. The hadron-going endcap calorimeter system is composed of an electromagnetic calorime- ter (fEMCAL) and a hadronic calorimeter (fHCAL). The fEMCAL is interior to the fH- CAL, and consists of an array of 3295 Lead Scintillator (PbSc) modules consistent with the √ PHENIX design [5, 18]. These modules have an energy resolution of 8%/ E. They are unable to fit closely to the beamline (reaching only η = 3.0), however, so they are paired with an array of 300 Lead Tungsten (PbW) crystals closer to the beamline, which bring the

16 η covered by the fEMCAL to 4.0. The fHCAL is a PbSc sandwich calorimeter which sits directly behind the fEMCAL. It consists of 2044 10x10 cm2 towers and covers an η range of 1.2 to 4.0. The electron-going endcap calorimeter consists of a single electromagnetic calorimeter

(eEMCAL). This calorimeter consists of 2959 Lead Tungstate (PbWO4) crystals. These crystals are held in place by a carbon fiber shell which fills the empty space between each √ one. With this calorimeter, we expect to achieve an energy resolution of 1.5%/ E.

Once these detectors are implemented in Geant4, we seek to reproduce the conditions of a real experiment. We do this through a process known as digitization. In Geant4 we record the light which is produced by the scintillating material in our detectors, which is a measurable quantity in our experiment. Then, to the output from each of the calorimeters we apply a certain level of randomly generated background noise, which simulates the random firing of photomultipliers/other readouts present in a detector. We then apply an energy cut intended to cut out the majority of this noise, but which also unintentionally removes some real energy information. We use known conversion factors (which depend both on our materials and on the geometry of our detectors) to convert light recorded back into energy deposited in the detector. It is this ‘smeared’, calibrated output which is fed into our analysis modules. In our simulations, full digitization has been fully implemented in the Barrel EMCAL, but only partially implemented in the Barrel HCAL’s (calibration of recorded light to energy is present, but no random smearing). We can see this in Figure 12, where only the innermost barrel calorimeter has a significant amount of noise present. Similarly, full digitization is present in the eEMCAL, but only partial digitization is present in the fHCAL and fEMCAL. In a normal DIS event, we would expect to observe only one jet of significant energy in the detector. However, in a LQ event we will have two (one true quark jet and one τ pseudo-jet). Thus, the existence of two jets is a logical simple cut for the identification of a

17 LQ event. This is easily accomplished with existing analysis modules in the EIC simulation software. Here, we attempt the next step in the analysis of these events: to distinguish between the individual jets and determine which jet was caused by a τ.

3.2 Jet Reconstruction

The anti-kt algorithm used here has been studied extensively, and has been shown to be both infrared and collinear safe [12]. The algorithm operates by grouping together objects (in our case, towers or clusters) using the following equations:

  2 1 1 ∆ij dij = min 2 , 2 2 kt−i kt−j R

2p (3) diB = kti

2 2 2 ∆ij = (yi − yj) + (φi − φj)

where kt is the transverse momentum (referred to elsewhere as pT ), y is the rapidity of the object, and φ is the azimuthal angle of the object. These equations provide a weighted ‘distance’ where two objects with a large energy will be closer together than two objects with one high and one low energy and much closer than two low energy objects. Looping over all of the objects in an event, the highest two objects within a certain radius (R) are

grouped together into one object if dij > diB. If two objects do not satisfy this relation, the they are kept as separate jets. This process repeats until each object is associated with a jet. In isolation, all the objects within R of a jet center will be associated with a single conical

jet. If there are two hard (high energy) objects within the range R < ∆ij < 2R then two overlapping semi-conical jets will be created.

Here, we apply the anti-kt algorithm to our simulated LQ events in the Geant4 simu- lation. For this purpose, we have created a custom analysis module which implements the

anti-kt (R = 0.5) reconstruction algorithm for each LQ event using only calorimeter infor-

18 mation such as tower position and energy deposited. Various parameters of these jets (η, φ, energy, etc.) are saved in an output file for use in our analysis. A plot of a single event with multiple jets can be seen in Figure 12.

Figure 12: Jets in the barrel calorimeters for a single event. Each point on this plot represents a tower which was identified as part of a jet. The calorimeters are arranged by an assigned ID number; where 1 is the CEMC, 2 is the Inner HCAL, and 3 is the Outer HCal.

From this figure, we can see that a large number of low energy ‘noise jets’ have been identified by our algorithms. These can easily be cut by simply applying a minimum total energy threshold. Then we are left with 2 jets which we need to classify as either DIS or τ. In order to determine which of these jets comes from a τ and which come from a quark, we need to form an understanding of the general properties of each jet type. In order to do this, we go back one step and remove all the particles from the input files which do not correspond to a τ jet. This is saved as a ‘sterilized’ input file which we put through the

Geant4 simulation by itself. This removes the need to identify the jets from one another, as the highest energy jet in the output will always be the τ jet. Background jets (from noise towers) are still present, but carry an energy much less than the primary jet and therefore would be discarded in any analysis. The same process is repeated for DIS quark jets. Figures 13 and 14 show the shape of these jets in η and φ.

19 Figure 13: ∆η vs. ∆φ for τ − jets (20x250 Figure 14: ∆η vs. ∆φ for DIS jets (20x250 GeV collision). GeV collision).

We can once again see that there is a difference between the jets from these two processes, on average. Once again, we see that the standard deviation in φ for the τ jets (0.098 radians) is less than that of the DIS jets (0.151 radians). However, the difference becomes much less clear when comparing single jets (Figures 15a and 15b). Looking at individual energy distribution is clearly not sufficient at this level to make a clear cut between τ and DIS jets. Thus, we now seek to create an algorithmic way to distinguish between τ and DIS jets on an event-by-event basis.

20 (a) ∆η vs. ∆φ for a single τ − jet. (b) ∆η vs. ∆φ for a single DIS jet.

(c) ∆φ for a single τ − jet. (d) ∆φ for a single DIS jet.

Figure 15: Comparison of DIS and τ jets from a single event (20x250 GeV).

3.3 Jet Identification

3.3.1 Initial Cuts

Using the jets produced by running sterilized input through the Geant4 simulation, we produce a two output csv files which contain a number of the parameters of the jets. One

21 file contains only parameters which are intrinsic to the jet (such as total energy, size in η/φ, number of towers above a certain threshold energy, etc.) and one which contains these in addition to a number of global parameters of the jet (position in the EIC in η/φ). Details regarding the exact parameters and how they are calculated can be found by following the links in Appendix B. We began by examining these parameters to see if we could make a simple 2-D linear cut in this parameter space which would divide the two sets of jets. After looking at the data (Figure 20), we found that the best parameters for this purpose were global η and the invariant mass of the jet, mjet, defined as follows:

2 2 2 p 2 2 E = p + mjet → mjet = E − p q (4) 2 2 2 2 = E − px + py + pz .

Similar plots, albeit with worse results, can be constructed from any combination of param- eters. After plotting these quantities on a plane, we use linear regression to find the line which best divides the two data sets. The resulting plot can be seen on the left side of Figure 16. In that plot, Class A (blue) represents DIS jets and Class B (red) represents τ jets. A line of best fit is chosen which divides the parameter space best between the two sets. The blue background represents all of the parameter space belonging to a DIS jet, while the brown background represents all the parameter space said to belong to τ jets. It can clearly be seen that this algorithm is not 100% effective, and the plot on the right side of Figure 16 illustrates this. This plot represents points above and below the decision boundary. Any point with a positive score is said to be below the boundary, and thus a τ jet, whereas any point with a negative score is above the boundary, and a DIS jet. A significant overlapping region between the histograms can be seen. When tested on a new validation data set, only 83.4% of jets are correctly identified.

22 Figure 16: Plot showing a linear cut between global η and mjet. In this figure, Class A refers to DIS jets and Class B refers to τ jets.

Another illustration of how well our algorithms perform is given by Table 1. The diagonals of this matrix represent the DIS and τ jets from the validation data set which were correctly identified, while the off diagonals represent those jets which were misidentified.

Actual τ 162 40 Actual DIS 26 170 Predicted τ Predicted DIS

Table 1: Confusion Matrix for a linear division of 20x250 τ vs. DIS jets. This confusion matrix was created using the division in Figure 16.

These results, with a false positive rate of nearly 14%, are inadequate for a detector experiment searching for rare events. Thus, we need to explore more sophisticated ways of discerning between the two data sets.

3.3.2 Application of Machine Learning

The scikit-learn library in python provides a number of ready-made, advanced machine learning algorithms [22]. We have tested all of the default methods with the following procedure. We randomly split the total number of sample jets (1000 τ jets and 1000 DIS jets) into a training data set (80%) and a validation data set (20%). The training data

23 set will be used to train our algorithms and the validation set will be used to evaluate the ultimate performance of the algorithm we choose to persue. For each of the algorithms we choose, we run through the training data set 10 times. Each time, 90% of the training set is utilized and the remaining 10% is used to evaluate the performance. This gives us a picture of how sensitive the algorithm is to changes in the data sets. A selection of the results of these tests can be seen in Figures 17 and 18.

Figure 17: Accuracy of different machine Figure 18: Accuracy of different ma- learning algorithms with the smaller, jet-only chine learning algorithms with the larger, data set. jet+global data set.

From these plots, we can draw a number of conclusions. First, we can see that the addition of more data points does not uniformly improve the performance of these algorithms. Some of the data points we have added are highly correlated (Figure 21), and this can hamper the performance of some algorithms. This can be seen most clearly in QDA (Quadratic Discriminant Analysis), where the average score drops from 83.6% to 54.1%. Not every algorithm suffers from this disadvantage. In fact, the majority of the scores increase with the addition of these parameters. From this information, we chose to investigate two different algorithms: ADA (AdaBoost Classifier [14, 26]) and LR (Logistic Regression). We chose to investigate these two for accuracy and tunability, respectively. We first look at the AdaBoost Classifier algorithm5. This algorithm was the most accurate 5We considered a number of settings for the AdaBoost function, but ultimately used the following

24 of all of the algorithms tested, and was less sensitive than most to the difference between the two data sets. After retraining it on the both the jet-only and the full jet+global training set, we evaluate its performance using the validation set. This gives us the following confusion matrices:

Actual τ 171 31 Actual τ 181 21 Actual DIS 22 174 Actual DIS 17 179 Predicted τ Predicted DIS Predicted τ Predicted DIS

Table 2: Confusion Matrix for an AdaBoost Table 3: Confusion Matrix for an AdaBoost Classifier trained on ∼ 800 20x250 τ vs. DIS Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet only variables. Total accuracy: jets with jet and global variables. Total ac- 86.7%. curacy: 90.5%.

This algorithm is more accurate in either case than the simple cut made earlier. Its’ overall accuracy is also improved by the addition of global η and φ by 3.8%. Due to limited simulation time, we were not able to have sample sizes larger than O(1000) events. Were we able to increase these sample sizes, it would likely improve our results as doing so has been shown to increase the accuracy of the AdaBoost algorithm in similar high-dimensional, binary classification problems [9]. While this algorithm was able to improve our overall accuracy, we still have a problem of false positives. Even in the best case for AdaBoost, we still have a false positive rate of 8.7%.

AdaBoost, in its implementation in scikit-learn, has no direct way to weight the different classes to produce an imbalanced result (i.e. one class is identified correctly more often in exchange for the other being misidentified). Thus, we look at this point to a simpler algorithm — Logistic Regression — in which we can assign weights to the two classes of jets. Whichever jet class we weight more heavily as we train the algorithm will be less likely to be misidentified in the validation set. In Tables 4 and 5 below, we show two different confusion matrices for the same Linear Regression algorithm with the class weightings altered.

in our analysis: AdaBoostClassifier(base_estimator=None, n_estimators=100, learning_rate=0.5, algorithm=’SAMME.R’, random_state=None)

25 Actual τ 168 34 Actual τ 106 96 Actual DIS 28 168 Actual DIS 1 195 Predicted τ Predicted DIS Predicted τ Predicted DIS

Table 4: Confusion Matrix for a Logistic Re- Table 5: Confusion Matrix for a Logistic Re- gression Classifier trained on ∼ 800 20x250 gression Classifier trained on ∼ 800 20x250 τ vs. DIS jets with jet and global variables τ vs. DIS jets with jet and global variables and τ:DIS weighting of 1:1. Total accuracy: and τ:DIS weighting of 10:1. Total accuracy: 84.4%. 75.6%.

Just by weighting the classes of jets differently we can exchange a reduction in the false positive rate (from 14.3% to 0.5%) for an increase in the false negative rate (from 16.8% to 47.5%). We could also do the opposite, and weight the DIS class more heavily to reduce the rate of false negatives while dramatically increasing the false positives. Depending on the position of the machine learning algorithm in the analysis chain, either one may be preferable. If there are additional steps beyond this where false positives will be screened out, then we would want to minimize the number of false negatives as far as possible to prevent events from being lost. However, if this algorithm is the terminal step in the analysis then we must do everything we can to eliminate false positives. The problem of false negatives is additionally only a problem if we are restricted by time. If, while conducting an experiment, we know that 47.5% of the time we will falsely discount a τ jet, then we simply need to operate our experiment for 47.5% more time to recoup those losses.

3.3.3 Energy Dependence

The previous studies were conducted using only the data from 20x250 GeV collisions. How- ever, this is only one potential (albeit typical) momentum combination for the EIC. Here, we examine whether the changes in energy influence the efficacy of the algorithms ability to distinguish between DIS and τ jets. Using the same procedure detailed in Section 3.3.2, we generated 1000 DIS and 1000 τ jets for a number of different center of mass energies and momentum imbalances. We then used the AdaBoost algorithm with the same settings as

26 before to attempt to classify these events.

10x50 GeV 30x275 GeV Actual τ 178 12 Actual τ 184 19 Actual DIS 4 206 Actual DIS 7 187 Predicted τ Predicted DIS Predicted τ Predicted DIS Total Accuracy: 96.0% Total Accuracy: 93.5%

30x50 GeV 5x275 GeV Actual τ 178 20 Actual τ 166 24 Actual DIS 21 180 Actual DIS 14 196 Predicted τ Predicted DIS Predicted τ Predicted DIS Total Accuracy: 89.7% Total Accuracy: 90.5%

Table 6: Confusion matrices for the AdaBoost algorithm for a variety of different electron and proton beam energies.

This data shows that the performance of the AdaBoost algorithm for jet classification is excellent for the wide range of energies of the EIC. In fact, for some of these energies the algorithm is even more effective than for a typical energy. Thus, we can conclude that the profiles and distributions of τ jets and DIS jets vary with energy but are always distinct enough to be distinguished with > 89% accuracy. However, for some of these events the √ center of mass energy ( s) is much lower than the typical 20x250 GeV events we have studied previously. The equation for the cross section of a leptoquark interaction, Equation 2, tells us that cross section for (and thus the probability of) such interactions is directly proportional to s. Thus, while a 10x50 τ jet may be very distinct from its DIS counterpart, it is also very unlikely to occur. Further studies will have to be done in order to quantify which combination of electron and proton energies will give the best results for both the cross section (and thus event rate) and jet identification.

27 4 Results and Discussions

Here we have shown that the geometry of the proposed EIC detector, as currently imple-

mented in Geant4, is compatible with a search for e− → τ − LFV events mediated by BRW leptoquarks. These events are primarily concentrated in the barrel region for the entire energy range of the EIC. The τ − particles produced from these leptoquarks decay quickly into multiple hadrons, primarily through the 3 − π channel, which go on to deposit their energy in our calorimeter subsystems. We have shown that we can detect and reconstruct

these jets with good accuracy using standard Anti-kt jet finding algorithms. Using machine learning algorithms, we have also shown that we can identify these jets with up to 90.5% overall accuracy at a typical EIC center of mass energy of 141 GeV (20x250 GeV collision). We have also shown that we can tun the machine learning algorithms to sacrifice some of this accuracy in order to minimize false positives (< 1%).

5 Future Studies

Currently, the machine learning algorithms utilize only information from the jet reconstruc- tion algorithms (η, φ, jet width, jet energy, etc.). The accuracy of the identification of jets can likely be improved with the addition of more data points from other detector subsystems. For instance, the absence of a final state electron in the EEMC would also be a character- istic signature of an e− → τ − conversion, but is currently not taken into account in this analysis. Future studies can take this information into account, and likely achieve discern LQ events from DIS background with even greater accuracy. This study was also limited by simulation time, and thus our sample sizes for training the machine learning algorithms were only O(1000) jets. Even with all other factors constant, increasing the sample sizes by 1-2 orders of magnitude should improve both the accuracy and precision of the tests. Finally, the analysis of this data was done using only pre-made machine learning algorithms. An

28 algorithm, possibly based on an AdaBoost method, which was custom designed to be both accurate and tunable would most likely improve results beyond what we have achieved here.

29 A Monte Carlo Event Generation

Monte Carlo simulations are a class of solutions to problems which, broadly speaking, harness the power of randomness in order to solve complicated problems. In order to employ a Monte Carlo method, it is necessary to first define a “phase space” of possible outcomes for your problem. Then, by selecting random entries from this phase space, one can begin to build up a statistical picture of the true solution to the problem. More probable outcomes will have a larger area in this phase space than uncommon outcomes, and thus will be chosen more often. The prototypical example of this is as follows. When a circle is inscribed in a square, such that the diameter of the circle (d) is equal to the length of the side of the square and their centers coincide, there exists a simple ratio between their two areas:

2 r A d 4 1 A◦ = 2 = 2 → π = (5) A◦ π(d/2) π 2 A

From this, we can construct a simple ratio for the value of π. If we place points randomly within the circle-square structure, we can transform this expression into one proportional to the number of points (N) placed within the area:

1r N π = ◦ (6) 2 N

This value will become more and more exact as we place more points, as shown in Figure 19 below:

30 Figure 19: Monte Carlo calculation of the value of π [21]. As we increase the number of points, the accuracy of our calculation increases.

Similarly, particle physicists apply Monte Carlo methods to particle collisions and decays. Each decay or interaction has a certain probability of taking place, and by iterating over many possible reactions we can develop a good understanding of how decays take place in a real experiment.

B Codebase

During the course of this thesis project, a great deal of code was written and maintained. In the interest of allowing this work to be carried forward, all of this code is publicly available. Below is a brief summary of where this information can be found. LQGENEP, the event generation software, can be found in the following github reposi- tory: https://github.com/jlabounty/LQGENEP. Within this respository, one can also find a copy of eic-smear. Using the BuildTree() function included with this package, one can convert the LQGENEP output text file into a ROOT TTree to be read into an analysis chain. The ROOT macros which have created the plots seen in this report can be found in the following github repository: https://github.com/jlabounty/leptoquark. Most require ROOT 5.3.4 to run, but some may be compatible with ROOT 6.

The Geant4 simulation files can be found in the following github repository, which was forked from the sPHENIX github page: https://github.com/jlabounty/macros. For

31 further information on how to set up and run the sPHENIX computing environment in order to run these simulations, contact a current administrator.

The analysis module, LeptoquarksReco.{C,h}, which identifies and stores jet informa- tion in a ROOT file for later analysis can be found in the following github repository: https://github.com/sPHENIX-Collaboration/analysis.

32 C Additional Figures

Figure 20: Scatter plots of all of the parameters in the smaller .csv file, for a 20x250 GeV collision.

33 Figure 21: Plot showing the correlation of the different variables used in this analysis. The x and y axes are the same values. A darker box indicates a more correlated set of values.

34 References

[1] A. Accardi et al. “Electron Ion Collider: The Next QCD Frontier”. In: Eur. Phys. J. A52.9 (2016). Ed. by A. Deshpande, Z. E. Meziani, and J. W. Qiu, p. 268. doi: 10.1140/epja/i2016-16268-9. arXiv: 1212.1701 [nucl-ex]. [2] A. Adare et al. “Concept for an Electron Ion Collider (EIC) detector built around the BaBar solenoid”. In: (2014). arXiv: 1402.1209 [nucl-ex]. [3] S. Agostinelli et al. “Geant4a simulation toolkit”. In: Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 506.3 (2003), pp. 250 –303. issn: 0168-9002. doi: https://doi.org/10. 1016/S0168-9002(03)01368-8. url: http://www.sciencedirect.com/science/ article/pii/S0168900203013688. [4] Q. R. Ahmad et al. “Direct Evidence for Neutrino Flavor Transformation from Neutral- Current Interactions in the Sudbury Neutrino Observatory”. In: Phys. Rev. Lett. 89 (1 2002), p. 011301. doi: 10.1103/PhysRevLett.89.011301. url: https://link.aps. org/doi/10.1103/PhysRevLett.89.011301. [5] C. Aidala et al. “sPHENIX: An Upgrade Concept from the PHENIX Collaboration”. In: (2012). arXiv: 1207.6378 [nucl-ex]. [6] A. Aktas et al. “Search for lepton flavour violation in ep collisions at HERA”. In: The European Physical Journal C 52.4 (2007), pp. 833–847. issn: 1434-6052. doi: 10.1140/epjc/s10052-007-0440-2. url: http://dx.doi.org/10.1140/epjc/ s10052-007-0440-2. [7] Carl H. Albright and Mu-Chun Chen. “Lepton Flavor Violation in Predictive SUSY- GUT Models”. In: Phys. Rev. D77 (2008), p. 113010. doi: 10.1103/PhysRevD.77. 113010. arXiv: 0802.4228 [hep-ph]. [8] L. Bellagamba. “LQGENEP: a leptoquark generator for /ep scattering∗”. In: Computer Physics Communications 141 (Nov. 2001), pp. 83–97. doi: 10.1016/S0010-4655(01) 00295-8. [9] Rok Blagus and Lara Lusa. “Boosting for high-dimensional two-class prediction”. In: BMC Bioinformatics 16.1 (2015), p. 300. issn: 1471-2105. doi: 10.1186/s12859-015- 0723-9. url: http://dx.doi.org/10.1186/s12859-015-0723-9. [10] Daniel Boer et al. “Gluons and the quark sea at high energies: Distributions, polariza- tion, tomography”. In: (2011). arXiv: 1108.1713 [nucl-th]. [11] W. Buchmuller, R. Ruckl, and D. Wyler. “Leptoquarks in lepton-quark collisions [Phys. Lett. B 191 (1987) 442]”. In: Physics Letters B 448 (Feb. 1999), pp. 320–320. doi: 10.1016/S0370-2693(99)00014-3. [12] Matteo Cacciari, Gavin P. Salam, and Gregory Soyez. “The Anti-k(t) jet clustering algorithm”. In: JHEP 04 (2008), p. 063. doi: 10.1088/1126-6708/2008/04/063. arXiv: 0802.1189 [hep-ph].

35 [13] S. Chekanov et al. “Search for lepton-flavor violation at HERA”. In: Eur. Phys. J. C44 (2005), pp. 463–479. doi: 10.1140/epjc/s2005-02399-1. arXiv: hep-ex/0501070 [hep-ex]. [14] Yoav Freund and Robert E Schapire. “A desicion-theoretic generalization of on-line learning and an application to boosting”. In: European conference on computational learning theory. Springer. 1995, pp. 23–37. [15] Matthew Gonderinger and Michael J. Ramsey-Musolf. “Electron-to-Tau Lepton Flavor Violation at the Electron-Ion Collider”. In: JHEP 11 (2010). [Erratum: JHEP05,047(2012)], p. 045. doi: 10.1007/JHEP05(2012)047,10.1007/JHEP11(2010)045. arXiv: 1006. 5063 [hep-ph]. [16] David Griffiths. Introduction to Elementary Particles. Wiley-VCH, 2008. [17] Thomas Krahulik. “Simulated Measurements of Exclusive Deep Inelastic Scattering with a Future Electron Ion Collider”. Undergraduate Thesis. 2017. [18] John Lajoie. “The sPHENIX Detector: The Future of Heavy-Ion Collisions at RHIC, and a Foundation for an EIC Detector”. In: DIS Conference. [19] R. P. Litchfield. “Muon to electron conversion: The COMET and experiments”. In: Interplay between Particle and Astroparticle physics (IPA2014) London, United Kingdom, August 18-22, 2014. 2014. arXiv: 1412 . 1406 [physics.ins-det]. url: https://inspirehep.net/record/1332516/files/arXiv:1412.1406.pdf. [20] Matic Lubej. Standard Model. 2015. url: http://www.physik.uzh.ch/groups/ serra/StandardModel.html. [21] Nicogauro. Monte Carlo method applied to approximating the value of pi. 2017. url: https://commons.wikimedia.org/wiki/File:Pi_30K.gif. [22] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830. [23] Michael Riordan. “The Discovery of Quarks”. In: SLAC-PUB-5724 (1992). [24] “sPHENIX preConceptual Design Report”. In: (2015). [25] The Variational Principles of Mechanics. Dover Publications, 1970. [26] Ji Zhu et al. “Multi-class adaboost”. In: Statistics and its Interface 2.3 (2009), pp. 349– 360.

36