DisCo Fever: Robust Networks Through Distance Correlation

Gregor Kasieczka1, ∗ and David Shih2, 3, 4, † 1 Institut f¨urExperimentalphysik, Universit¨atHamburg, 22761 Hamburg, Germany 2NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854 USA 3Theory Group, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA 4Berkeley Center for Theoretical Physics, University of California, Berkeley, CA 94720, USA While deep learning has proven to be extremely successful at supervised classification tasks at the LHC and beyond, for practical applications, raw classification accuracy is often not the only consideration. One crucial issue is the stability of network predictions, either versus changes of individual features of the input data, or against systematic perturbations. We present a new method based on a novel application of “distance correlation” (DisCo), a measure quantifying non-linear correlations, that achieves equal performance to state-of-the-art adversarial decorrelation networks but is much simpler and more stable to train. To demonstrate the effectiveness of our method, we carefully recast a recent ATLAS study of decorrelation methods as applied to boosted, hadronic W -tagging. We also show the feasibility of DisCo regularization for more powerful convolutional neural networks, as well as for the problem of hadronic top tagging.

Introduction derstanding and mitigating these systematic differences Recent breakthroughs in deep learning have begun to is essential in any experimental analysis, and having a revolutionize many areas of high energy physics. One decorrelated classifier has many applications in this re- area that has received considerable focus is the problem gard. For example, if the sources of systematic uncer- of classifying different types of jets at the LHC. Deep tainty are known, one can attempt to explicitly decorre- neural networks have been applied, for example, to dis- late a classifier against them in order to reduce or elimi- tinguishing top quarks from light quark and gluon jets. nate their effects [31–34]. Or, one can attempt to control For this problem a large number of architectures based for these systematic differences using data-driven meth- on fully connected neural networks [1, 2], image-based ods, such as sidebanding in the invariant mass.1 If the methods [3, 4], recursive clustering [5, 6], physics vari- signal is localized but the background is smooth in mass, ables [7–10], sets [11], and graphs [12, 13] have been the sideband method allows one to calculate MC vs. data studied [14–16]. Related challenges of identifying vector correction factors, define control samples, and estimate bosons [17, 18], b-quarks [19, 20], Higgs bosons [13, 21], backgrounds. But if the classifier sculpts features (e.g. and distinguishing light quark from gluon jets [22–25] bumps) into the background mass distribution, it cannot have seen similar progress. Beyond classifying single par- be relied on for sidebanding. A classifier that is decorre- ticles in an event, there is also work on developing holistic lated with mass is sufficient (although not necessary) to methods that classify full events according to the likely guarantee smoothness of the background mass distribu- physics process that produced them [26, 27]. Finally, tion. some of these novel deep learning methods are begin- ning to be applied to concrete experimental analyses, see The issue is especially acute for powerful multivari- e.g. [28–30]. ate classifiers such as neural networks, which will have a strong incentive to “learn the mass” when building the So far, the recent activity in developing better jet clas- optimal discriminant. Even if one excludes mass from the sifiers with deep learning has focused on maximizing their list of inputs to the machine learning algorithm, it may raw performance. However, the most accurate classifier not be enough to achieve a decorrelated classifier – many is often not the best one for actual experimental applica- of the other inputs may be correlated with mass, and tions. Instead, what is often desired is the most accurate arXiv:2001.05310v2 [hep-ph] 30 Sep 2020 machine learning methods in general are flexible enough classifier given the constraint that it is decorrelated with to exploit correlations of inputs. Such improvements will one or more auxiliary variables. be especially relevant for (but not limited to) searches for The underlying reason for this requirement is that clas- new resonances with unknown mass. The identification sifiers are trained on Monte Carlo (MC) simulated ex- of resonances in invariant mass distributions is histori- amples (for which perfect truth labels are available), but cally the main avenue to discovery in experimental par- are applied to (unlabeled) collision data. While the sim- ticle physics, and relies on robust background estimates. ulated events are of high fidelity, they do not perfectly Therefore an important and significant challenge is to de- reproduce the real data, and this gives rise to system- atic differences between training and testing data. Un-

1 Although different auxiliary variables can be used in experimen- tal analyses, one of the most common choices is invariant mass. ∗Electronic address: [email protected] So for concreteness, and without loss of generality, we will focus †Electronic address: [email protected] on the case of invariant mass for the remainder of this paper. 2 sign classifiers that are as fully decorrelated from mass Distance Correlation as possible while using maximal information. Given a sample of paired vectors (~xi, ~yi) (where the index i runs over the sample) drawn randomly from some distri- In this paper we will present a new method for training bution, we would like a function that measures the extent decorrelated classifiers which achieves performance com- to which they are drawn from independent distributions, parable to state-of-the-art methods, while being much ~ ~ ~ ~ easier to train. The key observation is that a statistical i.e. the extent to which Pjoint(X, Y ) = PX (X)PY (Y ). In measure called Distance Correlation (DisCo) [35–38] is order for this function to be applicable in a deep learn- sensitive to general, nonlinear correlations between two ing context, we also require that this function be differ- random variables and can be efficiently computed from fi- entiable and that it can be computed directly from the nite samples. Distance correlation is well-known in statis- sample. tics and has been applied to various fields including data In our case the vectors are one dimensional and corre- science [39] and biology [40]. To our knowledge, this is spond to mass X = m and classifier output Y = y but the first application of DisCo to particle physics. clearly one can imagine many more applications of such a measure at the LHC and beyond. By including DisCo as an additive regularizer term in the loss function, we demonstrate that we can achieve a The usual Pearson correlation coefficient R only mea- state-of-the art decorrelated classifier with just one addi- sures linear dependencies so it is not suitable for our tional hyperparameter (the coefficient of the DisCo reg- purposes. Specifically, features can have nonlinear de- 2 ularizer). By varying this coefficient, we can control the pendencies and still exhibit zero Pearson R. There are tradeoff between classification performance and decorre- many information-theoretic measures of similarity of dis- lation, interpolating between a fully decorrelated tagger tributions such as KL-divergence, Jensen-Shannon dis- and a fully performant one. tance, and mutual information. These are difficult to compute directly from the sample, without binning. One To validate our methods and rigorously demonstrate can approximate these measures by training a classifier that they are state-of-the-art, we will carefully reproduce and using the likelihood ratio trick, but this again leads the results of a recent ATLAS study of decorrelated tag- to adversarial methods, see e.g. [33, 46–49]. gers for identifying boosted W bosons [41]. This study One measure that seems to fit the bill perfectly is “dis- includes a comprehensive set of decorrelation methods, tance correlation”, which originated in the works of [35– including [31, 42–44]. The most promising technique so 38]. It can be computed from the sample and it has the far (in terms of achieving the highest classifier perfor- key property that it is zero iff X and Y are independent. mance for a given level of decorrelation) has been adver- sarially training a pair of neural networks: a classifier The definition of distance is: distinguishing different classes and an adversary predict- Z ing the mass [31, 44] for a given classifier output. 2 p q 2 dCov (X,Y ) = d sd t |fX,Y (s, t)−fX (s)fY (t)| w(s, t) The downside of the adversarial method has been that (1) it is extremely difficult to implement in practice. Not p q only does one have to essentially train two separate neu- where X ∈ R , Y ∈ R , fX and fY are the characteristic ral networks, each with their own set of hyperparame- functions for the random variables X and Y , and fX,Y is ters, but one has to carefully tune these two neural net- the joint characteristic function for X and Y . Finally works against each other. This stems from the nature of adversarial training: the objective is not to minimize a w(s, t) ∝ |s|−(p+1)|t|−(q+1) (2) loss function, but rather to find a saddle point where the classifier loss is minimized but the adversary loss is maxi- mized. Without careful tuning of learning rate schedules, is a weight function that is uniquely determined up to number of epochs, minibatch sizes, etc., the training eas- an overall normalization by the requirement that dCov ily becomes unstable (since the loss is unbounded from is invariant under constant shifts and orthogonal trans- below) and can quickly run away to a meaningless result. formations, and equivariant under scale transformations [50]. Since fX,Y = fX fY iff X and Y are independent By contrast, DisCo regularization maintains the con- random variables, the definition (1) makes clear that dis- vex objective of the original loss function (i.e. the DisCo tance covariance is a measure of the independence of X term is a positive measure of nonlinear correlations), and Y that is zero iff X and Y are independent. making it much more stable to train. And since it only Using the definition of the characteristic function it is has one additional hyperparameter, no additional tuning straightforward to verify that we can also express dCov is required. We will show, in the context of the ATLAS W -tagging study, that the result of DisCo decorrelation is comparable to that of adversarial decorrelation. In the Appendix, we will also demonstrate the state-of-the-art 2 Since the Pearson correlation coefficient is nonzero only if fea- performance for top tagging with jet images and convo- tures are correlated, it can however be used to actively correlate lutional neural networks (CNNs). features, see e.g. [45]. 3 as W 2 dCov (X,Y ) = h|X − X0||Y − Y 0|i 10 2 + h|X − X0|ih|Y − Y 0|i (3) 3 − 2h|X − X0||Y − Y 00|i 10 QCD where | · | refers to the Euclidean vector norm3 and 10 4 (X,Y ), (X0,Y 0), (X00,Y 00) are iid from the joint distri- bution of (X,Y )(X00 is not used in (3)). Using this al- normalized counts 10 5 ternative form of dCov2 it is straightforward to compute 2 4 a sampling estimate of dCov from a dataset of (xi, yi). Finally, we normalize the distance covariance by the 50 100 150 200 250 mass [GeV] individual distance to obtain distance correla- tion: FIG. 1: Invariant mass distribution for the inclusive W and dCov2(X,Y ) QCD samples. dCorr2(X,Y ) = (4) dCov(X,X)dCov(Y,Y )

The distance correlation is bounded between 0 and 1. validate our methods and rigorously demonstrate that Normalizing ensures equally strong decorrelation inde- our method of distance correlation is state-of-the-art. pendent of the overall scale. Following the ATLAS study, we generate the SM pro- 2 We will add dCorr as a regularizer term to the usual cesses√ pp → WW and pp → jj in Pythia 8.219 [51] at classifier loss function in the following.5 In detail: s = 13 TeV with a generator level cut of pT >250 GeV on the initial particles. We use Delphes 3.4.1 with the L = L (~y, ~y ) + λ dCorr2 (~m,~y) (5) default detector card for detector simulation [52]. We classifier true ytrue=0 also use the built-in functionality of Delphes to simu- where λ is a single hyperparameter that controls the late pileup with hNPU i = 24 as per the ATLAS study tradeoff between classifier performance and decorrela- [41]. tion, ~y is the output of the NN on a single minibatch, and Jets are reconstructed using FastJet 3.0.1 [53] and 6 ~ytrue and ~m are the true labels and masses respectively. the anti-kT algorithm [54] with R = 1 distance parame- The subscript ytrue = 0 indicates that the distance cor- ter. Jets are required to have |η| < 2 and to be within relation is only calculated for the subset of the minibatch ∆R < 0.75 or the original parton. The daughters of that is background; this is the appropriate mode for W - the W are also required to be within ∆R < 0.75 of the tagging. Of course, for other applications it may be more original W . Finally jets are trimmed [55] with param- appropriate to apply the decorrelation to all events, or eters Rsub = 0.2 and fcut = 5%. For the final sam- even to signal events only. ple, jets are required to have m ∈ [50, 300] GeV and pT ∈ [300, 400] GeV; the mass distributions for signal Samples and background are shown in fig. 1. Apart from the very As discussed in the Introduction, we will focus in this last requirement on pT , these are all following the AT- paper on W tagging, for which there is a detailed study LAS study. Here we choose to focus on a more narrow of existing decorrelation methods by the ATLAS collabo- range in pT for simplicity. ration [41]. (See the Appendix for a brief demonstration From this sample of jets, we compute the complete of DisCo decorrelation for top tagging.) By recasting the list of high-level kinematic variables shown in table 1 of ATLAS study as closely as possible, we will be able to the ATLAS study, see [41] for more details and original references. These form the inputs for all the methods in the ATLAS study. We will also use them as inputs for the DNN plus distance correlation. 3 In fact there is a family of distance covariance measures param- Since we will also study the decorrelation of CNN clas- eterized by 0 < α < 2 where one uses |X − X0|α instead of sifiers (see below), we will also form jet images in the |X − X0|. These relax the requirement of strict equivariance same way as [56]. We form images with ∆η = ∆φ = 2 under rescalings. In this paper we will focus on α = 1 but in and 40 × 40 pixel resolution. For simplicity we stick to principle this would be another hyperparameter to explore. 4 grayscale images (with pixel intensity equal to p ) for In the following we will be reweighting by pT . So we actually T need a weighted form of distance correlation. That follows easily this study. Fig. 2 shows the average of 100,000 W and from the sample definition (3). QCD jet images. 5 In principle another hyperparameter is the exact power of dCorr For all methods we reweight the training samples so that one adds to the loss function. We have not explored this in much detail. that the pT distributions of signal and background are 6 Our implementation of DisCo is available at flat, following the ATLAS study. We use 50 evenly- https://github.com/gkasieczka/DisCo. spaced pT bins between 300 and 400 GeV. For evaluation 4

W jets QCD jets enough to decorrelate more powerful deep learning clas- 0 0 sifiers that use low-level, high-dimensional features. For the CNN classifier we use a scaled down version of the 10 10 classifier in [56]. There are 4 convolutional layers with 64, 32, 32, 32 filters (size 4 × 4), with 2 × 2 Max pooling 20 20 after the second and fourth layer. This is followed by 3 hidden layers with 32, 64 and 64 nodes. All activations are ReLU. Finally we output to softmax. 30 30 For both CNN and DNN with DisCo regularization, we used the Adam optimizer with mini-batch size of 2048 −4 0 10 20 30 0 10 20 30 and a fixed learning rate of 10 . We found that the rel- atively large batch size of 2048 helped with the numerical FIG. 2: Average of 100k jet images for W jets (left) and QCD stability of the DisCo regularizer. We note that the sam- jets (right). pling estimate (3) for distance covariance is known to be statistically biased, and an unbiased estimator was given in [64]. The bias goes to zero as ∼ 1/n where n is the size of the sample (the minibatch size in our case). We ATLAS also reweights the signal pT distribution to look like background. But since we are taking such a narrow have verified that, as our minibatch size is sufficiently large, there is no practical benefit to using the unbiased pT slice our pT distributions are basically identical, so we skip this step. estimate of distance covariance in our case. All of the data samples used for this study will be made For the DNN (CNN) we performed a scan in DisCo publicly available here [57]. parameter λ in the range 0–600 (0–250). All classifiers were trained for 200 epochs; no early stopping was used. We have checked that 200 epochs is enough to ensure Methods convergence, in the sense that training for more epochs Following [41] we measure the tagging performance by does not improve things. Then, for each λ and train- the rejection factor R50 corresponding to the inverse of ing instance, the model with the best validation loss is the false positive rate (the probability to mis-identify a selected. This procedure is repeated six times with dif- QCD jet as W jet) at a true positive rate (the probabil- ferent random seeds to obtain a sense of the variability ity to correctly identify a W jet) of 50%. The decorre- in the training outcomes. lation is quantified by the inverse of the Jensen-Shannon In all of the ML based methods we use 250k/80k/80k Divergence 1/JSD between the inclusive background 50 signal jets and 110k/330k/770k background jets for train- distribution and the background distribution passing the ing/validation/testing. We use so many background jets selection corresponding to a true positive rate of 50%. in order to minimize the statistical error on the JSD cal- The Jensen-Shannon Divergence is calculated from his- culation (which is calculated only for the background). tograms with 50 bins between lowest and highest value. The deep learning algorithms were implemented with The binned entropy is measured in bits. PyTorch and trained on an NVIDIA P100 GPU. We have implemented the following pairs of (W -tagging, decorrelation) methods in our work. Results From the ATLAS study: (τ21, DDT) [42, 58], (D2, kNN) [59–61], (Adaboost BDT, uBoost) [62], and Our final result is shown in fig. 3, where the perfor- (DNN, adversary) [31]. We will additionally include mance of various decorrelation methods on the test set the simplest and possibly oldest decorrelation method, is summarized in the plane of 1/JSD50 (which measures namely “planing,” or reweighting events so that the mass decorrelation) vs. R50 (which measures classifier perfor- histograms of signal and background are identical. As mance). For DNN+DisCo and CNN+DisCo, the en- this approach is relatively simple to implement and does velopes of the 6 independent trainings per λ are shown, not add much computational cost, it is a good baseline together with lines connecting the median-decorrelated procedure.7 Finally, to all of this we will add our new points for different values of rejection. For the other ML method (DNN, DisCo regularization) for comparison. methods, a representative result is shown per decorrela- For details on all these methods, see the Appendix. tion parameter. (We have checked that the envelopes for In addition, we will go beyond the ATLAS study DNN+adversary and CNN+adversary are comparable to and examine a CNN classifier acting on jet images, to- their DisCo counterparts.) gether with adversarial and DisCo decorrelation. This The qualitative (and even quantitative) agreement will demonstrate that DisCo regularization is effective with fig. 11(a) of [41] is excellent, and we see a clear tradeoff between classifier performance and the amount of decorrelation. Comparing DNN+DisCo to the other methods, we find 7 See [63] for a recent comparison study of planing against other that it has comparable performance to DNN+adversary. methods. Meanwhile it is much easier to train – whereas DisCo 5

demonstrates that this is not the case. At the highest levels of decorrelation, we note that both DNN and CNN

5 performances are comparable. 10 21 21-DDT In fig. 4, we indicate more directly the level of decorre- D2 lation in the background mass distribution for the pure D2-kNN Adaboost 4 CNN case (no decorrelation), and for the CNN+DisCo 10 uBoost 3 DNN method at a working point that achieves 1/JSD50 ∼ 10 . DNN+planing We see that DisCo is quite effective at stabilizing the DNN+adversary 3 DNN+distance correlation background mass distribution against a cut on the clas- 0 10 5 CNN sifier.

D CNN+planing

S CNN+adversary

J Finally, let us also comment briefly on the performance

/ CNN+distance correlation

1 of planing. Unlike DisCo regularization and some of the 102 other methods studied here, planing yields a single work- ing point, instead of a tunable tradeoff between decorre-

1 lation and classifier performance. Since its performance 10 depends on the joint probability distribution for mass and the other observables,8 planing is not guaranteed to achieve strong results. But it is interesting to see that 100 101 102 in this case (and in many of the cases studied in [63]), R50 planing the DNN and CNN classifiers achieves very good performance. The performance lies on the DisCo regu- larization curve, and DisCo is capable of further decor- FIG. 3: Decorrelation against background rejection for differ- relation. ent approaches. Conclusions Deep learning is greatly increasing the classification per-

2 formance for a wide number of reconstruction problems 10 in particle physics. With the increasing adoption of these powerful machine learning solutions, a thorough under- 10 3 after cut, no decor. standing of their stability is needed. before cut In this paper it was shown how a simple regularisation 10 4 term based on the distance correlation metric can achieve after cut, DisCo state-of-the-art decorrelation power. Training is easier to 5 normalized counts 10 set-up, with far less hyperparameters to optimise, and is more stable than adversarial networks, while simultane- 10 6 ously being more powerful than simpler approaches. 50 100 150 200 250 DisCo regularization is an effective and promising new mass [GeV] method for decorrelation which should have a host of immediate experimental applications at the LHC. At the FIG. 4: QCD mass distribution before and after a cut on same time, the potential use cases are much wider and CNN plus DisCo (W -tagging) with signal efficiency of 50% −3 include problems of fairness and bias of decision algo- and JSD ∼ 10 . rithms in social applications. This will be an extremely interesting direction for future exploration. adds exactly one hyperparameter and no additional neu- ral network parameters to the DNN, the adversary more than doubles the number of hyperparameters and adds Acknowledgments an entire second NN to the story. See the Appendix for a complete list of hyperparameters for the adversarial We thank Joern Bach, Ben Nachman, Bryan Ostdiek, training. These were found through manual tuning and Tilman Plehn, Matt Schwartz and Mike Williams for their sheer complexity nicely illustrates the need for a helpful discussions. We are grateful to Chris Delitzsch, simpler method of decorrelation. Steven Schramm and especially Andreas Sogaard for help We see that DisCo regularization is equally capa- with details of the ATLAS decorrelation study. GK is ble of decorrelating the more powerful CNN classi- fier, and again achieves comparable performance to CNN+adversary. One concern could have been that a more powerful deep learning method such as the CNN 8 Planing replaces p(x, m) with p(x, m)/p(m) which does not guar- could overpower the DisCo regularizer, but our result antee independence. 6

supported by the Deutsche Forschungsgemeinschaft un- 3. Fixed efficiency regression der Germany‘s Excellence Strategy – EXC 2121 “Quan- tum Universe“ – 390833306”. GK is grateful for the gen- It is also possible to design decorrelated variables for erous support and hospitality of the Rutgers NHETC non-linear relations between features by subtracting the Visitor Program where this work was initiated. DS is expected response for background examples [70]. This supported by DOE grant DOE-SC0010008 and by the Di- average response can also be parametrised against mul- rector, Office of Science, Office of High Energy Physics tiple features. Take for example the de-correlation of a of the U.S. Department of Energy under the Contract feature y against x and x0. No. DE-AC02-05CH11231. DS thanks LBNL, BCTP k-NN and BCCP for their generous support and hospitality The decorrelated y can be calculated as during his sabbatical year. yk-NN = y − y(P %)(x, x0) (A3)

Appendix A: More details on the methods with the threshold y(P %)(x, x0) corresponding to a true positive rate for background events P interpoldated using 1. Planing a k-nearest neighbour regression fit [61].

One method to reduce correlation is to remove discrim- inating information carried by a variable. The approach of giving weights to training events so the distributions 4. uBoost for different classes are identical has been long used ex- perimentally9 and recently was studied for understanding network decisions [69] and resonance tagging [63]. Specif- The uBoost approach is a modified training methods for boosted decision trees (BDTs). A decision tree is a se- ically a weight wi,C for event with index i of class C is calculated by building a histogram of the feature x so ries of binary selection criteria that subsequently divide 10 the data. Boosting refers to a combination of multiple that nj denotes the number of events in bin j. The weight can then be calculated as: decision trees to maximise a chosen classification metric such as the Gini coefficient or cross entropy. uBoost [62] 1 introduces an additional weight term in the boosting pro- wi,C | = AC , (A1) xi in bin j cedure so that regions in mass with low efficiency receive nj a higher weigth and regions with large efficiency receive where A is a per-class normalisation factor. a lower weight. Planing weights are then used in the training of an Following ATLAS, we used the implementa- e.g. neural network classifier and modify the contribution tion provided in the hep mlv0.6.0 package [71]. of each event to the loss function. When applying the The hyperparameters were n estimators = 500, algorithm to events of unknown class in the testing phase learning rate = 0.5, and base estimator was no weights are used (i.e. weights are set equal to one). the DecisionTreeClassifier from sklearn with max depth = 20 and min samples leaf = 0.01. For the uBoost uniforming rate (the analogue of λ for DisCo and 2. Designed decorrelated taggers adversary), we scanned the range 0–3. We performed 5 independent trainings per uniforming rate and observed For decorrelating a classifier for a single selection effi- that the results were quite stable and consistent be- ciency, a transformation of the output using the expected tween them. Larger values of the uniforming rate were shape of the background distribution after the training observed to populate lower R50 but with a unreliably is completed is possible as well [42]. This approach is large variation in JSD50, so they were not included in named Designed decorrelated taggers (DDT ). Concretely, this study. to decorrelate feature y against x, it is transformed ac- cording to:

0 y = y − M · (x − O) (A2) 5. DNN classifier

dy where O in an offset and M is a slope parameter dx extracted for the background. As in the ATLAS study, we use for the DNN classifier a fully connected network consisting of 3 hidden layers with 64 nodes each. Except for the final softmax layer we use ReLU activations everywhere. Unlike the ATLAS study, 9 See e.g. [65–68] we chose to include a batchnorm layer [72] after the first 10 Due to the explicit use of histogramming, it can be difficult to hidden layer, as we found this improved the stability of generalise planing to multiple variables. the outcome. 7

6. Training with adversary

Adversarial training follows the approach outlined in 104 CNN+adversary the Introduction, with the adversary attempting to learn CNN+distance correlation the PDF of the mass. The training objective is given by

103 ∗ ∗ θC θA = arg minθC maxθA LC − λLA (A4) 0 5 D

S 2 J 10 and is usually solved by alternating training of the two / networks. The different training objectives between dis- 1 criminator and adversary are implemented using gradient reversal. Here λ is a tunable hyperparameter defining the 101 relative weight of classification and decorrelation objec- tive. The classifier loss term LC is the usual cross-entropy term, while for the output of the adversary A is the prob- ability density produced by the Gaussian mixture model 100 100 101 102 and L = − log A is evaluated at the true value of the A R mass.11 50 The adversary predicts a probability distribution func- tion for the mass. The function is parametrised by a FIG. 5: Decorrelation against background rejection for adver- sarial decorrelation and DisCo in the case of top tagging. sum of 20 Gaussian distributions in a Gaussian mixture model. This means the network outputs 60 quantitites, Appendix B: Top tagging interpreted as relative normalisation, mean and of 20 Gaussian distributions with a two layer fully con- nected network for each parameter where the first layer We will also compare the performance of DisCo to ad- has 64 nodes and is shared. The output of the discrimi- versarial decorrelation in the case of top tagging. For nator and p are used as inputs to the adversary. top tagging, we use the QCD and top samples in [16], T and we restrict our comparison to CNNs trained on jet Training the adversary is done in three phases: only images (with the same specifications as the W -tagging). training the discriminator for 200 epochs (45 epochs for Fig. 5 shows the average top and QCD images. Despite the CNN); only training the adversary with fixed dis- the much higher possible discriminating power in top tag- criminator for 20 epochs (30 epochs for the CNN); and ging, we again see that DisCo is comparable to the ad- joint training of both networks for 200 epochs (25 epochs versary, demonstrating that DisCo is indeed a powerful for the CNN). For the DNN, the initial learning rates and sensitive measure of nonlinear correlation and a very for the three phases are λC = 0.01, λA = 0.05 and effective penalty term for decorrelation. −6 (λC , λA) = (3 · 10 , 0.0001) respectively. The initial learning rates are subject to an exponential decay of References dC = 0.98, dA = 0.98, (dC , dA) = (0.97, 0.97). For the CNN the initial learning rates are λC = 0.0001, [1] L. G. Almeida, M. Backovi´c,M. Cliche, S. J. Lee, and λA = 0.0005, and (λC , λA) = (0.000001, 0.001) for the M. Perelstein, “Playing Tag with ANN: Boosted Top three phases. No exponential decay is used for pre- Identification with Pattern Recognition,” JHEP 07 (2015) 086, arXiv:1501.05968 [hep-ph]. training the classifier, and decay rates of dA = 0.98 and [2] J. Pearkes, W. Fedorko, A. Lister, and C. Gay, “Jet (dC , dA) = (0.95, 0.99) are used for the second and third phase. We verified for some representative values of λ Constituents for Deep Neural Network Based Top Quark Tagging,” arXiv:1704.02124 [hep-ex]. that greatly increasing the number of epochs (training up [3] G. Kasieczka, T. Plehn, M. Russell, and T. Schell, to 300 epochs) did not noticeably improve performance; “Deep-learning Top Taggers or The End of QCD?,” nor did changing the model selection to the lowest loss JHEP 05 (2017) 006, arXiv:1701.08784 [hep-ph]. instead of the final epoch. The batch size is 8192 (1000) [4] S. Macaluso and D. Shih, “Pulling Out All the Tops for the DNN (CNN) approach. with Computer Vision and Deep Learning,” JHEP 10 (2018) 121, arXiv:1803.00107 [hep-ph]. [5] G. Louppe, K. Cho, C. Becot, and K. Cranmer, “QCD-Aware Recursive Neural Networks for Jet 11 An alternative approach to adversarial decorrelation attempts to Physics,” JHEP 01 (2019) 057, arXiv:1702.00748 infer the mass itself, in which case the adversarial loss LA(θC , θA) [hep-ph]. would take the form of a regression term or cross-entropy between [6] S. Egan, W. Fedorko, A. Lister, J. Pearkes, and C. Gay, different mass bins [44]. “Long Short-Term Memory (LSTM) networks with jet 8

constituents for boosted top tagging at the LHC,” arXiv:1812.09223 [hep-ph]. arXiv:1711.09059 [hep-ex]. [24] H. Luo, M.-x. Luo, K. Wang, T. Xu, and G. Zhu, [7] A. Butter, G. Kasieczka, T. Plehn, and M. Russell, “Quark jet versus gluon jet: fully-connected neural “Deep-learned Top Tagging with a Lorentz Layer,” networks with high-level features,” Sci. China Phys. SciPost Phys. 5 (2018) no. 3, 028, arXiv:1707.08966 Mech. Astron. 62 (2019) no. 9, 991011, [hep-ph]. arXiv:1712.03634 [hep-ph]. [8] M. Erdmann, E. Geiser, Y. Rath, and M. Rieger, [25] K. Fraser and M. D. Schwartz, “Jet Charge and “Lorentz Boost Networks: Autonomous Machine Learning,” JHEP 10 (2018) 093, Physics-Inspired Feature Engineering,” JINST 14 arXiv:1803.08066 [hep-ph]. (2019) no. 06, P06006, arXiv:1812.09722 [hep-ex]. [26] M. Erdmann, B. Fischer, and M. Rieger, “Jet-parton [9] L. Moore, K. Nordstr¨om,S. Varma, and M. Fairbairn, assignment in tt¯H events using deep learning,” JINST “Reports of My Demise Are Greatly Exaggerated: 12 (2017) no. 08, P08020, arXiv:1706.01117 [hep-ex]. N-subjettiness Taggers Take On Jet Images,” SciPost [27] S. Diefenbacher, H. Frost, G. Kasieczka, T. Plehn, and Phys. 7 (2019) no. 3, 036, arXiv:1807.04769 [hep-ph]. J. M. Thompson, “CapsNets Continuing the [10] B. M. Dillon, D. A. Faroughy, and J. F. Kamenik, Convolutional Quest,” arXiv:1906.11265 [hep-ph]. “Uncovering latent jet substructure,” Phys. Rev. D100 [28] ATLAS Collaboration, “Search for pair production of (2019) no. 5, 056002, arXiv:1904.04200 [hep-ph]. heavy vector-like quarks√ decaying into hadronic final [11] P. T. Komiske, E. M. Metodiev, and J. Thaler, “Energy states in pp collisions at s = 13 TeV with the ATLAS Flow Networks: Deep Sets for Particle Jets,” JHEP 01 detector,” Phys. Rev. D98 (2018) no. 9, 092005, (2019) 121, arXiv:1810.05165 [hep-ph]. arXiv:1808.01771 [hep-ex]. [12] H. Qu and L. Gouskos, “ParticleNet: Jet Tagging via [29] CMS Collaboration, “Search for pair production of Particle Clouds,” arXiv:1902.08570 [hep-ph]. vectorlike quarks in the fully hadronic final state,” [13] E. A. Moreno, T. Q. Nguyen, J.-R. Vlimant, O. Cerri, Phys. Rev. D100 (2019) no. 7, 072001, H. B. Newman, A. Periwal, M. Spiropulu, J. M. Duarte, arXiv:1906.11903 [hep-ex]. and M. Pierini, “Interaction networks for the [30] CMS Collaboration, “Search for direct top squark pair identification of boosted H → bb decays,” production in events with one lepton, jets, and missing arXiv:1909.12285 [hep-ex]. transverse momentum at 13 TeV with the CMS [14] CMS Collaboration, “Machine learning-based experiment,” (2019) , arXiv:1912.08887 [hep-ex]. identification of highly Lorentz-boosted hadronically [31] G. Louppe, M. Kagan, and K. Cranmer, “Learning to decaying particles at the CMS experiment,” Pivot with Adversarial Networks,” arXiv:1611.01046 CMS-PAS-JME-18-002 (2019) . [stat.ME]. [15] ATLAS Collaboration, “Performance of top-quark and [32] C. Englert, P. Galler, P. Harris, and M. Spannowsky, W -boson tagging with ATLAS in Run 2 of the LHC,” “Machine Learning Uncertainties with Adversarial Eur. Phys. J. C79 (2019) no. 5, 375, arXiv:1808.07858 Neural Networks,” Eur. Phys. J. C79 (2019) no. 1, 4, [hep-ex]. arXiv:1807.08763 [hep-ph]. [16] A. Butter et al., “The Machine Learning Landscape of [33] P. Windischhofer, M. Zgubiˇc,and D. Bortoletto, Top Taggers,” SciPost Phys. 7 (2019) 014, “Preserving physically important variables in optimal arXiv:1902.09914 [hep-ph]. event selections: A case study in Higgs physics,” [17] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and arXiv:1907.02098 [hep-ph]. A. Schwartzman, “Jet-images — deep learning edition,” [34] S. Wunsch, S. J¨orger,R. Wolf, and G. Quast, JHEP 07 (2016) 069, arXiv:1511.05190 [hep-ph]. “Reducing the dependence of the neural network [18] Y.-C. J. Chen, C.-W. Chiang, G. Cottin, and D. Shih, function to systematic uncertainties in the input space,” “Boosted W/Z Tagging with Jet Charge and Deep arXiv:1907.11674 [physics.data-an]. Learning,” arXiv:1908.08256 [hep-ph]. [35] G. J. Sz´ekely, M. L. Rizzo, and N. K. Bakirov, [19] ATLAS Collaboration, “Identification of Jets “Measuring and testing dependence by correlation of Containing b-Hadrons with Recurrent Neural Networks distances,” Ann. Statist. 35 (2007) no. 6, 2769–2794. at the ATLAS Experiment,” Tech. Rep. https://doi.org/10.1214/009053607000000505. ATL-PHYS-PUB-2017-003, CERN, Geneva, Mar, 2017. [36] G. J. Sz´ekely and M. L. Rizzo, “Brownian distance http://cds.cern.ch/record/2255226. covariance,” Ann. Appl. Stat. 3 (2009) no. 4, [20] CMS Collaboration, “Performance of the DeepJet b 1236–1265. https://doi.org/10.1214/09-AOAS312. tagging algorithm using 41.9/fb of data from [37] G. J. Sz´ekely and M. L. Rizzo, “The distance proton-proton collisions at 13TeV with Phase 1 CMS correlation t-test of independence in high dimension,” detector,”CMS-DP-2018-058 (Nov, 2018) . J. Multivar. Anal. 117 (2013) 193–213. http://cds.cern.ch/record/2646773. http://dx.doi.org/10.1016/j.jmva.2013.02.012. [21] J. Lin, M. Freytsis, I. Moult, and B. Nachman, [38] G. J. Sz´ekely and M. L. Rizzo, “Partial distance “Boosting H → b¯b with Machine Learning,” JHEP 10 correlation with methods for dissimilarities,” Ann. (2018) 101, arXiv:1807.10768 [hep-ph]. Statist. 42 (2014) no. 6, 2382–2412. [22] P. T. Komiske, E. M. Metodiev, and M. D. Schwartz, https://doi.org/10.1214/14-AOS1255. “Deep learning in color: towards automated [39] R. Li, W. Zhong, and L. Zhu, “Feature screening via quark/gluon jet discrimination,” JHEP 01 (2017) 110, distance correlation learning,” Journal of the American arXiv:1612.01551 [hep-ph]. Statistical Association 107 (2012) no. 499, 1129–1139. [23] G. Kasieczka, N. Kiefer, T. Plehn, and J. M. [40] A. Villaverde and J. Banga, “Reverse engineering and Thompson, “Quark-Gluon Tagging: Machine Learning identification in systems biology: Strategies, vs Detector,” SciPost Phys. 6 (2019) 069, perspectives and challenges,”Journal of the Royal 9

Society 11 (02, 2014) 20130505. with Computer Vision and Deep Learning,” JHEP 10 [41] ATLAS Collaboration, “Performance of (2018) 121, arXiv:1803.00107 [hep-ph]. mass-decorrelated jet substructure observables for [57] G. Kasieczka and D. Shih, “Datasets for boosted w hadronic two-body decay tagging in tagging,” Jan., 2020. ATLAS,”ATL-PHYS-PUB-2018-014 (Jul, 2018) . https://doi.org/10.5281/zenodo.3606767. http://cds.cern.ch/record/2630973. [58] J. Thaler and K. Van Tilburg, “Identifying Boosted [42] J. Dolen, P. Harris, S. Marzani, S. Rappoccio, and Objects with N-subjettiness,” JHEP 03 (2011) 015, N. Tran, “Thinking outside the ROCs: Designing arXiv:1011.2268 [hep-ph]. Decorrelated Taggers (DDT) for jet substructure,” [59] A. J. Larkoski, I. Moult, and D. Neill, “Power Counting JHEP 05 (2016) 156, arXiv:1603.00027 [hep-ph]. to Better Jet Observables,” JHEP 12 (2014) 009, [43] I. Moult, B. Nachman, and D. Neill, “Convolved arXiv:1409.6298 [hep-ph]. Substructure: Analytically Decorrelating Jet [60] A. J. Larkoski, I. Moult, and D. Neill, “Analytic Substructure Observables,” JHEP 05 (2018) 002, Boosted Boson Discrimination,” JHEP 05 (2016) 117, arXiv:1710.06859 [hep-ph]. arXiv:1507.03018 [hep-ph]. [44] C. Shimmin, P. Sadowski, P. Baldi, E. Weik, [61] S. A. Dudani, “The distance-weighted D. Whiteson, E. Goul, and A. Søgaard, “Decorrelated k-nearest-neighbor rule,” IEEE Transactions on Jet Substructure Tagging using Adversarial Neural Systems, Man, and Cybernetics SMC-6 (1976) no. 4, Networks,” Phys. Rev. D96 (2017) no. 7, 074034, 325–327. arXiv:1703.03507 [hep-ex]. https://doi.org/10.1109%2Ftsmc.1976.5408784. [45] S. Chandar, M. M. Khapra, H. Larochelle, and [62] J. Stevens and M. Williams, “uBoost: A boosting B. Ravindran, “Correlational neural networks,” CoRR method for producing uniform selection efficiencies from abs/1504.07225 (2015) , arXiv:1504.07225. multivariate classifiers,” JINST 8 (2013) P12013, http://arxiv.org/abs/1504.07225. arXiv:1305.7248 [nucl-ex]. [46] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, [63] L. Bradshaw, R. K. Mishra, A. Mitridate, and Y. Bengio, A. Courville, and R. D. Hjelm, “Mine: B. Ostdiek, “Mass Agnostic Jet Taggers,” Mutual information neural estimation,” arXiv:1908.08959 [hep-ph]. arXiv:1801.04062 [cs.LG]. [64] G. J. Szekely and M. L. Rizzo, “Partial distance [47] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training correlation with methods for dissimilarities,” generative neural samplers using variational divergence arXiv:1310.2926 [stat.ME]. minimization,” 2016. [65] J. Freeman, J. D. Lewis, W. Ketchum, S. Poprocki, [48] S. Mohamed and B. Lakshminarayanan, “Learning in A. Pronko, V. Rusu, and P. Wittich, “An Artificial implicit generative models,” arXiv:1610.03483 neural network based b jet identification algorithm at [stat.ML]. the CDF Experiment,” Nucl. Instrum. Meth. A663 [49] K. Cranmer, J. Pavez, and G. Louppe, “Approximating (2012) 37–47, arXiv:1108.4738 [hep-ex]. Likelihood Ratios with Calibrated Discriminative [66] CMS Collaboration, “Top Tagging with New Classifiers,” arXiv:1506.02169 [stat.AP]. Approaches,” CMS-PAS-JME-15-002 (2016) . [50] G. J. Sz´ekely and M. L. Rizzo, “On the uniqueness of [67] ATLAS Collaboration, “Identification of boosted, distance covariance,” & Probability Letters 82 hadronically decaying W√ bosons and comparisons with (2012) no. 12, 2278 – 2282. ATLAS data taken at s = 8 TeV,” Eur. Phys. J. C76 http://www.sciencedirect.com/science/article/ (2016) no. 3, 154, arXiv:1510.05821 [hep-ex]. pii/S0167715212003124. [68] ATLAS Collaboration, “Identification of high √ [51] T. Sjostrand, S. Mrenna, and P. Z. Skands, “A Brief transverse momentum top quarks in pp collisions at s Introduction to PYTHIA 8.1,” Comput. Phys. = 8 TeV with the ATLAS detector,” JHEP 06 (2016) Commun. 178 (2008) 852–867, arXiv:0710.3820 093, arXiv:1603.03127 [hep-ex]. [hep-ph]. [69] S. Chang, T. Cohen, and B. Ostdiek, “What is the [52] DELPHES 3 Collaboration, J. de Favereau, Machine Learning?,” Phys. Rev. D97 (2018) no. 5, C. Delaere, P. Demin, A. Giammanco, V. Lemaˆıtre, 056009, arXiv:1709.10106 [hep-ph]. A. Mertens, and M. Selvaggi, “DELPHES 3, A modular [70] ATLAS Collaboration Collaboration, “Performance framework for fast simulation of a generic collider of mass-decorrelated jet substructure observables for experiment,” JHEP 02 (2014) 057, arXiv:1307.6346 hadronic two-body decay tagging in ATLAS,” Tech. [hep-ex]. Rep. ATL-PHYS-PUB-2018-014, CERN, Geneva, Jul, [53] M. Cacciari, G. P. Salam, and G. Soyez, “FastJet User 2018. http://cds.cern.ch/record/2630973. Manual,” Eur. Phys. J. C72 (2012) 1896, [71] A. Rogozhnikov et al., “hep ml: Machine Learning for arXiv:1111.6097 [hep-ph]. High Energy Physics, version 0.6,”. [54] M. Cacciari, G. P. Salam, and G. Soyez, “The anti-kt https://github.com/arogozhnikov/hep_ml. jet clustering algorithm,” JHEP 04 (2008) 063, [72] S. Ioffe and C. Szegedy, “Batch normalization: arXiv:0802.1189 [hep-ph]. Accelerating deep network training by reducing internal [55] D. Krohn, J. Thaler, and L.-T. Wang, “Jet Trimming,” covariate shift,” arXiv:1502.03167 [cs.LG]. JHEP 02 (2010) 084, arXiv:0912.1342 [hep-ph]. [56] S. Macaluso and D. Shih, “Pulling Out All the Tops