Arxiv:2001.05310V2 [Hep-Ph] 30 Sep 2020
Total Page:16
File Type:pdf, Size:1020Kb
DisCo Fever: Robust Networks Through Distance Correlation Gregor Kasieczka1, ∗ and David Shih2, 3, 4, y 1 Institut f¨urExperimentalphysik, Universit¨atHamburg, 22761 Hamburg, Germany 2NHETC, Dept. of Physics and Astronomy, Rutgers University, Piscataway, NJ 08854 USA 3Theory Group, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA 4Berkeley Center for Theoretical Physics, University of California, Berkeley, CA 94720, USA While deep learning has proven to be extremely successful at supervised classification tasks at the LHC and beyond, for practical applications, raw classification accuracy is often not the only consideration. One crucial issue is the stability of network predictions, either versus changes of individual features of the input data, or against systematic perturbations. We present a new method based on a novel application of \distance correlation" (DisCo), a measure quantifying non-linear correlations, that achieves equal performance to state-of-the-art adversarial decorrelation networks but is much simpler and more stable to train. To demonstrate the effectiveness of our method, we carefully recast a recent ATLAS study of decorrelation methods as applied to boosted, hadronic W -tagging. We also show the feasibility of DisCo regularization for more powerful convolutional neural networks, as well as for the problem of hadronic top tagging. Introduction derstanding and mitigating these systematic differences Recent breakthroughs in deep learning have begun to is essential in any experimental analysis, and having a revolutionize many areas of high energy physics. One decorrelated classifier has many applications in this re- area that has received considerable focus is the problem gard. For example, if the sources of systematic uncer- of classifying different types of jets at the LHC. Deep tainty are known, one can attempt to explicitly decorre- neural networks have been applied, for example, to dis- late a classifier against them in order to reduce or elimi- tinguishing top quarks from light quark and gluon jets. nate their effects [31{34]. Or, one can attempt to control For this problem a large number of architectures based for these systematic differences using data-driven meth- on fully connected neural networks [1, 2], image-based ods, such as sidebanding in the invariant mass.1 If the methods [3, 4], recursive clustering [5, 6], physics vari- signal is localized but the background is smooth in mass, ables [7{10], sets [11], and graphs [12, 13] have been the sideband method allows one to calculate MC vs. data studied [14{16]. Related challenges of identifying vector correction factors, define control samples, and estimate bosons [17, 18], b-quarks [19, 20], Higgs bosons [13, 21], backgrounds. But if the classifier sculpts features (e.g. and distinguishing light quark from gluon jets [22{25] bumps) into the background mass distribution, it cannot have seen similar progress. Beyond classifying single par- be relied on for sidebanding. A classifier that is decorre- ticles in an event, there is also work on developing holistic lated with mass is sufficient (although not necessary) to methods that classify full events according to the likely guarantee smoothness of the background mass distribu- physics process that produced them [26, 27]. Finally, tion. some of these novel deep learning methods are begin- ning to be applied to concrete experimental analyses, see The issue is especially acute for powerful multivari- e.g. [28{30]. ate classifiers such as neural networks, which will have a strong incentive to \learn the mass" when building the So far, the recent activity in developing better jet clas- optimal discriminant. Even if one excludes mass from the sifiers with deep learning has focused on maximizing their list of inputs to the machine learning algorithm, it may raw performance. However, the most accurate classifier not be enough to achieve a decorrelated classifier { many is often not the best one for actual experimental applica- of the other inputs may be correlated with mass, and tions. Instead, what is often desired is the most accurate arXiv:2001.05310v2 [hep-ph] 30 Sep 2020 machine learning methods in general are flexible enough classifier given the constraint that it is decorrelated with to exploit correlations of inputs. Such improvements will one or more auxiliary variables. be especially relevant for (but not limited to) searches for The underlying reason for this requirement is that clas- new resonances with unknown mass. The identification sifiers are trained on Monte Carlo (MC) simulated ex- of resonances in invariant mass distributions is histori- amples (for which perfect truth labels are available), but cally the main avenue to discovery in experimental par- are applied to (unlabeled) collision data. While the sim- ticle physics, and relies on robust background estimates. ulated events are of high fidelity, they do not perfectly Therefore an important and significant challenge is to de- reproduce the real data, and this gives rise to system- atic differences between training and testing data. Un- 1 Although different auxiliary variables can be used in experimen- tal analyses, one of the most common choices is invariant mass. ∗Electronic address: [email protected] So for concreteness, and without loss of generality, we will focus yElectronic address: [email protected] on the case of invariant mass for the remainder of this paper. 2 sign classifiers that are as fully decorrelated from mass Distance Correlation as possible while using maximal information. Given a sample of paired vectors (~xi; ~yi) (where the index i runs over the sample) drawn randomly from some distri- In this paper we will present a new method for training bution, we would like a function that measures the extent decorrelated classifiers which achieves performance com- to which they are drawn from independent distributions, parable to state-of-the-art methods, while being much ~ ~ ~ ~ easier to train. The key observation is that a statistical i.e. the extent to which Pjoint(X; Y ) = PX (X)PY (Y ). In measure called Distance Correlation (DisCo) [35{38] is order for this function to be applicable in a deep learn- sensitive to general, nonlinear correlations between two ing context, we also require that this function be differ- random variables and can be efficiently computed from fi- entiable and that it can be computed directly from the nite samples. Distance correlation is well-known in statis- sample. tics and has been applied to various fields including data In our case the vectors are one dimensional and corre- science [39] and biology [40]. To our knowledge, this is spond to mass X = m and classifier output Y = y but the first application of DisCo to particle physics. clearly one can imagine many more applications of such a measure at the LHC and beyond. By including DisCo as an additive regularizer term in the loss function, we demonstrate that we can achieve a The usual Pearson correlation coefficient R only mea- state-of-the art decorrelated classifier with just one addi- sures linear dependencies so it is not suitable for our tional hyperparameter (the coefficient of the DisCo reg- purposes. Specifically, features can have nonlinear de- 2 ularizer). By varying this coefficient, we can control the pendencies and still exhibit zero Pearson R. There are tradeoff between classification performance and decorre- many information-theoretic measures of similarity of dis- lation, interpolating between a fully decorrelated tagger tributions such as KL-divergence, Jensen-Shannon dis- and a fully performant one. tance, and mutual information. These are difficult to compute directly from the sample, without binning. One To validate our methods and rigorously demonstrate can approximate these measures by training a classifier that they are state-of-the-art, we will carefully reproduce and using the likelihood ratio trick, but this again leads the results of a recent ATLAS study of decorrelated tag- to adversarial methods, see e.g. [33, 46{49]. gers for identifying boosted W bosons [41]. This study One measure that seems to fit the bill perfectly is \dis- includes a comprehensive set of decorrelation methods, tance correlation", which originated in the works of [35{ including [31, 42{44]. The most promising technique so 38]. It can be computed from the sample and it has the far (in terms of achieving the highest classifier perfor- key property that it is zero iff X and Y are independent. mance for a given level of decorrelation) has been adver- sarially training a pair of neural networks: a classifier The definition of distance covariance is: distinguishing different classes and an adversary predict- Z ing the mass [31, 44] for a given classifier output. 2 p q 2 dCov (X; Y ) = d sd t jfX;Y (s; t)−fX (s)fY (t)j w(s; t) The downside of the adversarial method has been that (1) it is extremely difficult to implement in practice. Not p q only does one have to essentially train two separate neu- where X 2 R , Y 2 R , fX and fY are the characteristic ral networks, each with their own set of hyperparame- functions for the random variables X and Y , and fX;Y is ters, but one has to carefully tune these two neural net- the joint characteristic function for X and Y . Finally works against each other. This stems from the nature of adversarial training: the objective is not to minimize a w(s; t) / jsj−(p+1)jtj−(q+1) (2) loss function, but rather to find a saddle point where the classifier loss is minimized but the adversary loss is maxi- mized. Without careful tuning of learning rate schedules, is a weight function that is uniquely determined up to number of epochs, minibatch sizes, etc., the training eas- an overall normalization by the requirement that dCov ily becomes unstable (since the loss is unbounded from is invariant under constant shifts and orthogonal trans- below) and can quickly run away to a meaningless result.