Deep Neural Networks for Physics Analysis on low-level whole- detector data at the LHC

Wahid Bhimji, Steve Farrell, Thorsten Kurth, Michela Paganini, Prabhat, Evan Racah Lawrence Berkeley National Laboratory ACAT 2017: 21st August 2017

- 1 - Introduction / Aims • Use Deep Neural Network (NN)s on ‘raw’ data directly for physics analysis: – Without reconstruction of physics objects like jets; without tuning of analysis variables; using data from whole calorimeter /detector – Cutting-edge methods: performance and interpretation • Run efficiently on NERSC supercomputers – Primarily Intel Knights Landing (KNL) XeonPhi CPU based – Distributed training (up to 10k KNL nodes) – Timings, optimisations and recipes - 2 - Physics Use-Case

• Search for RPV SUSY gluino decays From ATLAS-CONF-2016-057: – Multi-jet final state – Analysis from ATLAS-CONF-2016-057 used as a benchmark – Classification problem: RPV Susy vs. QCD • Simulated samples – Pythia - event gen. (matching ATLAS config)

• Cascade mg =̃ 1400 , mχ 0̃ =850 default –explore other masses – Delphes detector simulation (ATLAS card) • Output calorimeter towers (and tracks) used in analysis - 3 - Data processing

• Bin calorimeter tower energy in η/ɸ to form an ‘image’ – 64x64 bins (~0.1 η/ɸ towers) or 224x224 • Also try 3 ‘channels’ (à la RGB images)1: – Energy in Electromagnetic and Hadronic Calorimeters and no. of tracks in bin • Reconstruct jets using same algorithm as physics analysis (AntiKt R=1.0 trimmed) for benchmark comparison and pre-selection

1Similar to Komiske, Metodiev, and Schwartz arXiv:1612.01551 - 4 - Convolutional (CNN) Architecture

• Popular architecture for natural images and now many HEP studies (so not explained here) Figure from Dumoulin, Vincent, and – Learn non-linear ‘filters’- slide across image: Francesco Visin. arXiv:1603.07285 shared filter reduces weights – Local structure/ translational invariance – Stacked layers respond to different scales • We use 3 alternating convolutional and pooling layers (or 4 for large images): with bias and/or

• QCD generated in PT ranges: – X-sec weight in training loss and evaluation

Input Conv+Pool(1) Conv+Pool(2) Conv+Pool(3) (Conv+Pool(4)) Fully Connected (FC) FC Output 1(or3)x64x64 64x32x32 128x8x8 256x4x4 4096 512 1

- 5 - CNN Performance

• Need good signal efficiency and (True Positive Rate (TPR)) high background rejection (low False Positive Rate (FPR)) TPR=0.77, AMS=4.2

– Compare to physics TPR=0.41, AMS=2.3 selections (see backup) – ROC curve (relative to preselection) • Increased signal efficiency at same background rejection without using jet variables Also compare AMS (approximate median significance) accounting for initial pre-selection and luminosity

- 6 - Compare to shallow classifier

• Try (gradient) boosted decision tree (GBDT) and 1 hidden- NN (MLP) – Input jet variables used in the physics analysis (Sum of Jet Mass, Number of Jets, Eta between leading 2 jets) and 4- mom of first 5 jets • These outperform selections but CNN performs better

- 7 - Weights

• Cross-section weights applied in training loss – Some QCD background events weighted 107 over RPV signal • Try log of weights – More stable implementation – More focussed on signal performance - 8 - Channels

• Three channel CNN – Separate Energy in Electromagnetic and Hadronic Calorimeters – Number of tracks in the same Eta/Phi bin • Further improves performance

- 9 - Further improving performance

• Implementation with full weights and that with log weights focus differently on signal and background • Can ensemble these by taking mean of predictions - gives best performance

- 10 - Robustness to different signals

• Model trained on a specific cascade decay Gluino mass (MGlu) of 1400 GeV, and neutralino mass (MNeu) of 850 GeV • Apply this model to other signal samples without retraining • Still good performance

- 11 - Pileup

• Most studies here use Delphes without pileup • Repeat with Delphes pileup card (mu=20) • Physics selections have lower bkg rejection • CNN still performs well – 1 channel CNN shown

- 12 - Comparing CNN to jet variables

• Plot NN output (P(signal)) vs benchmark analysis variable • Clear correlation – (Signal cuts: NJets >= 4 /5 MJet >= 800/600 GeV) • Add jet variable to CNN output in a 1 layer NN – Little/no increase in performance

- 13 - Running at NERSC

- 14 - NERSC and Cori

• NERSC at LBL, production HPC center for US Dept. of Energy – >7000 diverse users across science domains including many outside HEP • Cori – NERSC’s Newest Supercomputer – Cray XC40 (31.4 PF Peak) –Phase 1: 2388 Intel Haswell dual 16-core (2.3 GHz), 128 GB DDR4 DRAM –Phase 2: 9668 Intel Knights Landing (KNL) nodes: XeonPhi 68-core (1.4 GHz), 4 hardware threads; AVX-512 Vector pipelines; 16 GB MCDRAM, 96 GB DDR4 • Cray Aries high-speed “dragonfly” topology interconnect • Many popular deeplearning frameworks available – ; ; Lasagne; PyTorch; Tensorflow; Theano – Working with Intel to improve CPU (KNL) performance

- 15 - Timing RPV Susy CNN • Implemented CNN network in different frameworks: – (Pure) Tensorflow, Keras (Theano and TF), Lasagne (Theano), Caffe • Aim to drive multi-node Cori CPU performance to be comparable with GPU (for real use-cases): – Not aiming for exact comparison: implementation in frameworks differ slightly and some have been optimised • Compare training time (per batch ignoring I/O) for: – GPU: Titan X (Pascal) (10.2 TeraFlops (single-precision) peak) – CPU: Haswell E5-2698 v3 32 cores @ 2.3 GHz (2.4 TF) – KNL: Xeon Phi 7250 68 cores @1.4 GHz (6 TF)

- 16 - Timings and Tensorflow

• CPU performance of default TF Lasagne + Theano Keras + Theano Keras + Tensorflow Keras + TF (Intel) Keras + TF (Latest) Caffe 1.2 is poor 5 4.6 Batch Size: 512 4.5

)

s 4

(

• Intel optimisations with Intel

h 3.5

c

t

a 3

B

Math Kernel Library (MKL) e.g. 2.5

r

e 2

P 1.4 Conv layers multi- e 1.5

m 0.9 i 1 0.6 T 0.4 0.4 0.4 0.4 0.3 threaded,vectorize channels/ 0.5 0.1 0.06 0 Lasagne + Keras + Keras + Keras + Keras + TF Keras + TF Keras + Keras + TF Keras + TF Caffe Caffe filters and cache blocking Theano Theano Tensorwflo Tensorwflo (Intel) (Latest) Tensorwflo (Intel) (Latest) GPU CPU - HSW CPU - KNL CPU - 8 – Now in main TF-repo Node KNL • Further optimisations (released soon): e.g. MKL element-wise • (Intel)Caffe similar optimizations and Multi- operations (avoid MKL->Eigen node with MLSL library e.g. scale to 8 nodes conversions) time 6x faster for this 64x64 network

- 17 - Thorsten Kurth, Jian Zhang, Nadathur Satish, Ioannis Mitliagkas, Evan Racah, Mostofa Patwary, Tareq Malas, Narayanan Sundaram, Wahid Bhimji, Mikhail Smorkalov, Jack Deslippe, Mikhail Shiryaev, Srinivas Scaling up Sridharan, Prabhat, Pradeep Dubey, at 15PF (accepted for SC17) arXiv:1708.05256

Hybrid Architecture: • Train on 10 Million 224x224 3-channel images (7.4 TB) • Caffe implementation: multi-node - data parallel – Use Intel MLSL library (wraps comms - portable) • Sync/Async and Hybrid strategies: – Sync: barriers so nodes iterate together

Layer N PS • can have straggler nodes and limit batch size Group 1 – Async: users parameter servers to scale better

• can have old gradients so not converge faster. Layer N-1 PS Group 2 – Hybrid: Sync within a group and async across

• Dedicated parameter servers for each layer of network Layer 2 PS Group G • Modify our CNN layers to reduce communication: remove batch norm. and replaced big (~200MB) fully Model update Layer 1 PS connected layers with convolutional layer New model

- 18 - T Kurth et. al., Deep Learning at 15PF Scaling up - results (accepted for SC17) arXiv:1708.05256

• Single node 1.9 TF (⅓ peak) Strong • Strong scaling (overall batch size fixed): scaling – Hybrid approach reduces communication and straggler effects • Weak scaling (const batch per node): – Good scaling - though affected by variability from communication after fast convolutional layers Weak • Scaled to 9600 KNL nodes - 11.73 PF (6170x scaling 1-node) (single-precision) • Time to solution (a target loss) also scales (1024 node time 1/11 of 64-node time )

- 19 - Conclusions

• Implemented deep CNN on large whole detector ‘images’ directly for physics analysis – Outperforms physics-variable based selections (and shallow classifiers) without jet reconstruction – Further improvements from adding 3 channels; modifying weights and ensemble of models – Network robust to pileup and to apply other signal masses, and appears to learn physics of interest • Used to benchmark and improve popular deep learning libraries on CPU including XeonPhi/KNL at NERSC – Demonstrated distributed training up to 9600 KNL nodes

- 20 - Thanks: Ben Nachman and Brian Amadio (LBL) for discussions and physics input. Mustafa Mustafa (LBL) for help with Tensorflow optimisations.

Code and sample datasets will be made available with proceedings

- 21 - Backups

- 22 - Benchmark Analysis

Fat-jet object selection:

• AntiKt R=1.0 trimmed (Rtrim= 0.2, PTfrac= 0.05) • PT > 200 GeV , |η| < 2.0 Preselection

• Leading Fat-jet PT > 440 GeV • NFat-Jet > 2 Analysis Selection

• |∆η12| between leading 2 Fat-jets < 1.4 • NFat-Jet >= 4 && Sum MFat-jet > 800 GeV • Or NFat-Jet>= 5 && Sum MFat-jet > 600 GeV

- 23 - Interpretation - feature maps

Background QCD event: Signal RPV event:

- 24 - Scaling

• Single node performance per layer for 224x224 Caffe implementation • Time to 0.05 loss (corresponds to fixed significance) • At 1024 nodes: time is 11x 64- node (scales as expected) and Hybrid time is 1.66x Sync.

- 25 - Further work:

(with G. Rochette, J. Bruna, G.Louppe, K. Cranmer, NYU) Exploring Graph CNNs • Using a list of clusters rather than an image • Hybrid between graph and CNN – Represent clusters as nodes of a graph with interactions/similarity as edge weights • Model the interaction, and achieve precision without sparsity

- 26 -