Pipelines with Apache Spark and Intel BigDL

M. Migliorini1,2, V. Khristenko1 M. Pierini1, E. Motesnitsalis1, L. Canali1, M. Girone1 1)CERN, Geneva, Switzerland; 2)University of Padova, Padova, Italy

End-to-End ML Pipeline HEP use case

• The goal of this work is to produce a demonstrator of an end-to-end • The ability to classify events is of fundamental importance and Machine Learning pipeline using Apache Spark proved to be able to outperform other ML methods • See paper: “Topology classification with deep learning to improve • Investigate and develop solutions integrating: real time event selection at LHC” (arXiv:1807.00083v2) Data Engineering/Big Data tools • W+j • Machine Learning tools tt ̅ (0.4%) Data analytics platform • W + jets Topology QCD (63.4%) QCD Classifier (36.2%) • Use Industry standard tools: Isolated • Well known and widely adopted Lepton tt̅ • Open the HEP field to a larger community • The Pipeline is composed by the following stages: Particle-sequence High level feature Inclusive classifier classifier classifier

Data Feature Parameters Fully connected neural Training taking as Particle-sequence Ingestion Preparation Tuning Network taking as input a list of features classifier + information input a list of particles computed from low coming from the HLF level features

Data Feature Ingestion Preparation

• Filter events: require the presence of isolated leptons + Dev. dataset Model #1 • Prepare input for the classifiers • Produce multiple datasets • Access physics data stored in • Raw data (list of particles) EOS using Hadoop-XRootD • High Level features Model #2 Connector Full dataset • Store results in parquet files EOS storage • Read ROOT files into a Spark • Dev. dataset (100k events) Input: DF using Spark-ROOT reader • 10 TB of ROOT files • Full dataset (5M events) • 50M events Parameters Tuning

Trained the three models using Training Best Model various hardware and configurations

• Scan a grid of parameters to find the best model • Train multiple models at the + + same time (one per executor)

Summary

Observed near linear scalability of Intel BigDL • Created an end-to-end ML pipeline using Apache Spark • Python & Spark allow to distribute computation in a simple way • Intel BigDL scales well and it is easy to use because it has a similar API to • Interactive analysis made easier by Jupiter Notebooks • Future work • Test pipeline using cloud resources • Further performance improvements on data preparation and training • Model Serving: implement inference Reproduced the classifiers performance Throughput test measurements on the three on streaming data of the source paper different training methods and model types

https://openlab.cern/