Machine Learning Pipelines with Apache Spark and Intel Bigdl

Machine Learning Pipelines with Apache Spark and Intel BigDL M. Migliorini1,2, V. Khristenko1 M. Pierini1, E. Motesnitsalis1, L. Canali1, M. Girone1 1)CERN, Geneva, Switzerland; 2)University of Padova, Padova, Italy End-to-End ML Pipeline HEP use case • The goal of this work is to produce a demonstrator of an end-to-end • The ability to classify events is of fundamental importance and Machine Learning pipeline using Apache Spark Deep Learning proved to be able to outperform other ML methods • See paper: “Topology classification with deep learning to improve • Investigate and develop solutions integrating: real time event selection at LHC” (arXiv:1807.00083v2) Data Engineering/Big Data tools • W+j • Machine Learning tools tt ̅ (0.4%) Data analytics platform • W + jets Topology QCD (63.4%) QCD Classifier (36.2%) • Use Industry standard tools: Isolated • Well known and widely adopted Lepton tt̅ • Open the HEP field to a larger community • The Pipeline is composed by the following stages: Particle-sequence High level feature Inclusive classifier classifier classifier Data Feature Parameters Fully connected neural Training Recurrent Neural network taking as Particle-sequence Ingestion Preparation Tuning Network taking as input a list of features classifier + information input a list of particles computed from low coming from the HLF level features Data Feature Ingestion Preparation • Filter events: require the presence of isolated leptons + Dev. dataset Model #1 • Prepare input for the classifiers • Produce multiple datasets • Access physics data stored in • Raw data (list of particles) EOS using Hadoop-XRootD • High Level features Model #2 Connector Full dataset • Store results in parquet files EOS storage • Read ROOT files into a Spark • Dev. dataset (100k events) Input: DF using Spark-ROOT reader • 10 TB of ROOT files • Full dataset (5M events) • 50M events Parameters Tuning Trained the three models using Training Best Model various hardware and configurations • Scan a grid of parameters to find the best model • Train multiple models at the + + same time (one per executor) Summary Observed near linear scalability of Intel BigDL • Created an end-to-end ML pipeline using Apache Spark • Python & Spark allow to distribute computation in a simple way • Intel BigDL scales well and it is easy to use because it has a similar API to Keras • Interactive analysis made easier by Jupiter Notebooks • Future work • Test pipeline using cloud resources • Further performance improvements on data preparation and training • Model Serving: implement inference Reproduced the classifiers performance Throughput test measurements on the three on streaming data of the source paper different training methods and model types https://openlab.cern/.

Machine Learning Pipelines with Apache Spark and Intel Bigdl

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support