Building End-to-End, Integrated Big Data Analytics and AI Solution

Sajan Govindan

TOOLKITS App developers Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable, and AI applications on * with distributed on CPU/GPU/FPGA/VPU for *, extensible distributed deep learning TensorFlow*, *, BigDL TensorFlow*, MXNet*, ONNX*, Kaldi* platform built on Kubernetes (BETA)

Intel-optimized Frameworks libraries Python R Distributed * Data * * And more framework optimizations underway scientists • Scikit-learn • Cart • MlLib (on * * including PaddlePaddle*, • Pandas • RandomForest Spark) Chainer*, CNTK* & others • NumPy • e1071 • Mahout

Kernels ® Intel® Data Analytics Intel® Math Kernel Library Distribution Acceleration Library Library for Deep Neural developers for Python* (DAAL) Networks (MKL-DNN) Open source compiler for deep learning model Intel distribution High performance machine computations optimized for multiple devices (CPU, GPU, optimized for learning & data analytics Open source DNN functions for NNP) from multiple frameworks (TF, MXNet, ONNX) machine learning library CPU / integrated graphics

1 An open source version is available at: 01.org/openvinotoolkit *Other names and brands may be claimed as the property of others. Developer personas show above represent the primary user base for each row, but are not mutually-exclusive All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Visit: www.intel.ai/technologyAI © 2019 Intel Corporation Building End-to-End, Integrated Data Analytics & AI Solutions

Distributed, High-Performance Analytics + AI Platform Deep Learning Framework Distributed TensorFlow*, Keras*, for Apache Spark* PyTorch* and BigDL on Apache Spark*

https://github.com/intel-analytics/bigdl https://github.com/intel-analytics/analytics-zoo

Accelerating Data Analytics + AI Solutions At Scale

*Other names and brands may be claimed as the property of others. Real-World ML/DL Solutions Are Complex Data Analytics Pipelines

“Hidden Technical Debt in Machine Learning Systems”, Sculley et al., Google, NIPS 2015 Paper Analytics Zoo End-to-End, Integrated Data Analytics + AI Platform

Use case Recommendation Anomaly Detection Text Classification Text Matching

Model Image Classification Object Detection Seq2Seq Transformer BERT

Feature Engineering image 3D image text time series

tfpark: Distributed TF on Big Data Distributed Keras w/ autograd on Big Data High Level Pipelines nnframes: Spark Dataframes & ML Distributed Model Serving Pipelines for Deep Learning (batch, streaming & online)

TensorFlow Keras PyTorch BigDL NLP Architect Apache Spark Apache Flink Backend/ Library Ray MKLDNN OpenVINO Intel® Optane™ DCPMM DL Boost (VNNI)

https://github.com/intel-analytics/analytics-zoo Distributed TF & Keras on Spark in Analytics Zoo

Write TensorFlow code inline in PySpark program #pyspark code • Data wrangling and train_rdd = spark.hadoopFile(…).map(…) analysis using PySpark dataset = TFDataset.from_rdd(train_rdd,…)

# code import tensorflow as tf • Deep learning model slim = tf.contrib.slim development using images, labels = dataset.tensors with slim.arg_scope(lenet.lenet_arg_scope()): TensorFlow or Keras logits, end_points = lenet.lenet(images, …) loss = tf.reduce_mean( \ tf.losses.sparse_softmax_cross_entropy( \ logits=logits, labels=labels))

• Distributed training / #distributed training on Spark optimizer = TFOptimizer.from_loss(loss, Adam(…)) inference on Spark optimizer.optimize(end_trigger=MaxEpoch(5)) Deep Learning Pipelines for High Energy Physics at CERN using Apache Spark and Analytics Zoo High Energy Physics

Massive Data Analysis CERN, the European Organization for Nuclear Research, which operates the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator

THE NEED THE CHALLENGE THE SOLUTION1 THE RESULT Improve event selection LHC is generating 1 Petabyte per Use Analytics Zoo & Apache Spark Successful test of end-to-end accuracy at the particle second of data with particle on Intel® Xeon® Scalable servers to full data pipeline particle detectors. collision events happening every implement full data pipeline for classifier implementation that 25 ns! training a topology classifier for easily scale out Event filtering system accuracy event filtering at LHC. improvement can provide large savings for data analysis “Analytics Zoo & BigDL allowed us to Easily Scale-out Deep resources (compute and storage). Learning Training on Apache Spark clusters running on Intel® Xeon® servers, and enabled our researchers to successfully develop an end-to-end data pipeline to improve real-time event selection at the Large Hadron Collider” Maria Girone, Chief Technology Officer, CERN openlab 1 This CERN’s solution is a proof of concept and not in production yet. See proposed solution details at http://db-blog.web.cern.ch/blog/luca-canali/machine-learning-pipelines-high-energy-physics-using-apache-spark-bigdl *Other names and brands may be claimed as the property of others. AI Particle Classifier for High Energy Physics in CERN

Deep learning pipeline for physics data

Model serving using and Spark

https://db-blog.web.cern.ch/blog/luca-canali/machine-learning-pipelines-high-energy-physics-using-apache-spark-bigdl https://databricks.com/session/deep-learning-on-apache-spark-at-cerns-large-hadron-collider-with-intel-technologies And Many More Technology Cloud Service Providers End Users

software.intel.com/data-analytics Not a full list *Other names and brands may be claimed as the property of others. Innovate At Scale

https://software.intel.com/AIonBigData https://software.intel.com/ai

www.intel.ai/technology

© 2019 Intel Corporation