Building End-to-End, Integrated Big Data Analytics and AI Solution
Sajan Govindan Machine learning Deep learning
TOOLKITS App developers Open source platform for building E2E Analytics & Deep learning inference deployment Open source, scalable, and AI applications on Apache Spark* with distributed on CPU/GPU/FPGA/VPU for Caffe*, extensible distributed deep learning TensorFlow*, Keras*, BigDL TensorFlow*, MXNet*, ONNX*, Kaldi* platform built on Kubernetes (BETA)
Intel-optimized Frameworks libraries Python R Distributed * Data * * And more framework optimizations underway scientists • Scikit-learn • Cart • MlLib (on * * including PaddlePaddle*, • Pandas • RandomForest Spark) Chainer*, CNTK* & others • NumPy • e1071 • Mahout
Kernels Intel® Intel® Data Analytics Intel® Math Kernel Library Distribution Acceleration Library Library for Deep Neural developers for Python* (DAAL) Networks (MKL-DNN) Open source compiler for deep learning model Intel distribution High performance machine computations optimized for multiple devices (CPU, GPU, optimized for learning & data analytics Open source DNN functions for NNP) from multiple frameworks (TF, MXNet, ONNX) machine learning library CPU / integrated graphics
1 An open source version is available at: 01.org/openvinotoolkit *Other names and brands may be claimed as the property of others. Developer personas show above represent the primary user base for each row, but are not mutually-exclusive All products, computer systems, dates, and figures are preliminary based on current expectations, and are subject to change without notice. Visit: www.intel.ai/technologyAI © 2019 Intel Corporation Building End-to-End, Integrated Data Analytics & AI Solutions
Distributed, High-Performance Analytics + AI Platform Deep Learning Framework Distributed TensorFlow*, Keras*, for Apache Spark* PyTorch* and BigDL on Apache Spark*
https://github.com/intel-analytics/bigdl https://github.com/intel-analytics/analytics-zoo
Accelerating Data Analytics + AI Solutions At Scale
*Other names and brands may be claimed as the property of others. Real-World ML/DL Solutions Are Complex Data Analytics Pipelines
“Hidden Technical Debt in Machine Learning Systems”, Sculley et al., Google, NIPS 2015 Paper Analytics Zoo End-to-End, Integrated Data Analytics + AI Platform
Use case Recommendation Anomaly Detection Text Classification Text Matching
Model Image Classification Object Detection Seq2Seq Transformer BERT
Feature Engineering image 3D image text time series
tfpark: Distributed TF on Big Data Distributed Keras w/ autograd on Big Data High Level Pipelines nnframes: Spark Dataframes & ML Distributed Model Serving Pipelines for Deep Learning (batch, streaming & online)
TensorFlow Keras PyTorch BigDL NLP Architect Apache Spark Apache Flink Backend/ Library Ray MKLDNN OpenVINO Intel® Optane™ DCPMM DL Boost (VNNI)
https://github.com/intel-analytics/analytics-zoo Distributed TF & Keras on Spark in Analytics Zoo
Write TensorFlow code inline in PySpark program #pyspark code • Data wrangling and train_rdd = spark.hadoopFile(…).map(…) analysis using PySpark dataset = TFDataset.from_rdd(train_rdd,…)
#tensorflow code import tensorflow as tf • Deep learning model slim = tf.contrib.slim development using images, labels = dataset.tensors with slim.arg_scope(lenet.lenet_arg_scope()): TensorFlow or Keras logits, end_points = lenet.lenet(images, …) loss = tf.reduce_mean( \ tf.losses.sparse_softmax_cross_entropy( \ logits=logits, labels=labels))
• Distributed training / #distributed training on Spark optimizer = TFOptimizer.from_loss(loss, Adam(…)) inference on Spark optimizer.optimize(end_trigger=MaxEpoch(5)) Deep Learning Pipelines for High Energy Physics at CERN using Apache Spark and Analytics Zoo High Energy Physics
Massive Data Analysis CERN, the European Organization for Nuclear Research, which operates the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator
THE NEED THE CHALLENGE THE SOLUTION1 THE RESULT Improve event selection LHC is generating 1 Petabyte per Use Analytics Zoo & Apache Spark Successful test of end-to-end accuracy at the particle second of data with particle on Intel® Xeon® Scalable servers to full data pipeline particle detectors. collision events happening every implement full data pipeline for classifier implementation that 25 ns! training a topology classifier for easily scale out Event filtering system accuracy event filtering at LHC. improvement can provide large savings for data analysis “Analytics Zoo & BigDL allowed us to Easily Scale-out Deep resources (compute and storage). Learning Training on Apache Spark clusters running on Intel® Xeon® servers, and enabled our researchers to successfully develop an end-to-end data pipeline to improve real-time event selection at the Large Hadron Collider” Maria Girone, Chief Technology Officer, CERN openlab 1 This CERN’s solution is a proof of concept and not in production yet. See proposed solution details at http://db-blog.web.cern.ch/blog/luca-canali/machine-learning-pipelines-high-energy-physics-using-apache-spark-bigdl *Other names and brands may be claimed as the property of others. AI Particle Classifier for High Energy Physics in CERN
Deep learning pipeline for physics data
Model serving using Apache Kafka and Spark
https://db-blog.web.cern.ch/blog/luca-canali/machine-learning-pipelines-high-energy-physics-using-apache-spark-bigdl https://databricks.com/session/deep-learning-on-apache-spark-at-cerns-large-hadron-collider-with-intel-technologies And Many More Technology Cloud Service Providers End Users
software.intel.com/data-analytics Not a full list *Other names and brands may be claimed as the property of others. Innovate At Scale
https://software.intel.com/AIonBigData https://software.intel.com/ai
www.intel.ai/technology
© 2019 Intel Corporation