Microsoft ML for Apache Spark

Microsoft ML for Apache Spark Unifying Machine Learning Ecosystems at Massive Scales Mark Hamilton Microsoft, MIT [email protected] Overview Background Spark + SparkML MMLSpark Unifying ML Ecosystems LightGBM, CNTK, Vowpal Wabbit Vowpal Wabbit LightGBM CNTK Multilingual Bindings Microservice Orchestration Cognitive Services on Spark Cognitive Services Kubernetes Model Deployment with Spark Serving Use Cases The Snow Leopard Trust ● Code Driver ● Data Worker Worker Worker Age: Name: Age: Name: Age: Name: Int String Int String Int String A fault-tolerant distributed 15 Bob 1 Sam 25 Bob computing framework 25 Alice 2 Claire 25 Tom 33 Mary 53 Tina 66 Ted Map Reduce + SQL Data Source Whole program optimization + SQL, AWS Bucket, Local, CosmosDB, etc… query pushdown 120 110 Elastic 100 80 Scala, Python, R, Java, Julia 60 40 20 ML, Graph Processing, Streaming 0.9 0 Running Time(s) Hadoop Spark ML High level library for distributed machine learning More general than SciKit-Learn All models have a uniform interface Load Data Can compose models into Tokenizer Load Data complex pipelines Term Hashing Pipeline Can save, load, and transport Logistic Reg Evaluate models ● Estimator Evaluate ● Transformer Microsoft Machine Learning for Apache Spark v0.18 Microsoft’s Open Source Contributions to Apache Spark #UnifiedAnalytics #SparkAISummit 5 Distributed Fast Model Microservice Multilingual Binding Machine Learning Deployment Orchestration Generation www.aka.ms/spark Azure/mmlspark Unifying Machine Learning Ecosystems Goals Same API Composable Batch, Streaming, Serving Image Processing Gradient Boosting Deep Learning with Open CV with LightGBM Pipelines (Databricks) Elastically Distributed Fault Tolerant Multi-Language Data Source Agnostic Markus Cozowicz [email protected] Text Analytics with Distributed Model Deep Learning Data Scientist Vowpal Wabbit Interpretability with LIME with CNTK Example Backend: LightGBM on Spark PySpark SparklyR Wrapper Generation Barrier Execution for Synchronizing Workers Spark Worker Spark Worker LightGBM on Spark LGBM LGBM Spark Estimator/Transformer Fast Socket/MPI Process Process LightGBM Scala Bindings communication Scala/Java Interop mapPartitions for MPI Ring LightGBM Java Bindings Transformer LGBM SWIG Process Framework Time(s) Area under LightGBM Core (C++) ROC Spark Worker XGBoost 52.60 .808 SparkML GBT 82.78 .788 LightGBM 45.39 .812 Ilya Matiach, [email protected] Developer, Azure ML Cognitive Services Person High quality pre-built Skateboa intelligent services rd No time intensive model training or deployment “Hello World” Leverage Microsoft Research and Azure ML Place Time Range Available as Docker I had a wonderful trip to Seattle last week and even visited the Space Needle 2 times! Containers En-US 84% positive Place Vision Speech Language Decision Search Object, scene, and Speech transcription Language detection Q&A extraction from Ad-free web, news, image, activity detection (speech-to-text) unstructured text and video search results Named entity recognition Face recognition Custom speech models for Knowledge base creation Trends for video, news and identification unique vocabularies or Key phrase extraction from collections of Q&As complex environment Image identification, Celebrity and landmark Text sentiment analysis Semantic matching for classification and recognition Text-to-speech knowledge extraction Multilingual and contextual knowledge bases Emotion recognition Custom Voice spell checking Customizable content Identification of similar personalization learning images and products Text and handwriting Real-time speech translation Explicit or offensive text recognition (OCR) content moderation Named entity recognition Customizable speech and classification Customizable image transcription and translation PII detection for text recognition moderation Knowledge acquisition Speaker identification for named entities Video metadata, audio, and verification Text translation and keyframe extraction Search query autosuggest Customizable text translation and analysis Ad-free custom search Contextual language Explicit or offensive engine creation understanding content moderation Azure Cognitive Services on Spark Easy to use integration between Spark and the Azure Cognitive Services val df = new TextSentiment() Composable and pipelinable with all other SparkML models! .setTextCol(“text”) .setOutputCol(“sentiment”) Exponential Backoffs, .transform(inputs) Backpressure, Batching, Async Parallelism Features Time (s) Errors # None Fully Fluent API 30.8 18993 EBO+BP 1163.0 0 EBO+BP+B 57.1 0 EBO+BP+B+P 49.7 0 HTTP on Spark Web Service Full Integration between HTTP Protocol and Spark SQL Client Client Client Spark as a Microservice Partition Partition Partition Orchestrator Spark Worker Spark + X df = SimpleHTTPTransformer() .setInputParser(JSONInputParser()) Support for all Spark .setOutputParser(JSONOutputParser() Languages .setDataType(schema)) .setOutputCol("results") .setUrl(…) Deploying on Kubernetes Kubernetes (AKS, ACS, GKE, On-Prem etc) Works on any k8s cluster K8s worker K8s worker K8s worker Cognitive CognitiveSpark Cognitive Helm: Package Manager for Service ServiceWorker Service Cloud Container Container Container Cognitive Kubernetes Services HTTP on Spark HTTP on Spark HTTP on Spark HTTP on Spark Spark Spark Spark helm repo add microsoft \ Worker Worker Worker Storage or https://microsoft.github.io/charts/repo other Spark Serving Databases helm update Jupyter, Zepplin Spark Zepplin, Spark Readers Serving Load LIVY, or Balancer Spark Jupyter helm install microsoft/spark --version 1.0.0 Submit LB REST Requests to Submit Jobs, Run Notebooks, Deployed Models Manage Cluster, etc Users / Apps Dalitso Banda, [email protected] Microsoft AI Development Acceleration Program 12 Model Deployment with Spark Serving Sub-millisecond RESTful Model Deployment on Spark Clusters Users / Apps Batch API: Load Balancer HTTP Requests and spark.read.parquet.load(…) Responses .select(…) Server Server Streaming API: spark.readStream.kafka.load(…) Partition Partition Partition Partition Partition Partition .select(…) Spark Worker Spark Worker Serving API: Spark Master spark.readStream.server(“0.0.0.0”, 5000).load(…) .select(…) AI for Earth Endangered Status Matters Remote Camera Trapping Creating a labelled Training Dataset Creating a labelled Training Dataset Transfer Learning with ResNet 50 Low Level Mid Level Logistic Input Images Output Classes Features Features Regression Other Leopard Any SparkML Predictions Model Filters from Zeiler + Fergus 2013 Performance Without Deep With Deep Featurization, Featurization Augmentation, and Temporal Ensembling Accuracy 65.6% Accuracy 94.7% Goal: Identify Individual Leopards Source: HotSpotter - Patterned Species Instance Recognition Automating Detection with LIME on Spark LIME on Spark End to End Architecture Results Unsupervised Unsupervised Human Labels FRCNN Outputs Human Labels FRCNN Outputs Microsoft Machine Learning for Apache Spark v0.18 Microsoft’s Open Source Contributions to Apache Spark #UnifiedAnalytics #SparkAISummit 26 Distributed Fast Model Microservice Multilingual Binding Machine Learning Deployment Orchestration Generation www.aka.ms/spark Azure/mmlspark Thanks to Get in Touch You all! Ilya Matiach: LightGBM on Spark Support: mmlspark- Markus Cozowicz: VW on Spark [email protected] Sudarshan Raghunathan, Christina Lee, Daniel Ciborowski, Eli Barzilay, Tong Wen, Pablo Castro, Me: [email protected] Chris Hoder, Ryan Gaspar, Henrik Neilsen, Andrew Schonhoffer, Joseph Sirosh Github : Azure/mmlspark Microsoft NERD Garage Team + MIT Externship Website: www.aka.ms/spark Program Snow Leopard Trust: Koustubh Sharma, Rhetick Paper: www.aka.ms/spark-paper Sengupta, Jeff Brown, Michael Despines Microsoft Development Acceleration Team: Contributions Welcome! Dalitso Banda, Casey Hong, Karthik Rajendran, Check out our MSR Podcast on Manon Knoertzer, Tayo Amuneke, Alejandro Buendia Oct 2 Azure CAT, AzureML, and Azure Search Teams Backup Slides AI for Cultural Institutions Celebrating 2 years of Open Access at The MET In 2016 The MET Released 400k MIT, The MET, and Microsoft images under open access participated in a 3-day hackathon to create intelligent experiences using This past winter the MET released a the new collection new subject-keyword dataset of image annotations https://gen.studio 30 Goals: Create new Use new work Explore further works of art to explore with intelligent existing art search Needed Technologies: Generative Reverse image Elasticsearch Adversarial search with Cognitive Networks Services Generative Adversarial Art Training Data Real or Generated? Noise Real or Vector Generated? https://gen.studioGenerated Generator Discriminator Image Custom Reverse Image Search Query ResNet Deep Fast Nearest Closest Image Featurizer Features Neighbor Match Lookup MMLSpark SparkML LSH or Annoy Filters from Zeiler + Fergus 2013 Nearest Example Nearest Neighbors Example Nearest Neighbors Query Images Intelligent Search Index Pipe images through Query Computer Vision API Image: to annotate image for Describe searching A fish Image A picture A picture swimming Stream images and Output: containing a containing a intelligent person glass, cup underwater annotations to Azure Search Deep Feature Nearest Neighbors: End Application: Gen Studio https://gen.studio 36 AI for Accessibility Seeing AI Currency Identification A Familiar Architecture… Labelled Images Deep Features $1 $5 $2 Queries 0 1 Dollar Bing Azure Machine Union + Logistic 5 Dollars Image MobileNet Learning + Distinct Regression 10 Dollars Search Spark Serving 20 Dollars Prep Data Train Deploy.

Microsoft ML for Apache Spark

Asset Finder: a Search Tool for Finding Relevant Graphical Assets Using Automated Image Labelling

A Tensor Compiler for Unified Machine Learning Prediction Serving

Naiad: a Timely Dataflow System

Search, Analyse, Predict Image Spread on Twitter

Lightgbm Release 3.2.1.99

ML Cheatsheet Documentation

1.2 Le Zhang, Microsoft

Alibaba Amazon

2020 Sut20 CASC-J10.Pdf

Reverse Image Search for Scientific Data Within and Beyond the Visible

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

A Universal Neural Network Solution for Tabular Data